# Problem Session 7
## Classifying Cancer I

In this notebook you will work with a cancer data set that can be found here, <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29">https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29</a>. Specifically we will introduce the data, perform some EDA and build a couple of models.

The problems in this notebook will cover the content covered in our `Classification` notebooks including:
- `Adjustments for Classification`,
- `k Nearest Neighbors`,
- `The Confusion Matrix` and
- `Logistic Regression`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

##### 1. Load the data.

The data for this problem is stored in `sklearn`, here is the documentation page for that, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html">https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html</a>.

Run this code chunk to load in the data.

Use `np.unique`, <a href="https://numpy.org/doc/stable/reference/generated/numpy.unique.html">https://numpy.org/doc/stable/reference/generated/numpy.unique.html</a> to see the split between `0` and `1` in the data set. Then perform a stratified train test split.

Note we will flip the labels of the data because `sklearn` confusingly uses `0` for malignant tumors and `1` for benign tumors.

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
## Loads the data from sklearn 
cancer = load_breast_cancer(as_frame=True)

## the 'data' entry contains the features
X = cancer['data']

## the 'target' entry contains what we would like to predict
y = cancer['target']

## Chaning the labels around
y = -y + 1

In [None]:
## Find the unique values of y
## and the split those values have
print("y takes on the values 0 and 1", np.unique(y))
print("with a", np.unique(y, return_counts=True)[1]/len(y), "split.")

Note that `y=0` now corresponds to a benign tumor and `y=1` corresponds to a malignant one.

In [None]:
## Make the train test split
from sklearn.model_selection import train_test_split

In [None]:
## Make the train test split
X_train, X_test, y_train, y_test = train_test_split()

##### 2. Learn how the data were generated

Read through the following:

In this problem you will build a model to predict whether or not a tumor is malignant ($y=0$) or benign ($y=1$).

The features you will use to predict this model are a selection of measurments from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. An FNA of breast mass looks something like this:

<img src="fna_img.png" width="60%"></img>

<i>Source: Nuclear Feature Extraction For Breast Tumor Diagnosis, W. Nick Street, William H. Wolberg and O.L. Mangasarian, Center for Parallel Optimization, Computer Sciences Technical Report #1131 (1992). <a href="https://minds.wisconsin.edu/bitstream/handle/1793/59692/TR1131.pdf?sequence=1">https://minds.wisconsin.edu/bitstream/handle/1793/59692/TR1131.pdf?sequence=1</a>.</i>

For each FNA we have the mean, error and worst measurements of the following features:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)

##### 3. Choosing an appropriate performance metric

Consider the goal of this "project", we want to determine if an FNA is an image of a malignant or benign tumor. Is accuracy an appropriate metric for choosing a model here? Why or why not? 

If accuracy is not appropriate, what are some good alternatives? Select a performance metric (or metrics) for the models you will build later and give a reason why.

##### Write here






##### 4. Exploratory Data Analysis

- Make histograms of each feature, split by the value of $y$, <a href="https://matplotlib.org/3.5.0/api/_as_gen/matplotlib.pyplot.hist.html">https://matplotlib.org/3.5.0/api/_as_gen/matplotlib.pyplot.hist.html</a>

- Make a few scatter plots colored by the value of $y$, note, don't try to make all of the possible scatter plots, there are 30 total features so there are 30 choose 2 = 435 possible scatter plots to examine.

In [None]:
## histogram code
## complete the missing pieces of the loop

## loop through the columns of X_train
for :
    plt.figure(figsize=(14,4))
    
    ## Get all observations for this column where y = 0
    plt.hist(, color='b', alpha=.5, label="y=0", bins=30)
    
    ## Get all observations for this column where y = 1
    plt.hist(, color='r', alpha=.5, label="y=1", bins=30)
    
    
    plt.legend(fontsize=14)
    plt.title(column,fontsize=16)
    plt.show()

In [None]:
## Make one or two scatter plots here






##### 5. Making some logistic regression models

Using your exploratory histograms choose a few different features that seem to separate the malignant tumors from the benign. Make a separate logistic regression model regressing `y` on each feature you have chosen. For example if you choose `mean radius`, `mean area` and `mean perimeter` you would build three models:
- one regressing `y` on `mean radius`,
- one regressing `y` on `mean area` and
- one regressing `y` on `mean perimeter`.

Of the models you build find the one with best avg. cv. of the perfomance metric you chose in 3.

##### Make any notes you need here




In [None]:
## Import the stuff you need here

## import StratifiedKFold
from sklearn.model_selection 

## Import LogisticRegression
from sklearn.linear_model 

## import performance metrics
from sklearn.metrics 

In [None]:
## Make the kfold object here
kfold = 

## Make zero arrays to track your performance metrics here


## Write your kfold loop here
for train_index, test_index in kfold.split(X_train, y_train):
    X_tt = X_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_tt = y_train.iloc[train_index]
    y_ho = y_train.iloc[test_index]
    
    

In [None]:
## Examine your avg CV MSE performance here






##### 6. A $k$NN model

Perform $5$-fold cross-validation to find the value of $k$ that optimizes your performance metric of choice.

In [None]:
## import KNN here
from sklearn.neighbors import 

In [None]:
## Make the kfold object
kfold = 

## we'll test from 1 to max_neighbors knn
max_neighbors = 40

## Make arrays to keep track of your metrics


## counter for cv split
i = 0
for train_index, test_index in kfold.split(X_train, y_train.values):
    X_tt = X_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_tt = y_train.iloc[train_index]
    y_ho = y_train.iloc[test_index]
    
    ## counter for neighbor choice
    j = 0
    for k in range(1,max_neighbors+1):
        ## make the KNeighbors Model
        knn =
        
        ## fit the model
        knn.fit(X_tt.values, y_tt.values)
        
        ## Get the prediciton on the holdout set
        pred = knn.predict(X_ho.values)
        
        ## record the performance metrics for this split
        
        
        j = j + 1
    i = i + 1

In [None]:
## Examine your metrics





##### 7. Choose a model

Choose a model from the ones you tested. Write down what you chose below.

##### Write here





##### 8. Interpreting your model

Common questions for diagnostic models concern estimating the probability that an individual does or does not have a disease if the model says (or does not say) they have one. We can estimate such a statistic using Bayes' rule.

$$
P\left(\text{Has Cancer} | \text{Classified } 1\right)
$$

$$
= \frac{P\left(\text{Classified } 1 | \text{Has Cancer} \right) P\left( \text{Has Cancer}  \right)}{P\left(\text{Classified } 1 | \text{Has Cancer} \right) P\left( \text{Has Cancer}  \right) + P\left(\text{Classified } 1 | \text{Does Not Have Cancer} \right) P\left( \text{Does Not Have Cancer}  \right)},
$$

similarly

$$
P\left(\text{Has Cancer} | \text{Classified } 0\right)
$$

$$
= \frac{P\left(\text{Classified } 0 | \text{Has Cancer} \right) P\left( \text{Has Cancer}  \right)}{P\left(\text{Classified } 0 | \text{Has Cancer} \right) P\left( \text{Has Cancer}  \right) + P\left(\text{Classified } 0 | \text{Does Not Have Cancer} \right) P\left( \text{Does Not Have Cancer}  \right)},
$$

We can estimate $P\left(\text{Classified } 1 | \text{Has Cancer} \right)$ with the true positive rate and we can estimate $P(\text{Has Cancer})$ or $P(\text{Does Not Have Cancer})$ using the rates from the training set.

Estimate the true positive and true negative rates for your classifier using cross-validation. Then estimate $P\left(\text{Has Cancer} | \text{Classified } 1\right)$ and $P\left(\text{Has Cancer} | \text{Classified } 0\right)$ for your classifier.

In [None]:
## kfold object
kfold = StratifiedKFold(5, shuffle=True, random_state=14235)

## These will keep track of the relevant rates
tprs = []
tnrs = []
fprs = []
fnrs = []
has_cancer = []
no_cancer = []



for train_index, test_index in kfold.split(X_train, y_train):
    X_tt = X_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_tt = y_train.iloc[train_index]
    y_ho = y_train.iloc[test_index]
    
    ## make the model object
    
    
    ## Fit the model
    
    
    ## get your prediction here
    pred = 
    
    ## Calculate the confusion_matrix
    conf_mat = confusion_matrix(y_ho.values, pred)
    
    ## Append the relevant rates
    tprs.append()
    tnrs.append()
    fprs.append()
    fnrs.append()
    
    ## Estimate the rate of cancer or not
    has_cancer.append(np.sum(y_ho==1)/len(y_ho))
    no_cancer.append(np.sum(y_ho==0)/len(y_ho))
    
## Turn the lists into array for easy calculation purposes
tprs = np.array(tprs)
tnrs = np.array(tnrs)
fnrs = np.array(fnrs)
fprs = np.array(fprs)

has_cancer = np.array(has_cancer)
no_cancer = np.array(no_cancer)

In [None]:
## Use the formulae above to calculate
## these probabilities
p_has_cancer_given_1 = 
p_has_cancer_given_0 = 

In [None]:
print("If our classifier says a patient has cancer, we estimate",
      "a", np.round(np.mean(p_has_cancer_given_1),4),
      "probability that they actually have cancer.")

print()
print()

print("If our classifier says a patient does not have cancer, we estimate",
      "a", np.round(np.mean(p_has_cancer_given_0),4),
      "probability that they actually do have cancer.")

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)