# Problem Session 8
## Classifying Cancer II

In this notebook you continue to work with the cancer data set that can be found here, <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29">https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29</a>. 

The problems in this notebook will cover the content covered in some of our `Classification` notebooks as well as some of our `Dimension Reduction` notebooks. In particular we will cover content touched on in:
- `Classification/Adjustments for Classification`,
- `Classification/k Nearest Neighbors`,
- `Classification/The Confusion Matrix`,
- `Classification/Logistic Regression`,
- `Classification/Diagnostic Curves`,
- `Classification/Bayes Based Classifiers` and
- `Dimension Reduction/Principal Components Analysis`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

##### 1. Load the data.

The data for this problem is stored in `sklearn`, here is the documentation page for that, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html">https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html</a>.

Run this code chunk to load in the data.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

In [None]:
## Loads the data from sklearn 
cancer = load_breast_cancer(as_frame=True)

## the 'data' entry contains the features
X = cancer['data']

## the 'target' entry contains what we would like to predict
y = cancer['target']

## Chaning the labels around
y = -y + 1

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X.copy(), y.copy(),
                                                       shuffle=True,
                                                       random_state=214,
                                                       stratify=y,
                                                       test_size=.2)

##### 2. Remind yourselves

Take a few minutes to review `Problem Session 7` if you need to to remind yourselves about this data set and the overall goal of the problem.

##### 3. Dimension reduction

The cancer data set has $30$ features. While this does not seem like a large number of features, there may be a number of features that add <i>noise</i> to the data set with respect to separating our variable of interest, $y$.

One way to denoise the data is to run it through a dimension reduction technique. 

Run the features of the training set through principal components analysis (PCA) and reduce $30$ features down to $2$. Make a scatter plot with the first PCA values plotted on the horizontal axis and the second PCA values plotted on the vertical, color the points by their $y$ values. Comment on what you see.

<i>Hint: Remember that you have to scale the data prior to fitting the PCA.</i>

##### Sample Solution

In [None]:
## Import what you'll need
from sklearn.preprocessing import 
from sklearn.pipeline import 
from sklearn.decomposition import 

In [None]:
## Make a pipeline for your PCA
pipe = Pipeline()

## fit the pipeline


## Get the PCA transformed data here
fit = 

In [None]:
## This code plots the Second PCA value
## against the First PCA value for you
plt.figure(figsize=(8,8))

plt.scatter(fit[y_train==0, 0],
               fit[y_train==0, 1],
               c = 'b',
               alpha = .6,
               label='Benign')

plt.scatter(fit[y_train==1, 0],
               fit[y_train==1, 1],
               c = 'orange',
               marker = 'v',
               alpha = .6,
               label='Malignant')

plt.legend(fontsize=14)

plt.xlabel("First PCA Value", fontsize=16)
plt.ylabel("Second PCA Value", fontsize=16)

plt.show()

##### 4. Build a Model 

Fill in the cross-validation code provided below to build a $k$-nearest neighbors model using the PCA processed data you generated above. What was the average TPR, FPR and precision for such a model?

In [None]:
## Import what you need here
from sklearn.model_selection import 
from sklearn.metrics import 
from sklearn.neighbors import 

In [None]:
## Make the kfold for you
kfold = StratifiedKFold(5, shuffle=True, random_state=14235)

## Make zero arrays to store the metrics
## over the CV
knn_tprs = 
knn_fprs = 
knn_precs = 


## counter to keep track of the split
i = 0
for train_index, test_index in kfold.split(X_train, y_train.values):
    X_tt = X_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_tt = y_train.iloc[train_index]
    y_ho = y_train.iloc[test_index]
    
    ## Make the knn pipeline here
    ## Use 10 neighbors in the KNN
    knn_pipe = Pipeline()
    
    ## fit the pipeline here
    knn_pipe

    ## Get the prediction from the pipeline
    pred = knn_pipe.predict(X_ho.values)

    ## Compute the precision, tpr and fpr here
    knn_precs[i] = 
    
    conf_mat = 

    knn_tprs[i] = 
    knn_fprs[i] = 

    ## increasing the counter
    i = i + 1

In [None]:
## Here are the AVG CV metrics
print("Mean CV TPR =", np.round(np.mean(knn_tprs),3))
print()
print("Mean CV FPR =", np.round(np.mean(knn_fprs),3))
print()
print("Mean CV Prec =", np.round(np.mean(knn_precs),3))

##### 5. Optimizing explained variance

In our PCA lecture notebooks we discussed how you could choose a number of PCA components using the explained variance ratio. However, when you use PCA as a preprocessing step for a supervised learning algorithm you can perform a cross-validation optimization instead.

Fill in the missing code in the chunks below to tune the fraction of total explained variance used in the PCA step.

What are the explained variance ratios with the best average CV TPR, FPR and precision?

In [None]:
## The Explained Variance Ratios we'll try
fracs = np.arange(.01, 1, .01)

## These will track the performance across split and 
## explained variance ratio
pca_tprs = np.zeros((5, len(fracs)))
pca_fprs = np.zeros((5, len(fracs)))
pca_precs = np.zeros((5, len(fracs)))

## split counter
i = 0
for train_index, test_index in kfold.split(X_train, y_train.values):
    X_tt = X_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_tt = y_train.iloc[train_index]
    y_ho = y_train.iloc[test_index]
    
    ## explained variance ratio counter
    j = 0
    for frac in fracs:
        ## Build the PCA pipeline here
        ## again use 10 for k
        ## and frac for the PCA input
        knn_pipe = Pipeline()
        
        ## Fit the pipe here
        knn_pipe
        
        ## Get the predictions on the holdout set
        pred = 

        ## Record the metrics
        pca_precs[i,j] = 
        
        
        conf_mat = 
        
        pca_tprs[i,j] = 
        pca_fprs[i,j] = 
        j = j + 1

    i = i + 1

In [None]:
## This will print out the avg performance metrics
print("TPR")
print("==============================")
print("The explained variance ration with the highest avg. cv TPR was",
          fracs[np.argmax(np.mean(pca_tprs, axis=0))])
print("This produced a model with avg. cv. TPR of",np.round(np.max(np.mean(pca_tprs, axis=0)),4))
print()

print("FPR")
print("==============================")
print("The explained variance ration with the lowest avg. cv FPR was",
          fracs[np.argmin(np.mean(pca_fprs, axis=0))])
print("This produced a model with avg. cv. FPR of",np.round(np.min(np.mean(pca_fprs, axis=0)),4))
print()

print("Precision")
print("==============================")
print("The explained variance ration with the highest avg. cv Precision was",
          fracs[np.argmax(np.mean(pca_precs, axis=0))])
print("This produced a model with avg. cv. Precision of",np.round(np.max(np.mean(pca_precs, axis=0)),4))

##### 6. Make some diagnostic curves

While you found the explained variance ratio with the best TPR, FPR and precision, we are often interested in the tradeoffs between such metrics. For example, a true postive rate of $1$ could be useless if that comes with a very high false positive rate.

Plot the the average CV performance for all three of these metrics against the explained variance ratio in the same figure.

Then thinking in terms of what it translates to for our problem of classifying cancer, select a value for the explained variance ratio to form a final PCA $k$NN model.

In [None]:
## Plot here






##### 7. Trying Bayes based classifiers

Build LDA, QDA and naive Bayes' models on these data by filling in the missing code for the cross-validation below. 

Do these outperform your PCA-$k$NN models from above?

<i>Hint: You should scale the data prior to fitting the models.</i>

In [None]:
## Import what you need here
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

In [None]:
## Makes kfold object
kfold = StratifiedKFold(5, shuffle=True, random_state=14235)

## these keep track for you
da_tprs = np.zeros((5,3))
da_fprs = np.zeros((5,3))
da_precs = np.zeros((5,3))



i = 0
for train_index, test_index in kfold.split(X_train, y_train.values):
    X_tt = X_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_tt = y_train.iloc[train_index]
    y_ho = y_train.iloc[test_index]
    
    
    ## Build the three models
    
    
    
    ## Fit the three models
    
    
    
    ## Get the predictions for the three models


    
    
    ## Record the precisions for all 3 models
    da_precs[i,0] = 
    da_precs[i,1] = 
    da_precs[i,2] = 

    
    ## Record the confusion matrices
    lda_conf_mat = 
    qda_conf_mat = 
    nb_conf_mat = 


    ## Record the TPRs and FPRs here
    da_tprs[i,0] = 
    da_fprs[i,0] = 

    da_tprs[i,1] = 
    da_fprs[i,1] = 
    
    da_tprs[i,2] = 
    da_fprs[i,2] = 



    i = i + 1

In [None]:
## Examine the performance here





##### 8. LDA for supervised dimensionality reduction

While we introduced linear discriminant analysis (LDA) as a classification algorithm, it was originally proposed by Fisher as a supervised dimension reduction technique, <a href="https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15227/1/138.pdf">https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15227/1/138.pdf</a>. In particular, the initial goal was to project the features, $X$, corresponding to a binary output, $y$, onto a single dimension which best separates the possible classes. This single dimension has come to been known as <i>Fisher's discriminant</i>.

Walk through the code below to perform this supervised dimension reduction technique on these Cancer data.

In [None]:
## First we make a validation set for demonstration purposes
X_tt, X_val, y_tt, y_val = train_test_split(X_train.copy(), y_train,
                                               shuffle=True,
                                               random_state=302,
                                               test_size = .2,
                                               stratify = y_train)

In [None]:
## Make a pipeline that scales the data
## and ends with LDA
pipe = 

In [None]:
## Fit the pipeline like you would
## for a classifier


In [None]:
## Now instead of pipe.predict
## call pipe.transform, like a PCA object
## use the training features as input


Here we have projected this $30$-dimensional data onto a $1$-dimensional space that maximizes the separation between classes $0$ and $1$. We can visualize this by plotting a histogram split by class.

In [None]:
plt.figure(figsize=(10,6))

## Place the Fisher discriminant for the two classes of tumor here
plt.hist(, color='blue', label="Benign")

## And here
plt.hist(, color='orange', hatch='/', alpha=.6, label="Malignant")

plt.xlabel("Fisher Discriminant", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.legend(fontsize=14)

plt.xlim([-5,8])

plt.show()

From this we can see there is very little overlap in the values of the Fisher discriminant for the two types of tumor.

We could use the discriminant in order to make classifications, for example by setting a simple cutoff value or as input into a different classification algorithm.

However, it is important to note that the LDA algorithm maximizes the separation of the two classes among observations of the training set. It is possible that such good separation would not occur using data the algorithm was not trained on.

In this example we can visually inspect by plotting a histogram of the Fisher discriminant values for the validation set we created. Does the separation seem as pronounced on the validation data?

In [None]:
## The same plot as above but now on the validation set
plt.figure(figsize=(10,6))

## Place the Fisher discriminant for the two classes of tumor here
plt.hist(, color='blue', label="Benign")

## and here
plt.hist(, color='orange', hatch='/', alpha=.6, label="Malignant")

plt.xlabel("Fisher Discriminant", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.legend(fontsize=14)

plt.xlim([-5,8])

plt.show()

I would say that there does appear to be slightly more overlap for the validation data than the training data, but this is still a good amount of separation.

<i>For those interested in how the supervised dimension reduction aspect of LDA works see the `Bayes Based Classifiers` `Practice Problems` notebook.</i>

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)