# Classifying Cancer Cases

We'll now move on to using logistic regression on real data. This notebook will be a breakout session exercise. Remember to complete as much as possible in the time available. It is okay if you don't finish it all right now :)

## What You'll Accomplish
- You'll build a logistic regression model on a cancer data set.

## The Data

The data we're considering comes from a 1995 study on breast tumor cells. Each row of the data set presents the features of a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. 

The data has 30 features and is categorized as benign or malignant. The data was originally presented to demonstrate an algorithm that linearly separates the two classes of tumor. 

In this notebook you'll do your best to build a logistic regression classifier using these data. (Note the algorithm in the paper was not logistic regression, so we'll likely not do as well as the original paper).


In [None]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

In [None]:
# import the data
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

X,y = load_breast_cancer(return_X_y = True)

In [None]:
from sklearn.model_selection import train_test_split

# test train split
X_train,X_test,y_train,y_test = train_test_split(X, y,
                                                 test_size = .25,
                                                 random_state = 614,
                                                 shuffle = True,
                                                 stratify = y)

In [None]:
# Let's view the training data
# Note this may take a bit.
fig, axes = plt.subplots(15, 2, figsize = (12,30))

ax = axes.ravel()

for i in range(30):
    _, bins = np.histogram(X_train[:, i], bins = 50)
    ax[i].hist(X_train[y_train == 1, i], bins = bins, color = 'red', alpha = .5)
    ax[i].hist(X_train[y_train == 0, i], bins = bins, color = 'blue', alpha = .5)
    ax[i].set_title("feature " + str(i) + ": " + cancer.feature_names[i])
    ax[i].set_yticks(())

ax[0].legend(['malignant', 'benign'], loc = 'best')
    
fig.tight_layout()

### Practice!

#### Problem 1

Let's start with a simple one predictor model. Looking at the 30 input features above go ahead and find one to use as the input for your model and build a logistic regression classifier below. 


Do your best to find the probability cutoff that seems to have the best generalization accuracy. What technique should you use to figure this out? 

In [None]:
## Find the column of X_train that contains
## the variable you're interested in here


In [None]:
## Import logistic regression here


In [None]:
## Make your model object here



## Fit your model object here


In [None]:
## Import the accuracy_score metric here,
## docs, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html


In [None]:
## import cross-validation and clone here, remember 
## we'll need to stratify our data in each cv split


In [None]:
## Perform kfold cross validation here

## make a kfold object


## Make an array to store your cv accuracies in


## Perform the cross-validation loop for various probability
## cutoffs here



In [None]:
## Plot the average cross validation accuracy here
plt.figure(figsize=(10,8))






plt.show()

In [None]:
## Which cutoff provided the highest mean CV accuracy?




#### Problem 2

Is accuracy the best performance measure here? Think about the end goal of the algorithm, diagnosing malignant tumors, which measure do you think is most appropriate?

Once you decide on a new performance measure, look back at your model building process and make whatever fixes you think necessary to build the model that will generalize best in terms of that measure. Still only use one predictor.

In [None]:
## Import additional sklearn metrics here

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [None]:
## Rerun the cross validation using these whatever metric
## you settle on



In [None]:
## Sample Solution

## plot the mean cv value of that metric as 
## a function of the probability cutoff here
plt.figure(figsize=(10,8))




plt.show()

#### Problem 3

We're only using one predictor as input in this model. At the beginning we said the published model used all 30 features of the data. Surely our model would be better if we incorporated more predictors into the model.

What is wrong with this line of thinking?

#### Problem 4

Using whatever your final model and performance measure will be use the test set to find the test measure for your algorithm.

In [None]:
## Make your model and fit it here



In [None]:
## Find the test error here


#### Problem 5

Do you think that logistic regression, using only the data in hand without any additional machine learning techniques, is the best model for this data set? Why or why not?

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2021.

Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)