## Breast Cancer Cell Detection

You work for the data team at a local research hospital. You've been tasked with developing a means to help doctors diagnose breast cancer. You've been given data about biopsied breast cells; where it is benign (not harmful) or malignant (cancerous).

### Data

The dataset consists of 699 cells for which you have the following features:

Sample code number: id number
Clump Thickness: 1 - 10
Uniformity of Cell Size: 1 - 10
Uniformity of Cell Shape: 1 - 10
Marginal Adhesion: 1 - 10
Single Epithelial Cell Size: 1 - 10
Bare Nuclei: 1 - 10
Bland Chromatin: 1 - 10
Normal Nucleoli: 1 - 10
Mitoses: 1 - 10
Class: (2 for benign, 4 for malignant)

The dataset is also available here: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
headers = ['sample', 'clump_thickness', 'unif_cell_size', 'unif_cell_shape', 'marginal_adhesion', \
           'single_epi_size', 'bare_nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer_class']
raw_data = pd.read_csv('breast-cancer-wisconsin.csv', header=None, names=headers)
data = raw_data.copy()
data.head()

In [None]:
# Check the different types of values we have for each column
for i in headers:
    print(i, "\n", data[i].value_counts(), "\n")

In [None]:
# Everything looks good except for the bare_nuclei column. Remove those with '?' vals. 
print(len(data))
data = data[data.bare_nuclei != "?"]
data['bare_nuclei'] = data['bare_nuclei'].astype(int)
print(len(data))

In [None]:
# Check the different types of values we have for each column
print(data['bare_nuclei'].value_counts())

In [None]:
# Take a look at the relationship between these variables. 
data[headers].hist(bins=10, figsize=(20,15))

# Classifiers

In [None]:
# Split data into features and outcome variables
outcome = data.cancer_class
# Recode benign as 0 and malignant as 1
outcome.replace({2:0, 4:1}, inplace=True)
features = data.drop(['cancer_class', 'sample'], axis=1)
print(list(features))

In [None]:
# Quick feature engineering. 
features['ovr_unif'] = data['unif_cell_size'] * data['unif_cell_shape']

In [None]:
# Some packages
import matplotlib
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split

# Some functions
def bin_metrics(x, y):
    '''Prints four common metrics for evaluating classification predictions.'''
    print('Accuracy:', round(metrics.accuracy_score(x, y), 4))
    print('Precision:', round(metrics.precision_score(x, y), 4))
    print('Recall:', round(metrics.recall_score(x, y), 4))
    print('ROC_AUC:,', round(metrics.roc_auc_score(x, y), 4))
    print('F1:', round(metrics.f1_score(x, y), 4))

def plot_cm(x, y):
    cm = confusion_matrix(x, y)
    df_cm = pd.DataFrame(cm, columns=np.unique(x), index = np.unique(x))
    df_cm.index.name = 'Actual'
    df_cm.columns.name = 'Predicted'
    sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 20}, fmt='g')# font size
    plt.ylim([0, 2])

## Logistic Classifier
- Advantages: less prone to over-fitting (except with high-dimensional data); gives size and direction of predictors; easy to implement, interpret, and train.
- Disadvantages: assumption of linearity between variables

In [None]:
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(features,  outcome, test_size=0.20, random_state=649)
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
bin_metrics(y_test, y_pred)
plot_cm(y_test, y_pred)
print(list(features), lr.coef_)

## What features of a cell are the largest drivers of malignancy?
1. Clump thickness
1. Bare nuclei
1. Uniform cell shape
1. Uniform cell size

## What features drive your false positive rate? 
- Marginal adhesion and chromatin appear to be the main drivers of false positives

In [None]:
FP = X_test[(y_test == 0) & (y_pred == 1)] # Isolate the false positive cases
TP = X_test[(y_test == 1) & (y_pred ==1)] # Isolare the true positive cases
FN = X_test[(y_test == 1) & (y_pred == 0)] # Isolate the false negative cases

In [None]:
# Compare distribution of features across false positives and true positives
print("TRUE POSITIVE DISTRIBUTIONS")
TP_hds = list(TP)
TP[TP_hds].hist(bins=10, figsize=(15, 10))

In [None]:
print("\nTRUE NEGATIVE DISTRIBUTIONS")
FP_hds = list(FP)
FP[FP_hds].hist(bins=10, figsize=(15, 10))
# Looks like chromatin, nucleoli, and cell size might all be different across TP and FP. 

In [None]:
# Another way to look at the distribution comparison
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6), (ax7, ax8, ax9)) = plt.subplots(3, 3, figsize=(15, 15))
ax1.hist(TP['bare_nuclei'], label='TP')
ax1.hist(FP['bare_nuclei'], label='FP', alpha=0.5)
ax1.set_xlabel('Bare Nuclei')
ax2.hist(TP['chromatin'])
ax2.hist(FP['chromatin'], alpha=0.5)
ax2.set_xlabel('Chromatin')
ax3.hist(TP['clump_thickness'])
ax3.hist(FP['clump_thickness'], alpha=0.5)
ax3.set_xlabel('Clump Thickness')
ax4.hist(TP['marginal_adhesion'])
ax4.hist(FP['marginal_adhesion'], alpha=0.5)
ax4.set_xlabel('Marginal Adhesion')
ax5.hist(TP['mitoses'])
ax5.hist(FP['mitoses'], alpha=0.5)
ax5.set_xlabel('Mitoses')
ax6.hist(TP['nucleoli'])
ax6.hist(FP['nucleoli'], alpha=0.5)
ax6.set_xlabel('Nucleoli')
ax7.hist(TP['single_epi_size'])
ax7.hist(FP['single_epi_size'], alpha=0.5)
ax7.set_xlabel('Single Epithelial Size')
ax8.hist(TP['unif_cell_shape'])
ax8.hist(FP['unif_cell_shape'], alpha=0.5)
ax8.set_xlabel('Uniform Cell Shape')
ax9.hist(TP['unif_cell_size'])
ax9.hist(FP['unif_cell_size'], alpha=0.5)
ax9.set_xlabel('Uniform Cell Size')
ax1.legend()
plt.show()

## What features drive your false negative rate?
- With only one false positive, it's hard to say. 

## How would a physician use your product?
- The physician would enter the list of features for the biopsied cells and it would return a probability that the cell was malignant or benign (framing effects matter so could be a feature of deployment). 

## How would you go about determining the most cost-effective method of detecting malignancy?
- Probably look at the relative trade-off between how much information each feature adds compared to the importance of that feature. 
- For example, single epithelial cell size seems to not be predictive at all. Whereas, clump thickness seems to really important data to collect. 