# Breast Cancer Detection

## Preprocessing

* Number of instances: 569

* Number of attributes: 32

* Attribute information:
   
   1) ID number

   2) Diagnosis (M = malignant, B = benign)
3-32)


Ten real-valued features are computed for each cell nucleus:

  a) radius (mean of distances from center to points on the perimeter)

  b) texture (standard deviation of gray-scale values)

  c) perimeter

  d) area

  e) smoothness (local variation in radius lengths)

  f) compactness (perimeter^2 / area - 1.0)

  g) concavity (severity of concave portions of the contour)

  h) concave points (number of concave portions of the contour)

  i) symmetry

  j) fractal dimension ("coastline approximation" - 1)


* Missing attribute values: None

In [None]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
#  Import and read the breast-cancer.data.csv.
df = pd.read_csv("../Resources/data.csv")
df.head()

The key challenge against its detection is how to classify tumors into malignant (cancerous) or benign(non-cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.

In [None]:
# Find null values
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values")

In [None]:
# Find duplicate entries
print(f"Duplicate entries: {df.duplicated().sum()}")

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
# Choose a cutoff value and create a list of diagnosis to be replaced
# use the variable name `diagnosis_to_replace`

# Transform diagnosis
def diagnosis_to_replace(diagnosis):
    if diagnosis == "M":
        return 1
    else:
        return 0
    

df["diagnosis"] = df["diagnosis"].apply(diagnosis_to_replace)
df.head()

In [None]:
# Split our preprocessed data into our features and target arrays also drop the id as that is not useful
X = df.drop(["diagnosis", "id"], axis='columns')
y = df["diagnosis"]
X.shape

# Explore the dataset

In [None]:
# A quick look at the features to see if the data is distributed normally or not
X.hist(figsize=(15,15))
plt.show()

There is a significant number of features that have a strong right skew, so for some models the data will need to be transformed.

## Now we choose to look at our data from an outcome point of view to see if we can get any insights.

In [None]:
benign = df.loc[df['diagnosis']==0]
benign.drop(columns=['id','diagnosis'], inplace=True)

In [None]:
benign

In [None]:
malign = df.loc[df['diagnosis']==1]
malign.drop(columns=['id','diagnosis'], inplace=True)

In [None]:
malign

In [None]:
# have a look at how the malign data looks compared to the benign in a few categories
plt.hist(benign['radius_mean'], alpha=.5, label='B')
plt.hist(malign['radius_mean'], alpha=.5, label='M')
plt.legend(loc='upper right')
plt.show()

In [None]:
plt.hist(benign['texture_mean'], alpha=.5, label='B')
plt.hist(malign['texture_mean'], alpha=.5, label='M')
plt.legend(loc='upper right')
plt.show()

In [None]:
plt.hist(benign['perimeter_mean'], alpha=.5, label='B')
plt.hist(malign['perimeter_mean'], alpha=.5, label='M')
plt.legend(loc='upper right')
plt.show()

In [None]:
plt.hist(benign['smoothness_mean'], alpha=.5, label='B')
plt.hist(malign['smoothness_mean'], alpha=.5, label='M')
plt.legend(loc='upper right')
plt.show()

In [None]:
# We can come back to this when we want to tweak the features for improvements

In [None]:
# check for any negative values in df
(df < 0).values.any()

### Since the label classes are not balanced the model will have a slight bias towards detecting benign results. This can be corrected by randomly removing some of the benign results so the numbers are equal. This should help reduce the number of false negatives. The only drawback is that the more rows we remove, the less overall training the model get and thus possibly worse performance.

## Note:  Every time the next cell is run we will get different performance based on the randomness of the data that gets removed

In [None]:
# We want to create smaller datasets with a different bias

# balanced classes for no training bias
balanced_df = df.drop(   df.loc[df['diagnosis'] == 0].sample(n=(benign.shape[0] - malign.shape[0])).index ).reset_index(drop=True)

# slight bias towards malign
bias = 50
nb_df = df.drop(   df.loc[df['diagnosis'] == 0].sample(n=(benign.shape[0] - malign.shape[0] + bias)).index ).reset_index(drop=True)

In [None]:
# We now create features and labels based on these datasets with different bias
b_X = balanced_df.drop(["diagnosis", "id"], axis='columns')
b_y = balanced_df['diagnosis']

nb_X = balanced_df.drop(["diagnosis", "id"], axis='columns')
nb_y = balanced_df['diagnosis']

In [None]:
# split the different datasets: normal (positive bias), balanced and negative bias (after some exploration we found the results were best with a 31% test size)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=74, stratify=y, test_size=0.31)

b_X_train, b_X_test, b_y_train, b_y_test = train_test_split(b_X, b_y, random_state=74, stratify=b_y, test_size=0.31)

nb_X_train, nb_X_test, nb_y_train, nb_y_test = train_test_split(nb_X, nb_y, random_state=74, stratify=nb_y, test_size=0.31)

In [None]:
# Create a MinMaxScaler instance since all values are positive to try for better results
scaler = MinMaxScaler()

# Fit the MinMaxScaler
X_scaler = scaler.fit(X_train)
b_X_scaler = scaler.fit(b_X_train)
nb_X_scaler = scaler.fit(nb_X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

b_X_train_scaled = b_X_scaler.transform(b_X_train)
b_X_test_scaled = b_X_scaler.transform(b_X_test)

nb_X_train_scaled = nb_X_scaler.transform(nb_X_train)
nb_X_test_scaled = nb_X_scaler.transform(nb_X_test)

#### In all cases it should be noted that we got different results each time we ran the code, even with the same parameters and seeds. Without doing exhaustive testing to find the mean and variance of the performance of the models it may be hard to prove which parameters gve the best fit. As that is likely the case we will consider results that are within a say 2% to be roughly equivelant.

## Compile, Train and Evaluate the Model

# Standard Machine Learning Models

## Logistic Regression Classifier

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state = 1, max_iter=10000)
lr.fit(X_train_scaled, y_train)

b_lr = LogisticRegression(random_state = 1, max_iter=10000)
b_lr.fit(b_X_train_scaled, b_y_train)

nb_lr = LogisticRegression(random_state = 1, max_iter=10000)
nb_lr.fit(nb_X_train_scaled, nb_y_train)


print(f"Normal Training Data Score: {lr.score(X_train_scaled, y_train)}")
print(f"Normal Testing Data Score: {lr.score(X_test_scaled, y_test)}/n")

print(f"Balanced Training Data Score: {lr.score(b_X_train_scaled, b_y_train)}")
print(f"Balanced Testing Data Score: {lr.score(b_X_test_scaled, b_y_test)}/n")

print(f"Bias Training Data Score: {lr.score(nb_X_train_scaled, nb_y_train)}")
print(f"Bias Testing Data Score: {lr.score(nb_X_test_scaled, nb_y_test)}")

Now what we are really interested in is the number of false negatives as that is the worse way to fail. As such we can look at the confusion matrix and see what percent of results are false negatives. One thing that was discovered is that the smaller dataframes are more susceptible to variation.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
lr_y_pred = lr.predict(X_test_scaled)
b_lr_y_pred = b_lr.predict(X_test_scaled)
nb_lr_y_pred = nb_lr.predict(X_test_scaled)

In [None]:
lr_cm = confusion_matrix(y_test, y_pred)
lr_cm

In [None]:
# Since the method is the same we will just pick the normal dataset to see how the results are biased
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [None]:
# Now we can see the important false negative rate
fnr = fn/(tp+fn)
print(f'False negative rate: {fnr:.4f}')

For logistic regression we found the results to be very similar for each dataset

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=74, n_estimators=200).fit(X_train_scaled, y_train)
b_rfc = RandomForestClassifier(random_state=74, n_estimators=200).fit(b_X_train_scaled, b_y_train)
nb_rfc = RandomForestClassifier(random_state=74, n_estimators=200).fit(nb_X_train_scaled, nb_y_train)

print(f'Training Score: {rfc.score(X_train_scaled, y_train)}')
print(f'Testing Score: {rfc.score(X_test_scaled, y_test)}\n')

print(f'Balanced Training Score: {b_rfc.score(b_X_train_scaled, b_y_train)}')
print(f'Balanced Testing Score: {b_rfc.score(b_X_test_scaled, b_y_test)}\n')

print(f'Bias Training Score: {nb_rfc.score(nb_X_train_scaled, nb_y_train)}')
print(f'Bias Testing Score: {nb_rfc.score(nb_X_test_scaled, nb_y_test)}\n')

In [None]:
y_pred = rfc.predict(X_test_scaled)
b_y_pred = b_rfc.predict(b_X_test_scaled)
nb_y_pred = nb_rfc.predict(nb_X_test_scaled)

In [None]:
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

b_cm = confusion_matrix(b_y_test, b_y_pred)
btn, bfp, bfn, btp = confusion_matrix(b_y_test, b_y_pred).ravel()

nb_cm = confusion_matrix(nb_y_test, nb_y_pred)
nbtn, nbfp, nbfn, nbtp = confusion_matrix(nb_y_test, nb_y_pred).ravel()

#### This time we will look at all the biases between all the datasets

In [None]:
# Sensitivity, hit rate, recall, or true positive rate
tpr = tp/(tp+fn)
btpr = btp/(btp+bfn)
nbtpr = nbtp/(nbtp+nbfn)

# Specificity or true negative rate
tnr = tn/(tn+fp) 
btnr = btn/(btn+bfp) 
nbtnr = nbtn/(nbtn+nbfp)

# Precision or positive predictive value
ppv = tp/(tp+fp)
bppv = btp/(btp+bfp)
nbppv = nbtp/(nbtp+nbfp)

# Negative predictive value
npv = tn/(tn+fn)
bnpv = btn/(btn+bfn)
nbnpv = nbtn/(nbtn+nbfn)

# Fall out or false positive rate
fpr = fp/(fp+tn)
bfpr = bfp/(bfp+btn)
nbfpr = nbfp/(nbfp+nbtn)

# False negative rate
fnr = fn/(tp+fn)
bfnr = bfn/(btp+bfn)
nbfnr = nbfn/(nbtp+nbfn)

# False discovery rate
fdr = fp/(tp+fp)
bfdr = bfp/(btp+bfp)
nbfdr = nbfp/(nbtp+nbfp)

# Overall accuracy
acc = (tp+tn)/(tp+fp+fn+tn)
bacc = (btp+btn)/(btp+bfp+bfn+btn)
nbacc = (nbtp+nbtn)/(nbtp+nbfp+nbfn+nbtn)


In [None]:
print(f'True positive rates: {tpr:.4f} b: {btpr:.4f} nb: {nbtpr:.4f}')
print(f'True negative rates: {tnr:.4f} b: {btnr:.4f} nb: {nbtnr:.4f}')
print(f'Positive predictive values: {ppv:.4f} b: {bppv:.4f} nb: {nbppv:.4f}')
print(f'Negative predictive values: {npv:.4f} b: {bnpv:.4f} nb: {nbnpv:.4f}')
print(f'False positive rates: {fpr:.4f} b: {bfpr:.4f} nb: {nbfpr:.4f}')
print(f'False negative rates: {fnr:.4f} b: {bfnr:.4f} nb: {nbfnr:.4f}')
print(f'False discovery rates: {fdr:.4f} b: {bfdr:.4f} nb: {nbfdr:.4f}')
print(f'Overall Accuracies: {acc:.4f} b: {bacc:.4f} nb: {nbacc:.4f}')

## K Nearest Neighbors Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Loop through different k values to find which has the highest accuracy
train_scores = []
btrain_scores = []
nbtrain_scores = []
test_scores = []
btest_scores = []
nbtest_scores = []


for k in range(1, 20, 2):
    knn = KNeighborsClassifier(n_neighbors=k)
    bknn = KNeighborsClassifier(n_neighbors=k)
    nbknn = KNeighborsClassifier(n_neighbors=k)

    knn.fit(X_train_scaled, y_train)
    bknn.fit(b_X_train_scaled, b_y_train)
    nbknn.fit(nb_X_train_scaled, nb_y_train)

    train_score = knn.score(X_train_scaled, y_train)
    btrain_score = bknn.score(b_X_train_scaled, b_y_train)
    nbtrain_score = nbknn.score(nb_X_train_scaled, nb_y_train)

    test_score = knn.score(X_test_scaled, y_test)
    btest_score = bknn.score(b_X_test_scaled, b_y_test)
    nbtest_score = nbknn.score(nb_X_test_scaled, nb_y_test)
    
    train_scores.append(train_score)
    btrain_scores.append(btrain_score)
    nbtrain_scores.append(nbtrain_score)

    test_scores.append(test_score)
    btest_scores.append(btest_score)
    nbtest_scores.append(nbtest_score)


    print(f"k: {k}, Train/Test Score: {train_score:.4f}/{test_score:.4f}")
    print(f"k: {k}, bTrain/bTest Score: {btrain_score:.4f}/{btest_score:.4f}")
    print(f"k: {k}, nbTrain/nbTest Score: {nbtrain_score:.4f}/{nbtest_score:.4f}\n")

In [None]:
# Looks like k = 3 is best with the normal (larger dataset) 
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
print('k=3 Test Acc: %.4f' % knn.score(X_test_scaled, y_test))

In [None]:
y_pred_knn = knn.predict(X_test_scaled)
cm_knn = confusion_matrix(y_test, y_pred_knn)
cm_knn

In [None]:
ktn, kfp, kfn, ktp = confusion_matrix(y_test, y_pred_knn).ravel()

kfnr = kfn/(ktp+kfn)

print(f'False negative rate: {kfnr:.4f}')

This shows that knn performed as well as the balanced random forest in both accuracy and false negative rate.

## C-Support Vector Classification

In [None]:
from sklearn.svm import SVC
lsvc = SVC(kernel='linear')
psvc2 = SVC(kernel='poly', degree=2)
psvc3 = SVC(kernel='poly', degree=3)
psvc8 = SVC(kernel='poly', degree=8)
psvc9 = SVC(kernel='poly', degree=9)
gsvc = SVC(kernel='rbf')
ssvc = SVC(kernel='sigmoid')
lsvc.fit(X_train, y_train)
psvc2.fit(X_train, y_train)
psvc3.fit(X_train, y_train)
psvc8.fit(X_train, y_train)
psvc9.fit(X_train, y_train)
gsvc.fit(X_train, y_train)
ssvc.fit(X_train, y_train)

In [None]:
y_pred_l = lsvc.predict(X_test)
y_pred_p2 = psvc2.predict(X_test)
y_pred_p3 = psvc3.predict(X_test)
y_pred_p8 = psvc8.predict(X_test)
y_pred_p9 = psvc9.predict(X_test)
y_pred_g = gsvc.predict(X_test)
y_pred_s = ssvc.predict(X_test)

In [None]:
# for a quicker summary (though without the false negative rate) we can use a classification report.
# This can save time in deciding if a model is worth further investigation.

from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_pred_l))
print(classification_report(y_test, y_pred_p2))
print(classification_report(y_test, y_pred_p3))
print(classification_report(y_test, y_pred_p8))
print(classification_report(y_test, y_pred_p9))
print(classification_report(y_test, y_pred_g))
print(classification_report(y_test, y_pred_s))

# Dimension Reduction and unsupervised Model

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [None]:
# First we reduce the dimensionality (we will first try cutting it in half)
pca = PCA(n_components=15)

In [None]:
X_scale = scaler.fit(X)
X_scaled = X_scale.transform(X)
pca_X = pca.fit_transform(X_scaled)

In [None]:
# We want to look at a few parameters of t-sne to optimize since it can be sensitive to perplexity based on data density - we also only want to see 2 dimensions
tsne = TSNE(n_components=2, perplexity=15, n_iter=1000, learning_rate=150)

In [None]:
pca_X.shape

In [None]:
tsne_features = tsne.fit_transform(pca_X)

In [None]:
plt.scatter(tsne_features[:,0],tsne_features[:,1], c=y)
plt.show()

In [None]:
# We can now see how well kmeans can tell the data apart.  Since it is a binary test we can only have 2 clusters though.
from sklearn.cluster import KMeans

km = KMeans(n_clusters=2, random_state=74)
km.fit(pca_X)
cluster_pred = km.predict(pca_X)

In [None]:
plt.scatter(tsne_features[:,0], tsne_features[:,1], c=cluster_pred)
plt.show()

In [None]:
pca.explained_variance_ratio_

In [None]:
sum = 0

for i in pca.explained_variance_ratio_:
  sum += i

# here we can see that the 15 variables account for almost 99% of the variance
print(sum)

In [None]:
print(classification_report(y, cluster_pred))

# Neural Net approach

#### Use Keras-Tuner to find best parameters to optimize the model

In [None]:
# here we can create a function to return our false negatives for use as a metric (https://scorrea92.medium.com/useful-metrics-functions-for-keras-and-tensorflow-b82af9b22c9e)
import tensorflow.keras.backend as K

def fn(y_true, y_pred):
    return K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0)

In [None]:
# Create a method that creates a new Sequential model with hyperparameter options:  method based off of class activities
def create_model(hp):
    nn_model = tf.keras.models.Sequential()

    # Allow keras-tuner to decide which activation function to use in hidden layers
    activation = hp.Choice('activation',['relu','selu'])
    
    # Allow keras-tuner to decide number of neurons in first layer
    # Input dimensions set to X_train_scaled.shape[1] to be automatically set to the number of features(columns)
    nn_model.add(tf.keras.layers.Dense(units=hp.Int('first_units',
        min_value=1,
        max_value=80,
        step=5), activation=activation, input_dim=X_train_scaled.shape[1]))

    # Allow keras-tuner to decide number of hidden layers and neurons in hidden layers
    for i in range(hp.Int('num_layers', 1, 5)):
        nn_model.add(tf.keras.layers.Dense(units=hp.Int('units_' + str(i),
            min_value=1,
            max_value=80,
            step=10),
            activation=activation))
    
    nn_model.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

    # Compile the model
    nn_model.compile(loss="binary_crossentropy", optimizer='adam', metrics=['accuracy', fn])
    
    return nn_model

In [None]:
# Import the keras-tuner library
import keras_tuner as kt

tuner = kt.Hyperband(
    create_model,
    overwrite=True,
    objective="val_accuracy",
    max_epochs=100,
    hyperband_iterations=2)

# Warning:  This will take more than 15 minutes until I setup an early stop

In [None]:
# Run the keras-tuner search for best hyperparameters
tuner.search(X_train_scaled,y_train,epochs=10,validation_data=(X_test_scaled,y_test))

In [None]:
tuner.oracle.get_best_trials(num_trials=1)[0].hyperparameters.values

In [None]:
# Save the best hyperparameters from the search to put into the model
best_hp = tuner.get_best_hyperparameters()[0]

In [None]:
# Define the model using the parameters from the tuner
best_model = tuner.hypermodel.build(best_hp)
# Check the structure of the model
best_model.summary()

In [None]:
# Train the model
fit_model = best_model.fit(X_train_scaled, y_train, epochs=34)

In [None]:
# Evaluate the model using the test data
model_loss, model_accuracy, fn = nn.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy*100} %, False Negative Rate: {fn} %")

## We were able to get the best results with the neural network by getting the highest accuracy and lowest false negative rate

In [None]:
# I got the tuner to say 100% accuracy, but when I ran it, it only got 98.7% and the fn rate was 0.35%