<a href="https://colab.research.google.com/github/chris-lovejoy/CodingForMedicine/blob/main/Breast_cancer_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Uses the **'Breast Cancer Wisconsin' Kaggle dataset** (available [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/downloads/breast-cancer-wisconsin-data.zip/2)) and **scikit-learn** (documentation [here](https://scikit-learn.org/)).

Initial inspiration from [this exercise by CodeMD](http://codemd.co.uk/data-science-with-breast-cancer-data/).


## Upload the data and get into the right directory


Download the Kaggle dataset by clicking [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/downloads/breast-cancer-wisconsin-data.zip/2), and unzip it on your computer.

Upload the 'data.csv' file on the left, by clicking on 'Files' then on 'Upload'.

In [0]:
cd ../

[0m[01;34mbin[0m/      [01;34mdatalab[0m/  [01;34mlib[0m/    [01;34mmnt[0m/   [01;34mrun[0m/    [01;34msys[0m/                   [01;34musr[0m/
[01;34mboot[0m/     [01;34mdev[0m/      [01;34mlib32[0m/  [01;34mopt[0m/   [01;34msbin[0m/   [01;34mtensorflow-2.0.0-rc0[0m/  [01;34mvar[0m/
[01;34mcontent[0m/  [01;34metc[0m/      [01;34mlib64[0m/  [01;34mproc[0m/  [01;34msrv[0m/    [30;42mtmp[0m/
data.csv  [01;34mhome[0m/     [01;34mmedia[0m/  [01;34mroot[0m/  [01;34mswift[0m/  [01;34mtools[0m/


In [0]:
ls

In [0]:
# Install dependencies
import pandas as pd


In [0]:
# Load our data as a dataframe
df = pd.read_csv('data.csv')

## Visualise our data

In [0]:
# Look at the top of the table
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [0]:
# Look at the whole table
df

##Clean up our table

In [0]:
df.drop('id', axis=1, inplace = True)
df.drop("Unnamed: 32",axis=1,inplace=True)

In [0]:
df['diagnosis'] = df['diagnosis'].map({'M':1,'B':0})


## Choose our variables for the model


In [0]:
# Modify this as you wish, to select the variables you want to include
prediction_var = ['perimeter_mean', 'compactness_mean', 'concavity_mean']

## Create our training and test set data

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
train, test = train_test_split(df, test_size = 0.15)

In [0]:
train_x = train[prediction_var]
train_y = train.diagnosis

test_x = test[prediction_var]
test_y = test.diagnosis

## Train our model

In [0]:
# Import multiple options, to enable us to try out different classifiers

from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC


We will train a simple neural network:

In [0]:
# Deciding on the model architecture, learning rate, etc

model = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2))

In [0]:
# Training the model to fit the data in our training sets

model.fit(train_x, train_y)

## Testing our model on specific examples

In [0]:
test_1 = test_x.iloc[0,0:3]


In [0]:
# We can see that 'test_1' was in row number 349 by the statement 'Name: 349'
test_1

In [0]:
# Therefore, to see it's actual diagnosis we can use:

df.loc[349,'diagnosis']

In [0]:
# Now, let's test our model...

model.predict([test_1])

In [0]:
# Let's try two more test points

test_2 = test_x.iloc[1,0:3]
test_3 = test_x.iloc[45,0:3]

In [0]:
test_2

In [0]:
test_3

In [0]:
df.loc[430,'diagnosis']


In [0]:
df.loc[524,'diagnosis']


In [0]:
# Then test the model again
model.predict([test_2])

In [0]:
model.predict([test_3])

## Calculate our performance metrics

In [0]:
from sklearn import metrics

In [0]:
y_true = test_y
y_pred = model.predict(test_x)

In [0]:
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

recall = recall_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)

print("The recall score is ", "%.2f" %recall)
print("The precision score is ", "%.2f" %precision)

The equation for F1 score is:

![F1 equation](https://wikimedia.org/api/rest_v1/media/math/render/svg/1bf179c30b00db201ce1895d88fe2915d58e6bfd)



In [0]:
# Therefore, we can calculate it this way:
F1 = 2 * (precision * recall) / (precision + recall)
print("%.2f" %F1)

0.92


In [0]:
# Or this way:
from sklearn.metrics import f1_score

f1_score = f1_score(y_true, y_pred)

print("The f1 score is ", "%.2f" %f1_score)


In [0]:
# And finally, accuracy and AUC:

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score 

accuracy = accuracy_score(y_true, y_pred)
AUC = roc_auc_score(y_true, y_pred)

print("The accuracy is ", "%.2f" %accuracy)
print("The AUC is ", "%.2f" %AUC)


## Create a 'confusion matrix'

In [0]:
from sklearn.metrics import confusion_matrix


In [0]:
confusion_matrix = confusion_matrix(y_true, y_pred)


In [0]:
confusion_matrix

## Now let's try some other classification methods!

Go back to the 'Training our model' section and try the following classifiers:
- Random Forest
- Nearest Neighbours
- Support Vector Machine

Use the [scikit-learn](https://scikit-learn.org/) documentation as required.

Compare the performance metrics and see which performs best!

## Next steps

Once you've tried all of the above, try modifying the features selected and the hyperparameters (learning rate, train/test split) and see how this affects performance.
