# Breast Cancer Patient Outcomes

The purpose of this project is to investigate using supervised machine learning methods to predict patient outcomes in breast cancer cases. I begin by loading in the dataset. This dataset was uploaded to Kaggle by Kreesh Rajani and can be found here: 
https://www.kaggle.com/datasets/kreeshrajani/breast-cancer-survival-dataset

## Exploring the Data
Looking at the dataframe head below, we can see that we have 14 possible predictors for our 1 outcome variable, Patient_Status. Patient_Status is a binary variable, bluntly given as (Alive/Dead), which gives us the patient outcome.

In [82]:
import pandas as pd
import numpy as np

df = pd.read_csv('breast_cancer_survival.csv')
df.head()

Unnamed: 0,Age,Gender,Protein1,Protein2,Protein3,Protein4,Tumour_Stage,Histology,ER status,PR status,HER2 status,Surgery_type,Date_of_Surgery,Date_of_Last_Visit,Patient_Status
0,42,FEMALE,0.95256,2.15,0.007972,-0.04834,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,20-May-18,26-Aug-18,Alive
1,54,FEMALE,0.0,1.3802,-0.49803,-0.50732,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,26-Apr-18,25-Jan-19,Dead
2,63,FEMALE,-0.52303,1.764,-0.37019,0.010815,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Lumpectomy,24-Aug-18,08-Apr-20,Alive
3,78,FEMALE,-0.87618,0.12943,-0.37038,0.13219,I,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,16-Nov-18,28-Jul-20,Alive
4,42,FEMALE,0.22611,1.7491,-0.54397,-0.39021,II,Infiltrating Ductal Carcinoma,Positive,Positive,Positive,Lumpectomy,12-Dec-18,05-Jan-19,Alive


Below we can see a summary of the numerical data. We have 334 total rows, there are only NA values in two columns: Date_of_Last_Visit and Patient_Status. 

In [83]:
print(df.describe())

print('\nNA counts:')
print(df.isna().sum())

              Age    Protein1    Protein2    Protein3    Protein4
count  334.000000  334.000000  334.000000  334.000000  334.000000
mean    58.886228   -0.029991    0.946896   -0.090204    0.009819
std     12.961212    0.563588    0.911637    0.585175    0.629055
min     29.000000   -2.340900   -0.978730   -1.627400   -2.025500
25%     49.000000   -0.358888    0.362173   -0.513748   -0.377090
50%     58.000000    0.006129    0.992805   -0.173180    0.041768
75%     68.000000    0.343598    1.627900    0.278353    0.425630
max     90.000000    1.593600    3.402200    2.193400    1.629900

NA counts:
Age                    0
Gender                 0
Protein1               0
Protein2               0
Protein3               0
Protein4               0
Tumour_Stage           0
Histology              0
ER status              0
PR status              0
HER2 status            0
Surgery_type           0
Date_of_Surgery        0
Date_of_Last_Visit    17
Patient_Status        13
dtype: int64


### Cleaning the Data

The NA values for Patient_Status are especially problematic but there are not very many of them, so we will drop the NA value rows. We can see below that we end up with 317 rows so we have only lost about $5\%$ of our data.

In [84]:
df = df.dropna()
print(df.describe())

              Age    Protein1    Protein2    Protein3    Protein4
count  317.000000  317.000000  317.000000  317.000000  317.000000
mean    58.725552   -0.027232    0.949557   -0.095104    0.006713
std     12.827374    0.543858    0.906153    0.589027    0.625965
min     29.000000   -2.144600   -0.978730   -1.627400   -2.025500
25%     49.000000   -0.350600    0.368840   -0.531360   -0.382240
50%     58.000000    0.005649    0.997130   -0.193040    0.038522
75%     67.000000    0.336260    1.612000    0.251210    0.436250
max     90.000000    1.593600    3.402200    2.193400    1.629900


A few columns have spaces in their column names which could cause some syntax problems later on so I will rename those to remove the space. They all end in the word 'status' so I will drop that as well. Once that is done, we turn our attention to our non-numeric data. 

In the Gender column, I will map 'FEMALE' to 0 and 'MALE' to 1. For ER, PR, and HER2 I will map 'Negative' to 0 and 'Positive' to 1. For the Tumour_Stage variable, I will map each roman numeral to its corresponding integer. The 'Histology' column has three possible values: 'Infiltrating Ductal Carcinoma', 'Infiltrating Lobular Carcinoma', and 'Mucinous Carcinoma' which I will map to 0, 1, and 2 respectively. The 'Surgery_type' columns has the following values: 'Other', 'Lumpectomy', 'Modified Radical Mastectomy', and 'Simple Mastectomy' which I will map to 0, 1, 2, and 3 respectively.

Under the assumption that particular surgery dates or particular appointment dates have no meaningful impact (i.e. meeting with one's doctor on a Tuesday does not increase one's chances of survival), but may introduce additional noise, we will drop these columns. It is entirely possible that this assumption is mistaken and for an applied study a domain expert should be consulted.

Finally, our target variable 'Patient_Status' has two values: 'Dead' and 'Alive' which I will map to 0, and 1 respectively.

Our final dimensions then are 317 rows with 12 predictors for our target variable.

Our new, clean dataframe looks like this:

In [85]:
df = df.rename(columns={'ER status': 'ER', 'PR status': 'PR', 'HER2 status': 'HER2'})

df['Gender'] = df['Gender'].map({'FEMALE' : 0, 'MALE' : 1})

df['ER'] = df['ER'].map({'Negative' : 0, 'Positive' : 1})

df['PR'] = df['PR'].map({'Negative' : 0, 'Positive' : 1})

df['HER2'] = df['HER2'].map({'Negative' : 0, 'Positive' : 1})

df['Tumour_Stage'] = df['Tumour_Stage'].map({'I' : 1, 'II' : 2, 'III': 3, 'IV': 4})

df['Histology'] = df['Histology'].map({'Infiltrating Ductal Carcinoma' : 0, 'Infiltrating Lobular Carcinoma' : 1, 'Mucinous Carcinoma' : 2})

df['Surgery_type'] = df['Surgery_type'].map({'Other' : 0, 'Lumpectomy' : 1, 'Modified Radical Mastectomy' : 2,'Simple Mastectomy' : 3})

df = df.drop(columns=['Date_of_Surgery', 'Date_of_Last_Visit'], axis=1)

df['Patient_Status'] = df['Patient_Status'].map({'Dead' : 0, 'Alive' : 1})

df.head()

Unnamed: 0,Age,Gender,Protein1,Protein2,Protein3,Protein4,Tumour_Stage,Histology,ER,PR,HER2,Surgery_type,Patient_Status
0,42,0,0.95256,2.15,0.007972,-0.04834,2,0,1,1,0,0,1
1,54,0,0.0,1.3802,-0.49803,-0.50732,2,0,1,1,0,0,0
2,63,0,-0.52303,1.764,-0.37019,0.010815,2,0,1,1,0,1,1
3,78,0,-0.87618,0.12943,-0.37038,0.13219,1,0,1,1,0,0,1
4,42,0,0.22611,1.7491,-0.54397,-0.39021,2,0,1,1,1,1,1


### Visualizations

Before we move on with our analysis, let's take a look at any possible correlation between variables that might indicate collinearity or other issues in our predictors.

In [86]:
# TODO

## Modeling the Data

Now that our data is clean, we can split it for training and testing. I played around a bit with test and training sizes and with this dataset it seems like we get the best results when test size is relatively large. This makes sense, we don't want to be testing on a small sample. Thus, I've set the test size here to 70% of the data. For reproducibility, I've set random_state to 0.

In [87]:
from sklearn.model_selection import train_test_split

x = df.iloc[:,:-1]

y = df.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.7, random_state=0)


### Regression
Now that our training and testing data is split we can start our modeling process. Because this problem has a binary target value, this is a classification problem. Below we see a first pass at a logistic regression model using all possible predictors. The results are not impressive but not hopeless. This may point to a possible issue in our choice (or lack thereof) of predictors.

In [88]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

reg_mod = LogisticRegression().fit(x_train, y_train)

y_pred = reg_mod.predict(x_test)

print('Accuracy of first Logistic Regression model: ' + str(round(accuracy_score(y_pred, y_test),3)))

Accuracy of first Logistic Regression model: 0.797


#### Feature Selection

Sci-Kit Learn provides a handy tool for feature selection. The SequentialFeatureSelector will perform forward or backward selection and return the optimal features given a particular number of features to select.

In [89]:
from sklearn.feature_selection import SequentialFeatureSelector
feature_names = np.array(x_train.columns)

for i in range(1,12):
    log_reg = LogisticRegression()
    forward_selector = SequentialFeatureSelector(log_reg, n_features_to_select=i, direction='forward').fit(x_train,y_train)
    features = feature_names[forward_selector.get_support()]
    print(features)


['Age']
['Age' 'Gender']
['Age' 'Gender' 'Protein2']
['Age' 'Gender' 'Protein1' 'Protein2']
['Age' 'Gender' 'Protein1' 'Protein2' 'Protein3']
['Age' 'Gender' 'Protein1' 'Protein2' 'Protein3' 'Protein4']
['Age' 'Gender' 'Protein1' 'Protein2' 'Protein3' 'Protein4' 'Tumour_Stage']
['Age' 'Gender' 'Protein1' 'Protein2' 'Protein3' 'Protein4' 'Tumour_Stage'
 'Histology']
['Age' 'Gender' 'Protein1' 'Protein2' 'Protein3' 'Protein4' 'Tumour_Stage'
 'Histology' 'HER2']
['Age' 'Gender' 'Protein1' 'Protein2' 'Protein3' 'Protein4' 'Tumour_Stage'
 'Histology' 'ER' 'HER2']
['Age' 'Gender' 'Protein1' 'Protein2' 'Protein3' 'Protein4' 'Tumour_Stage'
 'Histology' 'ER' 'PR' 'HER2']


Below I trained a series of models using each set of features. We can see that accuracy actually goes down beyond 3 features.

In [91]:
nfeatures = 1

x = x_train[['Age']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender', 'Protein2']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender', 'Protein2']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender', 'Protein1', 'Protein2']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender', 'Protein1', 'Protein2']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage','Histology']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage','Histology']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage','Histology', 'HER2']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage','Histology', 'HER2']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage','Histology', 'ER', 'HER2']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage','Histology', 'ER', 'HER2']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

x = x_train[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage','Histology', 'ER', 'PR', 'HER2']]
model = LogisticRegression().fit(x, y_train)
y_pred = model.predict(x_test[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage','Histology', 'ER', 'PR', 'HER2']])
score = accuracy_score(y_pred, y_test)
print('With ' + str(nfeatures) + ' features, accuracy is '+ str(score))
nfeatures += 1

With 1 features, accuracy is 0.8108108108108109
With 2 features, accuracy is 0.8108108108108109
With 3 features, accuracy is 0.8108108108108109
With 4 features, accuracy is 0.8063063063063063
With 5 features, accuracy is 0.8063063063063063
With 6 features, accuracy is 0.8063063063063063
With 7 features, accuracy is 0.8018018018018018
With 8 features, accuracy is 0.8063063063063063
With 9 features, accuracy is 0.8063063063063063
With 10 features, accuracy is 0.8063063063063063
With 11 features, accuracy is 0.8063063063063063


Let's see what set of features backward selection recommends. We'll assume that 3 features is the sweet spot. We see below that backward selection ends up agreeing that Age, Gender, and Protein2 are our best predictors.

In [92]:
backward_selector = SequentialFeatureSelector(log_reg, n_features_to_select=3, direction='forward').fit(x_train,y_train)
features = feature_names[backward_selector.get_support()]
print(features)

['Age' 'Gender' 'Protein2']


So, our final logistic regression model here uses 3 features, 'Age', 'Gender', and 'Protein2'. As we can see below, it has an accuracy of $81.1\%$ Which is better than where we started but still not great.

In [96]:
x = x_train[['Age', 'Gender', 'Protein2']]
log_mod = LogisticRegression().fit(x, y_train)
y_pred = log_mod.predict(x_test[['Age', 'Gender', 'Protein2']])
score = accuracy_score(y_pred, y_test)
print('Accuracy of final logistic regression model is: ' + str(round(score,3)))

Accuracy of final logistic regression model is: 0.811
