# Breast Cancer Prognosis using TensorFlow

**Objective**: To classify whether or not cancer will recur in a patient in 24 months <br>
**Dataset used**: Wisconsin Prognostic Breast Cancer dataset

Similar to the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, the Prognostic Breast Cancer dataset also contains the 10 features of extracted from each of the 3 nuclei under consideration namely, radius, perimeter, area, compactness, smoothness, concavity, concave points, symmetry, fractal dimension, and texture. This data was tabulated for each of the 198 patients. In addition to the 32 features from the WDBC dataset, this dataset contains 3 additional features namely, the tumor diameter, number of lymph nodes removed from the patient and a time feature which indicates the number of months the patient was disease free since the initial treatment, if they're a non-recurring patient and number of months post initial treatment when the cancer recurred. This yields a database of 198 samples x 35 features.

**Note:** Unlike the WDBC dataset, the WPBC dataset requires a few additional pre-processing steps. For one, there are some missing values for certain patients. Additionally the labelling to be done is conditional. Only recurring patients that have a recurring time less than 24 months are labelled as 1 while the rest are labelled as 0. 

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np

  from ._conv import register_converters as _register_converters


In [2]:
df = pd.read_csv("Datasets/Prognosis_Breast_Cancer.csv")

In [3]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,...,Column26,Column27,Column28,Column29,Column30,Column31,Column32,Column33,Column34,Column35
0,119513,N,31,18.02,27.6,117.5,1013.0,0.09489,0.1036,0.1086,...,139.7,1436.0,0.1195,0.1926,0.314,0.117,0.2677,0.08113,5.0,5
1,8423,N,61,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,3.0,2
2,842517,N,116,21.37,17.44,137.5,1373.0,0.08836,0.1189,0.1255,...,159.1,1949.0,0.1188,0.3449,0.3414,0.2032,0.4334,0.09067,2.5,0
3,843483,N,123,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,2.0,0
4,843584,R,27,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,3.5,0


### Labelling Columns

In [4]:
df.rename(columns={'Column1': 'ID No', 
                   'Column2': 'Outcome',
                   'Column3': 'Time',
                   'Column4': 'Radius1',
                   'Column5': 'Texture1',
                   'Column6': 'Perimeter1',
                   'Column7': 'Area1',
                   'Column8': 'Smoothness1',
                   'Column9': 'Compactness1',
                   'Column10': 'Concavity1',
                   'Column11': 'ConcavePoints1',
                   'Column12': 'Symmetry1',
                   'Column13': 'FractalDim1',
                   'Column14': 'Radius2',
                   'Column15': 'Texture2',
                   'Column16': 'Perimeter2',
                   'Column17': 'Area2',
                   'Column18': 'Smoothness2',
                   'Column19': 'Compactness2',
                   'Column20': 'Concavity2',
                   'Column21': 'ConcavePoints2',
                   'Column22': 'Symmetry2',
                   'Column23': 'FractalDim2',
                   'Column24': 'Radius3',
                   'Column25': 'Texture3',
                   'Column26': 'Perimeter3',
                   'Column27': 'Area3',
                   'Column28': 'Smoothness3',
                   'Column29': 'Compactness3',
                   'Column30': 'Concavity3',
                   'Column31': 'ConcavePoints3',
                   'Column32': 'Symmetry3',
                   'Column33': 'FractalDim3',
                   'Column34': 'Tumor Diameter',
                   'Column35': 'Lymph Nodes Removed'},inplace=True)

In [5]:
df.head()

Unnamed: 0,ID No,Outcome,Time,Radius1,Texture1,Perimeter1,Area1,Smoothness1,Compactness1,Concavity1,...,Perimeter3,Area3,Smoothness3,Compactness3,Concavity3,ConcavePoints3,Symmetry3,FractalDim3,Tumor Diameter,Lymph Nodes Removed
0,119513,N,31,18.02,27.6,117.5,1013.0,0.09489,0.1036,0.1086,...,139.7,1436.0,0.1195,0.1926,0.314,0.117,0.2677,0.08113,5.0,5
1,8423,N,61,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,3.0,2
2,842517,N,116,21.37,17.44,137.5,1373.0,0.08836,0.1189,0.1255,...,159.1,1949.0,0.1188,0.3449,0.3414,0.2032,0.4334,0.09067,2.5,0
3,843483,N,123,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,2.0,0
4,843584,R,27,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,3.5,0


In [6]:
df['Outcome'].unique()

array(['N', 'R'], dtype=object)

In [7]:
'''
Those with recurring cancer and if their recurring time is less than 24 months, it's labelled as 1 and the rest (non recurring
cancer patients and those patients that have a recurring time greater than 24 months) are labelled as 0.
'''

df.loc[(df['Time'] <= 24) & (df['Outcome'] == 'R'), 'Outcome'] = 1
df['Outcome'] = df['Outcome'].values != 1

# Replacing True/False column with 1/0 column
df['Outcome'].replace(False, 0, inplace=True)
df['Outcome'].replace(True, 1, inplace=True)


In [8]:
df.head()

Unnamed: 0,ID No,Outcome,Time,Radius1,Texture1,Perimeter1,Area1,Smoothness1,Compactness1,Concavity1,...,Perimeter3,Area3,Smoothness3,Compactness3,Concavity3,ConcavePoints3,Symmetry3,FractalDim3,Tumor Diameter,Lymph Nodes Removed
0,119513,1.0,31,18.02,27.6,117.5,1013.0,0.09489,0.1036,0.1086,...,139.7,1436.0,0.1195,0.1926,0.314,0.117,0.2677,0.08113,5.0,5
1,8423,1.0,61,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,3.0,2
2,842517,1.0,116,21.37,17.44,137.5,1373.0,0.08836,0.1189,0.1255,...,159.1,1949.0,0.1188,0.3449,0.3414,0.2032,0.4334,0.09067,2.5,0
3,843483,1.0,123,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,2.0,0
4,843584,1.0,27,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,3.5,0


### Addressing Missing Values

As mentioned previously, the WPBC dataset contains a lot of missing values (specifically in the **Lymph Nodes Removed** column). Data could be missing due to a variety of reasons. Primarily it could be due to
1. Missing at Random - Propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data.  
2. Missing Completely at Random - The fact that a certain value is missing has nothing to do with its hypothetical value and with the values of other variables. 
3. Missing not at Random - Two possible reasons are that the missing value depends on the hypothetical value (e.g. People with high salaries generally do not want to reveal their incomes in surveys) or missing value is dependent on some other variable’s value (e.g. Let’s assume that females generally don’t want to reveal their ages! Here the missing value in age variable is impacted by gender variable)

In the first two cases, it is safe to remove the data with missing values depending upon their occurrences, while in the third case removing observations with missing values can produce a bias in the model. 

Ref: https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

Since it could be hypothesised that the missing values in the Lymph Nodes removed columns are missing at random, we could address this issue by simply removing the record of those patients with missing values.

In [9]:
df = df.drop(df[df['Lymph Nodes Removed']=='?'].index)

In [10]:
df['Outcome'].value_counts()

1.0    166
0.0     28
Name: Outcome, dtype: int64

### Features and Labels

In [11]:
x_data = df.drop(["ID No","Outcome"],axis=1)
y_labels = df['Outcome']

### Converting pandas DataFrame into numpy matrix

In [12]:
x_data = x_data.as_matrix()
y_labels = y_labels.as_matrix()

### Train-Test Split

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(x_data,y_labels,test_size=0.3,random_state=101)

In [15]:
X_train

array([[68, 14.19, 26.02, ..., 0.1061, 1.4, '0'],
       [17, 19.71, 19.06, ..., 0.08621000000000001, 4.0, '15'],
       [19, 19.55, 28.77, ..., 0.1005, 6.0, '15'],
       ...,
       [91, 13.77, 22.29, ..., 0.09333, 1.2, '0'],
       [117, 15.85, 23.95, ..., 0.06287000000000001, 1.0, '0'],
       [67, 20.51, 27.81, ..., 0.08327999999999999, 9.0, '24']],
      dtype=object)

### Normalise Data

In [16]:
from sklearn.preprocessing import MinMaxScaler

In [17]:
scaler = MinMaxScaler()

In [18]:
scaled_x_train = scaler.fit_transform(X_train)
scaled_x_test = scaler.transform(X_test)



In [19]:
onehot_y_train = pd.get_dummies(y_train).as_matrix()

### Defining the Network

In [20]:
X_train.shape

(135, 33)

In [21]:
num_inputs = 33
num_hidden1 = 30
num_hidden2 = 30
num_outputs = 2
learning_rate = 0.003

### Placeholders

In [22]:
X = tf.placeholder(tf.float32,[None,num_inputs],name="X")
y_true = tf.placeholder(tf.float32,[None,num_outputs],name="Labels")

### Initialise Weights

In [23]:
W1 = tf.Variable(tf.random_normal([num_inputs,num_hidden1],stddev=0.01),name="W1")
b1 = tf.Variable(tf.random_normal([num_hidden1],stddev=0.01),name="b1")
W2 = tf.Variable(tf.random_normal([num_hidden1,num_hidden2],stddev=0.01),name="W2")
b2 = tf.Variable(tf.random_normal([num_hidden2],stddev=0.01),name="b2")
W3 = tf.Variable(tf.random_normal([num_hidden2,num_outputs],stddev=0.01),name="W3")
b3 = tf.Variable(tf.random_normal([num_outputs],stddev=0.01),name="b3")


w1s = tf.summary.histogram("W1",W1)
b1s = tf.summary.histogram("b1",b1)
w2s = tf.summary.histogram("W2",W2)
b2s = tf.summary.histogram("b2",b2)
w3s = tf.summary.histogram("W3",W3)
b3s = tf.summary.histogram("b3",b3)

### Choose activation function

In [24]:
actf = tf.nn.relu

### Operation

In [25]:
with tf.name_scope("NeuralNetwork"):
    O = tf.add(tf.matmul(X,W1),b1)
    Z = actf(O)
    Z = tf.nn.dropout(Z,0.25)
    
    O1 = tf.add(tf.matmul(Z,W2),b2)
    Z1 = actf(O1)
    Z1 = tf.nn.dropout(Z1,0.2)
        
    output = tf.add(tf.matmul(Z1,W3),b3)

### Error Calculation

In [26]:
with tf.name_scope("CrossEntropyError"):
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_true,logits=output))
cs = tf.summary.scalar('cross_entropy_error',cost)

### Accuracy Calculation

In [27]:
with tf.name_scope('Accuracy'):
    with tf.name_scope('correct_prediction'):
        correct_prediction = tf.equal(tf.argmax(y_true, 1), tf.argmax(output, 1))
    with tf.name_scope('accuracy'):
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    acc = tf.summary.scalar('accuracy', accuracy)

### Optimizer

In [28]:
optimizer = tf.train.AdamOptimizer(learning_rate)
train = optimizer.minimize(cost)

In [29]:
predict = tf.argmax(output,axis=1)

### Initialise Global Variables

In [30]:
init = tf.global_variables_initializer()

### Saving Model

In [31]:
saver = tf.train.Saver()

In [32]:
training_steps = 5000

with tf.Session() as sess:
    sess.run(init)
    writer = tf.summary.FileWriter("Prognosis/Logs",sess.graph)
    summaries = tf.summary.merge([w1s,b1s,w2s,b2s,w3s,b3s,cs,acc])
    for i in range(training_steps):
        sess.run(train,feed_dict={X:scaled_x_train,y_true:onehot_y_train})
        pred = sess.run(predict, feed_dict={X:scaled_x_train,y_true:onehot_y_train})
        if i % 500 == 0:
            
            # Print out accuracy
            correct_prediction = tf.equal(y_train, pred)
            print("Training Accuracy:",sess.run(tf.reduce_mean(tf.cast(correct_prediction, tf.float32))))
            
            s = sess.run(summaries,feed_dict={X:scaled_x_train,y_true:onehot_y_train})
            writer.add_summary(s, global_step=i)
        
    # Get predictions
    logits = output.eval(feed_dict={X:scaled_x_test})
    preds = tf.argmax(logits,axis=1)
    results = preds.eval()
    writer.close()
    saver.save(sess,'Prognosis/Models/my_base_model.ckpt')

Training Accuracy: 0.8666667
Training Accuracy: 0.95555556
Training Accuracy: 0.9851852
Training Accuracy: 0.93333334
Training Accuracy: 0.97037035
Training Accuracy: 0.94814813
Training Accuracy: 0.9777778
Training Accuracy: 0.97037035
Training Accuracy: 0.9259259
Training Accuracy: 0.9851852


### Evaluating Performance on Test set

In [33]:
from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(results,y_test))

             precision    recall  f1-score   support

          0       0.70      0.78      0.74         9
          1       0.96      0.94      0.95        50

avg / total       0.92      0.92      0.92        59

