# CNN Cancer Detection Kaggle Mini Project

The goal of this Kaggle competition, is to create an algorithm to identify metastatic cancer 
in small image patches taken from larger digital pathology scans. 
The data for this competition are 96x96 pixel images separated in both training and test sets.

## Introduction: Step 1
    
We will see how many training and test samples we have as well as verify the shapes of the samples are
96x96x3 (96x96 pixels, 3 colors for RGB).  Also, as this is a binary classification problem, I will
want to make sure the classes are balanced so as not to introduce any bias into the predictions

# I AM NOT SURE WHY THE IMAGES DID NOT LOAD WHEN DOWNLOADING MY FILE FROM KAGGLE.  PLEASE SEE THE NOTEBOOK HERE FOR THE VISUALIZATIONS:

https://www.kaggle.com/code/dayivy/cnn-cancer-detection-kaggle-mini-project


In [1]:
#import necessary packages and training labels
import numpy as np
import pandas as pd

train_labels = pd.read_csv('../input/histopathologic-cancer-detection/train_labels.csv', dtype=str)
print(train_labels.shape)

In [2]:
train_labels.head()

In [3]:
train_labels.dtypes

In [4]:
train_labels['label'] = train_labels['label'].astype(float)

In [5]:
import os
print(len(os.listdir('../input/histopathologic-cancer-detection/train/')))
print(len(os.listdir('../input/histopathologic-cancer-detection/test/')))

So, we see we have 220,025 training samples and 57,458 test samples.

In [6]:
len(train_labels)

## Step 2: Exploratory Data Analysis

I will now look to do any data cleaning and figure out my plan of how to approach this task.  Initially I will want to determine if the classes in the training set are unbalanced.

In [7]:
train_labels['label'].value_counts()

In [8]:
train_labels['label'].value_counts().plot(kind='pie')

There is a class imbalance in the training data with significantly more non-cancerous images (label '0') versus cancerous (label '1').  I will reduce the size of label 0 by random undersampling of that set in order to match the same size as the set of images with label 1.

In [9]:
#split into two sets based on labels
train_labels_pos = train_labels[train_labels['label']==1]
train_labels_neg = train_labels[train_labels['label']==0]

In [10]:
#take a random sample of the neg labels of the same size as the set of pos labels
train_labels_neg = train_labels_neg.sample(n = train_labels_pos.shape[0])

In [11]:
#confirm both sets are of the same size
print(train_labels_neg.shape[0])
print(train_labels_pos.shape[0])

In [12]:
#combine and randomize the two sets
train_labels_balanced = pd.concat([train_labels_neg,train_labels_pos]).sample(frac=1, random_state=12345).reset_index(drop=True)
train_labels_balanced.head()

In [13]:
#confirm final set has the expected amount and shape
train_labels_balanced.shape

In [14]:
#confirm final set has the expected value counts
train_labels_balanced['label'].value_counts()

In [15]:
train_labels_balanced['label'].value_counts().plot(kind='pie')

Now I will look at a several images to see what we are actually trying to classif and confirm they are of the expected size

In [16]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img = mpimg.imread(f'../input/histopathologic-cancer-detection/train/{train_labels_balanced.iloc[47,0]}.tif')
imgplot = plt.imshow(img)

In [17]:
print(img.shape)

In [18]:
sample_imgs = np.random.choice(train_labels_balanced.index,15)


In [19]:
fig, ax = plt.subplots(5, 3,figsize=(20,20))

for i in range(0, sample_imgs.shape[0]):
    ax = plt.subplot(5, 3, i+1)
    img = mpimg.imread(f'../input/histopathologic-cancer-detection/train/{train_labels_balanced.iloc[sample_imgs[i],0]}.tif')
    ax.imshow(img)
    lab = train_labels_balanced.iloc[sample_imgs[i],1]
    ax.set_title('Label: %s'%lab)
    
plt.tight_layout()

So, we can see some images are vastly different from others, and with no medical training I cant possibly tell which images are indicative of cancer and which arent.  Hopefully, with deep learning, the model will be able to!

## Step 3: Model Architecture

I will initially attempt to use a model architecture based on the VGGNet model.  This utilizes blocks of convulutions in a [Conv-Conv-MaxPool] set repeated n times.  This is a fairly simple architecture for a beginner like my self to develop and understand so I will stick with that for this assignment.

I will try a couple of different block sizes then look to further tune other hyperparameters such as optimization methods and activation functions.

For the first step, I will split the training set into a training subset and a validation subset.  I will use these same subsets throughout this process in order to remain consistent.

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
train_df, valid_df = train_test_split(train_labels_balanced, test_size=0.25, random_state=1234, stratify=train_labels_balanced.label)

In [27]:
#import tensorflow and keras as well as any necessary packages
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow import keras
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D
from keras import regularizers, optimizers
from keras.layers import PReLU
from keras.initializers import Constant

from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [23]:
train_df['id'] = train_df['id']+'.tif'
valid_df['id'] = valid_df['id']+'.tif'

In [24]:
train_df['label'] = train_df['label'].astype(str)
valid_df['label'] = valid_df['label'].astype(str)

In [25]:
#create the training and validation subsets
train_datagen=ImageDataGenerator(rescale=1/255)

train_generator=train_datagen.flow_from_dataframe(dataframe=train_df,directory="../input/histopathologic-cancer-detection/train/",
                x_col="id",y_col="label",batch_size=64,seed=1234,shuffle=True,
                class_mode="binary",target_size=(96,96))

valid_generator=train_datagen.flow_from_dataframe(dataframe=valid_df,directory="../input/histopathologic-cancer-detection/train/",
                x_col="id",y_col="label",batch_size=64,seed=1234,shuffle=True,
                class_mode="binary",target_size=(96,96))


In [54]:
#initial model with 4 sets of 2 convolutional layers
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',input_shape=(96,96,3)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())

model.add(Conv2D(128, (3, 3)))
model.add(Activation('relu'))
model.add(Conv2D(128, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())

model.add(Conv2D(256, (3, 3)))
model.add(Activation('relu'))
model.add(Conv2D(256, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
          
model.add(Flatten())
model.add(Dropout(0.25))
model.add(Dense(512))
model.add(Activation('relu'))

model.add(Dropout(0.25))
model.add(Dense(256))
model.add(Activation('relu'))

model.add(Dropout(0.25))
model.add(Dense(64))
model.add(Activation('relu')) 

model.add(Dropout(0.25))
model.add(Dense(1, activation='sigmoid'))
opt = tf.keras.optimizers.Adam(0.001)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])


In [55]:
model.summary()

In [56]:
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size

history = model.fit_generator(generator=train_generator,
                    steps_per_epoch=STEP_SIZE_TRAIN,
                    validation_data=valid_generator,
                    validation_steps=STEP_SIZE_VALID,
                    epochs=30, verbose=1
)

In [58]:
#next model with 3 sets of 5 convolutional layers
model2 = Sequential()
model2.add(Conv2D(32, (3, 3), padding='same',input_shape=(96,96,3)))
model2.add(Activation('relu'))
model2.add(Conv2D(32, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(32, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(32, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(32, (3, 3)))
model2.add(Activation('relu'))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(BatchNormalization())

model2.add(Conv2D(64, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(64, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(64, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(64, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(64, (3, 3)))
model2.add(Activation('relu'))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(BatchNormalization())

model2.add(Conv2D(128, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(128, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(128, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(128, (3, 3)))
model2.add(Activation('relu'))
model2.add(Conv2D(128, (3, 3)))
model2.add(Activation('relu'))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(BatchNormalization())
          
model2.add(Flatten())
model2.add(Dropout(0.25))
model2.add(Dense(512))
model2.add(Activation('relu'))

model2.add(Dropout(0.25))
model2.add(Dense(256))
model2.add(Activation('relu'))

model2.add(Dropout(0.25))
model2.add(Dense(64))
model2.add(Activation('relu')) 

model2.add(Dropout(0.25))
model2.add(Dense(1, activation='sigmoid'))
opt = tf.keras.optimizers.Adam(0.001)
model2.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])


In [59]:
model2.summary()

In [60]:
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size

history2 = model2.fit_generator(generator=train_generator,
                    steps_per_epoch=STEP_SIZE_TRAIN,
                    validation_data=valid_generator,
                    validation_steps=STEP_SIZE_VALID,
                    epochs=30, verbose=1
)

I will now take a look at the plots of the accuracy of the training vs validation sets to see which model architecture appears to be the best.  I will then move forward with further hyperparameter tuning.

In [67]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [68]:
plt.plot(history2.history['accuracy'])
plt.plot(history2.history['val_accuracy'])
plt.title('model2 accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

So, we see the second model with more convolutional layers seems to be slightly better on the validation set.  We are able to reach about 90% accuracy after 10-15 epochs.  Using this model going forward, I will compare a different optimization method and finally different activations in order to tune the hyperparameters.

In [71]:
#next model with 3 sets of 5 convolutional layers
model3 = Sequential()
model3.add(Conv2D(32, (3, 3), padding='same',input_shape=(96,96,3)))
model3.add(Activation('relu'))
model3.add(Conv2D(32, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(32, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(32, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(32, (3, 3)))
model3.add(Activation('relu'))
model3.add(MaxPooling2D(pool_size=(2, 2)))
model3.add(BatchNormalization())

model3.add(Conv2D(64, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(64, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(64, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(64, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(64, (3, 3)))
model3.add(Activation('relu'))
model3.add(MaxPooling2D(pool_size=(2, 2)))
model3.add(BatchNormalization())

model3.add(Conv2D(128, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(128, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(128, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(128, (3, 3)))
model3.add(Activation('relu'))
model3.add(Conv2D(128, (3, 3)))
model3.add(Activation('relu'))
model3.add(MaxPooling2D(pool_size=(2, 2)))
model3.add(BatchNormalization())
          
model3.add(Flatten())
model3.add(Dropout(0.25))
model3.add(Dense(512))
model3.add(Activation('relu'))

model3.add(Dropout(0.25))
model3.add(Dense(256))
model3.add(Activation('relu'))

model3.add(Dropout(0.25))
model3.add(Dense(64))
model3.add(Activation('relu')) 

model3.add(Dropout(0.25))
model3.add(Dense(1, activation='sigmoid'))
opt = tf.keras.optimizers.RMSprop(0.001)
model3.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])


In [72]:
model3.summary()

In [73]:
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size

history3 = model3.fit_generator(generator=train_generator,
                    steps_per_epoch=STEP_SIZE_TRAIN,
                    validation_data=valid_generator,
                    validation_steps=STEP_SIZE_VALID,
                    epochs=30, verbose=1
)

In [74]:
plt.plot(history3.history['accuracy'])
plt.plot(history3.history['val_accuracy'])
plt.title('model3 accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

So, we see the model with RMSprop as the optimizer does not seem to be as stable at least in the validation set and the accuracy still appears to be improving even after 30 epochs whereas the model utilizing Adam optimizer had converged by that point.  I will continue with the model utilizing Adam optimization, comparing the activation function (relu vs prelu) before determining what the final model will be.

In [28]:
#next model with 3 sets of 5 convolutional layers, using prelu activations
model4 = Sequential()
model4.add(Conv2D(32, (3, 3), padding='same',input_shape=(96,96,3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(32, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(32, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(32, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(32, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(MaxPooling2D(pool_size=(2, 2)))
model4.add(BatchNormalization())

model4.add(Conv2D(64, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(64, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(64, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(64, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(64, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(MaxPooling2D(pool_size=(2, 2)))
model4.add(BatchNormalization())

model4.add(Conv2D(128, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(128, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(128, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(128, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(Conv2D(128, (3, 3)))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))
model4.add(MaxPooling2D(pool_size=(2, 2)))
model4.add(BatchNormalization())
          
model4.add(Flatten())
model4.add(Dropout(0.25))
model4.add(Dense(512))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))

model4.add(Dropout(0.25))
model4.add(Dense(256))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))

model4.add(Dropout(0.25))
model4.add(Dense(64))
model4.add(PReLU(alpha_initializer=Constant(value=0.25)))

model4.add(Dropout(0.25))
model4.add(Dense(1, activation='sigmoid'))
opt = tf.keras.optimizers.RMSprop(0.001)
model4.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])


In [29]:
model4.summary()

In [30]:
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size

history4 = model4.fit_generator(generator=train_generator,
                    steps_per_epoch=STEP_SIZE_TRAIN,
                    validation_data=valid_generator,
                    validation_steps=STEP_SIZE_VALID,
                    epochs=30, verbose=1
)

In [31]:
plt.plot(history4.history['accuracy'])
plt.plot(history4.history['val_accuracy'])
plt.title('model3 accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

So, we see with PReLU activation, the model converged faster (reaching nearly 100% accuracy on the training set within 30 epochs) and the validation set accuracy was much more stable.  This makes sense as the ReLU activation suffers from vanishing gradients and can essentially "turn off" nodes which may actually be the nodes we need for different instances.  From the images, we know that not every image is perfectly centered so the important "nodes" in our model could really occur anywhere and we shouldnt look to turn them completely off if at all possible.

In [32]:
test_set = os.listdir('../input/histopathologic-cancer-detection/test/')


In [33]:
test_df = pd.DataFrame(test_set)
test_df.columns = ['id']
test_df.head()

In [34]:
test_datagen=ImageDataGenerator(rescale=1/255)

test_generator=test_datagen.flow_from_dataframe(dataframe=test_df,directory="../input/histopathologic-cancer-detection/test/",
                x_col="id",batch_size=64,seed=1234,shuffle=False,
                class_mode=None,target_size=(96,96))

In [None]:
STEP_SIZE_TEST=test_generator.n/2

preds = model4.predict_generator(generator=test_generator,steps=STEP_SIZE_TEST, verbose = 1)

In [None]:
predictions = []

for pred in preds:
    if pred >= 0.5:
        predictions.append(1)
    else:
        predictions.append(0)
        
predictions[:10]

In [None]:
submission = test_df.copy()
submission['id']=submission['id'].str[:-4]
submission['label']=predictions
submission.head()

In [None]:
submission.to_csv('submission.csv',index=False)