<a href="https://colab.research.google.com/github/edgarbarr1/colon-cancer-cnn/blob/main/colon_cancer_image.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colon Cancer #
### _Predicting the outcomes of colon cells to predict cancer_ ###

### Business Understanding ###
Colon cancer has been deemed the number 3 most common cancer in the world, according to the World Cancer Research Fund. Based on this statistic, it is not a surprise to know that more approximately 19 million colonoscopies are perfeormed each year in the United States.

Some experts believe that some of the main causes of this cancer is the Western food diet along with living a sedentary lifestyle as well as being obese. Unfortunately, according to the CDC, the US appears to be on an upward trend in obesity which in turn increses the likelihood of men and women to develop colorectal cancers.

Although the morttality rate for the most part appears to be relatively low (80% survival rate), it is important to note that like everything, there is always something to improve with either accurate test results, the time it takes to report those results and the resources available to compile said results.

Currently, as per the American Cancer Society, it takes 2-3 days to report the findings of a colonoscopy biopsy.

Objective
This notebook has the objective of finding out the population that is deeply affected by colon cancer and build a Convolutional Neural Network that can get close to the 1-2% accuracy that current tests. We will also strive to have an efficient model that can give accurate results faster than 2-3 days and ideally within the time frame of "same-day" results.

Before doing so, we will look at some mortality rates among different populations and determine whether the economic status of a population affects the mortality rate of colon cancer.



In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import cv2
import random

import PIL
import PIL.Image
import pathlib
# Packages to import and preprocess images
import glob
import random
import shutil
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from google.colab import drive

# Packages for our models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Flatten, Conv2D, MaxPooling2D, LeakyReLU
from sklearn.metrics import confusion_matrix
%matplotlib inline

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
print(tf.__version__)

2.5.0


Deleted all folders from zenodo except NORM which is normal and TUM which is the cancer cells

Inspiration for the function in the creation of the [directories](https://www.youtube.com/watch?v=_L2uYfVV48I)

In order to have class balance in the dataset we will be using a total of 24,000 images for our model training, 1,800 items for the validation, and 1,720 images for our test dataset to generate predictions. This brings our total of images used to 27,520 images used in this Convolutional Neural Network.

In [None]:
# PIL.Image.open('/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/NORM/NORM-AAAKGLVQ.tif')

In [None]:
# os.listdir()

In [None]:
# normal_image_count = len(list(glob.glob('/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/NORM/*.tif')))
# cancer_image_count = len(list(glob.glob('/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/TUM/*.tif')))
# print('Normal images: {}'.format(normal_image_count))
# print('Cancer images: {}'.format(cancer_image_count))

The images in the dataset are in TIF format. Let's convert the images into jpegs.

In [None]:
z2ell to convert the TIF images to jpeg

# paths = ['/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/NORM',
#          '/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/TUM']
# for path in paths:
#   for root, dirs, files in os.walk(path, topdown=False,):
#       for name in files:
#           print(os.path.join(root, name))
#           #if os.path.splitext(os.path.join(root, name))[1].lower() == ".tiff":
#           if os.path.splitext(os.path.join(root, name))[1].lower() == ".tif":
#               if os.path.isfile(os.path.splitext(os.path.join(root, name))[0] + ".jpg"):
#                   print ("A jpeg file already exists for %s" % name)
#               # If a jpeg with the name does *NOT* exist, convert one from the tif.
#               else:
#                   outputfile = os.path.splitext(os.path.join(root, name))[0] + ".jpg"
#                   try:
#                       im = PIL.Image.open(os.path.join(root, name))
#                       print ("Converting jpeg for %s" % name)
#                       im.thumbnail(im.size)
#                       im.save(outputfile, "JPEG", quality=100)
#                   except Exception as e: 
#                     print(e)

# Jpegs have been converted. Now separate into subdirectories.

Now let's divide the images into subdirecotiries.

In [None]:
# os.chdir('drive/MyDrive/colon_dataset')

In [None]:
# os.listdir()

In [None]:
# Makes the subdirecoteries for the training, validation, and testing data.
# if os.path.isdir('train/normal') is False:
#     os.makedirs('train/normal')
#     os.makedirs('train/cancer')
#     os.makedirs('validation/normal')
#     os.makedirs('validation/cancer')
#     os.makedirs('test/')

Currently we have Tif and jpeg duplicates in our directory. Now we will move the jpeg images to the newly created directories.

In [None]:
# Run this code to move the images from the original dataset directory to the newly created dataset that the model will read.
# for image in random.sample(glob.glob('/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/NORM/*.jpg'), 8000):
#   shutil.move(image, '/content/drive/MyDrive/colon_dataset/train/normal/')


# for image in random.sample(glob.glob('/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/TUM/*.jpg'), 8000):
#   shutil.move(image, '/content/drive/MyDrive/colon_dataset/train/cancer/')


# for image in random.sample(glob.glob('/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/NORM/*.jpg'), 400):
#   shutil.move(image, '/content/drive/MyDrive/colon_dataset/validation/normal/')


# for image in random.sample(glob.glob('/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/TUM/*.jpg'), 400):
#   shutil.move(image, '/content/drive/MyDrive/colon_dataset/validation/cancer/')


# for image in random.sample(glob.glob('/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/NORM/*.jpg'), 360):
#   shutil.move(image, '/content/drive/MyDrive/colon_dataset/test/')


# for image in random.sample(glob.glob('/content/drive/MyDrive/colon_dataset/NCT-CRC-HE-100K/TUM/*.jpg'), 360):
#   shutil.move(image, '/content/drive/MyDrive/colon_dataset/test/')

In [None]:
# current_path_list = ['/content/drive/MyDrive/colon_dataset/validation',
#                      '/content/drive/MyDrive/colon_dataset/test',
#                      '/content/drive/MyDrive/colon_dataset/train']

# Image Data Generator #

In [None]:
data_gen = ImageDataGenerator(
    rescale = 1./255,
    zoom_range = (0.95,0.95),
    brightness_range = [0.5, 1.0]
)

In [None]:
train_generator = data_gen.flow_from_directory(
    '/content/drive/MyDrive/colon_dataset/train',
    target_size = (224,224),
    batch_size = 20,
    color_mode = 'rgb',
    shuffle = True,
    class_mode = 'binary',
    subset = 'training',
    seed = 20
)
validation_generator = data_gen.flow_from_directory(
    '/content/drive/MyDrive/colon_dataset/validation',
    target_size = (224,224),
    batch_size = 20,
    color_mode = 'rgb',
    shuffle = True,
    class_mode = 'binary',
    subset = 'training',
    seed = 20
)

Found 16000 images belonging to 2 classes.
Found 800 images belonging to 2 classes.


# Model 1 #

Let's create our first model based on the data generator that we created above.

Our first model will be a rather simple one with the following layers:




1.   A convolutional layer with a (10,10) kernel size or the height and width of our convolutional window.
2.   A Max Pooling Layer that with a height and width of (5,5).
1.   A Dense Layer with a 64 output size
2.  










In [None]:
model = Sequential()
model.add(Conv2D(32, kernel_size=(5,5), activation='relu' , input_shape = (224,224,3)))
model.add(MaxPooling2D(3,3))
model.add(Dense(64, activation='relu'))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 220, 220, 32)      2432      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 73, 73, 32)        0         
_________________________________________________________________
dense (Dense)                (None, 73, 73, 64)        2112      
_________________________________________________________________
flatten (Flatten)            (None, 341056)            0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 341057    
Total params: 345,601
Trainable params: 345,601
Non-trainable params: 0
_________________________________________________________________


In [None]:
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.layers import Dropout
import seaborn as sns

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[Precision(), Recall()])

In [None]:
history = model.fit(x= train_generator,
                    validation_data = validation_generator,
                    epochs = 5)

Epoch 1/5

In [None]:
preds_train_1 = model.predict(train_generator, 
                                   steps=(train_generator.n//20), 
                                   verbose=1,
                                   workers=8)
preds_val_1 = model.predict(validation_generator, 
                                 steps=(validation_generator.n//20),
                                 verbose=1,
                                 workers=8)



In [None]:
model_metrics = model.evaluate(validation_generator)
model_metrics



[0.47471725940704346, 0.782608687877655, 0.8550000190734863]

In [None]:
perf_df = pd.DataFrame(columns=['model', 'loss', 'precision', 'recall'])
perf_df.loc[len(perf_df.index)] = ['model'] + model_metrics

In [None]:
perf_df

In [None]:
def plot_confusion_matrix(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred,)
    
    ax= plt.subplot()
    # annot=True to annotate cells, fmt='g' to disable scientific notation
    sns.heatmap(cm, annot=True, ax=ax, fmt='g', cmap='magma', linewidths=1, linecolor='black')

    # labels, title and ticks
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix')
    ax.xaxis.set_ticklabels(['NORMAL', 'PNEUMONIA'])
    ax.yaxis.set_ticklabels(['NORMAL', 'PNEUMONIA'])
    plt.show();

In [None]:
plot_confusion_matrix(train_generator.labels, np.rint(preds_train_1))

In [None]:
def visualize_training_results_1(history):
    '''
    From https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/
    
    Input: keras history object (output from trained model)
    '''
    fig, (ax1, ax2, ax3) = plt.subplots(3, sharex=True)
    fig.suptitle('Model Results')

    # summarize history for accuracy
    ax1.plot(history.history['recall'])
    ax1.plot(history.history['val_recall'])
    ax1.set_ylabel('Recall')
    ax1.legend(['train', 'test'], loc='upper left')
    # summarize history for loss
    ax2.plot(history.history['loss'])
    ax2.plot(history.history['val_loss'])
    ax2.set_ylabel('Loss')
    ax2.legend(['train', 'test'], loc='upper left')
    
    ax3.plot(history.history['precision'])
    ax3.plot(history.history['val_precision'])
    ax3.set_ylabel('Precision')
    ax3.legend(['train', 'test'], loc='upper left')
    
    plt.xlabel('Epoch')
    plt.show()
    pass

In [None]:
visualize_training_results_1(history)

# Model 2 #

In [None]:
model_2 = Sequential()
model_2.add(Conv2D(16, kernel_size=(5,5), padding='valid', input_shape = (224,224,3)))
model_2.add(MaxPooling2D(3,3))
model_2.add(Dense(32, activation='relu'))
model_2.add(Flatten())
model_2.add(Dense(128, activation='relu'))
model_2.add(LeakyReLU(alpha=(.3)))
model_2.add(Dropout(.20))
model_2.add(Dense(256, activation='relu'))
model_2.add(Dense(1, activation='sigmoid'))

In [None]:
model_2.summary()

In [None]:
model_2.compile(loss='binary_crossentropy', optimizer='adam', metrics=[Precision(), Recall()])

In [None]:
model_2.fit(
    x = train_generator,
    validation_data = validation_generator,
    batch_size = 20,
    epochs = 10
)

In [None]:
preds_train_2 = model_2.predict(train_generator, 
                                   steps=(train_generator.n//20), 
                                   verbose=1,
                                   workers=8)
preds_val_2 = model_2.predict(validation_generator, 
                                 steps=(validation_generator.n//20),
                                 verbose=1,
                                 workers=8)

In [None]:
model_metrics = model_2.evaluate(validation_generator)
model_metrics

In [None]:
perf_df = pd.DataFrame(columns=['model', 'loss', 'precision', 'recall'])
perf_df.loc[len(perf_df.index)] = ['model'] + model_metrics

In [None]:
plot_confusion_matrix(train_generator.labels, np.rint(preds_train_2))

In [None]:
model_2.history.history

In [None]:
def visualize_training_results(history, iteration):
    '''
    From https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/
    
    Input: keras history object (output from trained model)
    '''
    fig, (ax1, ax2, ax3) = plt.subplots(3, sharex=True)
    fig.suptitle('Model Results')

    # summarize history for accuracy
    ax1.plot(history.history['recall_{}'.format(iteration)])
    ax1.plot(history.history['val_recall_{}'.format(iteration)])
    ax1.set_ylabel('Recall')
    ax1.legend(['train', 'test'], loc='upper left')
    # summarize history for loss
    ax2.plot(history.history['loss'])
    ax2.plot(history.history['val_loss'])
    ax2.set_ylabel('Loss')
    ax2.legend(['train', 'test'], loc='upper left')
    
    ax3.plot(history.history['precision_{}'.format(iteration)])
    ax3.plot(history.history['val_precision_{}'.format(iteration)])
    ax3.set_ylabel('Precision')
    ax3.legend(['train', 'test'], loc='upper left')
    
    plt.xlabel('Epoch')
    plt.show()
    pass

In [None]:
visualize_training_results(model_2.history, 3)

# Model 3 #

In [None]:
model_3 = Sequential()
model_3.add(Conv2D(256, kernel_size=(10,10), activation='relu', input_shape = (224,224,3)))
model_3.add(MaxPooling2D(7,7))
model_3.add(Dense(256, activation='relu'))
model_3.add(Conv2D(128, kernel_size=(7,7), activation='relu'))
model_3.add(MaxPooling2D(5,5))
model_3.add(Dense(64))
model_3.add(Flatten())
model_3.add(Dropout(.15))
model_3.add(Dense(512,activation='relu'))
model_3.add(LeakyReLU(alpha=.2))
model_3.add(Dropout(.15))
model_3.add(Dense(512, activation='relu'))
model_3.add(Dense(1, activation='sigmoid'))

In [None]:
model_3.summary()

In [None]:
model_3.compile(loss='binary_crossentropy', optimizer='adam', metrics=[Precision(), Recall()])

In [None]:
model_3.fit(x=train_generator,
            validation_data = validation_generator,
            epochs=10)

In [None]:
train_path

As per the documentation of the dataset it appears that the images in the zenodo dataset are 224x224 pixels. The dataset in the kaggle dataset are 768x768. IN this case we will make the 224x224 size standard accross all images.

In [None]:
train_dataset_batch = ImageDataGenerator(rescale=1./255).flow_from_directory(directory=train_path, 
                                                                             target_size=(224,224), 
                                                                             classes=['cancer', 'normal'], 
                                                                             batch_size=100)
validation_dataset_batch = ImageDataGenerator(rescale=1./255).flow_from_directory(directory=validation_path,
                                                                                  target_size=(224,224), 
                                                                                  classes=['cancer', 'normal'], 
                                                                                  batch_size=100)
test_dataset_batch = ImageDataGenerator(rescale=1./255).flow_from_directory(directory=test_hold_path,
                                                                            target_size=(224,224), 
                                                                            classes=['cancer', 'normal'], 
                                                                            batch_size=100)

NameError: ignored

In [None]:
images_train, label_train = next(train_dataset_batch)
images_validation, label_validation = next(validation_dataset_batch)

In [None]:
def plot_image(img):
    fig, axes = plt.subplots(1,10, figsize=(10,10))
    axes = axes.flatten()
    for image, ax in zip(img, axes):
        ax.imshow(image)
        ax.axis('off')
    plt.tight_layout()
    plt.show

Based on the order of how we called the classes when defining our batches, [1,0] refers to a normal cell and [0,1] refers to a cancer cell.

In [None]:
print(images_train.shape)
print(images_validation.shape)

## Build the first CNN ##

In [None]:
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(3,3), activation='tanh', padding='same', input_shape=(224,224,3)))
model.add(MaxPooling2D(pool_size=(4,4)))
model.add(Flatten())
model.add(Dense(64, activation='tanh'))
model.add(Dense(2, activation='relu'))

In [None]:
model.summary()

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[Precision()])

In [None]:
model.fit(x=train_dataset_batch,
          validation_data = (validation_dataset_batch),
          epochs=5)