# Colon Cancer #
### _Predicting the outcomes of colon cells to predict cancer_ ###

### Business Understanding ###
Colon cancer has been deemed the number 3 most common cancer in the world, according to the World Cancer Research Fund. Based on this statistic, it is not a surprise to know that more approximately 19 million colonoscopies are perfeormed each year in the United States.

Some experts believe that some of the main causes of this cancer is the Western food diet along with living a sedentary lifestyle as well as being obese. Unfortunately, according to the CDC, the US appears to be on an upward trend in obesity which in turn increses the likelihood of men and women to develop colorectal cancers.

Although the morttality rate for the most part appears to be relatively low (80% survival rate), it is important to note that like everything, there is always something to improve with either accurate test results, the time it takes to report those results and the resources available to compile said results.

Currently, as per the American Cancer Society, it takes 2-3 days to report the findings of a colonoscopy biopsy.

Objective
This notebook has the objective of finding out the population that is deeply affected by colon cancer and build a Convolutional Neural Network that can get close to the 1-2% accuracy that current tests. We will also strive to have an efficient model that can give accurate results faster than 2-3 days and ideally within the time frame of "same-day" results.

Before doing so, we will look at some mortality rates among different populations and determine whether the economic status of a population affects the mortality rate of colon cancer.



In [8]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import cv2
import random

import PIL
import PIL.Image
import pathlib
# Packages to import and preprocess images
import glob
import random
import shutil
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Packages for our models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Flatten, Conv2D, MaxPooling2D
%matplotlib inline

In [7]:
print(tf.__version__)

2.0.0


Deleted all folders from zenodo except NORM which is normal and TUM which is the cancer cells

Inspiration for the function in the creation of the [directories](https://www.youtube.com/watch?v=_L2uYfVV48I)

In order to have class balance in the dataset we will be using a total of 24,000 images for our model training, 1,800 items for the validation, and 1,720 images for our test dataset to generate predictions. This brings our total of images used to 27,520 images used in this Convolutional Neural Network.

In [None]:
# os.chdir('../data/')
# if os.path.isdir('train/normal') is False:
#     os.makedirs('train/normal')
#     os.makedirs('train/cancer')
#     os.makedirs('validation/normal')
#     os.makedirs('validation/cancer')
#     os.makedirs('test/')

    
#     for image in random.sample(glob.glob('NCT-CRC-HE-100K/NORM/NORM*'), 8000):
#         shutil.move(image, 'train/normal')
#     for image in random.sample(glob.glob('NCT-CRC-HE-100K/TUM/TUM*'), 8000):
#         shutil.move(image, 'train/cancer')
#     for image in random.sample(glob.glob('NCT-CRC-HE-100K/NORM/NORM*'), 400):
#         shutil.move(image, 'validation/normal')
#     for image in random.sample(glob.glob('NCT-CRC-HE-100K/TUM/TUM*'), 400):
#         shutil.move(image, 'validation/cancer')
#     for image in random.sample(glob.glob('NCT-CRC-HE-100K/NORM/NORM*'), 360):
#         shutil.move(image, 'test/normal')
#     for image in random.sample(glob.glob('NCT-CRC-HE-100K/TUM/TUM*'), 360):
#         shutil.move(image, 'test/cancer')

In [None]:
data_gen = ImageDataGenerator(
    rescale = 1./255,
    zoom_range = (0.95,0.95),
    brightness_range = [0.5, 1.0]
)

In [None]:
train_generator = data_gen.flow_from_directory(
    '../data/train',
    target_size = (224,224),
    batch_size = 20,
    color_mode = 'rgb',
    shuffle = True,
    class_mode = 'binary',
    subset = 'training',
    seed = 20
)
validation_generator = data_gen.flow_from_directory(
    '../data/validation',
    target_size = (224,224),
    batch_size = 20,
    color_mode = 'rgb',
    shuffle = True,
    class_mode = 'binary',
    subset = 'training',
    seed = 20
)

In [None]:
train_path = '../data/train/'
validation_path = '../data/validation/'
test_hold_path = '../data/test/'
categories = ['cancer', 'normal']

In [None]:
for category in categories:
    path_train = os.path.join(train_path,category)
    for img in os.listdir(path_train):
        img_array_train = cv2.imread(os.path.join(path_train,img))
        plt.imshow(img_array_train)
        plt.show()
        break
    break

In [None]:
train_data = []
validation_data = []
test_data = []

def create_training_data():
    """
    Using a for loop, the images in data_dir_train and the labels in category are appended to train_data to
    create our training data set.
    """
    for category in categories:
        path_train = os.path.join(train_path,category)
        class_num = categories.index(category)
        for img in os.listdir(path_train):
            img_array_train = cv2.imread(os.path.join(path_train,img))
            train_array = cv2.resize(img_array_train,(224, 224))
            train_data.append([train_array, class_num])
            
def create_validation_data():
    """
    Using a for loop, the images in data_dir_test and the labels in category are appended to test_data to
    create our testing data set.
    """
    for category in categories:
        path_test = os.path.join(validation_path,category)
        class_num = categories.index(category)
        for img in os.listdir(path_test):
            img_array_test = cv2.imread(os.path.join(path_test,img))
            test_array = cv2.resize(img_array_test,(224, 224))
            validation_data.append([test_array, class_num])
            
            
def create_testing_data():
    """
    Using a for loop, the images in data_dir_test and the labels in category are appended to test_data to
    create our testing data set.
    """
    for category in categories:
        path_test = os.path.join(test_hold_path,category)
        class_num = categories.index(category)
        for img in os.listdir(path_test):
            img_array_test = cv2.imread(os.path.join(path_test,img))
            test_array = cv2.resize(img_array_test,(224, 224))
            test_data.append([test_array, class_num])

In [None]:
create_training_data()
create_validation_data()
create_testing_data()

In [None]:
train_data[0][0]

In [None]:
create_validation_data()

In [None]:
validation_data[0][0]

In [None]:
create_testing_data()

In [None]:
test_data[0][0]

In [None]:
random.seed(20)
random.shuffle(train_data)
random.shuffle(validation_data)
random.shuffle(test_data)

In [None]:
for sample in train_data[:1]:
    print('Train array', sample[0])
    print('Train label', sample[1])
    
    
for sample in validation_data[:1]:
    print('Validation array', sample[0])
    print('Validation label', sample[1])
    
for sample in test_data[:1]:
    print('Testing array', sample[0])
    print('Testing label', sample[1])

In [None]:
X_train, y_train, X_valid, y_valid, X_test, y_test = ([] for list in range(6))

In [None]:
def create_features_label_from_array(array, x, y):
    """
    Separates the features and the label from a specified array and appends it to specified X list and y list.
    """
    for features, label in array:
        x.append(features)
        y.append(label)

In [None]:
create_features_label_from_array(train_data, X_train, y_train)
create_features_label_from_array(validation_data, X_valid, y_valid)
create_features_label_from_array(test_data, X_test, y_test)

In [None]:
print(len(X_test))
print(len(y_test))

In [None]:
X_train = np.array(X_train)
X_valid = np.array(X_valid)
X_test = np.array(X_test)

In [None]:
print('Training shape: ', X_train.shape)
print('Validation shape: ', X_valid.shape)
print('Testing shape: ', X_test.shape)

In [None]:
X_train = X_train/255.0
X_valid = X_valid/255.0
X_test = X_test/255.0

In [None]:
y_train = tf.keras.utils.to_categorical(y_train, num_classes=2)
y_test = tf.keras.utils.to_categorical(y_test, num_classes=2)
y_valid =tf.keras.utils.to_categorical(y_valid, num_classes=2)

In [None]:
print('Training shape: ', y_train.shape)
print('Validation shape: ', y_valid.shape)
print('Testing shape: ', y_test.shape)

# Model 1 #

In [None]:
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(3,3), activation='tanh', padding='same', input_shape=(224,224,3)))
model.add(MaxPooling2D(pool_size=(4,4)))
model.add(Flatten())
model.add(Dense(64, activation='tanh'))
model.add(Dense(2, activation='relu'))

In [None]:
model.summary()

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.fit(x=train_generator,
          validation_data = validation_generator,
          epochs=5)

As per the documentation of the dataset it appears that the images in the zenodo dataset are 224x224 pixels. The dataset in the kaggle dataset are 768x768. IN this case we will make the 224x224 size standard accross all images.

In [None]:
train_dataset_batch = ImageDataGenerator(rescale=1./255).flow_from_directory(directory=train_path, 
                                                                             target_size=(224,224), 
                                                                             classes=['cancer', 'normal'], 
                                                                             batch_size=100)
validation_dataset_batch = ImageDataGenerator(rescale=1./255).flow_from_directory(directory=validation_path,
                                                                                  target_size=(224,224), 
                                                                                  classes=['cancer', 'normal'], 
                                                                                  batch_size=100)
test_dataset_batch = ImageDataGenerator(rescale=1./255).flow_from_directory(directory=test_hold_path,
                                                                            target_size=(224,224), 
                                                                            classes=['cancer', 'normal'], 
                                                                            batch_size=100)

In [None]:
images_train, label_train = next(train_dataset_batch)
images_validation, label_validation = next(validation_dataset_batch)

In [None]:
def plot_image(img):
    fig, axes = plt.subplots(1,10, figsize=(10,10))
    axes = axes.flatten()
    for image, ax in zip(img, axes):
        ax.imshow(image)
        ax.axis('off')
    plt.tight_layout()
    plt.show

Based on the order of how we called the classes when defining our batches, [1,0] refers to a normal cell and [0,1] refers to a cancer cell.

In [None]:
print(images_train.shape)
print(images_validation.shape)

## Build the first CNN ##

In [None]:
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(3,3), activation='tanh', padding='same', input_shape=(224,224,3)))
model.add(MaxPooling2D(pool_size=(4,4)))
model.add(Flatten())
model.add(Dense(64, activation='tanh'))
model.add(Dense(2, activation='relu'))

In [None]:
model.summary()

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.fit(x=train_dataset_batch,
          validation_data = (validation_dataset_batch),
          epochs=5)