<a href="https://colab.research.google.com/github/briankosiadi/Capstone-4/blob/master/Lung_Cancer_Neural_Networks_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For my final capstone, I will be creating a convolutional neural network on a chest radiograph image dataset. Currently, lung cancer presents a problem as the most fatal form as cancer, as well as one of the most difficult problems for radiologists to detect and diagnose. The NIH Clinical Center released a dataset consisting of over 100,000 de-classified [images](https://nihcc.app.box.com/v/ChestXray-NIHCC) of chest radiographs which I will be using for my model. The aim for this project is to create a model that would perform well enough to be used by medical clinics as an additional opinion when observing chest radiographs.

<h2>1. Data Acquisition and Preprocessing
</h2>

In [0]:
# I uploaded the image data I'm working with to google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# I use tensorflow 2.1.0 in this notebook
try: 
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

print("Version: ", tf.version)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

TensorFlow 2.x selected.
Version:  <module 'tensorflow_core._api.v2.version' from '/tensorflow-2.1.0/python3.6/tensorflow_core/_api/v2/version/__init__.py'>
Eager mode:  True
GPU is available


In [0]:
import pandas as pd
import numpy as np
import os
import cv2
import glob
import shutil
from tensorflow.keras import optimizers
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.image import ImageDataGenerator, img_to_array
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Activation, Dropout, Flatten, Dense, BatchNormalization
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from tensorflow.keras.models import load_model
import warnings
warnings.filterwarnings("ignore")

In [0]:
# for image in glob.glob('/content/drive/My Drive/Thinkful Datasets/final_images/train/positives/*.png'):
#   print(image)

Part of my data includes csv files that have image labels. Here I create dataframes based on the csv files to sort the images into proper directories in my drive.

In [0]:
# df_test = pd.read_csv('/content/drive/My Drive/Thinkful Datasets/test_labels.csv')

In [0]:
# df_train = pd.read_csv('/content/drive/My Drive/Thinkful Datasets/validation_labels.csv')

The dataset consists of 14 different lung diagnoses, but for this project I am only using two - nodules and masses.

In [0]:
# df_train['label'] = df_train['Finding Labels'].apply(lambda x: True if ('Mass' in x or 'Nodule' in x) else False)

In [0]:
# df_test['label'] = df_test['Finding Labels'].apply(lambda x: True if ('Mass' in x or 'Nodule' in x) else False)

In [0]:
# train_positives = df_train.loc[df_train['label']==True]
# train_negatives = df_train.loc[df_train['label']==False]

In [0]:
# test_positives = df_test.loc[df_test['label']==True]
# test_negatives = df_test.loc[df_test['label']==False]

In [0]:
# train_positives = train_positives[['Image Index', 'Finding Labels']]
# train_negatives = train_negatives[['Image Index', 'Finding Labels']]

In [0]:
# test_positives = test_positives[['Image Index', 'Finding Labels']]
# test_negatives = test_negatives[['Image Index', 'Finding Labels']]

In [0]:
# train_negatives.info()

For the following several cells, I iterated the code through 12 folders of images to move images into their classified labels, but had to delete already sorted images to make space. Thus, only the twelfth and final iteration is shown below. Afterwards, I manually moved images in my drive to create a 50/50 class balance and a 70/30 train/test split.

In [0]:
# for image in train_positives['Image Index']:
#   try:
#     current_loc = '/content/drive/My Drive/Thinkful Datasets/images_12/'+image
#     target_loc = '/content/drive/My Drive/Thinkful Datasets/final_images/train/positives/'
#     shutil.move(current_loc, target_loc)
#   except:
#     pass

In [0]:
# for image in train_negatives['Image Index']:
#   try:
#     current_loc = '/content/drive/My Drive/Thinkful Datasets/images_12'+image
#     target_loc = '/content/drive/My Drive/Thinkful Datasets/final_images/train/negatives/'
#     shutil.move(current_loc, target_loc)
#   except:
#     pass

In [0]:
# for image in test_positives['Image Index']:
#   try:
#     current_loc = '/content/drive/My Drive/Thinkful Datasets/images_12/'+image
#     target_loc = '/content/drive/My Drive/Thinkful Datasets/final_images/test/positives/'
#     shutil.move(current_loc, target_loc)
#   except:
#     pass

In [0]:
# for image in test_negatives['Image Index']:
#   try:
#     current_loc = '/content/drive/My Drive/Thinkful Datasets/images_12/'+image
#     target_loc = '/content/drive/My Drive/Thinkful Datasets/final_images/test/negatives/'
#     shutil.move(current_loc, target_loc)
#   except:
#     pass

In [0]:
train_data_dir = '/content/drive/My Drive/Thinkful Datasets/final_images/train/'
test_data_dir = '/content/drive/My Drive/Thinkful Datasets/final_images/test/'

Here, I generate load images into the notebook using ImageGenerator.

In [0]:
img_width, img_height = 250, 250
batch_size = 40

if K.image_data_format() == 'channels_first':
    input_shape = (3, img_width, img_height)
else:
    input_shape = (img_width, img_height, 3)

In [0]:
train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.3,
    zoom_range=.1,
    horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255,)

In [0]:
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    test_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    #shuffle=False,
    class_mode='binary')

Found 616 images belonging to 2 classes.
Found 264 images belonging to 2 classes.


In [0]:
train_len = (len(glob.glob('/content/drive/My Drive/Thinkful Datasets/final_images/train/positives/*.png'))
  +len(glob.glob('/content/drive/My Drive/Thinkful Datasets/final_images/train/negatives/*.png')))
print(train_len)

616


In [0]:
test_len = (len(glob.glob('/content/drive/My Drive/Thinkful Datasets/final_images/test/positives/*.png'))
  +len(glob.glob('/content/drive/My Drive/Thinkful Datasets/final_images/test/negatives/*.png')))
print(test_len)

264


In [0]:
print('# of train positive images:', len(glob.glob('/content/drive/My Drive/Thinkful Datasets/final_images/train/positives/*.png')))
print('# of train negative images:', len(glob.glob('/content/drive/My Drive/Thinkful Datasets/final_images/train/negatives/*.png')))
print('# of test positive images:', len(glob.glob('/content/drive/My Drive/Thinkful Datasets/final_images/test/positives/*.png')))
print('# of test negative images:', len(glob.glob('/content/drive/My Drive/Thinkful Datasets/final_images/test/negatives/*.png')))

# of train positive images: 308
# of train negative images: 308
# of test positive images: 132
# of test negative images: 132


<h2>
3. Model Building
</h2>

I experimented with Sequential a bit, but decided to try autokeras to try to optimize my model performance. Ultimately, I ran into autokeras bugs that prevented me from using an exact copy of high performing models it created, but instead I was able to replicate some of the layers it used and tune the hyperparameters from there.

In [0]:
# !pip install tensorflow-gpu==2.1.0 --quiet

In [0]:
!pip install autokeras --quiet

[K     |████████████████████████████████| 71kB 3.4MB/s 
[K     |████████████████████████████████| 61kB 7.2MB/s 
[?25h  Building wheel for keras-tuner (setup.py) ... [?25l[?25hdone
  Building wheel for terminaltables (setup.py) ... [?25l[?25hdone


In [0]:
import autokeras as ak

In [0]:
clf = ak.ImageClassifier(max_trials=10)

In [0]:
itr_train = train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=616,
    class_mode='binary')

X_train, y_train = itr_train.next()

Found 616 images belonging to 2 classes.


In [0]:
itr_test = validation_generator = test_datagen.flow_from_directory(
    test_data_dir,
    target_size=(img_width, img_height),
    batch_size=264,
    #shuffle=False,
    class_mode='binary')

X_test, y_test = itr_test.next()

Found 264 images belonging to 2 classes.


In [0]:
clf.fit(X_train, y_train)

Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000


Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000


Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000


Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000


Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000


Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000


Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000


Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000


Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000


Train for 16 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000


INFO:tensorflow:Oracle triggered exit
Train for 20 steps, validate for 4 steps
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000


In [0]:
auto_model = clf.export_model()

In [0]:
auto_model.save('/content/drive/My Drive/Capstone 4/auto_model.h5')

In [0]:
from tensorflow.keras.layers import Input, Dense, Activation, Dropout, Conv2D, MaxPooling2D, LayerNormalization, GlobalAveragePooling2D
from tensorflow.keras import Model

In [0]:
auto_model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 250, 250, 3)]     0         
_________________________________________________________________
normalization (Normalization (None, 250, 250, 3)       7         
_________________________________________________________________
conv2d (Conv2D)              (None, 248, 248, 64)      1792      
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 246, 246, 64)      36928     
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 123, 123, 64)      0         
_________________________________________________________________
dropout (Dropout)            (None, 123, 123, 64)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 121, 121, 32)      18464 

<h2>
3. Model Building
</h2>

Here is where I create my custom model based on the layers of autokeras' model.

In [0]:
inputs = Input(shape=(250, 250, 3))
normalization = BatchNormalization()(inputs)
conv2d = Conv2D(32, kernel_size=(7,7))(normalization)
conv2d_1 = Conv2D(16, kernel_size=(3,3))(conv2d)
max_pooling2d = MaxPooling2D()(conv2d_1)
conv2d_2 = Conv2D(16, kernel_size=(3,3))(max_pooling2d)
conv2d_3 = Conv2D(8, kernel_size=(3,3))(conv2d_2)
max_pooling2d_1 = MaxPooling2D()(conv2d_3)
conv2d_4 = Conv2D(16, kernel_size=(3,3))(max_pooling2d_1)
conv2d_5 = Conv2D(32, kernel_size=(3,3))(conv2d_4)
max_pooling2d_2 = MaxPooling2D()(conv2d_5)
global_average_pooling2d = GlobalAveragePooling2D()(max_pooling2d_2)
dropout_1 = Dropout(0.5)(global_average_pooling2d)
flatten = Flatten()(dropout_1)
dense = Dense(1, activation='sigmoid')(flatten)

In [0]:
model = Model(inputs = inputs, outputs = dense)
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 250, 250, 3)]     0         
_________________________________________________________________
batch_normalization (BatchNo (None, 250, 250, 3)       12        
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 244, 244, 32)      4736      
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 242, 242, 16)      4624      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 121, 121, 16)      0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 119, 119, 16)      2320      
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 117, 117, 8)       1160

In [0]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [0]:
model.fit(
    train_generator,
    steps_per_epoch = 15,
    epochs=50,
    validation_data = validation_generator,
    validation_steps = 5)

In [0]:
model.save('/content/drive/My Drive/Capstone 4/best_model.h5')