# Dissertation Project Code: Common Chest X-ray Classification and Localization with Deep Learning

This code is intended for my Dissertation Project at The University of Nottingham

Dataset available at: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345

Let's start with importing some of the libraries that we are going to use

In [None]:
# general libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import os
import sys
from glob import glob

# keras tensorflow and other image processing libraries
import cv2
import keras.backend as K
from keras.preprocessing.image import ImageDataGenerator, image
from keras.models import Sequential, Model, load_model
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D, Dropout, Flatten, Dense
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.applications import VGG16, VGG19, MobileNet, MobileNetV2, InceptionResNetV2, InceptionV3, ResNet50, DenseNet121, DenseNet169, DenseNet201
from keras import regularizers, optimizers

# scikit-learn libraries for utility
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc

# set random seed
import random
random.seed(111)

print(os.listdir("../input")) # list items inside the directory

To check if we are in the correct directory, we can list all of the items inside it. If all the items inside the directory are all printed out, we are in the correct directory. 

## 1. Exploring, Visualizing, and Pre-processing the Dataset

General patients' information are stored inside the 'Data_Entry_2017.csv'. This is some kind of "medical documents" containing the patients' personal information (age and gender), image file name, and the disease labels

In [None]:
# load data
data = pd.read_csv('../input/Data_Entry_2017.csv')

total_data = len(data)
print('Total number of data:', total_data) # total number of data

data.head(5) # view first 5 rows

Keep in mind that we have 112,120 total number of images in our directory with about 2500x2000 dimensions each image. It is a **VERY BIG DATASET**

Before jumping into the image processing and training part, it is always a good practice to analyse the .csv data. We might get some useful insights later while attempting to build the training model

### I. Analysing Patients' Data

Let's group male and female patients according to their ages

In [None]:
# visualize the distribution of age by its gender
g = sns.catplot(data=data, col='Patient Gender', x='Patient Age', kind='count')
g.set_xticklabels(np.arange(0,100));
g.set_xticklabels(step=10);
g.fig.suptitle('Age Distribution by Gender',fontsize=11);
g.fig.subplots_adjust(top=.9)

Seems like both female and male patient in their 50s-60s are the biggest number of patients in the data according to the gender respectively. The number of male patients are also higher than the female patients

### II. Analysing the Diseases

Let's separate the diseases into different groups according to their names. Keep in mind that a single chest x-ray can have different diseases which is separated by '|' sign (e.g. Mass|Hernia|Nodule)

In [None]:
# create an array of 14 diseases for one-hot encoding
diseases = ['Atelectasis', 'Consolidation', 'Infiltration', 'Pneumothorax', 'Edema', 'Emphysema', 'Fibrosis', 'Effusion', 'Pneumonia', 'Pleural_Thickening', 
'Cardiomegaly', 'Nodule', 'Mass', 'Hernia'] # taken from paper

for label in diseases:
    data[label] = data['Finding Labels'].map(lambda result: 1 if label in result else 0)
data.head(20) # check the data

Notice that we created another columns with the disease name as its header. These columns are useful when we build our training model. This method is also known as **one-hot encoding**

From this block of code, we create a temporary dataframe which only contains 'Image Index', 'Finding Labels', 'Follow-up #', 'Patient ID', 'Patient Age', and 'Patient Gender' for analysing the diseases.
We don't want to change anything from our original dataframe (data) 

In [None]:
# creates new temporary dataframe for further analysis
temp = data[['Image Index','Finding Labels','Follow-up #','Patient ID','Patient Age','Patient Gender']]
for i in diseases:
    temp[i] = data['Finding Labels'].map(lambda x: 1 if i in x else 0)
temp['Nothing'] = data['Finding Labels'].map(lambda x: 1 if 'No Finding' in x else 0)

In [None]:
temp.head()

Now we can use the 'temp' dataframe for the analysis

In [None]:
# visualize the diseases against the patients' gender
gender_split = pd.melt(temp,
             id_vars='Patient Gender',
             value_vars=diseases,
             var_name='Category',
             value_name='Count')
gender_split = gender_split.loc[gender_split.Count>0]
g = sns.countplot(y='Category',hue='Patient Gender',data=gender_split, order = gender_split['Category'].value_counts().index)
g.set_title('Individual Disease Count by Gender')

In [None]:
# visualize patients with 'No Finding' labels against the patients' gender
gender_split_nothing = pd.melt(temp,
             id_vars='Patient Gender',
             value_vars = list(['Nothing']),
             var_name = 'Category',
             value_name = 'Count')
gender_split_nothing = gender_split_nothing.loc[gender_split_nothing.Count>0]
g = sns.countplot(y='Category',hue='Patient Gender',data=gender_split_nothing)
g.set_title('No Finding Count Disease by Gender')

We can discover some findings
* Patients with Infiltration is the highest and Hernia is the lowest.
* The number of male patients in all diseases is higher than female patients except for Cardiomegaly and Hernia
* The number of male patients is also higher than female patients for 'No Finding' x-rays

**Assumption: This imbalance data might affect our training model since it can get biased towards the x-ray infected with Infiltration disease**

Remember that this is just an assumption we make by just looking at the graph above

Now let's look at the 'Finding Labels' for each data

In [None]:
# plot the distribution of data
count_per_unique_label = data['Finding Labels'].value_counts() # get frequency counts per label
df_count_per_unique_label = count_per_unique_label.to_frame() # convert series to dataframe for plotting purposes

print(df_count_per_unique_label) # view tabular results

In total, we have 836 different states of chest x-ray

For simplicity, we can just look at the diseases in the top 20 and convert them into a graph

In [None]:
g = sns.barplot(x = df_count_per_unique_label.index[:20], y="Finding Labels", data=df_count_per_unique_label[:20]), plt.xticks(rotation = 90)
plt.title('Label Distribution')

We can see here that 'No Finding' label outnumbers all diseases with a very big gap, but don't worry since this is a real-world data that we might also encounter in medical imaging

Now let's leave the 'No Finding' label out and group patients with multiple diseases and count them in according to the disease. 

For example:
Patient with Infiltration | Nodule | Mass is counted in to Infiltration, Nodule, and Mass individually.

In [None]:
# create clean labels for the diseases which exclude the 'No Finding' label
clean_labels = data[diseases].sum().sort_values(ascending= False) # get sorted value_count for clean labels
print(clean_labels) # view tabular results

# plot the data
clean_labels_df = clean_labels.to_frame()
sns.barplot(x = clean_labels_df.index[::], y=0, data = clean_labels_df[::]), plt.xticks(rotation=90)
plt.title('Diseases distribution without "No Finding" label')

Seems like infiltration is the most common diseases from both gender

Let's breakdown the graph above into individual graphs according to the diseases

In [None]:
# visualize individual disease against patients' age, associated with gender
f, ax = plt.subplots(7, 2, sharex=True, figsize=(15, 20)) # 7x2 subplots with shared x value, 15x20 size
i = j = 0 # variables to iterate between subplots
x=np.arange(0,100,10)
for disease in diseases:
    g = sns.countplot(x='Patient Age', hue='Patient Gender', data=data[data['Finding Labels'] == disease], ax=ax[i,j])
    ax[i, j].set_title(disease)   
    g.set_xlim(0,90)
    g.set_xticks(x)
    g.set_xticklabels(x)
    j=(j+1)%2
    if j==0:
        i=(i+1)%7
f.subplots_adjust(hspace=0.3)

Do you see something unusual here?
> All graphs are quite evenly distributed, showed by a bell curve in each graph. The peak (highest curve) in the middle of the graph represents the variability of data dispersion.

However, this is not the case for patients with Hernia. Let's do a quick research about this disease (https://www.niddk.nih.gov/health-information/digestive-diseases/inguinal-hernia#who)

Here's what I got:
* **Adults from 75-80** are the most likely to have hernia
* **Children** can also get hernia between **age 0-5**
* Hernia are also commonly found in **premature infants**

However, if we look at Hernia patients graph, patients with this disease are commonly found in between 10s to 30s group age. This is an interesting finding because our findings contradict with the information we got from the NIH website above.
For now, let's move on to the next analysis

We can compare patients who have multiple diseases in their x-rays with patients who only have one disease. We can also count how many unique labels that occur in the dataset

In [None]:
# count unique labels in the dataset
num_unique_labels = data['Finding Labels'].nunique()
print('Number of unique labels:',num_unique_labels)

In [None]:
# compare between single and multiple diseases
multi_and_single_disease_split = temp.groupby('Finding Labels').count().sort_values('Patient ID', ascending=False)
multi = multi_and_single_disease_split[['|' in i for i in multi_and_single_disease_split.index]].copy()
single = multi_and_single_disease_split[['|' not in i for i in multi_and_single_disease_split.index]]
single = single[['No Finding' not in i for i in single.index]]

single['Finding Labels'] = single.index.values
multi['Finding Labels'] = multi.index.values

In [None]:
# visualize between single vs multiple diseases
f, ax = plt.subplots(sharex=True,figsize=(8, 5))
sns.countplot(y='Category', data=gender_split, ax=ax, order=gender_split['Category'].value_counts().index,
              color='b', label="Multiple Diseases")
sns.barplot(x='Patient ID', y='Finding Labels', data=single, ax=ax, color='r', label='Single Disease')
ax.legend(ncol=2, loc="center right", frameon=True, fontsize=10) 
ax.set(ylabel="Type of disease", xlabel="Number of Patients")
ax.set_title("Comparison between Single vs Multiple Diseases",fontsize=12) 
sns.despine(left=True)

Number of patients with multiple diseases are higher than patients who only have one disease.

### III. Analysing the Images

Reading images are rather inconvenience with this dataset. Here's the problem

In it is very common that dataset in a directory is managed in this fashion

1. Main Directory<br>
    1.1. CLASS_1
        1.1.1. Train
        1.1.2. Validation
        1.1.3. Test
    1.2. CLASS_2
        1.2.1. Train
        1.2.2. Validation
        1.2.3. Test
    1.3. CLASS_3
        1.3.1. Train
        1.3.2. Validation
        1.3.3. Test
<br>
By managing the directory like this, image processing and data augmentation becomes easier since the data has been "tidied up" 

This is why we create a glob object, then create a new column to store the full path to each image in the directory.

In [None]:
# create glob object
my_glob = glob('../input/images*/images/*.png')
print('Number of Observations: ', len(my_glob)) # should be 112120

In [None]:
# store the image paths onto data
full_img_paths = {os.path.basename(x): x for x in my_glob}
data['full_path'] = data['Image Index'].map(full_img_paths.get)

In [None]:
data.head()

We can plot the image using our new column "full_path" to check whether it is the correct directory

In [None]:
# groups the image paths by according to the disease
image_split = data.groupby(['Finding Labels', 'full_path']).count().index

In [None]:
# Emphysema
disease_path = image_split[['Emphysema' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])
    
fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Atelectasis
disease_path = image_split[['Atelectasis' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Consolidation
disease_path = image_split[['Consolidation' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Infiltration
disease_path = image_split[['Infiltration' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Pneumothorax
disease_path = image_split[['Pneumothorax' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Edema
disease_path = image_split[['Edema' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Fibrosis
disease_path = image_split[['Fibrosis' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Effusion
disease_path = image_split[['Effusion' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Pneumonia
disease_path = image_split[['Pneumonia' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Pleural Thickening
disease_path = image_split[['Pleural_Thickening' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Cardiomegaly
disease_path = image_split[['Cardiomegaly' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Nodule
disease_path = image_split[['Nodule' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Mass
disease_path = image_split[['Mass' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

In [None]:
# Hernia
disease_path = image_split[['Hernia' in i for i in image_split]]
disease_img = []
for i in range(len(disease_path)):
    disease_img.append(disease_path[i][1])

fig, ax = plt.subplots(1, 5, figsize=(20,20))
ax = ax.flatten()
for (x, a) in zip(disease_img, ax):
    t = plt.imread(x)
    a.imshow(t, cmap='gray')

## 2. Data Augmentation

One last thing to do. Remember the one-hot encoding we created in the beginning of this notebook? Now we have to combine those 0s and 1s into one single array in a column.

In [None]:
# creates 'target_vector' column to combine the one-hot encoding
data['target_vector'] = data.apply(lambda target: [target[diseases].values], 1).map(lambda target: target[0])

In [None]:
data.head() # check the 'target_vector' column

Great! Now we are ready to build the model

The data we are going to use is split into train/validation/test with split of 80/10/10 respectively.

In [None]:
# split the data into training, validation, and testing set
from sklearn.model_selection import train_test_split
# 80/10/10 split
train_set, val_set = train_test_split(data, test_size=0.1, random_state=1993)
train_set, test_set = train_test_split(train_set, test_size=0.1, random_state=1993)

print('Training: ', len(train_set))
print('Validation: ', len(val_set))
print('Testing: ', len(test_set))
print('Total data: ', len(data))
print(len(train_set)+len(test_set)+len(val_set) == len(data)) # double check

Keras has a convenient method to do data augmentation which is ImageDataGenerator. Data augmentation is one of the alternatives to make training data more variant by applying some image processing such as: zoom, rotation, shift, shear, flip, etc.<br><br>

Notice that ** we only normalize and do not apply any data augmentation in the test set**. This is due to the nature of medical imaging. When we are in a hospital, having our chest scanned, we do not receive a zoomed, rotated, shifted, or flipped image of our chest x-ray

In [None]:
# creates image data generator for both training and testing images
from keras.preprocessing.image import ImageDataGenerator
train_data_gen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        rotation_range=20,
        width_shift_range=0.2,
        height_shift_range=0.2,
        horizontal_flip=True)

test_data_gen = ImageDataGenerator(rescale=1./255) # we don't want to apply zoom, rotation, shear, etc. in the test set

flow_from_dataframe function below will help us to acquire image in the directory by referring to our dataframe that has been created earlier

In [None]:
# Credits to Kevin Mader who created this function
# https://www.kaggle.com/kmader/train-simple-xray-cnn

# flow_from_dataframe is a function that takes a full path of an image from the dataframe we created earlier,
# instead from the directory
def flow_from_dataframe(img_data_gen, in_df, path_col, y_col, **dflow_args):
    base_dir = os.path.dirname(in_df[path_col].values[0])
    print('## Ignore next message from keras, values are replaced anyways')
    df_gen = img_data_gen.flow_from_directory(base_dir, 
                                     class_mode = 'sparse',
                                    **dflow_args)
    df_gen.filenames = in_df[path_col].values
    df_gen.classes = np.stack(in_df[y_col].values)
    df_gen.samples = in_df.shape[0]
    df_gen.n = in_df.shape[0]
    df_gen._set_index_array()
    df_gen.directory = '' # since we already have full_path column, we can set this to None or ''
    print('Reinserting dataframe: {} images'.format(in_df.shape[0]))
    return df_gen

Now we can apply the data augmentation to training, validation, and testing data.<br>
* Here we set color_mode to RGB because we are going to use pre-trained models. These pre-trained models only accept input with 3 color channels. 
* We can try different input size by changing image_size, most pre-trained models accept input image with 224x224 pixels
* We set 1024 batch on test set to make predictions faster by taking a big batch of data while training and validation can be set to 32, 64, 128 or higher. When memory error occurs, lower the batch_size

In [None]:
IMG_SIZE = (224, 224) # image re-sizing
TRAIN_BATCH_SIZE = 32
VAL_BATCH_SIZE = 256
TEST_BATCH_SIZE = 1024

# train dataset
train_gen = flow_from_dataframe(train_data_gen, train_set, path_col = 'full_path', y_col = 'target_vector',
                                target_size = IMG_SIZE, color_mode = 'rgb', batch_size = TRAIN_BATCH_SIZE)
# validation dataset
valid_X, valid_Y = next(flow_from_dataframe(train_data_gen, val_set, path_col = 'full_path',
                                            y_col = 'target_vector', target_size = IMG_SIZE, 
                                            color_mode = 'rgb', batch_size = VAL_BATCH_SIZE))
# test dataset
test_X, test_Y = next(flow_from_dataframe(test_data_gen, test_set, path_col = 'full_path', y_col = 'target_vector', 
                                          target_size = IMG_SIZE, color_mode = 'rgb', batch_size = TEST_BATCH_SIZE))

Let's see the images in the training data and test data which have been augmented by the ImageDataGenerator

In [None]:
# show the images from the train/validation data generator
t_x, t_y = next(train_gen)
fig, m_axs = plt.subplots(4, 4, figsize = (16, 16))
for (c_x, c_y, c_ax) in zip(t_x, t_y, m_axs.flatten()):
    c_ax.imshow(c_x[:,:,0], cmap='bone')
    c_ax.set_title(', '.join([n_class for n_class, n_score in zip(diseases, c_y) 
                             if n_score>0.5])) # n_score will find the '1' inside 14 indexes
    c_ax.axis('off')

In [None]:
# show the images from the test data generator
t_x, t_y = test_X, test_Y
fig, m_axs = plt.subplots(4, 4, figsize = (16, 16))
for (c_x, c_y, c_ax) in zip(t_x, t_y, m_axs.flatten()):
    c_ax.imshow(c_x[:,:,0], cmap = 'bone')
    c_ax.set_title(', '.join([n_class for n_class, n_score in zip(diseases, c_y) 
                             if n_score>0.5])) # n_score will find the '1' inside 14 indexes
    c_ax.axis('off')

## 3. Building the Model

This part is for experiments. We can build our own model or use the pre-trained models with various hyperparameters

### Build our own model

In [None]:
# # Create the model
# my_model = Sequential()

# my_model.add(Conv2D(filters = 32, kernel_size = 3, padding = 'same', activation = 'relu', input_shape = test_X.shape[1:]))
# # my_model.add(MaxPooling2D(pool_size = 2))
# my_model.add(Dropout(0.2))

# my_model.add(Conv2D(filters = 64, kernel_size = 3, padding = 'same', activation = 'relu'))
# # my_model.add(MaxPooling2D(pool_size = 2))
# my_model.add(Dropout(0.2))
          
# my_model.add(Conv2D(filters = 128, kernel_size = 3, padding = 'same', activation = 'relu'))
# # my_model.add(MaxPooling2D(pool_size = 2))
# my_model.add(Dropout(0.2))

# my_model.add(Conv2D(filters = 256, kernel_size = 3, padding = 'same', activation = 'relu'))
# # my_model.add(MaxPooling2D(pool_size = 2))
# my_model.add(Dropout(0.2))

# my_model.add(GlobalAveragePooling2D())
# my_model.add(Dropout(0.2))

# my_model.add(Flatten())
# my_model.add(Dense(256, activation = 'relu'))
# my_model.add(Dropout(0.2))
# my_model.add(Dense(len(diseases), activation = 'softmax'))

# my_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['binary_accuracy', 'mae'])
# my_model.summary()

### Base model with transfer learning
Base : VGG16, VGG19, MobileNet, MobileNetV2, InceptionResNetV2, InceptionV3, ResNet50, DenseNet121, DenseNet169, DenseNet201


In [None]:
# converts all images to 3 color channels, because pre-trained models only use RGB color channel

base_choice = 'VGG16'
# VGG16
if base_choice.upper() == 'VGG16':
    base_model = VGG16(include_top=False, input_shape=(224,224,3), weights='imagenet')
# VGG19
elif base_choice.upper() == 'VGG19':   
    base_model = VGG19(include_top=False, input_shape=(224,224,3))
# MobileNet
elif base_choice.upper() == 'MOBILE':
    base_model = MobileNet(include_top=False, input_shape=(224,224,3))
# MobileNetV2
elif base_choice.upper() == 'MOBILEV2':
    base_model = MobileNetV2(include_top=False, input_shape=(224,224,3))
# InceptionResNetV2
elif base_choice.upper() == 'INCEPTIONV2':
    base_model = InceptionResNetV2(include_top=False, input_shape=(224,224,3))
# InceptionV3
elif base_choice.upper() == 'INCEPTIONV3':
    base_model = InceptionV3(include_top=False, input_shape=(224,224,3))
# ResNet50
elif base_choice.upper() == 'RESNET50':
    base_model = ResNet50(include_top=False, input_shape=(224,224,3))
# DenseNet 121
elif base_choice.upper() == 'DENSE121':
    base_model = DenseNet121(include_top=False, input_shape=(224,224,3))
# DenseNet 169
elif base_choice.upper() == 'DENSE169':
    base_model = DenseNet169(include_top=False, input_shape=(224,224,3))
# DenseNet 201
elif base_choice.upper() == 'DENSE201':
    base_model = DenseNet201(include_top=False, input_shape=(224,224,3))
    
print("Base pre-trained model:", base_choice)
base_model.summary()

If we are using the pre-trained models, we need to freeze these layers since we do not want to pass any gradients back (backpropagation) later while training the model.

In [None]:
# freeze the base model
for layer in base_model.layers:
    layer.trainable = False
        
# adds our own dense layers
output = base_model.output
output = Flatten()(output)
output = Dense(256, activation='relu', kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=regularizers.l2(0.01))(output)
last_output = Dense(14, activation='softmax')(output)
# construct final model
final_model = Model(base_model.input, last_output)
# compile the model
# opt = optimizers.Adamax(lr=0.02)
final_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['binary_accuracy'])
final_model.summary()

Saving a weight during or after the training is important for reusability. This checkpoint will trigered if it finds better val_loss and save the model in our directory with the specified name.

In [None]:
# set up a callbacks
checkpointer = ModelCheckpoint(filepath='weights.'+base_choice+ '.{epoch:d}-{val_loss:.2f}.hdf5', verbose=1, save_best_only = True)
callbacks_list = [checkpointer]

Training phase will take some time. Run it on GPU if you have one or use Kaggle for 30 hour weekly free GPU

In [None]:
# fit the model
fitted_model = final_model.fit_generator(generator=train_gen, steps_per_epoch=len(train_gen)//TRAIN_BATCH_SIZE, epochs=5, validation_data=(valid_X, valid_Y), validation_steps=len(valid_X)//VAL_BATCH_SIZE)

Uncomment this code below if we want to save the weight after the training is finished

In [None]:
# save the model
# final_model.save('weights.'+base_choice+ '.{epoch:d}-{val_loss:.2f}.hdf5')


## 4. Evaluating Model

ROC (Receiver Operating Characteristics) curve is one of the methods to measure the performance of a classification problem.
>It tells how good our model to separate between classes (predicting 0s as 0s and 1s as 1s)


Model has a good performance in separating classes if AUC (area under curve) is close to 1, otherwise close to 0 if model performs badly in separating classes
![AUC-ROC Curve](https://miro.medium.com/max/451/1*pk05QGzoWhCgRiiFbz-oKQ.png)
Image taken from https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

In [None]:
# create predictions based on the trained model
deep_model_predictions = final_model.predict(test_X, batch_size=len(test_X)//TEST_BATCH_SIZE, verbose=1)
# plot ROC curve based on the predictions
fig, c_ax = plt.subplots(1,1, figsize = (9, 9))
for (i, label) in enumerate(diseases):
    fpr, tpr, thresholds = roc_curve(test_Y[:,i].astype(int), deep_model_predictions[:,i])
    c_ax.plot(fpr, tpr, label = '%s (AUC:%0.2f)'  % (label, auc(fpr, tpr)))

# set labels for plot
c_ax.legend()
c_ax.set_xlabel('False Positive Rate')
c_ax.set_ylabel('True Positive Rate')
fig.savefig(base_choice+'_roc.png') # save the roc

Or evaluate the model with the accuracy and loss metrics against the test set

In [None]:
# evaluate model with the test set
final_model.evaluate(test_X, test_Y, batch_size=len(test_X)//TEST_BATCH_SIZE)

Plot the loss and accuracy into curves

In [None]:
print(fitted_model.history.keys())

In [None]:
plt.plot(fitted_model.history['loss'])
plt.plot(fitted_model.history['val_loss'])
plt.title('Model Loss vs Validation Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

In [None]:
plt.plot(fitted_model.history['binary_accuracy'])
plt.plot(fitted_model.history['val_binary_accuracy'])
plt.title('Training Accuracy vs validation Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

## 5. Localisation

Uncomment, change the batch_size if necessary, and run the code below if you are running out of memory

In [None]:
# test dataset
# test_X, test_Y = next(flow_from_dataframe(test_data_gen, test_set, path_col = 'full_path', y_col = 'target_vector', 
#                                           target_size = image_size, color_mode = 'rgb', batch_size = 32))

Make a prediction for a single image with the model

In [None]:
pred_Y = final_model.predict(test_X)

In [None]:
pred_Y[0]

Now for the localisation, we are going to use the CAM (Class Activation Map) algorithm. Later in different notebook, I will use another algorithm which is the "upgraded" version of CAM, which is the Grad CAM

We have defined our model architecture and now it can predict the images, but what does it actually see inside the learning process?

In [None]:
# extract the weights from the final layer of the model
saved_weights = final_model.layers[-1].get_weights()[0]
saved_weights.shape
# create a new model with the last convolutional layer as the output and the final predicted layer
cam_model = Model(inputs=final_model.input, outputs=(final_model.layers[-14].output, final_model.layers[-1].output))

features, res = cam_model.predict(test_X, batch_size=32) # make a new prediction with that model

To obtain CAM, these are the general steps:
1. The Softmax layer (the last layer of our model) contains the probabilities for all classes
2. The final convolutional layer from our model contains the weights to predict more complex pattern from the images. This is why we want the **final convolutional layer** to see what did they "see"
3. Next, calculate the **dot product between the weights from the final layer and the feature map to produce the CAM (Class Activation Map)** and plot them into the subplots to convert those floating numbers into a more comprehensive format, which is color

In [None]:
sickest_idx = np.argsort(np.sum(test_Y, 1)<1)
# create subplots to plot different images
fig, m_axs = plt.subplots(4, 2, figsize = (16, 16))
for (idx, c_ax) in zip(sickest_idx, m_axs.flatten()):
    indv_features = features[idx,:,:,:]
    pred = np.argmax(res[idx]) # use np.argmax to find the highest probability

    cam_features = indv_features # feature maps
    cam_weights = saved_weights[:, pred] # the weights from the last layer
    cam_output = np.dot(cam_features, cam_weights) # calculate the dot product
    
    c_ax.imshow(cam_output, cmap='jet')
#     c_ax.imshow(test_X[idx,:,:,0], cmap = 'bone')
    stat_str = [n_class for n_class, n_score in zip(diseases, test_Y[idx]) if n_score>0.5]
    pred_str = ['%s:%2.0f%%' % (n_class, p_score*100) for n_class, n_score, p_score in zip(diseases, test_Y[idx], pred_Y[idx]) 
                if (n_score>0.5) or (p_score>0.5)]
    c_ax.set_title('Label(s): '+', '.join(stat_str)+'\n Prediction(s): '+', '.join(pred_str))
    c_ax.axis('off')

The result is rather confusing and does not tell us any information about the existence of the disease(s). Generally, the RGB colored pixels in the results are only the indicators which pattern the model found. "Warmer" pixel colors mean stronger correlation to the pattern, while "cooler" pixel colors mean otherwise (weaker correlation).<br><br>
I will introduce another algorithm called Grad-CAM which is the upgraded version of this algorithm