<h1>Building a Convolutional Neural Network (CNN) to Detect COVID-19 in X-Ray Images</h1>
<h4>Anthony Preza</h4>
<p>In this notebook, we will load and preprocess X-ray data from <a href="https://github.com/ieee8023/covid-chestxray-dataset">this open soure dataset</a>. We will then build a CNN leveraging transfer learning from a VGG19. The goal is to predict a diagnosis of COVID-19.</p>

In [1]:
import os
import random
import sys
import logging

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as display
import tensorflow as tf

from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from PIL import Image

In [2]:
tf.__version__

'2.0.0'

<h4>Load and Preprocess Data</h4>
<p>We will determine the shape of the data and perform any preprocessing operations. In order to use the VGG19 base model, we prepare the data for transfer learning.</p>

In [3]:
LOGGER = logging.getLogger(__name__)
LOGGER.setLevel(logging.INFO)
LOGGER.addHandler(logging.StreamHandler(stream=sys.stdout))

<h4>Split data into training, validation, and testing sets</h4>
<p>Now that the data has been preprocessed, we can split it into training, validation, and testing sets. The training set will be used to train the CNN. Validation set is needed to validate hyperparamaters used to tune the model. Testing set will allow us to evaluate final performance.</p>

In [4]:
# initialize preprocessing constants
keras.applications.vgg19.preprocess_input(tf.zeros((224, 224, 3)))

classes = ['vp', 'normal', 'COVID-19', 'bp']
indices = tf.range(len(classes), dtype=tf.int64)
table_init = tf.lookup.KeyValueTensorInitializer(classes, indices)
table = tf.lookup.StaticVocabularyTable(table_init, 1)

def make_dataset(directory, num_parallel_calls=tf.data.experimental.AUTOTUNE, shuffle=100, repeat=1, batch=30):
    set_ = []
    
    def preprocess(tensor):
        img_file = tensor[0]
        
        # read the image file
        img_str = tf.io.read_file(img_file)
                
        # decode image as jpeg
        img_decoded = tf.image.decode_jpeg(img_str, channels=3)
                
        # cast image data to float32
        img = tf.cast(img_decoded, tf.float32)
        
        # resize image to proper shape
        resized_img = tf.image.resize(img, (224, 224))
        
        # process for vgg19
        final_img = keras.applications.vgg19.preprocess_input(resized_img)
                
        return final_img, tf.one_hot(table.lookup(tensor[1]), depth=len(classes) + 1)[:len(classes)]
    
    for class_ in classes:
        curdir = f'{directory}/{class_}'
        set_.extend([(f'{curdir}/{filename}', class_ ) for filename in os.listdir(curdir) if '.jpe' in filename or '.png' in filename])
        
        # shuffle the (filename, label) tuples
        random.shuffle(set_)
        
    dataset = tf.data.Dataset.from_tensor_slices(tf.stack(set_))
    dataset = dataset.map(preprocess, num_parallel_calls=num_parallel_calls)
    dataset = dataset.shuffle(shuffle).repeat(repeat)
    
    return dataset.batch(batch).prefetch(1)

train_dataset = make_dataset('train')
test_dataset = make_dataset('test')
val_dataset = make_dataset('val')

In [5]:
for item in train_dataset.take(1):
    print(item[0].shape)
    print(item[1])

(30, 224, 224, 3)
tf.Tensor(
[[0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]], shape=(30, 4), dtype=float32)


<h4>Build and Train CNN</h4>
<p>Here we will build a CNN using the VGG19 architecture as a base model and adding an average pooling and dense layer to the top of the model (implementing dropout). We will train for 100 epochs, and implement early stopping with a patience of 10  epochs. The base model is frozen during the training session until the top layers are well trained.</p>

In [6]:
LOGGER.info('Building CNN model...')
base_model = keras.applications.vgg19.VGG19(weights='imagenet',
                                                 include_top=False,
                                                  input_shape=(224, 224, 3))

avg = keras.layers.GlobalAveragePooling2D()(base_model.output)
norm = keras.layers.BatchNormalization()(avg)
dense = keras.layers.Dense(64, activation=keras.activations.relu)(norm)
norm = keras.layers.BatchNormalization()(dense)
dropout = keras.layers.Dropout(0.5)(norm)
output = keras.layers.Dense(4, activation=keras.activations.sigmoid)(dropout)
model = keras.Model(inputs=base_model.input, outputs=output)

Building CNN model...


In [7]:
lr = 0.2
for layer in base_model.layers:
    layer.trainable = False

LOGGER.info('Compiling model...')
model.compile(loss=keras.losses.categorical_crossentropy, 
              optimizer=keras.optimizers.Adam(lr=lr, decay=lr / 1e-6),
#               callbacks=[keras.callbacks.EarlyStopping(patience=10, monitor='val_loss')],
              metrics=['accuracy'])

LOGGER.info('Training top layers of model...')
history = model.fit(train_dataset, epochs=15, validation_data=val_dataset)

Compiling model...
Training top layers of model...
Epoch 1/15
     40/Unknown - 224s 6s/step - loss: 1.4227 - accuracy: 0.5205

KeyboardInterrupt: 

In [None]:
# plot training data
fig, ax = plt.subplots(figsize=[10, 6])

plt_data = pd.DataFrame(history.history)

plt.plot(plt_data)
plt.legend(labels=plt_data.columns, fontsize=10)

ax.spines["top"].set_visible(False)    
ax.spines["bottom"].set_visible(False)    
ax.spines["right"].set_visible(False)    
ax.spines["left"].set_visible(False)  

plt.xlabel('Epoch',fontsize=12)
plt.yticks(np.array(range(0, 11, 1)) / 10)
plt.xticks(fontsize=10)

for y in np.array(range(0, 11, 1)) / 10:    
    plt.plot([-1,31], [y] * 2, "--", lw=0.5, color="black", alpha=0.3)  

plt.tick_params(axis="both", which="both", bottom=False, top=False,    
                labelbottom=True, left=False, right=False, labelleft=True)

plt.title('Training History', fontsize=18)
plt.gca().set_ylim(0,1)
plt.show()

In [None]:
loss, acc = model.evaluate(test_dataset)

In [None]:
import seaborn as sn
from sklearn.metrics import accuracy_score, confusion_matrix

predictions = model.predict(np.argmax(np.array([i[1] for i in test_dataset])))
conf_mx = confusion_matrix(np.argmax(np.array([i[1] for i in test_dataset])), predictions)

# plot confusion matrix
sn.set(font_scale=1.4) # for label size
sn.heatmap(conf_mx, annot=True, annot_kws={"size": 16}) # font size
plt.title('Confusion matrix: 0=Normal, 1=COVID-19')
plt.show()