## **COVID-19 Diagnosis Using Chest X-Ray Data**
This project uses Chest X-Ray Data to train a deep neural network to help diagnose COVID-19.

### Introduction
COVID-19 is severely impacting the health of countless people worldwide. A crucial step in controlling the disease has been early detection of infected patients, which can be achieved through radiography, according to prior literature that shows COVID-19 causes chest abnormalities noticeable in chest X-rays.

We begin by importing necessary packages for our model.


In [1]:
import cv2
import os
import numpy as np
import pandas as pd

###Data Collection
There is no substantially-sized, clinically verified, and publicly available COVID-19 dataset. However, a small composite dataset with X-Rays of COVID-19 positive patients recently became publicly available with [DeepCovid](https://github.com/shervinmin/DeepCovid), which compiled their data from:

[Covid-Chestxray-Dataset](https://github.com/ieee8023/covid-chestxray-dataset) for COVID-19 X-ray samples

[ChexPert Dataset](https://stanfordmlgroup.github.io/competitions/chexpert/) for Non-COVID samples


Our data was stored on a google drive so we will mount the drive to get the data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
from zipfile import ZipFile

file_name = 'drive/My Drive/dataset.zip'

with ZipFile(file_name, 'r') as zip:
  zip.extractall()
  print("Data extracted!")

Data extracted!


Next, let us begin feature and label building. We can utilize Paths to 
list the directory of every X-Ray from the dataset.

In [4]:
from imutils import paths

data = []
labels = []

# Grab list of image paths using paths.list_images
imagePaths = list(paths.list_images('data_upload_v3/train/covid')) + list(paths.list_images('data_upload_v3/train/non'))

# Label and resize the images 

for imagePath in imagePaths:
	# extract the class label from the filename
    if imagePath[21] == 'c':
	    label = 'covid'
    else:
        label = 'normal'

    image = cv2.imread(imagePath)
    image = cv2.resize(image, (224, 224))

    data.append(image)
    labels.append(label)


# Convert data and labels to a Numpy Array and normalize the pixel values
data = np.array(data) / 255.0
labels = np.array(labels)

In [5]:
(unique, counts) = np.unique(labels, return_counts=True)
frequencies = np.asarray((unique, counts)).T
print(frequencies)

[['covid' '84']
 ['normal' '2000']]


In [6]:
# Extract test data

test_data = []
test_labels = []

# Grab list of image paths using paths.list_images
test_imagePaths = list(paths.list_images('data_upload_v3/train/covid')) + list(paths.list_images('data_upload_v3/train/non'))

# Label and resize the images 

for imagePath in test_imagePaths:
	# extract the class label from the filename
    if imagePath[21] == 'c':
	    label = 'covid'
    else:
        label = 'normal'

    image = cv2.imread(imagePath)
    image = cv2.resize(image, (224, 224))

    test_data.append(image)
    test_labels.append(label)


# Convert data and labels to a Numpy Array and normalize the pixel values
test_data = np.array(test_data) / 255.0
test_labels = np.array(test_labels)

Use SKLearn to one-hot encode our labels.

In [7]:
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.utils import to_categorical

lb = LabelBinarizer()
labels = lb.fit_transform(labels)
labels = to_categorical(labels)
test_labels = lb.fit_transform(test_labels)
test_labels = to_categorical(test_labels)

### Data Splitting
Now that we're done grabbing our data, we can begin to look at splitting the data into our training and validation sets.

In [8]:
from sklearn.model_selection import train_test_split
# Partition data into 80% training and 20% validation

trainX, testX, trainY, testY = train_test_split(data, labels, stratify=labels,test_size = 0.20, random_state=123)

### Set hyperparameters

In [9]:
EPOCHS = 20
BATCH_SIZE = 16
LR = 1e-4

### Base model
We are gonna use ResNet as the base model. The pretrained weights come from imagenet, and our input is (224,224,3).

In [10]:
from tensorflow.keras.applications import ResNet152V2
from tensorflow.keras.layers import Input

base = ResNet152V2(include_top=False, weights = 'imagenet', input_shape = (224,224,3))

###Build Model###
We will construct our model here.

In [11]:
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import AUC
from keras.losses import categorical_crossentropy

#Construct structure after ResNet
head = base.output
head = AveragePooling2D(pool_size=(7,7))(head)
head = BatchNormalization()(head)
head = Flatten()(head)
head = Dense(128, activation='relu')(head)
head = Dense(2, activation='sigmoid')(head)

# Combine the model
model = Model(inputs=base.input, outputs=head)

# Freeze base layers
base.trainable = False

# Compile the model
model.compile(loss=categorical_crossentropy, optimizer=Adam(learning_rate = LR), metrics=[AUC(curve="PR")])

### Data augmentation
Use data augmentation object for our training set to make the model more robust.

In [12]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

trainAug = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
)

###Training


In [13]:
# Add Earlystopping to monitor validation loss
from tensorflow.keras.callbacks import EarlyStopping
callback = EarlyStopping(monitor = 'val_loss', mode = 'min', patience = 5)

In [14]:
# Train model
H = model.fit(
    trainAug.flow(trainX, trainY, batch_size=BATCH_SIZE), 
    validation_data = (testX, testY), 
    steps_per_epoch=len(trainX) // BATCH_SIZE, 
    epochs = EPOCHS,
    callbacks = [callback]
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20


### Predictions
Here are some important metrics to see how well our model performs on the test set.

In [15]:
model.evaluate(test_data, test_labels)



[0.026741014793515205, 0.9979087710380554]

In [16]:
from sklearn.metrics import classification_report

predIdxs = np.argmax(model.predict(testX, batch_size=BATCH_SIZE), axis=1)

print(classification_report(testY.argmax(axis=1), predIdxs,
	target_names=lb.classes_))

              precision    recall  f1-score   support

       covid       0.88      0.41      0.56        17
      normal       0.98      1.00      0.99       400

    accuracy                           0.97       417
   macro avg       0.93      0.70      0.77       417
weighted avg       0.97      0.97      0.97       417



In [17]:
predIdxs = np.argmax(model.predict(test_data, batch_size=BATCH_SIZE), axis=1)

print(classification_report(test_labels.argmax(axis=1), predIdxs,
	target_names=lb.classes_))

              precision    recall  f1-score   support

       covid       0.96      0.81      0.88        84
      normal       0.99      1.00      1.00      2000

    accuracy                           0.99      2084
   macro avg       0.97      0.90      0.94      2084
weighted avg       0.99      0.99      0.99      2084



We can also compute the confusion matrix to find the sensitivity and specificity.