# Phase 1: Introduction to the Project & Dataset Exploration

### Brief Description of the Dataset:

Link to the Dataset: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia

#### <u>The dataset is comprised of 3 folders:<u>

1. **train** - *contains the images that the model will be trained on* (the training set)
   - **1341 images** are x-rays of **'normal'** lungs, **3875 images** are x-rays of lungs with **pneumonia**
2. **test** - *contains the images that the model will be tested on to evaluate its performance, providing guidelines for model tuning* (the test set)
   - **234 images** are x-rays of **'normal'** lungs, **390 images** are x-rays of lungs with **pneumonia**
3. **val** - *contains the images that will be used to assess the final performance of the fully trained model* (the validation set)
   - **8 images** are x-rays of **'normal'** lungs, **8 images** are x-rays of lungs with **pneumonia**

The dataset contains images of lungs with both 'bacterial' and 'viral' pneumonia, however the model will not differentiate between the two (it will simply aim to classify a given x-ray image of a lung as being infected with pneumonia or not).

### **Pointing out the class imbalance**
With a total of 5216 images in the training set, roughly **74.3%** of the images in the training set are **x-ray images of pneumonia-infected lungs**, leaving ***only around 25.7%*** of the images in the training set to be ***x-ray images of healthy lungs***. **This is quite a significant class imbalance** (almost 3x as many pneumonia infected lungs as healthy lungs) and this will need to be handled later on.

# Phase 2: Data Preprocessing

### Importing Tensorflow & the ImageDataGenerator Class

In [None]:
import tensorflow
from tensorflow.keras.preprocessing.image import ImageDataGenerator

## 1. Preprocessing the Training Set

In [None]:
train_datagen = ImageDataGenerator(rescale = 1./255,
                                   shear_range = 0.2,
                                   zoom_range = 0.2,
                                   horizontal_flip = True)

training_set = train_datagen.flow_from_directory('data/train',
                                                 target_size = (64,64),
                                                 batch_size = 32,
                                                 color_mode = 'grayscale',
                                                 class_mode = 'binary')

Keras' *flow_from_directory* expects a directory structure whereby the main directory (which is passed in as the first parameter) contains **subdirectories** for each class, which is how this dataset is structured. Images are then assigned a label according to the folder name (due to alphabetical order, images in 'NORMAL' are assigned 0, images in 'PNEUMONIA' are assigned 1)

In [None]:
import matplotlib.pyplot as plt

images, labels = next(training_set)

plt.figure(figsize=(12, 12))
for i in range(6):
    plt.subplot(2, 3, i + 1)
    plt.imshow(images[i], cmap='gray')
    plt.title(f"Label: {int(labels[i])}") #0 for normal, 1 for pneumonia
    plt.axis("off")
plt.show()

## 2. Preprocessing the Test Set

The test set should **not be augmented** because it should reflect *real-world conditions*, the model would be tested on **altered images** that aren't actual medical scans

In [None]:
test_datagen = ImageDataGenerator(rescale=1.0/255.0)

test_set = test_datagen.flow_from_directory('data/test',
                                                 target_size = (64,64),
                                                 batch_size = 32,
                                                 color_mode = 'grayscale',
                                                 class_mode = 'binary')

#### Handling the Class Imbalance: Using Class Weighting instead of Resampling techniques

Since the class imbalance is not too extreme, and resampling (over/undersampling) can lead to overfitting/loss of valuable information, I will proceed with *cost-sensitive learning*, by using sklearn's **compute_class_weight** function

In [None]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

class_labels = np.array([0,1]) #(0 = Normal, 1 = Pneumonia)

class_weights = compute_class_weight(
    class_weight = "balanced",
    classes = class_labels,
    y = [0]*1341 + [1]*3875)

class_weight_dict = {i: weight for i, weight in enumerate(class_weights)}
print("Computed Class Weights:", class_weight_dict)

The CNN model to be built calculates a **loss function** which the model aims to minimize throughout subsequent epocs. With class weighting, the loss function applies **higher penalties** to misclassified samples from the *minority class* (which are the normal lungs). This handles the class imbalance by urging the model to pay more attention to the minority class (normal lungs).

# Phase 3: Model Development

In [None]:
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPool2D
from tensorflow.keras.models import Sequential

### 1. Convolution

In [None]:
cnn = Sequential()

In [None]:
cnn.add(Conv2D(filters=32,
               kernel_size=3,
               activation='relu',
               input_shape=(64,64,1)))

### 2. Pooling

In [None]:
cnn.add(MaxPool2D(pool_size=2,
                  strides=2))

**Multiple convolution layers may be added to make the architecture more complex** - This is needed since detecting pneumonia in grayscale x-ray images is already very difficult for the human eye: The more difficult it is to detect certain the features, the greater the need for a more complex model.

In [None]:
#Second Layer:
cnn.add(Conv2D(filters=32,
                 kernel_size=3,
                 activation='relu'))

cnn.add(MaxPool2D(pool_size=2,
                  strides=2))

#Third Layer:
cnn.add(Conv2D(filters=32,
                 kernel_size=3,
                 activation='relu'))

cnn.add(MaxPool2D(pool_size=2,
                  strides=2))

### 3. Flattening

In [None]:
cnn.add(Flatten())

### 4. Full Connection

In [None]:
cnn.add(Dense(units=128,
              activation='relu'))

#final output layer
cnn.add(Dense(units=1,
              activation='sigmoid'))

In [None]:
from tensorflow.keras.metrics import Precision, Recall
cnn.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics=['accuracy', Precision(name='precision'), Recall(name='recall')])

In [None]:
history = cnn.fit(x=training_set, validation_data=test_set, epochs=10, class_weight = class_weight_dict)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()

plt.subplot(1,2,2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.show()

There are relatively erratic fluctuations in validation loss. While the fluctuations are not too significant in number, there is no 
clear downward trend: The model is clearly struggling to generalize.

The model also seems to be overfitting due to **lower validation accuracy** than training accuracy, as well as **higher validation loss** than training loss

### Precision Metrics

In [None]:
print("Training Precision:", history.history['precision'], '\n')
print("Validation Precision:", history.history['val_precision'], '\n')

The model seems to be **slightly weak** at minimizing false positives when it comes to unseen data (validation precision is quite significantly lower than training precision across all epochs)

### Recall Metrics

In [None]:
print("Training Recall:", history.history['recall'], '\n')
print("Validation Recall:", history.history['val_recall'], '\n')

The model seems to be **relatively strong** at detecting actual pneumonia cases when it comes to unseen data

## Model Tuning

### Trying a lower learning rate

In [None]:
#default learning rate of adam optimizer is 0.001, try 0.0005
from tensorflow.keras.optimizers import Adam
adam_optimizer = Adam(learning_rate=0.0005)

cnn.compile(optimizer = adam_optimizer, loss = 'binary_crossentropy', metrics=['accuracy', Precision(name='precision'), Recall(name='recall')])
history = cnn.fit(x=training_set, validation_data=test_set, epochs=15, class_weight=class_weight_dict)

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()

plt.subplot(1,2,2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.show()

There are slight upward/downward trends in validation accuracy/loss respectively, however there are still erratic fluctuations - **slight improvement nonetheless**

This may suggest batch sensitivity and/or model instability - overall this shows the model is still inconsistent. Additionally, the model is still overfitting due to the significant difference between training and validation accuracy/loss

### Trying a larger batch size

In [None]:
#Try larger batch size
training_set = train_datagen.flow_from_directory('data/train',
                                                 target_size = (64,64),
                                                 batch_size = 64,
                                                 color_mode = 'grayscale',
                                                 class_mode = 'binary')

test_set = test_datagen.flow_from_directory('data/test',
                                                 target_size = (64,64),
                                                 batch_size = 64,
                                                 color_mode = 'grayscale',
                                                 class_mode = 'binary')

cnn.compile(optimizer = adam_optimizer, loss = 'binary_crossentropy', metrics=['accuracy', Precision(name='precision'), Recall(name='recall')])
history = cnn.fit(x=training_set, validation_data=test_set, epochs=15, class_weight=class_weight_dict)

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()

plt.subplot(1,2,2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.show()

Model seems even more unstable: This may have occured since larger batch sizes lead to **less frequent weight updates** (the model's weights update after each batch, therefore larger bathces means fewer batches = fewer updates)

In [None]:
#Reducing batch size back to 32
training_set = train_datagen.flow_from_directory('data/train',
                                                 target_size = (64,64),
                                                 batch_size = 32,
                                                 color_mode = 'grayscale',
                                                 class_mode = 'binary')

test_set = test_datagen.flow_from_directory('data/test',
                                                 target_size = (64,64),
                                                 batch_size = 32,
                                                 color_mode = 'grayscale',
                                                 class_mode = 'binary')

### Trying an even lower learning rate and introducing early stopping and L2 regularization

In [None]:
#Try lower learning rate + modify Early Stopping
adam_optimizer = Adam(learning_rate=0.0003) #0.0005 --> 0.0003

early_stop = EarlyStopping(monitor='val_loss', patience=4, restore_best_weights=True) #Stop after 4 rounds

In [None]:
#Redefining model architecture with Regularization
from tensorflow.keras.regularizers import l2

cnn = Sequential()

# First Convolutional Layer
cnn.add(Conv2D(32, kernel_size=3, activation='relu', kernel_regularizer=l2(0.001), input_shape=(64, 64, 1)))
cnn.add(MaxPool2D(pool_size=2, strides=2))

# Second Convolutional Layer
cnn.add(Conv2D(32, kernel_size=3, activation='relu', kernel_regularizer=l2(0.001)))
cnn.add(MaxPool2D(pool_size=2, strides=2))

# Third Convolutional Layer
cnn.add(Conv2D(32, kernel_size=3, activation='relu', kernel_regularizer=l2(0.001)))
cnn.add(MaxPool2D(pool_size=2, strides=2))

# Flatten layer
cnn.add(Flatten())

# Fully Connected Layer
cnn.add(Dense(units=128, activation='relu'))

#Output Layer
cnn.add(Dense(units=1, activation='sigmoid'))

cnn.compile(optimizer=adam_optimizer, loss='binary_crossentropy', metrics=['accuracy', Precision(name='precision'), Recall(name='recall')])
cnn.fit(x=training_set, validation_data=test_set, epochs=20, class_weight=class_weight_dict, callbacks=[early_stop])

In [None]:
history = cnn.history
epochs = range(len(history.history['accuracy']))

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(epochs, history.history['accuracy'], label='Training Accuracy')
plt.plot(epochs, history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.title('Training & Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs, history.history['loss'], label='Training Loss')
plt.plot(epochs, history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.title('Training & Validation Loss')

plt.show()

#### Excluding observation of 1st epoch (usually has instability due to untrained weights)

In [None]:
epochs = range(1, len(history.history['accuracy'])) #one less epoch

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(epochs, history.history['accuracy'][1:], label='Training Accuracy')
plt.plot(epochs, history.history['val_accuracy'][1:], label='Validation Accuracy')
plt.legend()
plt.title('Training & Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs, history.history['loss'][1:], label='Training Loss')
plt.plot(epochs, history.history['val_loss'][1:], label='Validation Loss')
plt.legend()
plt.title('Training & Validation Loss')

plt.show()

**Significant improvement**. Model looks more stable (although there are still some erratic changes in validation accuracy/loss), less overfitting (smaller difference betweeen train and validation accuracy) and there are clear upward/downward trends for validation accuracy/loss respectively.

In [None]:
print("Training Precision:", history.history['precision'], '\n')
print("Validation Precision:", history.history['val_precision'], '\n')

print("Training Recall:", history.history['recall'], '\n')
print("Validation Recall:", history.history['val_recall'], '\n')

## Making a Single Prediction

In [None]:
from keras.preprocessing import image

img = image.load_img('data/val/NORMAL/NORMAL2-IM-1442-0001.jpeg', target_size=(64, 64), color_mode='grayscale')
img_array = image.img_to_array(img)
img_array = np.expand_dims(img_array, axis=0)
img_array = img_array / 255.0

prediction = cnn.predict(img_array)
class_label = "Pneumonia" if prediction > 0.5 else "Normal"

print(f"Prediction: {class_label} (Confidence: {prediction[0][0]:.4f})")

- Model Predicts 5/8 Normal Lungs **correctly** in the Validation Set
- Model Predicts 8/8 Pneumonia Lungs **correctly** in the Validation Set

## Saving the model

In [None]:
cnn.save("../../web-app/david-boules/model.keras")

# References:

- Data Augmentation Techniques: https://www.linkedin.com/advice/0/how-do-you-implement-data-augmentation-techniques
- Handling Class Imbalance in Image Classification: Techniques and Best Practices: https://medium.com/@okeshakarunarathne/handling-class-imbalance-in-image-classification-techniques-and-best-practices-c539214440b0
- Handling Class Imbalances using Class Weights: https://medium.com/@ravi.abhinav4/improving-class-imbalance-with-class-weights-in-machine-learning-af072fdd4aa4
- The Ultimate Guide to Convolutional Neural Networks: https://www.superdatascience.com/blogs/the-ultimate-guide-to-convolutional-neural-networks-cnn
- Building a Convolutional Neural Network using TensorFlow: https://www.analyticsvidhya.com/blog/2021/06/building-a-convolutional-neural-network-using-tensorflow-keras/
- Fixing Overfitting on a CNN: https://www.geeksforgeeks.org/what-are-the-possible-approaches-to-fixing-overfitting-on-a-cnn/
- Regularization in Deep Learning: https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/