# Phase 1: Introduction to the Project & Dataset Exploration

### Quick Description of the Dataset:

Link to the Dataset: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia

#### <u>The dataset is comprised of 3 folders:<u>

1. **train** - *contains the images that the model will be trained on* (the training set)
   - **1341 images** are x-rays of **'normal'** lungs, **3875 images** are x-rays of lungs with **pneumonia**
2. **test** - *contains the images that the model will be tested on to evaluate its performance, providing guidelines for model tuning* (the test set)
   - **234 images** are x-rays of **'normal'** lungs, **390 images** are x-rays of lungs with **pneumonia**
3. **val** - *contains the images that will be used to assess the final performance of the fully trained model* (the validation set)
   - **8 images** are x-rays of **'normal'** lungs, **8 images** are x-rays of lungs with **pneumonia**

The dataset contains images of lungs with both 'bacterial' and 'viral' pneumonia, however the model will not differentiate between the two (it will simply aim to classify a given x-ray image of a lung as being infected with pneumonia or not).

### **CLASS IMBALANCE ALERT!!!**
With a total of 5216 images in the training set, roughly **74.3%** of the images in the training set are **x-ray images of pneumonia-infected lungs**, leaving ***only around 25.7%*** of the images in the training set to be ***x-ray images of healthy lungs***. **This is quite a significant class imbalance** (almost 3x as many pneumonia infected lungs as healthy lungs) and this will need to be handled later on.

# Phase 2: Data Preprocessing

### Importing Tensorflow & the ImageDataGenerator Class

In [2]:
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

## 1. Preprocessing Training Set

<u>*The 'what' and 'why' behind the keys steps in image preprocessing*:<u>
- **Image Resizing**: ensure all images are the same size - neural networks require inputs of a fixed size


- **Pixel Normalization**: scale pixel values to take a value wihtin a certain (smaller) range - helps model converge faster by ensuring data is consistent and not too varied


- **Data Augmentaion**: apply random transformations (e.g. rotations/flips) to images - increases the variety in the training set to prevent overfitting and increase the generalization of the model

In [3]:
train_datagen = ImageDataGenerator(rescale = 1./255,
                                   shear_range = 0.2,
                                   zoom_range = 0.2,
                                   horizontal_flip = True)

training_set = train_datagen.flow_from_directory('data/train',
                                                 target_size = (64,64), #try different sizes
                                                 batch_size = 32,
                                                 color_mode = 'grayscale',
                                                 class_mode = 'binary')

Found 5216 images belonging to 2 classes.


#### Explaining each parameter:

<u>**ImageDataGenerator Parameters:**<u>

- **rescale**: *'min-max scaling'* - normalizes pixel values to take a value between 0 and 1 by dividing by 255
- **shear_range**: a type of transvection
- **zoom_range**: applies a random zoom augmentation within a given range ([1-value, 1+value])
    -  e.g. if value = 0.2, range is [0.8, 1.2]
- **horizontal_flip**: flips image in the y-axis

*other transformations that may be used later on:*
- *rotation_range*: applies a random rotation with an angle between 0 and the value passed in 
- *width_shift_range*: applies a horizontal shift to the image
- *height_shift_range*: applies a vertical shift to the image

<u>**flow_from_directory Parameters:**<u>

- **(first argument)**: folder path
- **target_size**: image resizing dimensions
- **batch_size**: how many images are in each batch for each training iteration
- **class_mode**: determines the type of label array(s) that is/are returned

#### Notes on Image Preprocessing

##### Data Augmentation: Geometric & Photometric augmentation techniques:

**Geometric**: 'Modify shape/position of the image' e.g. zooming/cropping, flipping(horizontal/vertical), rotating, shifting

**Photometric**: 'Modify appearance/color of image' e.g. altering brigthness, contrast, saturation, hue

Depending on the data available, specific techniques/transformations are used in combination. In the case of this dataset (grayscale X-ray images), I will avoid ***photometric augmentation techniques*** since pneumonia detection involves detecting **gray/white intersitial patterns** which could be obstructed/unrecognizable through certain photometric augmentation techniques by altering parameters such as brightness and contrast.

## 2. Preprocessing Test Set

The test set should **not be augmented** because it should reflect *real-world conditions*, the model would be tested on **altered images** that aren't actual medical scans

In [4]:
test_datagen = ImageDataGenerator(rescale=1.0/255.0)

test_set = test_datagen.flow_from_directory('data/test',
                                                 target_size = (64,64), #try different sizes
                                                 batch_size = 32,
                                                 color_mode = 'grayscale',
                                                 class_mode = 'binary')

Found 624 images belonging to 2 classes.


#### Handling the Class Imbalance: Using Class Weighting instead of Resampling techniques

Since the class imbalance is not too extreme, and resampling (over/undersampling) can lead to overfitting/loss of valuable information, I will proceed with the cost-sensitive learning, by applying the **inverse class frequency method** or simply using sklearn's **compute_class_weight** function

In [3]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

class_labels = np.array([0,1]) #(0 = Normal, 1 = Pneumonia)

class_weights = compute_class_weight(
    class_weight = "balanced",
    classes = class_labels,
    y = [0]*1341 + [1]*3875)

class_weight_dict = {i: weight for i, weight in enumerate(class_weights)}
print("Computed Class Weights:", class_weight_dict)

Computed Class Weights: {0: 1.9448173005219984, 1: 0.6730322580645162}


The CNN model to be built calculates a **loss function** which the model aims to minimize throughout subsequent EPOCs. With class weighting, the loss function applies **higher penalties** to misclassified samples from the *minority class* (which are the normal lungs). This handles the class imbalance by urging the model to pay more attention to the minority class (normal lungs).

# Phase 3: Model Development

# References:

- Data Augmentation Techniques: https://www.linkedin.com/advice/0/how-do-you-implement-data-augmentation-techniques
- Handling Class Imbalance in Image Classification: Techniques and Best Practices: https://medium.com/@okeshakarunarathne/handling-class-imbalance-in-image-classification-techniques-and-best-practices-c539214440b0
- Handling Class Imbalances using Class Weights: https://medium.com/@ravi.abhinav4/improving-class-imbalance-with-class-weights-in-machine-learning-af072fdd4aa4