<a href="https://colab.research.google.com/github/aramirezfr/Aircraft-Acquisition-Proposal/blob/master/CNN_Pneumonia_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pneumonia Image Binary Classifier Model** \
By: Adriana Ramirez Franco. \
Email: aramirezfr20@gmail.com

# Business Understanding

**About Pneumonia:** \
Pneumonia is a serious respiratory condition that can lead to severe complications, especially if not diagnosed early. Timely and accurate diagnosis is crucial to initiate appropriate treatment and reduce morbidity and mortality, particularly among vulnerable populations like children, the elderly, and individuals with compromised immune systems.


**Why not using traditional in person diagnosis by doctors?** \
Traditional diagnosis of pneumonia relies heavily on radiologists interpreting chest X-rays, which can be time-consuming and prone to human error, especially under high workloads or in resource-limited settings.

**How can we support a professional doctor's diagnosis?** \
To address these challenges, this project aims to develop a convolutional neural network (CNN)-based binary classification model to automatically identify pneumonia from chest X-ray images. By distinguishing between normal and pneumonia-affected lungs, the model will assist healthcare professionals in making quicker, more accurate decisions. This model can improve diagnostic efficiency, alleviate the burden on radiologists, enhance patient outcomes, and provide valuable support in remote or underserved areas where access to specialized radiology expertise is limited.

## Benefits of implementing a Medical Image Classifier

**Medical image classification using machine learning** is critically important for several reasons:  

### **1. Faster Diagnosis and Treatment**  
Convolutional neural networks (CNNs), can analyze medical images much faster than humans. This reduces the time required for diagnosis, enabling quicker initiation of treatment, which is particularly crucial for conditions like pneumonia, cancer, or strokes where delays can have life-threatening consequences.

### **2. Improved Accuracy and Consistency**  
Machine learning systems can match or exceed the diagnostic accuracy of radiologists in specific tasks, as they learn from large datasets and can identify patterns that may be difficult for human experts to detect. This ensures consistency in diagnosis, reducing human errors caused by fatigue or cognitive bias.

### **3. Addressing Resource Gaps**  
In many regions, especially remote or low-resource settings, there is a shortage of radiologists and specialized healthcare professionals. Machine learning models can act as decision-support tools to help non-specialists make informed diagnoses or prioritize cases that need expert attention.

### **4. Reduced Workload for Healthcare Professionals**  
With the increasing demand for medical imaging, radiologists often have to analyze hundreds of images per day. AI systems can help pre-screen images or highlight abnormal cases, allowing radiologists to focus their expertise on the most critical cases, improving workflow efficiency.

### **5. Continuous Learning and Scalability**  
Machine learning models can continuously improve as they are trained with new data, making them adaptable to emerging medical conditions. They are also scalable, meaning once a model is developed, it can be deployed across multiple healthcare systems globally with minimal modifications.

### **6. Enabling Preventive Healthcare**  
Automated image classification can also aid in early detection of diseases that may not exhibit symptoms initially, facilitating preventive interventions. For instance, AI models used for early screening of pneumonia or lung cancer can detect subtle abnormalities that might be missed in routine examinations.

In summary, machine learning-based medical image classification is transforming healthcare by enhancing diagnostic accuracy, improving efficiency, and expanding access to quality care, making it an essential tool for modern medicine.

# Data Understanding

1. **Source and Properties of the Data**:
   - This dataset, published by Paul Mooney on Kaggle, originates from the Guangzhou Women and Children’s Medical Center. It contains labeled chest X-ray images grouped into "Pneumonia" (with bacterial and viral categories) and "Normal."
   - The images are grayscale with consistent resolution, showing clear lung structures. The dataset is ideal for supervised machine learning tasks since each image is accurately labeled as pneumonia or healthy, making it a reliable choice for binary classification.



2. **Size of Data and Descriptive Statistics of Features**:
   - The dataset comprises 5,863 images, divided into training, validation, and test sets, enabling efficient model evaluation. The training set includes around 4,000 images, with a smaller validation and test set.
   - Key features include image pixel intensity values, which represent lung opacity patterns. The dataset contains approximately three times more pneumonia cases than normal ones, making it slightly imbalanced.

3. **Feature Suitability**:
   The primary feature—chest X-ray images—allows for visual detection of pneumonia markers, such as lung opacity and structure irregularities. This aligns well with the objective to classify cases of pneumonia versus healthy lungs based on these medical imaging patterns.



4. **Limitations of Using This Data**:
   - *Challenges*: The dataset's class imbalance (more pneumonia cases than normal) could affect model performance. The variation in image quality and possible label inconsistencies may also introduce noise, impacting model accuracy and generalizability.
   - *Generalization Limits*: Since the dataset was sourced from a specific medical center, models trained on it might not generalize well to X-rays from different machines or patient demographics.


For further details, refer to the Kaggle dataset page: [Chest X-Ray Images (Pneumonia) on Kaggle](https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia).

# Data Preparation

I will begin downloading the necessary file from Kaggle, unzip the file, and I will be importing the necessary libraries for this project.

In [None]:
#Downloading the data file from Kaggle
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia

In [None]:
#Unzip the data folder
!unzip chest-xray-pneumonia.zip -d data

In [None]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, Activation
from tensorflow.keras.utils import load_img, img_to_array
from sklearn.metrics import confusion_matrix, classification_report
from tensorflow.keras.applications.resnet50 import preprocess_input
from keras.callbacks import EarlyStopping, ModelCheckpoint

import warnings
warnings.filterwarnings('ignore')

import os

-----------------

In [None]:
#setting the directory that contains the kaggle data file
directory=os.listdir('data/chest_xray')
print(directory) #listing contents of directory

There are 3 directories inside the chest_xray file: training, validation and testing directory.

In [None]:
#setting variables to set the directories
train_dir='data/chest_xray/train'
val_dir='data/chest_xray/val'
test_dir='data/chest_xray/test'

Using **image_dataset_from_directory** function to create a dataset from images stored in the previous directories.
I will create 3 different datasets, one for each directory.

In [None]:
#creating datasets for all directories
train=keras.utils.image_dataset_from_directory(
    directory=train_dir,
    shuffle=True,
    labels='inferred',
    label_mode='binary',
    batch_size=32,
    image_size=(150, 150))

test = keras.utils.image_dataset_from_directory (
    directory = test_dir,
    shuffle=True,
    labels = "inferred",
    label_mode='binary',
    batch_size = 32,
    image_size = (150,150))
validation = keras.utils.image_dataset_from_directory (
    directory = val_dir,
    shuffle=True,
    labels ="inferred",
    label_mode='binary',
    batch_size = 32,
    image_size = (150 , 150))

In [None]:
#checking the name of the classes in the files
print(train.class_names)
print(test.class_names)
print(validation.class_names)

Each file in 'chest_xray' includes a set of images with **"Normal"** x-rays and another set of x-rays with **"Pneumonia"**.

------------------

**Training data:**

The model uses training data to learn patterns, relationships, and rules that map inputs to outputs. A well-trained model should generalize from the training data to make accurate predictions on new, unseen data. The quality and diversity of the training data significantly affect the model's ability to generalize.A model trained on imbalanced data may not generalize well to real-world scenarios where the distribution of classes might be different. This can lead to poor performance in practical applications.

In [None]:
#defineing directories of training images
pneumonia_dir = 'data/chest_xray/train/PNEUMONIA'
normal_dir = 'data/chest_xray/train/NORMAL'

#list files in each directory
pneumonia_files = os.listdir(pneumonia_dir)
normal_files = os.listdir(normal_dir)

#checking the quantity of images in each directory
len(pneumonia_files)+len(normal_files)

There are 5216 images to train on that belong to the subgroups **'PNEUMONIA'** and **'NORMAL'**.

-----------------------------------------------------------------------------------------------------------------------------

Taking a look to see the difference between X-rays that show lungs with pneumonia and normal healthy lungs.

In [None]:
#plot the count of images
def display_images(image_files, image_dir, num_images=5, title=''):
    plt.figure(figsize=(10, 5))
    plt.suptitle(title, fontsize=16)

    for i, image_name in enumerate(image_files[:num_images]):
        image_path = os.path.join(image_dir, image_name)
        img = mpimg.imread(image_path)
        plt.subplot(1, num_images, i + 1)
        plt.imshow(img, cmap='gray')  #using 'gray' for grayscale images
        plt.title(image_name)
        plt.axis('off')

#display images from PNEUMONIA class
display_images(pneumonia_files, pneumonia_dir, num_images=3, title='PNEUMONIA')

#display images from NORMAL class
display_images(normal_files, normal_dir, num_images=3, title='NORMAL')

* Out of the 100% (5,216) training images available, 74% of them are Pneumonia images and 26% of them are Normal Lung images. This demonstrates class imbalance. \
* In an ideally balanced dataset, each class would have an equal number of instances, typically approximately around 50% for each class for a binary classification problem. \

In [None]:
#checking the count of normal and x-rays with pneumonia
len(pneumonia_files), len(normal_files)

Plot the count of the images in the Training file to check for imbalance.

In [None]:
#plotting the count of pneumonia and normal class
# create a variable with the counts:
pneumonia_count = len(pneumonia_files)
normal_count = len(normal_files)

#define the labels and their corresponding counts
labels = ['Pneumonia', 'Normal']
counts = [pneumonia_count, normal_count]

#plot
plt.figure(figsize=(5, 5))
plt.bar(labels, counts) #labels on the x-axis and counts on the y-axis
plt.ylabel('Count')
plt.title('Class Imbalance')
plt.show()

## Data Augmentation:

Given that there is a severe class imbalance in the training class I will use ImageDataGenerator to create some synthetic images to help balance the class and help prevent overfitting.

In [None]:
#setting parameters
img_size=(100,100)
SHAPE=(100,100,3)
batch_size=32

In [None]:
#setting the data generator for train and validation
datagen = ImageDataGenerator(preprocessing_function=preprocess_input,
                             rotation_range=20,
                             width_shift_range=0.2,
                             height_shift_range=0.2,
                             shear_range=0.2,
                             zoom_range=0.2,
                             horizontal_flip=False,
                             fill_mode='nearest')

#setting the test data generator
test_datagen=ImageDataGenerator(preprocessing_function=preprocess_input)

In [None]:
#applying the data generator
#training and validation set
train_set=datagen.flow_from_directory(train_dir,
                                      class_mode='binary',
                                      target_size=img_size,
                                      batch_size=batch_size,
                                      #shuffle=False,
                                      seed=42)

val_set=datagen.flow_from_directory(val_dir,
                                      class_mode='binary',
                                      target_size=img_size,
                                      batch_size=batch_size,
                                      shuffle=False,
                                      seed=42)
test_set=test_datagen.flow_from_directory(test_dir,
                                      class_mode='binary',
                                      target_size=img_size,
                                      batch_size=batch_size,
                                      shuffle=False,
                                      seed=42)

Now that the directories will be going through **ImageDataGenerator** their names (train_dir, val_dir, test_dir) will change to: train_set, val_set and test_set.

# Modeling:

**Model building** is an iterative process that starts from a baseline model to more complex models based on rationales that will be redefined on each iteration.\
I will start with a baseline model that will be used for comparison.
New model iterations will be justified on the models that will come after the base model.
New models will show a different result and we are looking for the best model that will give us the best result for this binary classifier.

## Baseline model:

In [None]:
#setting metrics for all the models to be trained
METRICS=['accuracy',
         tf.keras.metrics.Precision(name='precision'),
         tf.keras.metrics.Recall(name='recall')]

In [None]:
#setting up the baseline model
base_model = models.Sequential([

    #first convolutional layer
    layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(100,100, 3)),
    layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),

    #second convolutional layer
    layers.Conv2D(64, kernel_size=(3, 3), activation='relu'),
    layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),

    #flatten the feature maps
    layers.Flatten(),

    #fully connected layer
    layers.Dense(128, activation='relu'),

    #output layer with a single neuron for binary classification
    layers.Dense(1, activation='sigmoid')])

In [None]:
#compile the model
base_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=METRICS)

**Early Stopping:**

Is used to prevent overfitting and improve the training process. Early stopping can reduce the computational cost by stopping the training process once the model stops improving, saving time and computational resources.

In [None]:
#defining callbacks for early stopping
early_stopping=[EarlyStopping(monitor='val_acc', patience=10),
                              ModelCheckpoint(filepath='best_model.keras',
                              monitor='val_acc',
                              save_best_only=True )]

In [None]:
base_model.summary()

In [None]:
#calculates # of batches needed to cover the training set in one epoch
steps_per_epoch=len(train_set)//batch_size

#calculates # of batches needed to cover the validation set
validation_steps=len(val_set)//batch_size

In [None]:
#calculates # of batches needed to cover the training set in one epoch
steps_per_epoch=len(train)//batch_size

#calculates # of batches needed to cover the validation set
validation_steps=len(val)//batch_size

In [None]:
#train base model
base_model_hist = base_model.fit(train_set,
                                 steps_per_epoch=steps_per_epoch,
                                 epochs=10,
                                 callbacks=early_stopping,
                                 validation_data=val_set,
                                 shuffle=False)

In [None]:
#evaluate base model
results_base_train=base_model.evaluate(test_set)
results_base_train

In [None]:
#unpacking values
basetest_loss, basetest_accuracy, basetest_precision, basetest_recall = results_base_train

#print the result values
print(f"Test accuracy: {basetest_accuracy:.2f}")
print(f"Test precision: {basetest_precision:.2f}")
print(f"Test recall: {basetest_recall:.2f}")

Here's why it's important and how focusing on precision and recall can benefit you:

**Visual Diagnosis:**
Detecting Overfitting/Underfitting: By plotting these metrics, you can visually assess whether your model is overfitting or underfitting. Overfitting is indicated by a large gap between training and validation metrics, where training performance is much better than validation performance. Underfitting might be suggested if both metrics are poor.\
**Training Stability:** You can see if the training process is stable or if the model's performance is volatile, which might suggest problems with the learning rate or data quality.

**Focusing on Precision and Recall:**
Precision: Important when the cost of false positives is high. A precision plot can help you understand how well the model is maintaining this aspect during training and validation.
Recall: Crucial when missing positive instances is costly. Plotting recall helps ensure the model is effectively identifying positive cases throughout training.

**Hyperparameter Tuning:** By visualizing these metrics, you can better understand how different hyperparameters (e.g., batch size, learning rate) affect model performance. This facilitates more informed decisions during hyperparameter tuning.

In [None]:
#plot precision and recall
#(train)reflect how well the model is learning the patterns in the training data
plt.plot(base_model_hist.history['precision'], label='Train Precision')

#(validation)model's performance on unseen data
plt.plot(base_model_hist.history['val_precision'], label='Validation Precision')
plt.plot(base_model_hist.history['recall'], label='Train Recall')
plt.plot(base_model_hist.history['val_recall'], label='Validation Recall')

plt.xlabel('Epoch')
plt.ylabel('Score')
plt.title('Precision and Recall over Epochs')
plt.legend()
plt.show()

In [None]:
#plot the training history
plt.plot(base_model_hist.history['accuracy'], label='train accuracy')
plt.plot(base_model_hist.history['val_accuracy'], label='val accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

## Testing Model 1:

In [None]:
#Adding a third layer and dropout layer to compare results with base model
test1_model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(100, 100, 3)),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    #extra third layer
    Conv2D(128, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    Dense(128, activation='relu'),
    #adding dropout layer
    Dropout(0.5),  #to prevent overfitting
    Dense(1, activation='sigmoid')  #output layer for binary classification
])

In [None]:
#compile the model
test1_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=METRICS)

In [None]:
#train the test1_model
test1_history = test1_model.fit(train_set,
                                 steps_per_epoch=steps_per_epoch,
                                 epochs=10,
                                 callbacks=early_stopping,
                                 validation_data=val_set,
                                 shuffle=False)

In [None]:
test1_results= test1_model.evaluate(test_set)
print("Evaluation results:", test1_results)

In [None]:
test1_loss, test1_accuracy, test1_precision, test1_recall = test1_results

# Print the results
print(f"Test accuracy: {test1_accuracy:.2f}")
print(f"Test precision: {test1_precision:.2f}")
print(f"Test recall: {test1_recall:.2f}")

In [None]:
#plot the training history
plt.plot(test1_history.history['accuracy'], label='train accuracy')
plt.plot(test1_history.history['val_accuracy'], label='val accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

MAIN MODEL TRIALLLLL

In [None]:
test2_model = models.Sequential([
    #input layer(32 filters, 3x3 kernel size, "relu" activation,input shape 150x150 to fit image generator function)
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(100, 100, 3)),

    #layer with 2x2 pool size
    layers.MaxPooling2D((2, 2)),

    #64 filters
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    layers.Flatten(), #flatten layer

    #Dense layer with 512 neurons
    layers.Dense(512, activation='relu'),#this dense layer input matches the flattened output

    layers.BatchNormalization(),
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.1),
    layers.BatchNormalization(),
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.2),
    layers.BatchNormalization(),
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.2),
    layers.BatchNormalization(),
    #output layer: dense layer with 2 layers
    layers.Dense(1, activation='sigmoid')  #for binary classification
])


In [None]:
test2_model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=METRICS)

In [None]:
#print summary model of architecture
test2_model.summary()

In [None]:
#train the model
test2_history = test2_model.fit(train_set,
                                 steps_per_epoch=steps_per_epoch,
                                 epochs=10,
                                 callbacks=early_stopping,
                                 validation_data=val_set,
                                 shuffle=False)


In [None]:
test2_results= test2_model.evaluate(test_set)
print("Evaluation results:", test2_results)

In [None]:
test2_loss, test2_accuracy, test2_precision, test2_recall = test2_results

# Print the results
print(f"Test accuracy: {test2_accuracy:.2f}")
print(f"Test precision: {test2_precision:.2f}")
print(f"Test recall: {test2_recall:.2f}")

In [None]:
#plot the training history
plt.plot(test2_history.history['accuracy'], label='train accuracy')
plt.plot(test2_history.history['val_accuracy'], label='val accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

---------------------

## Train the model:

In [None]:
history= model.fit(train,
                        epochs=10,
                        validation_data=test)

# Evaluation:

In [None]:
#checking accuracy:
results_train= model.evaluate(train, batch_size=128)
results_train

In [None]:
results_test= model.evaluate(test, batch_size=128)
results_test

In [None]:
pd.DataFrame(history.history).plot(figsize=(10,10))


Create a confusion Matrix to analyze the description of the performace of the classification model on the set of test data.

In [None]:
cm= confusion_matrix(y_true=, y_pred=result)

In [None]:
Create a classification report.

In [None]:

y_pred=model.predict(test)
y_pred_classes = np.round(y_pred).astype(int)
#generate confusion matrix
cm = confusion_matrix(y_test, y_pred_classes)

#plot
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted Negative', 'Predicted Positive'],
            yticklabels=['Actual Negative', 'Actual Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()


In [None]:
#Create a classification report.
report = classification_report(y_test, y_pred_classes)
print("Classification Report:\n", report)

# Summary:

## Results:

Training process(epochs?)\
how long does each epoch takes?\
total training time?\
results? \

true positives!!\
true negatives!!\
false positives??\
false negatives X (minimize) \