<a href="https://colab.research.google.com/github/dR0ski/MIT-Food-CNN/blob/main/Reference_Notebook_Malaria_Detection_Full_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Malaria Detection**

# Problem Definition for Malaria Detection Using CNN

## Context
Malaria is a severe public health issue caused by Plasmodium parasites, transmitted via infected female Anopheles mosquito bites. This disease predominantly affects tropical and subtropical areas and poses significant challenges. Prompt and accurate diagnosis is essential for effective treatment and control, significantly impacting malaria's associated mortality and morbidity.

## Objectives
The primary objective is to develop a Convolutional Neural Network (CNN) model that can automatically classify microscopic images of red blood cells into two categories: parasitized and uninfected. This deep learning approach aims to improve the speed and accuracy of malaria detection, thus facilitating quicker clinical decision-making and enabling timely and appropriate treatment interventions.

## Key Questions
1. How can we effectively train a CNN model to classify red blood cells as parasitized or uninfected with high accuracy using the provided image data?
2. What are the optimal architectures and hyperparameters for the CNN model suited for this task?
3. How can the CNN model be integrated seamlessly into existing healthcare infrastructures to offer real-time diagnostic support?
4. What are the expected improvements in diagnosis time and accuracy when using the CNN model compared to traditional diagnostic methods?
5. How can the model be adapted and scaled to operate efficiently across different regions, especially where healthcare resources are limited?

## Problem Formulation
The challenge involves using data science and deep learning to create a predictive CNN model that accurately classifies images of red blood cells as either parasitized or uninfected. This involves:
- Preprocessing and augmenting image data to train the CNN effectively.
- Experimenting with different CNN architectures to identify the most effective model for recognizing patterns indicative of malaria parasites.
- Validating the model's performance through rigorous testing to ensure high sensitivity and specificity.

## Impact on Population, Hospitals, and Economy
**Population:** Enhanced diagnostic techniques will lead to more effective malaria control, reducing both the prevalence and spread of the disease, improving overall public health outcomes.

**Hospitals:** By automating malaria diagnostics, hospitals can manage their resources better and improve patient care efficiency, particularly in high-burden areas.

**Economy:** A decrease in malaria prevalence can significantly affect economic stability by reducing disease-induced absenteeism and preserving valuable human resources. Improved health outcomes can further contribute to economic growth and stability.

Incorporating a CNN-based approach in malaria diagnostics promises substantial improvements across health, societal, and economic sectors, especially in regions most affected by the disease.

## <b>Data Description </b>

There are a total of 24,958 train and 2,600 test images (colored) that we have taken from microscopic images. These images are of the following categories:<br>


**Parasitized:** The parasitized cells contain the Plasmodium parasite which causes malaria<br>
**Uninfected:** The uninfected cells are free of the Plasmodium parasites<br>



###<b> Mount the Drive

In [None]:
#------------------------------------------------------------------------------
# Reusable Variables Declarations
#------------------------------------------------------------------------------
nl = "\n"
ln = "--------------------------------------------------------------------------"
tb = "\t"


In [None]:
#------------------------------------------------------------------------------
# Mounting Google drive / Google.Colab
#------------------------------------------------------------------------------
print("---> Commence Loading Google Drive")
from google.colab import drive
drive.mount('/content/drive')
print("---> Google Drive Mounted Successfully")

---> Commence Loading Google Drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
---> Google Drive Mounted Successfully


In [None]:
#------------------------------------------------------------------------------
# Store the directory path for the cell_images.zip
#------------------------------------------------------------------------------
pathZip = '/content/drive/MyDrive/Capstone/cell_images.zip'

#------------------------------------------------------------------------------
# Store the directory path for the file to be unzipped to
#------------------------------------------------------------------------------
pathToExtract = '/content/drive/MyDrive/Capstone/'

### <b>Loading libraries</b>

In [None]:
# ------------------------------------------------------------------------------
# Imports the required data and ML libraries
# ------------------------------------------------------------------------------
print(f"{nl}{ln}{nl}> Commence importation of all required data & ML libraries{nl}{ln}{nl}")
import numpy as np
import matplotlib.pyplot as plt
import os
import zipfile
import pandas as pd
import seaborn as sns
import cv2
import tensorflow as tf
import keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, BatchNormalization, Activation, Input, LeakyReLU, Dropout, Flatten
from tensorflow.keras import backend
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras import losses, optimizers
from tensorflow.keras.preprocessing.image import load_img # opens image file from directory
from google.colab.patches import cv2_imshow
print(f"{nl}{ln}{nl}> Python Library Import Completed{nl}{ln}{nl}")


--------------------------------------------------------------------------
> Commence importation of all required data & ML libraries
--------------------------------------------------------------------------


--------------------------------------------------------------------------
> Python Library Import Completed
--------------------------------------------------------------------------



### <b>Let us load the data</b>

**Note:**
- You must download the dataset from the link provided on Olympus and upload the same to your Google Drive. Then unzip the folder.

In [None]:
# Unzips file
# ------------------------------------------------------------------------------
with zipfile.ZipFile(pathZip, 'r') as zip_ref:
    zip_ref.extractall(pathToExtract)

In [None]:
# ------------------------------------------------------------------------------
# LOADING DATA : TRAINING & TESTING DATA SET
# ------------------------------------------------------------------------------
# Brute force approach for loading data
# ------------------------------------------------------------------------------
print(f"{nl}{ln}{nl}> Commence loading of data {nl}{ln}{nl}")

DATADIR = '/content/drive/MyDrive/Capstone/cell_images/train' # Path to training data set
DATADIR_test = '/content/drive/MyDrive/Capstone/cell_images/test' # Path to testing data set
CATEGORIES = ['parasitized', 'uninfected'] # Folders within the data set under the training folder
IMG_SIZE = 150 # Set Image Size, such that our system doesnt run out of memory



--------------------------------------------------------------------------
> Commence loading of data 
--------------------------------------------------------------------------



The extracted folder has different folders for train and test data will contain the different sizes of images for parasitized and uninfected cells within the respective folder name.

The size of all images must be the same and should be converted to 4D arrays so that they can be used as an input for the convolutional neural network. Also, we need to create the labels for both types of images to be able to train and test the model.

Let's do the same for the training data first and then we will use the same code for the test data as well.

In [None]:
# ------------------------------------------------------------------------------
# LOADING DATA
# ------------------------------------------------------------------------------
# Brute force approach for loading training data
# ------------------------------------------------------------------------------
def create_training_data(directory):
  print(f"{nl}{ln}{nl}> Leveraging brute force approach to loading data sets. {nl}{ln}{nl}")
  t_data = []
  for category in CATEGORIES:
    path = os.path.join(directory, category)
    class_num = category

    for img in os.listdir(path):
      img_array = cv2.imread(os.path.join(path, img)) # Read each image file in the directory that is in focus
      new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE)) # Resize each image to 150X150 pixels to save space in memory
      t_data.append([new_array, class_num]) # Append the image and it's category to the 'training_data' python list
  print(f"{nl}{ln}{nl}> Data loading & resizing complaeted.{nl}{ln}{nl}")
  return t_data

In [None]:
# ------------------------------------------------------------------------------
# LOADING DATA : TRAINING DATA SET
# ------------------------------------------------------------------------------
training_data = create_training_data(DATADIR)

Brute force approach for loading training data
--------------------------------------------------------------------------



In [None]:
# ------------------------------------------------------------------------------
# LOADING DATA : TESTING DATA SET
# ------------------------------------------------------------------------------
testing_data = create_training_data(DATADIR_test)


--------------------------------------------------------------------------
> Leveraging brute force approach to loading data sets. 
--------------------------------------------------------------------------


--------------------------------------------------------------------------
> Data loading & resizing complaeted.
--------------------------------------------------------------------------



###<b> Check the shape of train and test images

In [None]:
# ------------------------------------------------------------------------------
# Data Shape : TRAINING DATA SET
# ------------------------------------------------------------------------------
training_data[0][0].shape

(150, 150, 3)

In [None]:
# ------------------------------------------------------------------------------
# Data Shape : TESTING DATA SET
# ------------------------------------------------------------------------------
testing_data[0][0].shape

(150, 150, 3)

###<b> Check the shape of train and test labels

In [None]:
# ------------------------------------------------------------------------------
# Load Labels : DATA SET AGNOSTIC
# ------------------------------------------------------------------------------
def loading_labels(data_dir, cat_num):
  # ----------------------------------------------------------------------------
  # Stores the names of image files into a Python list
  # ----------------------------------------------------------------------------
  labels = [fn for fn in os.listdir(f"{data_dir}/{CATEGORIES[cat_num]}")]
  # ----------------------------------------------------------------------------
  # Prints the length of list
  # ----------------------------------------------------------------------------
  print(f"{nl}{ln}{nl}> List Length: {len(labels)}{nl}{ln}{nl}{nl}")

  # ----------------------------------------------------------------------------
  # Loads bread filenames into a numpy array
  # Select nine (9) random breads
  # ----------------------------------------------------------------------------
  load_numpy = np.random.choice(labels, 9, replace=False)
  return load_numpy


In [6]:
print(f"{nl}{nl}Paratized Count")
select_train_parasitized = loading_labels(DATADIR, 0)

NameError: name 'nl' is not defined

In [None]:
print(f"{nl}{nl}Uninfected Paratized Count")
select_train_uninfected = loading_labels(DATADIR_test, 1)



Uninfected Paratized Count

--------------------------------------------------------------------------
> List Length: 1300
--------------------------------------------------------------------------




In [None]:
# ------------------------------------------------------------------------------
# SPLITTING THE DATA SET FUNCTION
# ------------------------------------------------------------------------------
# Function that takes a numpy array as input
# ------------------------------------------------------------------------------

def split_data_along_axis(nmpy_data_array):
  """
  Function takes a numpy array that contains imgs and labels
  """
  try:
    # Holds values for the x axis from the training data set
    x_axis_values = []

    # Holds values for the y axis from the training data set
    y_axis_values = []

    # ------------------------------------------------------------------------------
    # Shuffle training data in a numpy array
    # ------------------------------------------------------------------------------
    np.random.shuffle(nmpy_data_array)

    # ------------------------------------------------------------------------------
    # Loads images to the x axis & labels to the y axis
    # ------------------------------------------------------------------------------

    for img, label in nmpy_data_array:
      x_axis_values.append(img)
      y_axis_values.append(label)

    x_axis_values = np.array(x_axis_values)
    y_axis_values = np.array(y_axis_values)

    return {"x_axis": x_axis_values, "y_axis": y_axis_values}

  except Exception as splitting_error:
    print({"Splitting_Error": splitting_error})


In [None]:
# ------------------------------------------------------------------------------
# SPLITTING THE TRAINING DATA SET
# ------------------------------------------------------------------------------
t_data = split_data_along_axis(training_data)
x_train = t_data["x_axis"]
y_train = t_data["y_axis"]

print(f"Training Data X Axis Shape: {x_train.shape}{nl}Training Data Y Axis Shape: {y_train.shape}")


In [None]:
# ------------------------------------------------------------------------------
# SPLITTING THE TEST DATA SET
# ------------------------------------------------------------------------------
tst_data = split_data_along_axis(testing_data)
x_tst = tst_data["x_axis"]
y_tst = tst_data["y_axis"]

print(f"Testing Data X Axis Shape: {x_tst.shape}{nl}Testing Data Y Axis Shape: {y_tst.shape}")


Testing Data X Axis Shape: (2600, 150, 150, 3)
Testing Data Y Axis Shape: (2600,)


####<b> Observations and insights: _____


### <b>Check the minimum and maximum range of pixel values for train and test images

In [None]:
# ------------------------------------------------------------------------------
# Convert the numbers in the Numpy arrays from numbers that are 0-255 to 0 to 1
# ------------------------------------------------------------------------------
x_train = x_train/255.0

In [None]:
# ------------------------------------------------------------------------------
# Convert the numbers in the Numpy arrays from numbers that are 0-255 to 0 to 1
# ------------------------------------------------------------------------------
x_tst = x_tst/255.0

NameError: name 'x_tst' is not defined

####<b> Observations and insights: _____



###<b> Count the number of values in both uninfected and parasitized

###<b>Normalize the images

####<b> Observations and insights: _____

###<b> Plot to check if the data is balanced

####<b> Observations and insights: _____

### <b>Data Exploration</b>
Let's visualize the images from the train data

####<b> Observations and insights: _____

###<b> Visualize the images with subplot(6, 6) and figsize = (12, 12)

In [None]:
# ------------------------------------------------------------------------------
# VISUALIZING DATA: TRAINING DATA SET
# ------------------------------------------------------------------------------
# Function that takes a numpy array as input
# ------------------------------------------------------------------------------
def visualize_data_plt_figure(image_category, dir_position_num, data_dir):
  """
  Fucntion takes an input of image_category which is a numpy array
  """
  fig = plt.figure(figsize=(12,12))
  for i in range(9):
    ax = fig.add_subplot(6,6, i+1)
    fp = f'{data_dir}/{CATEGORIES[dir_position_num]}/{image_category[i]}'

    fn = load_img(fp, target_size = (150,150))
    plt.imshow(fn,)
    plt.axis('off')





####<b>Observations and insights:

###<b> Plotting the mean images for parasitized and uninfected

<b> Mean image for parasitized

<b> Mean image for uninfected

####<b> Observations and insights: _____

### <b>Converting RGB to HSV of Images using OpenCV

###<b> Converting the train data

###<b> Converting the test data

####<b>Observations and insights: _____

###<b> Processing Images using Gaussian Blurring

###<b> Gaussian Blurring on train data

###<b> Gaussian Blurring on test data

####**Observations and insights: _____**

**Think About It:** Would blurring help us for this problem statement in any way? What else can we try?

## **Model Building**

### **Base Model**

**Note:** The Base Model has been fully built and evaluated with all outputs shown to give an idea about the process of the creation and evaluation of the performance of a CNN architecture. A similar process can be followed in iterating to build better-performing CNN architectures.

###<b> Importing the required libraries for building and training our Model

####<B>One Hot Encoding the train and test labels

###<b> Building the model

In [None]:
#

###<b> Compiling the model

<b> Using Callbacks

<b> Fit and train our Model

###<b> Evaluating the model on test data

<b> Plotting the confusion matrix

<b>Plotting the train and validation curves

So now let's try to build another model with few more add on layers and try to check if we can try to improve the model. Therefore try to build a model by adding few layers if required and altering the activation functions.

###<b> Model 1
####<b> Trying to improve the performance of our model by adding new layers


###<b> Building the Model

###<b> Compiling the model

<b> Using Callbacks

<b>Fit and Train the model

###<b> Evaluating the model

<b> Plotting the confusion matrix

<b> Plotting the train and the validation curves

###<b>Think about it:</b><br>
Now let's build a model with LeakyRelu as the activation function  

*  Can the model performance be improved if we change our activation function to LeakyRelu?
*  Can BatchNormalization improve our model?

Let us try to build a model using BatchNormalization and using LeakyRelu as our activation function.

###<b> Model 2 with Batch Normalization

###<b> Building the Model

###<b>Compiling the model

<b> Using callbacks

<b>Fit and train the model

<b>Plotting the train and validation accuracy

###<b>Evaluating the model

####<b>Observations and insights: ____

<b> Generate the classification report and confusion matrix

###**Think About It :**<br>

* Can we improve the model with Image Data Augmentation?
* References to image data augmentation can be seen below:
  *   [Image Augmentation for Computer Vision](https://www.mygreatlearning.com/blog/understanding-data-augmentation/)
  *   [How to Configure Image Data Augmentation in Keras?](https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/)





###<b>Model 3 with Data Augmentation

###<b> Use image data generator

###**Think About It :**<br>

*  Check if the performance of the model can be improved by changing different parameters in the ImageDataGenerator.



####<B>Visualizing Augmented images

####<b>Observations and insights: ____

###<b>Building the Model

<b>Using Callbacks

<b> Fit and Train the model

###<B>Evaluating the model

<b>Plot the train and validation accuracy

<B>Plotting the classification report and confusion matrix

<b> Now, let us try to use a pretrained model like VGG16 and check how it performs on our data.

### **Pre-trained model (VGG16)**
- Import VGG16 network upto any layer you choose
- Add Fully Connected Layers on top of it

###<b>Compiling the model

<b> using callbacks

<b>Fit and Train the model

<b>Plot the train and validation accuracy

###**Observations and insights: _____**

*   What can be observed from the validation and train curves?

###<b> Evaluating the model

<b>Plotting the classification report and confusion matrix

###<b>Think about it:</b>
*  What observations and insights can be drawn from the confusion matrix and classification report?
*  Choose the model with the best accuracy scores from all the above models and save it as a final model.


####<b> Observations and Conclusions drawn from the final model: _____



**Improvements that can be done:**<br>


*  Can the model performance be improved using other pre-trained models or different CNN architecture?
*  You can try to build a model using these HSV images and compare them with your other models.

#### **Insights**

####**Refined insights**:
- What are the most meaningful insights from the data relevant to the problem?

####**Comparison of various techniques and their relative performance**:
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

####**Proposal for the final solution design**:
- What model do you propose to be adopted? Why is this the best solution to adopt?