<a href="https://colab.research.google.com/github/ThabangIsaac1/Malaria/blob/main/final_model_malaria_detection_in_red_blood_cells_using_cnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Malaria Detection In Red Blood Cells Using Convulutional Neural Networks**

Malaria is one of the most severe public health problems worlwide. It is caused by a tiny plasmodium parasite that uses mosquitoes as vectors to bite humans hence infectecting  their redblood cells, leading to adverse health effects. Moreover the manual examination using  the microscopic slide by taxonomists has become the most ineffective and tedious process that has made global responses slow leading to numerous deaths and infections. This project will entail of a proposed solution that was inspired by Laxmi Kant a machine learning enthusiast.

Improvements to Laxmi Kant model include.


1.   Code Cleaning
2.   Accuracy improvements through a new proposed Convulutional layer, VGG 19 model, Image Augmentation and improvements using transfer learning.
3.   Model visualisations and analytics

Laxmi Kants  Repo.  https://github.com/laxmimerit/Malaria-Classification-Using-CNN/blob/master/TensorFlow_2_0_Tutorial_for_Beginners_15_Malaria_Parasite_Detection_Using_CNN.ipynb


Improvements will be clarified in the documentation.





In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from IPython.display import display, Image
display(Image(filename='/content/drive/MyDrive/Colab Notebooks/redblood.jpg'))

The dataset used in this project is provided by the researchers at the Lister Hill National Center for Biomedical Communications (LHNCBC), part of the National Library of Medicine (NLM).




# **Library Imports of All Needed Dependencies**






In [None]:
import os
import glob
import numpy as np
import pandas as pnd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from collections import Counter
import cv2
from concurrent import futures
import threading
import matplotlib.pyplot as mplt
%matplotlib inline
import tensorflow as tf
import datetime
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, BatchNormalization, Dropout
from keras.models import Sequential
from keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import roc_curve, auc 
from sklearn import metrics




# **Redblood Cells image dataset import & Summarization**

In [None]:
#Load image dataset of mosquitoes from the directories into two variables categorised as - Infected and Uninfected red blood cells.

main_folder = os.path.join('/content/drive/My Drive/Colab Notebooks/Red_Blood_Dataset/cell_images')
infected_dir = os.path.join(main_folder,'Infected')
healthy_dir = os.path.join(main_folder,'Uninfected')

infected_files = glob.glob(infected_dir+'/*.png')
healthy_files = glob.glob(healthy_dir+'/*.png')
len(infected_files), len(healthy_files)

In [None]:
#View the first 5 images of Red blood cells in the dataset and also explore infected and non infected cells.

np.random.seed(42)

dataset_df = pnd.DataFrame({
    'filename': infected_files + healthy_files,
    'label': ['malaria'] * len(infected_files) + ['healthy'] * len(healthy_files)
}).sample(frac=1, random_state=42).reset_index(drop=True)

dataset_df.head()

# **Split model Dataset into training, validation, and testing data,**

In [None]:
#Prepare data to be used for training,testing and validations accordingly
#Using a 60:10:30 ratio.

training_files, testing_files, training_labels, testing_labels = train_test_split(dataset_df['filename'].values,
                                                                      dataset_df['label'].values,
                                                                      test_size=0.3, random_state=42)
training_files, validation_files, training_labels, validation_labels = train_test_split(training_files,
                                                                    training_labels,
                                                                    test_size=0.1, random_state=42)

print(training_files.shape, validation_files.shape, testing_files.shape)
print('Training:', Counter(training_labels), '\nValidation:', Counter(validation_labels), '\nTesting:', Counter(testing_labels))

# **Statics and Summarization of Training Data**

**Visualise image dimensons and sizes to reach consensus on best standard scales to used**

In [None]:
#Get summary statistcs of image dimensions and orientations

def get_image_shape_parallel(keyx, img, total_imgs):
    if keyx % 5000 == 0 or keyx == (total_imgs - 1):
        print('{}: working on the image num: {}'.format(threading.current_thread().name,
                                                  keyx))
    return cv2.imread(img).shape
 
ex = futures.ThreadPoolExecutor(max_workers=None)
traindata_inp = [(keyx, img, len(training_files)) for keyx, img in enumerate(training_files)]
print('Initializing Image shape computation:')
training_img_dimensions_map = ex.map(get_image_shape_parallel,
                            [record[0] for record in traindata_inp],
                            [record[1] for record in traindata_inp],
                            [record[2] for record in traindata_inp])

train_images_dimensions = list(training_img_dimensions_map)
print('Minimum Dimensions:', np.min(train_images_dimensions, axis=0))
print('Average Dimensions:', np.mean(train_images_dimensions, axis=0))
print('Median Dimensions:', np.median(train_images_dimensions, axis=0))
print('Maximum Dimensions:', np.max(train_images_dimensions, axis=0))


# **Image Resize and  Parallel Processing Application**



1.   Parallel Processing applied to speed up image-reading operations.
2.   Based on summary statistics images will be scaled to standard dimensions - 125 x 125 Pixels



In [None]:
IMAGE_DIMENSIONS = (125, 125)

def get_parallel_imagedata(idx, img, total_imgs):
    if idx % 5000 == 0 or idx == (total_imgs - 1):
        print('{}: working on img num: {}'.format(threading.current_thread().name,
                                                  idx))
    img = cv2.imread(img)
    img = cv2.resize(img, dsize=IMAGE_DIMENSIONS,
                     interpolation=cv2.INTER_CUBIC)
    img = np.array(img, dtype=np.float32)
    return img

ex = futures.ThreadPoolExecutor(max_workers=None)
training_data_inp = [(idx, img, len(training_files)) for idx, img in enumerate(training_files)]
validation_data_inp = [(idx, img, len(validation_files)) for idx, img in enumerate(validation_files)]
testing_data_inp = [(idx, img, len(testing_files)) for idx, img in enumerate(testing_files)]

print('Unpack Training Images:')
training_data_map = ex.map(get_parallel_imagedata,
                        [record[0] for record in training_data_inp],
                        [record[1] for record in training_data_inp],
                        [record[2] for record in training_data_inp])
training_data = np.array(list(training_data_map))

print('\nUnpack Validation Images:')
validation_data_map = ex.map(get_parallel_imagedata,
                        [record[0] for record in validation_data_inp],
                        [record[1] for record in validation_data_inp],
                        [record[2] for record in validation_data_inp])
validation_data = np.array(list(validation_data_map))

print('\nUnpack Test Images:')
testing_data_map = ex.map(get_parallel_imagedata,
                        [record[0] for record in testing_data_inp],
                        [record[1] for record in testing_data_inp],
                        [record[2] for record in testing_data_inp])
testing_data = np.array(list(testing_data_map))

training_data.shape, validation_data.shape, testing_data.shape

# **Dataset Visualization**



1.   Randomly visualizing sample datasets after resizing



In [None]:
mplt.figure(1 , figsize = (8 , 8))
n = 0
for i in range(16):
    n += 1
    r = np.random.randint(0 , training_data.shape[0] , 1)
    mplt.subplot(4 , 4 , n)
    mplt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
    mplt.imshow(training_data[r[0]]/255.)
    mplt.title('{}'.format(training_labels[r[0]]))
    mplt.xticks([]) , mplt.yticks([])

# **Configurations Setup**



1.   Encoding the categorical class labels
2.   Setting Batch size and epochs
3.   Implementing Tensorflow 



In [None]:
NUM_CLASSES = 2


train_imgs_scaled = training_data / 255.
val_imgs_scaled = validation_data / 255.

# encode text category labels by categorization.

le = LabelEncoder()
le.fit(training_labels)
training_labels_enc = le.transform(training_labels)
val_labels_enc = le.transform(validation_labels)

print(training_labels[:6], training_labels_enc[:6])


In [None]:
tf.random.set_seed(42)
tf.__version__


#**Model Design**

My CNN model below is similar to the one Laxmis proposed model but with major improvements to it. 
___________________________
Laxmi's Model Dimensions:
Total params: 406,625
Trainable params: 406,625
Non-trainable params: 0
________________________


1.   The model will be trained accordingly.
2.   After training the model will be refined using transfer learning to improve the accuracy levels of the model.



In [None]:
inp = tf.keras.layers.Input(shape=(125, 125, 3))

conv1 = tf.keras.layers.Conv2D(32, kernel_size=(3, 3),
                               activation='relu', padding='same')(inp)
pool1 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(conv1)
conv2 = tf.keras.layers.Conv2D(64, kernel_size=(3, 3),
                               activation='relu', padding='same')(pool1)
pool2 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(conv2)
conv3 = tf.keras.layers.Conv2D(128, kernel_size=(3, 3),
                               activation='relu', padding='same')(pool2)
pool3 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(conv3)

flat = tf.keras.layers.Flatten()(pool3)

hidden1 = tf.keras.layers.Dense(512, activation='relu')(flat)
drop1 = tf.keras.layers.Dropout(rate=0.3)(hidden1)
hidden2 = tf.keras.layers.Dense(512, activation='relu')(drop1)
drop2 = tf.keras.layers.Dropout(rate=0.3)(hidden2)

out = tf.keras.layers.Dense(1, activation='sigmoid')(drop2)

model = tf.keras.Model(inputs=inp, outputs=out)
model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
model.summary()

In [None]:
# Tensor Logs
logdir = os.path.join('/content/drive/MyDrive/Colab Notebooks/Red_Blood_Dataset/Tensorlogs',
                      datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))

In [None]:
#train model
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                              patience=2, min_lr=0.000001)
callbacks = [reduce_lr, tensorboard_callback]

history = model.fit(x=train_imgs_scaled, y=training_labels_enc,
                    batch_size=64,
                    epochs=10,
                    validation_data=(val_imgs_scaled, val_labels_enc),
                    callbacks=callbacks,
                    verbose=1)

# **Plotting the training, validation accuracy and loss curves.**

**Noted Issues**



1.   Validation accuracy of 96.0% 
2.   Model Overfitting based on the 99.8% training accuracy

The malaria detection model proposed is promising but faces a challenge of overfitting. This means the noise or random fluctuations in the training data is picked up and learned as concepts by the model. But these concepts  do not apply  or relate to new data  thus alters the models aptitude to generalize. 


**Graphs below provide statistical point of view.**



In [None]:
#Graphs for model performance

f, (ax1, ax2) = mplt.subplots(1, 2, figsize=(12, 4))
t = f.suptitle('CNN Model Performance', fontsize=12)
f.subplots_adjust(top=0.85, wspace=0.3)

max_epoch = len(history.history['accuracy'])+1
epoch_list = list(range(1,max_epoch))
ax1.plot(epoch_list, history.history['accuracy'], label='Train Accuracy')
ax1.plot(epoch_list, history.history['val_accuracy'], label='Validation Accuracy')
ax1.set_xticks(np.arange(1, max_epoch, 5))
ax1.set_ylabel('Accuracy Value')
ax1.set_xlabel('Epoch')
ax1.set_title('Accuracy')
l1 = ax1.legend(loc="best")

ax2.plot(epoch_list, history.history['loss'], label='Train Loss')
ax2.plot(epoch_list, history.history['val_loss'], label='Validation Loss')
ax2.set_xticks(np.arange(1, max_epoch, 5))
ax2.set_ylabel('Loss Value')
ax2.set_xlabel('Epoch')
ax2.set_title('Loss')
l2 = ax2.legend(loc="best")

# **Implementing Transfer Learning & Augmentation**

Transfer Learning is a deep learning is a concept of using domain knowledge from a  pretrained model and adapting it to another domain based on a targert task.

Our CNN model is promosing but is not satsifying looking at the accuracy levels it has. Although it is better than Laxmi Kants' model in the sense that Laxmi Kants' model had an accuracy of 93.5% whilist our achieved achieved 96.0% accuracy levels during training. But our model experienced a  problem of overfitting.

To solve this we are going to image augmentation to improve our model.
Steps


1.   Save Our Current model
2.   Create image augmentors to fine tune our model
2.   Implement transfer learning and train our new model.




In [None]:
#Save our first model
model.save('ModelOne_cnn.h5')

In [None]:
#Building Image Augmentors
training_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255,
                                                                zoom_range=0.05,
                                                                rotation_range=25,
                                                                width_shift_range=0.05,
                                                                height_shift_range=0.05,
                                                                shear_range=0.05, horizontal_flip=True,
                                                                fill_mode='nearest')

valiation_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255)


training_generator = training_datagen.flow(training_data, training_labels_enc, batch_size=64, shuffle=True)
validation_generator = valiation_datagen.flow(validation_data, val_labels_enc, batch_size=64, shuffle=False)

In [None]:
#Results from a batch of image augmentation transforms

id = 0
generated_sample = training_datagen.flow(training_data[id:id+1], training_labels[id:id+1],
                                      batch_size=1)
sample = [next(generated_sample) for i in range(0,5)]
fig, ax = mplt.subplots(1,5, figsize=(16, 6))
print('Labels:', [item[1][0] for item in sample])
l = [ax[i].imshow(sample[i][0][0]) for i in range(0,5)]

VGG concept 

This model was considered since it has been trained on a huge dataset with vast image categories hence it has learned the hierarchy of features, which are spatial-, rotational-, and translation-invariant learnt by CNN models. After building the model it will be trained using an augmented training dataset.

In [None]:
vggmodel = tf.keras.applications.vgg19.VGG19(include_top=False, weights='imagenet', 
                                        input_shape=((125, 125, 3)))
# Freeze the layers for training
vggmodel.trainable = True

set_trainable = False
for layer in vggmodel.layers:
    if layer.name in ['block5_conv1', 'block4_conv1']:
        set_trainable = True
    if set_trainable:
        layer.trainable = True
    else:
        layer.trainable = False
    
base_vgg = vggmodel
base_out = base_vgg.output
pool_out = tf.keras.layers.Flatten()(base_out)
hidden1 = tf.keras.layers.Dense(512, activation='relu')(pool_out)
drop1 = tf.keras.layers.Dropout(rate=0.3)(hidden1)
hidden2 = tf.keras.layers.Dense(512, activation='relu')(drop1)
drop2 = tf.keras.layers.Dropout(rate=0.3)(hidden2)

out = tf.keras.layers.Dense(1, activation='sigmoid')(drop2)

model = tf.keras.Model(inputs=base_vgg.input, outputs=out)
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=1e-5),
                loss='binary_crossentropy',
                metrics=['accuracy'])

print("Total Layers:", len(model.layers))
print("Total trainable layers:", sum([1 for l in model.layers if l.trainable]))

In [None]:
#train model
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                              patience=2, min_lr=0.000001)

callbacks = [reduce_lr, tensorboard_callback]
train_steps_per_epoch = training_generator.n // training_generator.batch_size
val_steps_per_epoch = validation_generator.n // validation_generator.batch_size
history = model.fit(training_generator, steps_per_epoch=train_steps_per_epoch, epochs=20,
                              validation_data=validation_generator, validation_steps=val_steps_per_epoch,
                              verbose=1)

In [None]:
#Visualising model training accuracies and losses

f, (ax1, ax2) = mplt.subplots(1, 2, figsize=(12, 4))
t = f.suptitle('VGG Performance Stats', fontsize=12)
f.subplots_adjust(top=0.85, wspace=0.3)

max_epoch = len(history.history['accuracy'])+1
epoch_list = list(range(1,max_epoch))
ax1.plot(epoch_list, history.history['accuracy'], label='Train Accuracy')
ax1.plot(epoch_list, history.history['val_accuracy'], label='Validation Accuracy')
ax1.set_xticks(np.arange(1, max_epoch, 5))
ax1.set_ylabel('Accuracy Value')
ax1.set_xlabel('Epoch')
ax1.set_title('Accuracy')
l1 = ax1.legend(loc="best")

ax2.plot(epoch_list, history.history['loss'], label='Train Loss')
ax2.plot(epoch_list, history.history['val_loss'], label='Validation Loss')
ax2.set_xticks(np.arange(1, max_epoch, 5))
ax2.set_ylabel('Loss Value')
ax2.set_xlabel('Epoch')
ax2.set_title('Loss')
l2 = ax2.legend(loc="best")

In [None]:
#Save final fine tuned model
model.save('final_model.h5')

# **Model Evaluations**


1.   Test model prediction accuracies using Test Dataset


In [None]:
test_images_scaled = testing_data / 255.
test_images_scaled, testing_labels.shape

In [None]:
# Initialize saved models

model1 = tf.keras.models.load_model('./ModelOne_cnn.h5')
finalmodel = tf.keras.models.load_model('./final_model.h5')

# Make Predictions on Test Data
model1_cnn_preds = model1.predict(test_images_scaled, batch_size=512)
model2_preds = finalmodel.predict(test_images_scaled, batch_size=512)

model1_cnn_pred_labels = le.inverse_transform([1 if pred > 0.5 else 0
                                                  for pred in model1_cnn_preds.ravel()])
main_model2_pred_labels = le.inverse_transform([1 if pred > 0.5 else 0
                                                  for pred in model2_preds.ravel()])

In [None]:
def get_metrics(true_labels, predicted_labels):
    
    return {
        'Accuracy': np.round(
                        metrics.accuracy_score(true_labels, 
                                               predicted_labels),
                        4),
        'Precision:': np.round(
                        metrics.precision_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        4),
        'Recall': np.round(
                        metrics.recall_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        4),
        'F1 Score:': np.round(
                        metrics.f1_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        4)
    }
    
#Evaluate model using 4 metrics

modelone_performance = get_metrics(true_labels=testing_labels, predicted_labels=model1_cnn_pred_labels)

finalvgg_performance = get_metrics(true_labels=testing_labels, predicted_labels=main_model2_pred_labels)

pnd.DataFrame([modelone_performance, finalvgg_performance],
             index=['First Model CNN',  ' Fine Tuned VGG Model'])