# In this Notebook, we will work on the [**C-NMC_Leukemia** Dataset](https://www.kaggle.com/datasets/avk256/cnmc-leukemia/data)  

##### Note: This notebook runs **only** on Kaggle, the file paths and directories are configured for Kaggle.. running the code cells won't work unless the dataset is available _locally_ and file paths correctly corresponds to system directories.  

This dataset is all just pictures of blood samples of patients and non-patients, every pictures is an input sample of $450 \times 450$ pixels, which is a managable size for `Convolutional Neural Networks`.  

First, let's import necessary libraries:  

In [8]:
import kagglehub
import os
import sys
import shutil
import numpy as np
import tensorflow as tf
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import absl.logging

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import EfficientNetB0
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, BatchNormalization, Conv2D, MaxPooling2D, Flatten, Dense, Dropout, GlobalAveragePooling2D
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report, confusion_matrix


path = kagglehub.dataset_download("avk256/cnmc-leukemia")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/cnmc-leukemia


In [9]:
sys.stderr = open('/dev/null', 'w')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
absl.logging.set_verbosity(absl.logging.ERROR)
tf.debugging.set_log_device_placement(False)

## 1- Data Preprocessing  

It's important to know the structure of the dataset: _you can view it by enabling sidebar from the `view` menu_


The following cell splits the training data into _train_, _validation_ and _test_ sets, all being used during the **training phase** to train and test the model. While the data in the `validation_data` directory is used as the final model evaluation to determine the model's real-world performance.  

The following cell consolidates all data samples into one big dataset and store it in a new directory `output_dir`, containing the 2 class labels: `all` and `hem` which corresponds to **True** and **False** respectively.
> #### all --> Acute Lymphoblastic Leukemia, hem --> healthy.
> #### True --> Has cancer, False --> No Cancer. 

In [10]:
output_dir = "/kaggle/working/unified_dataset"
fold_dirs = ['fold_0/fold_0', 'fold_1/fold_1', 'fold_2/fold_2']

os.makedirs(os.path.join(output_dir, "all"), exist_ok=True)
os.makedirs(os.path.join(output_dir, "hem"), exist_ok=True)

for fold in fold_dirs:
    fold_path = os.path.join(path, fold)
    
    for label in ["all", "hem"]:
        source_dir = os.path.join(fold_path, label)
        target_dir = os.path.join(output_dir, label)
        
        for img_file in os.listdir(source_dir):
            if img_file.endswith(".bmp"): 
                src_path = os.path.join(source_dir, img_file)
                dest_path = os.path.join(target_dir, img_file)
                
                if not os.path.exists(dest_path):
                    shutil.copy(src_path, dest_path)
                    
print("All images consolidated into unified directories!")

All images consolidated into unified directories!


And now, let's make a dataframe mapping every picture with its target output, based on what folder the picture in:  

In [11]:
image_paths, labels = [], []

for label_dir in ["all", "hem"]:  
    class_dir = os.path.join(output_dir, label_dir)
    label = label_dir 
    
    for img_file in os.listdir(class_dir):
        if img_file.endswith(".bmp"):  # Ensuring only image files are included
            image_paths.append(os.path.join(class_dir, img_file))
            labels.append(label)

data = pd.DataFrame({"image_path": image_paths, "label": labels})

The goal of converting the data into a `pd.DataFrame` is to be able to manipulate it easily and use the `train_test_split` function available in scikit-learn library.  
We can use it to split the data twice: into train/test, then the train portion into train/validation; as following:  

In [12]:
from sklearn.model_selection import train_test_split

train_val_data, test_data = train_test_split(data, test_size=0.2, 
                                             stratify=data["label"], random_state=42)
train_data, val_data = train_test_split(train_val_data, test_size=0.25, 
                                        stratify=train_val_data["label"], random_state=42)

print("Training data:", train_data["label"].value_counts())
print("Validation data:", val_data["label"].value_counts())
print("Testing data:", test_data["label"].value_counts())

Training data: label
all    4363
hem    2033
Name: count, dtype: int64
Validation data: label
all    1454
hem     678
Name: count, dtype: int64
Testing data: label
all    1455
hem     678
Name: count, dtype: int64


We need to rescale the input samples to be $224\times224$ as many pre-trained CNN models, _if we end up using one of them_, work with this input size. We also want to _augment_ the data as it increases the diversity of the data, helping the model to generalize well.  
We can do so using the `ImageDataGenerator` class from keras processing images library.  

In [13]:
datagen = ImageDataGenerator(rescale=1.0/255, width_shift_range=0.05, height_shift_range=0.05, 
                             shear_range=0.05, horizontal_flip=True, fill_mode='nearest')
test_datagen = ImageDataGenerator(rescale=1.0 / 255)

train_generator = datagen.flow_from_dataframe(dataframe=train_data, x_col="image_path", y_col="label", 
                                              target_size=(224, 224), batch_size=32, class_mode="binary")

val_generator = datagen.flow_from_dataframe(dataframe=val_data, x_col="image_path", y_col="label", 
                                              target_size=(224, 224), batch_size=32, class_mode="binary")

test_generator = datagen.flow_from_dataframe(dataframe=test_data, x_col="image_path", y_col="label", 
                                              target_size=(224, 224), batch_size=32, class_mode="binary", shuffle=False)

print("Class indices:", train_generator.class_indices)

Found 6396 validated image filenames belonging to 2 classes.
Found 2132 validated image filenames belonging to 2 classes.
Found 2133 validated image filenames belonging to 2 classes.
Class indices: {'all': 0, 'hem': 1}


In [15]:
train_generator

<keras.src.legacy.preprocessing.image.DataFrameIterator at 0x7df052339000>

## 2- Building the model  

For image classification, a **CNN** model is ideal, or we could also use a pre-trained model like **VGG16**.
a **`Conv2D`** layer is what makes a neural network _convolutional_, it also reduces the number of trainable parameters compared to fully connected **`Dense`** layer.  
a **`BatchNormalization`** layer is often used after a `Conv2D` layer to standardize its activation functions, which leads to faster training by keeping the distribution more stable. It also help make the model generalize well if it's overfitting.  
a **`MaxPooling2D`** layer is essential as it downsamples the spatial dimentions, _height and width_  from the previous layers, while preserving the features. It also prevents overfitting as it reduces the number of parameters.  
a **`Flatten`** layer is used after finishing the custom CNN layers to flatten out the **2D** dimensions into a **1D** feature vector, which is a necessary transformation before feeding the data into a `Dense` layer, _fully connected NN_.  
a **`Dropout`** layer is used to randomly _drop out_ some of the neurons during training. It helps the model generalize better, but in most application when using `BatchNormalization` layers, it's not mandatory to dropout most of the neurons.  

In [None]:
model = Sequential([
    Input(shape=(224, 224, 3)), 
     
    Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'),
    BatchNormalization(), 
    MaxPooling2D(pool_size=(2, 2)),
    
    Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same'),
    BatchNormalization(), 
    MaxPooling2D(pool_size=(2, 2)),
    
    Conv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same'),
    BatchNormalization(), 
    MaxPooling2D(pool_size=(2, 2)),
    
    Conv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same'),
    BatchNormalization(), 
    MaxPooling2D(pool_size=(2, 2)),
    
    Flatten(),
    Dense(128, activation='relu'),
    BatchNormalization(), 
    
    Dropout(0.45),
    Dense(1, activation='sigmoid'), ])

## 3- Compiling the model  

We're using the `adam` optimizer as it's a standard efficient optimizer for most ML and DL applications.  
We're monitoring the `accuracy` of the model as it helps get a generalized idea of how good the model is, _even if we care more about **precision** in such tasks_.  

In [None]:
adam = tf.keras.optimizers.Adam(learning_rate=1e-5)
model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

## 4- Training the model  

We're using the `EarlyStopping` technique to avoid the model overfitting the data over a large number of epochs, and then restoring its best parameter depending on lowest `validation_loss`.  

In [None]:
early_stopping = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True)

history = model.fit(train_generator, validation_data=val_generator,
                    epochs=40, callbacks=[early_stopping])

print(f"Best validation accuracy: {max(history.history['val_accuracy']) * 100:.2f}%")

## 5- Model evaluation

In [None]:
y_pred_probs = model.predict(test_generator, verbose=1)
y_pred = (y_pred_probs > 0.5).astype(int)
y_true = test_generator.classes

accuracy = classification_report(y_true, y_pred, output_dict=True)['accuracy']
print("Test Accuracy: " + str(accuracy) + '\n' + classification_report(y_true, y_pred))

plt.plot(history.history['accuracy'], label='Training Accuracy', color='black')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy', color='lightgreen')
plt.axhline(y=accuracy, color='red', linestyle='--', label="Test Accuracy")

plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

In [None]:
cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Hem (0)', 'All (1)'], yticklabels=['Hem (0)', 'All (1)'])
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

##### **Now let's save the model**:  

In [None]:
model.save('/kaggle/working/leukemia_classifier.h5')