# Image Classification - Pap smear images for Cervical Cancer screening
**A simple implementation of a image classifier using Keras**

Data source: SipakMed https://www.kaggle.com/datasets/prahladmehandiratta/cervical-cancer-largest-dataset-sipakmed

In [5]:
import matplotlib.pyplot as plt
import os
import zipfile
import shutil
import random
from PIL import Image

# Tensorflow/Keras imports
import tensorflow
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, GlobalAveragePooling2D
from tensorflow.keras.preprocessing import image
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow_addons.metrics import F1Score
from keras.optimizers import Adam
from tensorflow.keras.models import load_model

from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

Note: when replicating this code, it's possible to have some versioning issues with the imports. The reason behind is that I want to use the GPU/CUDA and TensorFlow 2.10 was the last TensorFlow release that supported GPU on native-Windows.
For more information check: https://www.tensorflow.org/install/pip#windows-native

In [6]:
# Checing if GPU is available
print("Num GPUs Available: ", len(tensorflow.config.list_physical_devices('GPU')))

Num GPUs Available:  1


## 1 - Extract the images stored in 'archive' zip file

For simplicty I already extracted the data from Kaggle to a local archive.zip file.

You can do the same by following this guide: https://www.geeksforgeeks.org/how-to-download-kaggle-datasets-into-jupyter-notebook/

In [7]:
# Let's extract our images in archive.zip
def extract_cropped_images(zip_path, extraction_path):
    # Open the zip file
    with zipfile.ZipFile(zip_path, 'r') as archive:
        # Iterate through each file
        for file_info in archive.infolist():
            # Check if the file is a .bmp file within a CROPPED subfolder
            if file_info.filename.endswith('.bmp') and 'CROPPED' in file_info.filename:
                # Split the path to get the necessary components
                parts = file_info.filename.split('/')
                # Extract the first subfolder name as the label
                label = parts[1]
                # Get the image filename
                image_filename = parts[-1]
                # Destination path
                destination_dir = os.path.join(extraction_path, label)
                destination_path = os.path.join(destination_dir, image_filename)
                # Create the directory (if it doesn't exist)
                os.makedirs(destination_dir, exist_ok=True)
                # Extract the images
                with archive.open(file_info) as source_file:
                    with open(destination_path, 'wb') as dest_file:
                        shutil.copyfileobj(source_file, dest_file)
    print(f"Images extracted!")

TLDR: we just want to grab the images.bmp from the CROPPED file inside the archive.zip file. And we want to keep these images inside their respective subfolders - these subfolders represent the target labels.

In [8]:
zip_path = 'archive.zip'
extraction_path = 'cropped_images'

In [9]:
extract_cropped_images(zip_path, extraction_path)

Images extracted!


In [10]:
# Checking the number of images of each label
for subfolder in os.listdir(extraction_path):
        subfolder_path = os.path.join(extraction_path, subfolder)
        if os.path.isdir(subfolder_path):
            # Count the number of .bmp files in the subfolder
            num_images = len([name for name in os.listdir(subfolder_path) if name.endswith('.bmp')])
            print(f"{num_images} images in '{subfolder}'.")

813 images in 'im_Dyskeratotic'.
825 images in 'im_Koilocytotic'.
793 images in 'im_Metaplastic'.
787 images in 'im_Parabasal'.
831 images in 'im_Superficial-Intermediate'.


## 2 - Data Generators

In [11]:
img_width, img_height = 224, 224
batch_size = 32

datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)

train_generator = datagen.flow_from_directory(
    extraction_path,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='categorical',
    subset='training'
)

validation_generator = datagen.flow_from_directory(
    extraction_path,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='categorical',
    subset='validation',
    shuffle=False
)

Found 3241 images belonging to 5 classes.
Found 808 images belonging to 5 classes.


## 3 - Model

In [12]:
# Building a simple CNN model
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    MaxPooling2D((2, 2)),
    
    Conv2D(64, (3, 3), activation='relu', padding='same'),
    MaxPooling2D((2, 2)),
    
    Conv2D(128, (3, 3), activation='relu', padding='same'),
    MaxPooling2D((2, 2)),
    
    Conv2D(256, (3, 3), activation='relu', padding='same'),
    MaxPooling2D((2, 2)),

    GlobalAveragePooling2D(),
    Flatten(),
    Dense(256, activation='relu'),
    Dropout(rate=0.2),
    Dense(256, activation='relu'),
    Dropout(rate=0.2),
    Dense(5, activation='softmax')
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 222, 222, 32)      896       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 111, 111, 32)     0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 111, 111, 64)      18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 55, 55, 64)       0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 55, 55, 128)       73856     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 27, 27, 128)      0

In [13]:
# Compile
opt_adam = Adam(learning_rate=0.001)
f1 = F1Score(num_classes=5, average='weighted')

model.compile(optimizer=opt_adam, loss='categorical_crossentropy', metrics=['accuracy',f1])

In [14]:
# ModelCheckpoint callback - save best weights
chekpoint = ModelCheckpoint(filepath='best_weights.hdf5', save_best_only=True, verbose=1)

# EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True, mode='min')

In [15]:
# Train the model
history = model.fit(train_generator,
                    epochs=20,
                    validation_data=validation_generator,
                    callbacks=[chekpoint, early_stop])

Epoch 1/20
Epoch 1: val_loss improved from inf to 1.35906, saving model to best_weights.hdf5
Epoch 2/20
Epoch 2: val_loss improved from 1.35906 to 1.04118, saving model to best_weights.hdf5
Epoch 3/20
Epoch 3: val_loss improved from 1.04118 to 0.96544, saving model to best_weights.hdf5
Epoch 4/20
Epoch 4: val_loss did not improve from 0.96544
Epoch 5/20
Epoch 5: val_loss did not improve from 0.96544
Epoch 6/20
Epoch 6: val_loss did not improve from 0.96544
Epoch 7/20
Epoch 7: val_loss did not improve from 0.96544
Epoch 8/20
Epoch 8: val_loss did not improve from 0.96544


In [16]:
# Evaluation
model.load_weights('best_weights.hdf5')
loss, accuracy, f1 = model.evaluate(validation_generator)

print("Val Loss:", loss)
print("Val Accuracy:", accuracy)
print("Val F1 Score:", f1)

Val Loss: 0.9654361009597778
Val Accuracy: 0.6262376308441162
Val F1 Score: 0.6071136593818665


In [17]:
# Get the predictions
preds = model.predict(validation_generator)
predicted_classes = tensorflow.argmax(preds, axis=1)
# Get the true classes
true_classes = validation_generator.classes
# Get the classification report
print(classification_report(true_classes, predicted_classes, target_names=validation_generator.class_indices.keys()))

                             precision    recall  f1-score   support

            im_Dyskeratotic       0.86      0.73      0.79       162
            im_Koilocytotic       0.43      0.35      0.38       165
             im_Metaplastic       0.86      0.83      0.84       158
               im_Parabasal       0.56      0.99      0.71       157
im_Superficial-Intermediate       0.42      0.27      0.33       166

                   accuracy                           0.63       808
                  macro avg       0.62      0.63      0.61       808
               weighted avg       0.62      0.63      0.61       808

