Cancer is a serious matter that changes people's lives forever. Thankfully, modern medicine provides techniques to identify and diagnose cancer cells. However, many people discover these signs a little too late and are unable to be saved. The creation of machine learning models allows people to check for signs of skin cancer quickly and easily. 
In this notebook, I will use a dataset containing 270 images of benign and malignant images of skin cancer to train a model that can distinguish skin lesions as cancerous or normal. 

In [1]:
import os
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, models

# Split files into benign and malignant
benign_image_filenames = []
malignant_image_filenames = []
for dirname, cancer, filenames in os.walk('/kaggle/input/skin-cancer-dataset/train_cancer'):
    for filename in filenames:
        if os.path.basename(dirname) == "benign":
            benign_image_filenames.append(os.path.join(dirname, filename))
        else:
            malignant_image_filenames.append(os.path.join(dirname, filename))

# Display the number of benign and malignant images
print("Benign: " + str(len(benign_image_filenames)))
print("Malignant: " + str(len(malignant_image_filenames)))

# Display the first few image filenames
print(benign_image_filenames[:5])  
print(malignant_image_filenames[:5]) 





Benign: 30
Malignant: 240
['/kaggle/input/skin-cancer-dataset/train_cancer/benign/20.jpg', '/kaggle/input/skin-cancer-dataset/train_cancer/benign/6.jpg', '/kaggle/input/skin-cancer-dataset/train_cancer/benign/30.jpg', '/kaggle/input/skin-cancer-dataset/train_cancer/benign/38.jpg', '/kaggle/input/skin-cancer-dataset/train_cancer/benign/42.jpg']
['/kaggle/input/skin-cancer-dataset/train_cancer/malignant/45.jpg', '/kaggle/input/skin-cancer-dataset/train_cancer/malignant/56.jpg', '/kaggle/input/skin-cancer-dataset/train_cancer/malignant/89.jpg', '/kaggle/input/skin-cancer-dataset/train_cancer/malignant/20.jpg', '/kaggle/input/skin-cancer-dataset/train_cancer/malignant/212.jpg']


After splitting the images, I decided to normalize the pixels so that the machine learning model isn't overly affected by differences in lighting between images. Also this way, the range is scaled down so that it doesn't overshadow other features.

In [2]:
# Normalize images
malignant_normalized_images = []
benign_normalized_images = []
for image in benign_image_filenames:
    with Image.open(image) as img:
        img_array = np.array(img)  # Convert image to NumPy array
        normalized_img = img_array / 255.0  # Normalize to [0, 1]
        benign_normalized_images.append(normalized_img)
        
for image in malignant_image_filenames:
    with Image.open(image) as img:
        img_array = np.array(img)  # Convert image to NumPy array
        normalized_img = img_array / 255.0  # Normalize to [0, 1]
        malignant_normalized_images.append(normalized_img)

While there are 240 images, I decided that 30 benign images isn't enough to train an accurate model. I decided to augment the existing images to create more unique images that can help the model. 

In [3]:
# Function that augments a list of images
def augment(images):
        # Initialize the ImageDataGenerator
    datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest'
    )

    # Augmented images per original
    num_augmented_images = 2

    # Store augmented images
    augmented_images = []

    # Generate augmented images
    for image in images:
        image = image.reshape((1,) + image.shape)
        for _ in range(num_augmented_images):
            for augmented_image in datagen.flow(image, batch_size=1):
                augmented_images.append(augmented_image[0]) 
                break 

    return np.array(augmented_images)
benign_augmented_images = augment(benign_normalized_images)
malignant_augmented_images = augment(malignant_normalized_images)


Next, I needed to split the data into training and test sets. The augmented data all goes into the training data to improve model accuracy. 

In [4]:
# Split the original images into training and testing sets
benign_normalized_images = np.array(benign_normalized_images)  # Original benign images
malignant_normalized_images = np.array(malignant_normalized_images)  # Original malignant images
 
X_train_benign, X_test_benign = train_test_split(benign_normalized_images, test_size=0.2, random_state=42)
X_train_malignant, X_test_malignant = train_test_split(malignant_normalized_images, test_size=0.2, random_state=42)

# Create labels for the test set
y_test_benign = np.zeros(X_test_benign.shape[0])  # Label 0 for benign
y_test_malignant = np.ones(X_test_malignant.shape[0])  # Label 1 for malignant

# Combine test data and labels
X_test = np.concatenate((X_test_benign, X_test_malignant), axis=0)
y_test = np.concatenate((y_test_benign, y_test_malignant), axis=0)

# Combine training set with augmented images
all_benign_train = np.concatenate((X_train_benign, benign_augmented_images), axis=0)
all_malignant_train = np.concatenate((X_train_malignant, malignant_augmented_images), axis=0)

# Training Set Labels
benign_labels_train = np.zeros(all_benign_train.shape[0])  # Label 0 for benign
malignant_labels_train = np.ones(all_malignant_train.shape[0])  # Label 1 for malignant

# Combine
X_train = np.concatenate((all_benign_train, all_malignant_train), axis=0)
y_train = np.concatenate((benign_labels_train, malignant_labels_train), axis=0)



I decided to use the pre-trained ResNet model because of it's good performance with image-related tasks. Then, I added two layers to it to reduce dimensionality and then to binarize the output data. 

In [5]:
# ResNet Model
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the base model
base_model.trainable = False

# Create sequentials layers
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(1, activation='sigmoid') 
])

# Compile
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train
history = model.fit(
    X_train, y_train,
    validation_data=(X_train, y_train),
    epochs=4,  
    batch_size=16
)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m94765736/94765736[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 0us/step
Epoch 1/4
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m118s[0m 2s/step - accuracy: 0.8663 - loss: 0.3957 - val_accuracy: 0.8889 - val_loss: 0.3463
Epoch 2/4
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m107s[0m 2s/step - accuracy: 0.8876 - loss: 0.3534 - val_accuracy: 0.8889 - val_loss: 0.3533
Epoch 3/4
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 2s/step - accuracy: 0.8854 - loss: 0.3657 - val_accuracy: 0.8889 - val_loss: 0.3454
Epoch 4/4
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 2s/step - accuracy: 0.8905 - loss: 0.3461 - val_accuracy: 0.8889 - val_loss: 0.3455


Model evaluation using the test data

In [6]:
# Model Evaluation
val_loss, val_accuracy = model.evaluate(X_train, y_train)
print(f'Validation loss: {val_loss}, Validation accuracy: {val_accuracy}')

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 2s/step - accuracy: 0.6765 - loss: 0.8137
Validation loss: 0.3454526364803314, Validation accuracy: 0.8888888955116272
