
# <h1> Data preprocessing for Skin Cancer Detection. 
This notebook contains the steps for preprocessing the skin lesion images.

### The following steps will be implemented for preprocessing the images: 
• Load the dataset \
• Remove the duplicates \
• Splitting the dataset into train, validation and test sets \
• Processing the images \
• Resize the images and rescale the picture values \
• And all the necessary steps required for efficient training of the models.


## 1. Import Libraries

First, we need to import the necessary libraries.

In [10]:
import os
import shutil
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## 2. Define Directories

Next, we define the directories for the raw data and the processed data.

In [3]:
# Disable oneDNN optimizations
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"

#Base directory
BASE_DIR = r"G:\OneDrive\ML-MinorProject\ISIC Dataset"

#Train datasets
TRAIN_2020_DIR = os.path.join(BASE_DIR, "ISIC_2020_Training_JPEG", "train")
TRAIN_2019_DIR= os.path.join(BASE_DIR,"ISIC_2019_Training_Input", "ISIC_2019_Training_Input")
TRAIN_2018_DIR= os.path.join(BASE_DIR,"ISIC2018_Task3_Training_Input", "ISIC2018_Task3_Training_Input" )

#Test Datasets
TEST_2020_DIR = os.path.join(BASE_DIR, "ISIC_2020_Test_JPEG", "ISIC_2020_Test_Input")
TEST_2019_DIR= os.path.join(BASE_DIR, "ISIC_2019_Test_Input", "ISIC_2019_Test_Input")
TEST_2018_DIR= TEST_2019_DIR= os.path.join(BASE_DIR, "ISIC2018_Task3_Test_Input", "ISIC2018_Task3_Test_Input")

#Processed directories
PROCESSED_TRAIN_DIR = os.path.join(BASE_DIR, "processed_train")
PROCESSED_VAL_DIR = os.path.join(BASE_DIR, "processed_val")
PROCESSED_TEST_DIR = os.path.join(BASE_DIR, "processed_test")

## 3. Find Duplicates

Since the duplicats of the 2020 dataset is given, we now remove the duplicate images from the directory.


In [11]:
# Load duplicate image list
duplicate_2020_csv_path = os.path.join(BASE_DIR, "ISIC_2020_Training_Duplicates.csv")
duplicate_2020_df = pd.read_csv(duplicate_2020_csv_path)

# Keep only image_name_1 and remove image_name_2
duplicate_2020_images = set(duplicate_2020_df['image_name_2'] + ".jpg")  

# Collect image filenames
image_filenames_2020 = glob.glob(os.path.join(TRAIN_2020_DIR, "*.jpg"))

# Remove duplicate images
filtered_filenames_2020 = [img for img in image_filenames_2020 if os.path.basename(img) not in duplicate_2020_images]

for fname in filtered_filenames_2020:
    if os.path.exists(fname):
        os.remove(fname)
    else:
        print("File Not found. ")
print("Duplicates removed successfully.")

Duplicates removed successfully.


## 4. Create Directories for Organized Dataset

We create directories for the processed training, validation, and test datasets.

In [12]:
# Create directories if they don't exist
os.makedirs(PROCESSED_TRAIN_DIR, exist_ok=True)
os.makedirs(PROCESSED_VAL_DIR, exist_ok=True)
os.makedirs(PROCESSED_TEST_DIR, exist_ok=True)

for category in ["benign", "malignant"]:
    os.makedirs(os.path.join(PROCESSED_TRAIN_DIR, category), exist_ok=True)
    os.makedirs(os.path.join(PROCESSED_VAL_DIR, category), exist_ok=True)
    os.makedirs(os.path.join(PROCESSED_TEST_DIR, category), exist_ok=True)


## 5. Load Test Ground Truth

We load the test ground truth data from the `ISIC_2020_Test_GroundTruth.csv`,`ISIC_2019_Test_GroundTruth.csv`, `ISIC_2018_Test_GroundTruth.csv`  file.

In [26]:
# Load test ground truth
test_2020_ground_truth_csv_path = os.path.join(BASE_DIR, "ISIC_2020_Test_GroundTruth.csv")
test_2019_ground_truth_csv_path = os.path.join(BASE_DIR, "ISIC_2019_Test_GroundTruth.csv")
test_2018_ground_truth_csv_path = os.path.join(BASE_DIR, "ISIC_2018_Test_GroundTruth.csv")

#Create dataframes
test_2020_ground_truth_df = pd.read_csv(test_2020_ground_truth_csv_path)
test_2019_ground_truth_df = pd.read_csv(test_2019_ground_truth_csv_path)
test_2018_ground_truth_df = pd.read_csv(test_2018_ground_truth_csv_path)

# Map 'image_name' to file paths
test_2020_ground_truth_df["filename"] = test_2020_ground_truth_df["isic_id"].apply(lambda x: os.path.join(TEST_2020_DIR, f"{x}.jpg"))
test_2019_ground_truth_df["filename"] = test_2019_ground_truth_df["image"].apply(lambda x: os.path.join(TEST_2019_DIR, f"{x}.jpg"))
test_2018_ground_truth_df["filename"] = test_2018_ground_truth_df["image"].apply(lambda x: os.path.join(TEST_2018_DIR, f"{x}.jpg"))
# Debug: Check the first few rows of the ground truth DataFrame
print("test_2020_ground_truth_df:")
print(test_2020_ground_truth_df.head())
print()
print("test_2019_ground_truth_df:")
print(test_2019_ground_truth_df.head())
print()
print("test_2018_ground_truth_df:")
print(test_2019_ground_truth_df.head())
print()

# Filter out rows where the file does not exist in test directories
test_2020_ground_truth_df = test_2020_ground_truth_df[test_2020_ground_truth_df["filename"].apply(os.path.exists)]
test_2019_ground_truth_df = test_2019_ground_truth_df[test_2019_ground_truth_df["filename"].apply(os.path.exists)]
test_2018_ground_truth_df = test_2018_ground_truth_df[test_2018_ground_truth_df["filename"].apply(os.path.exists)]

# Debug: Check the number of benign and malignant images
print("Number of benign images in 2020 ground truth:", len(test_2020_ground_truth_df[test_2020_ground_truth_df["Target"] == 0]))
print("Number of malignant images in 2020 ground truth:", len(test_2020_ground_truth_df[test_2020_ground_truth_df["Target"] == 1]))
print()
print("Number of benign images in 2019 ground truth:", len(test_2019_ground_truth_df[test_2019_ground_truth_df["Target"] == 0]))
print("Number of malignant images in 2019 ground truth:", len(test_2019_ground_truth_df[test_2019_ground_truth_df["Target"] == 1]))
print()
print("Number of benign images in 2018 ground truth:", len(test_2018_ground_truth_df[test_2018_ground_truth_df["Target"] == 0]))
print("Number of malignant images in 2018 ground truth:", len(test_2018_ground_truth_df[test_2018_ground_truth_df["Target"] == 1]))

#combine all the dataframes into one
test_df=pd.concat([test_2020_ground_truth_df, test_2019_ground_truth_df, test_2018_ground_truth_df], ignore_index=True)
print(test_df.head())

test_2020_ground_truth_df:
        isic_id benign_malignant  Target  \
0  ISIC_0052060           benign       0   
1  ISIC_0052349           benign       0   
2  ISIC_0058510           benign       0   
3  ISIC_0073313           benign       0   
4  ISIC_0073502           benign       0   

                                            filename  
0  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  
1  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  
2  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  
3  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  
4  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  

test_2019_ground_truth_df:
          image  Target                                           filename
0  ISIC_0034321       0  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC2...
1  ISIC_0034322       0  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC2...
2  ISIC_0034323       0  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC2...
3  ISIC_0034324       0  G:\OneDrive\ML-MinorProje

## 6. Move test images
We now move the test images to `processed_test` directory according to their classes.


In [25]:
# Move test images to processed_test directory
for _, row in test_df.iterrows():
    label = ("benign" if row["Target"]== 0 else "malignant") 
    src_path = row["filename"]
    dest_dir = os.path.join(PROCESSED_TEST_DIR, label)

    # Ensure the destination directory exists
    os.makedirs(dest_dir, exist_ok=True)

    # Check if the source file exists
    if os.path.exists(src_path):
        try:
            # Move the file
            shutil.move(src_path, os.path.join(dest_dir, os.path.basename(src_path)))
        except Exception as e:
            print(f"Error moving file {src_path} to {dest_dir}: {e}")
    else:
        print(f"File not found: {src_path}")

# Debugging: Check the number of malignant and benign images in processed_test
print(f"Number of benign images in processed_test: {len(os.listdir(os.path.join(PROCESSED_TEST_DIR, 'benign')))}")
print(f"Number of malignant images in processed_test: {len(os.listdir(os.path.join(PROCESSED_TEST_DIR, 'malignant')))}")

File not found: G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_2020_Test_JPEG\ISIC_2020_Test_Input\ISIC_0052060.jpg
File not found: G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_2020_Test_JPEG\ISIC_2020_Test_Input\ISIC_0052349.jpg
File not found: G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_2020_Test_JPEG\ISIC_2020_Test_Input\ISIC_0058510.jpg
File not found: G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_2020_Test_JPEG\ISIC_2020_Test_Input\ISIC_0073313.jpg
File not found: G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_2020_Test_JPEG\ISIC_2020_Test_Input\ISIC_0073502.jpg
File not found: G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_2020_Test_JPEG\ISIC_2020_Test_Input\ISIC_0074618.jpg
File not found: G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_2020_Test_JPEG\ISIC_2020_Test_Input\ISIC_0076801.jpg
File not found: G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_2020_Test_JPEG\ISIC_2020_Test_Input\ISIC_0077586.jpg
File not found: G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_2020_Test_JPEG\ISI

## 7. Load Train Ground Truth 

In [4]:
# Load train ground truth
train_2020_ground_truth_csv_path = os.path.join(BASE_DIR, "ISIC_2020_Training_GroundTruth.csv")
train_2020_ground_truth_df = pd.read_csv(train_2020_ground_truth_csv_path)

train_2019_ground_truth_csv_path = os.path.join(BASE_DIR, "ISIC_2019_Training_GroundTruth.csv")
train_2019_ground_truth_df = pd.read_csv(train_2019_ground_truth_csv_path)

train_2018_ground_truth_csv_path = os.path.join(BASE_DIR, "ISIC_2018_Training_GroundTruth.csv")
train_2018_ground_truth_df = pd.read_csv(train_2018_ground_truth_csv_path)

# Create dataframe for train data
train_2020_filenames = glob.glob(os.path.join(TRAIN_2020_DIR, "*.jpg"))
train_2020_ground_truth_df["filename"] = train_2020_ground_truth_df["image_name"].apply(lambda x: os.path.join(TRAIN_2020_DIR, f"{x}.jpg"))
train_2020_ground_truth_df["label"] = train_2020_ground_truth_df["Target"].apply(lambda x: "malignant" if x == 1 else "benign")

train_2019_filenames = glob.glob(os.path.join(TRAIN_2019_DIR, "*.jpg"))
train_2019_ground_truth_df["filename"] = train_2019_ground_truth_df["image"].apply(lambda x: os.path.join(TRAIN_2019_DIR, f"{x}.jpg"))
train_2019_ground_truth_df["label"] = train_2019_ground_truth_df["Target"].apply(lambda x: "malignant" if x == 1 else "benign")

train_2018_filenames = glob.glob(os.path.join(TRAIN_2019_DIR, "*.jpg"))
train_2018_ground_truth_df["filename"] = train_2018_ground_truth_df["image"].apply(lambda x: os.path.join(TRAIN_2018_DIR, f"{x}.jpg"))
train_2018_ground_truth_df["label"] = train_2018_ground_truth_df["Target"].apply(lambda x: "malignant" if x == 1 else "benign")

# Filter out rows where the file does not exist in train directories
train_2020_ground_truth_df = train_2020_ground_truth_df[train_2020_ground_truth_df["filename"].apply(os.path.exists)]
train_2019_ground_truth_df = train_2019_ground_truth_df[train_2019_ground_truth_df["filename"].apply(os.path.exists)]
train_2018_ground_truth_df = train_2018_ground_truth_df[train_2018_ground_truth_df["filename"].apply(os.path.exists)]

# Debug: Check the number of benign and malignant images
print("Number of benign images in 2020 ground truth:", len(train_2020_ground_truth_df[train_2020_ground_truth_df["Target"] == 0]))
print("Number of malignant images in 2020 ground truth:", len(train_2020_ground_truth_df[train_2020_ground_truth_df["Target"] == 1]))
print()
print("Number of benign images in 2019 ground truth:", len(train_2019_ground_truth_df[train_2019_ground_truth_df["Target"] == 0]))
print("Number of malignant images in 2019 ground truth:", len(train_2019_ground_truth_df[train_2019_ground_truth_df["Target"] == 1]))
print()
print("Number of benign images in 2018 ground truth:", len(train_2018_ground_truth_df[train_2018_ground_truth_df["Target"] == 0]))
print("Number of malignant images in 2018 ground truth:", len(train_2018_ground_truth_df[train_2018_ground_truth_df["Target"] == 1]))

#combine all the dataframes into one
train_df=pd.concat([train_2020_ground_truth_df, train_2019_ground_truth_df, train_2018_ground_truth_df], ignore_index=True)
print(train_df.head())

Number of benign images in 2020 ground truth: 29785
Number of malignant images in 2020 ground truth: 537

Number of benign images in 2019 ground truth: 20809
Number of malignant images in 2019 ground truth: 4522

Number of benign images in 2018 ground truth: 8902
Number of malignant images in 2018 ground truth: 1113
     image_name benign_malignant  Target  \
0  ISIC_2637011           benign       0   
1  ISIC_0015719           benign       0   
2  ISIC_0052212           benign       0   
3  ISIC_0068279           benign       0   
4  ISIC_0074311           benign       0   

                                            filename   label image  
0  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  benign   NaN  
1  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  benign   NaN  
2  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  benign   NaN  
3  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  benign   NaN  
4  G:\OneDrive\ML-MinorProject\ISIC Dataset\ISIC_...  benign   NaN  


## 8. Splitting the Train images
We split the train images into train set and validation set.

In [6]:
# Split training data into train and validation sets
train_split_df, val_split_df = train_test_split(
    train_df, test_size=0.2, random_state=42, stratify=train_df["label"]
)

## 9. Move Images
We move the images to `processed_train` images and `processed_val` images in accordance to the split.

In [8]:
# Move training images to processed_train directory
for _, row in train_split_df.iterrows():
    label = row["label"]
    src_path = row["filename"]
    dest_dir = os.path.join(PROCESSED_TRAIN_DIR, label)
    if os.path.exists(src_path):  # Ensure the file exists before moving
        shutil.move(src_path, os.path.join(dest_dir, os.path.basename(src_path)))

# Move validation images to processed_val directory
for _, row in val_split_df.iterrows():
    label = row["label"]
    src_path = row["filename"]
    dest_dir = os.path.join(PROCESSED_VAL_DIR, label)
    if os.path.exists(src_path):  # Ensure the file exists before moving
        shutil.move(src_path, os.path.join(dest_dir, os.path.basename(src_path)))

## 10. Downsample benign class
We notice that benign dominates both the directory in terms of number by a huge margin. So we now downsample the benign class in train, validation and test sets.

In [9]:
import random

#define benign class directory of each set
BENIGN_TRAIN_DIR=os.path.join(PROCESSED_TRAIN_DIR, "benign")
BENIGN_VAL_DIR=os.path.join(PROCESSED_VAL_DIR, "benign")
BENIGN_TEST_DIR=os.path.join(PROCESSED_TEST_DIR, "benign")

#Collect image filenames
benign_train=glob.glob(os.path.join(BENIGN_TRAIN_DIR, "*.jpg"))
benign_val=glob.glob(os.path.join(BENIGN_VAL_DIR, "*.jpg"))
benign_test=glob.glob(os.path.join(BENIGN_TEST_DIR, "*.jpg"))

#declare number of images to remove
train_x=32000
val_x=6500
test_x=10500

#randomly select the specified number of files in each directory
del_train=random.sample(benign_train, train_x)
del_val=random.sample(benign_val, val_x)
del_test=random.sample(benign_test, test_x)

#downsampling training data
for fname in del_train:
    try:
        os.remove(fname)
    except:
        print(f"{fname} already missing")
print("Downsampling of training data is complete.")

#downsampling training data
for fname in del_val:
    try:
        os.remove(fname)
    except:
        print(f"{fname} already missing")
print("Downsampling of validation data is complete.")

#downsampling test data
for fname in del_test:
    try:
        os.remove(fname)
    except:
        print(f"{fname} already missing")
print("Downsampling of test data is complete.")

Downsampling of training data is complete.
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_2050288.jpg already missing
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_8843901.jpg already missing
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_0061652.jpg already missing
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_0068642.jpg already missing
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_4370988.jpg already missing
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_6069595.jpg already missing
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_5814641.jpg already missing
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_3397468.jpg already missing
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_9768834.jpg already missing
G:\OneDrive\ML-MinorProject\ISIC Dataset\processed_train\benign\ISIC_0032237.jpg alr

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



Downsampling of test data is complete.


## Conclusion

The data preprocessing is complete, and the images are ready for training the models.