# Data Preprocessing

To recap, in the EDA step, we found that:

1. The dataset contains a total of 10015 images and all of them are referenced in the metadata file.
2. There are a total of 7 diagnostic categories and the most common is "melanocytic nevi" (nv) suggesting that the dataset is highly imbalanced.
3. There are some specific lesions represented by more than one image (same lesion_id, but different image_ids) with the maximum number of images per lesion equal to 6 and the average number equal to 1, evidencing that most lesions are represented by only one image.
4. The categories in the images are, in general, distinguishable but, there are some cases in which images belonging to different diagnostic categories are very similar.
5. In some images there are hairs covering the lesion.
6. All the images in the dataset have the same dimension, which is more than 600x600.
7. The pixel values are between 0 and 255.

Given these information, this notebook aims to achieve 2 objectives:

1. Resize all the images to a dimension suitable for model training.
2. Split the dataset (both the metadata and the images) into training, validation and test sets.

It could also be necessary to further preprocess the images containing hairs covering the lesion area but, since this would add an additional level of complexity, we will leave it for a later update and, for now, try to solve it through data augmentation.

### Image resizing

Let's start with the first point. For now, we want to resize the images to 224x224.

In [6]:
import PIL.Image as Image
from pathlib import Path

input_folders = [
    Path("../data/raw/HAM10000_images_part_1"),
    Path("../data/raw/HAM10000_images_part_2")
]

output_folder = Path("../data/processed")

for folder in input_folders:
    for img_path in folder.glob("*.jpg"):
        img = Image.open(img_path)
        img.resize((224,224), resample=Image.Resampling.LANCZOS)
        img.save(output_folder / img_path.name)

print(f"All images resized to 224x224 and saved to {output_folder}")

All images resized to 224x224 and saved to ..\data\processed


### Dataset Splitting

At this point we need to split the dataset into training, validation and test set. In particular, based on the information obtained from the EDA step, we need to perform a stratified split in such a way that:

1. All the images representing the same lesion (same lesion_id) are put inside the same subset.
2. Each subset contains the same percentage of images belonging to each diagnostic category.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load metadata file
input_dir = Path('../data/raw/HAM10000_metadata.csv')
df = pd.read_csv(input_dir)

# Get the lesions id and use them for the split
lesions = df.groupby('lesion_id').first().reset_index()[['lesion_id', 'dx']]

# Split lesions stratified by diagnosis (dx)
train_lesions, temp_lesions = train_test_split(
    lesions,
    test_size=0.3,
    stratify=lesions['dx'],
    random_state=42
)
val_lesions, test_lesions = train_test_split(
    temp_lesions,
    test_size=0.5,
    stratify=temp_lesions['dx'],
    random_state=42
)

# Assign all images to their lesion's split
train_df = df[df['lesion_id'].isin(train_lesions['lesion_id'])]
val_df = df[df['lesion_id'].isin(val_lesions['lesion_id'])]
test_df = df[df['lesion_id'].isin(test_lesions['lesion_id'])]

# Save the splits
columns_to_save = ['image_id', 'lesion_id', 'dx']
train_df[columns_to_save].to_csv('../data/splits/train.csv', index=False)
val_df[columns_to_save].to_csv('../data/splits/val.csv', index=False)
test_df[columns_to_save].to_csv('../data/splits/test.csv', index=False)

print("Dataset splitted into training, validation and test sets.")
print(f"Train: {len(train_df)} images, Val: {len(val_df)}, Test: {len(test_df)}")

Dataset splitted into training, validation and test sets
Train: 6981 images, Val: 1532, Test: 1502


Finally, to deal with class imbalances, we want to compute class weights.
In this way, we ensure that rare classes get higher weights while common classes (like "nv") get lower weights. This weights will be then used during training.

In [13]:
import json
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Compute class weights from training set
classes = train_df['dx'].unique()
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=classes,
    y=train_df['dx']
)

# Convert to dict for easy lookup
class_weights_dict = {cls: weight for cls, weight in zip(classes, class_weights)}

# Save to JSON
with open("../data/splits/class_weights.json", "w") as f:
    json.dump(class_weights_dict, f, indent=2)

print("Class weights:", class_weights_dict)

Class weights: {'bkl': np.float64(1.291820873427091), 'nv': np.float64(0.2129587260913334), 'df': np.float64(14.046277665995976), 'mel': np.float64(1.290149695065607), 'vasc': np.float64(10.073593073593074), 'bcc': np.float64(2.7625643055005935), 'akiec': np.float64(4.492277992277992)}
