# **Preprocessing Guide for Datasets**

In this file, we explain how the datasets for SIIM-ISIC and COCO-Stuff were trimmed down. This was done due to the large size of the original datasets, which made storing the data harder. As such, below we provide the scripts used to trim down both these datasets.

As mentioned in the other files, the trimmed down datasets have been uploaded to Hugging Face and can be downloaded via the prepared scripts (which can be found [here](../scripts/download_cocostuff) and [here](../scripts/download_siim)).

Before starting, however, we of course have to begin with dependency imports:

In [None]:
import os
import json
import numpy as np
import pandas as pd
import shutil

## **COCO-Stuff**

Due to how the original dataset was initially setup for multi-object classification, the processing required is somewhat complex by nature. An overview of such processing would be:

1. Select all images containing the target label for each label.
2. Group all the labels based on the image containing it and export as a JSON dict file.
3. Obtain all the unique directories required as an image can have multiple class labels in a list.
4. Save all the unique images obtained in a new folder.

This should obtain the same dataset as in the link, assuming all the files are in-place.

In [None]:
# here we first define helper functions to ease the processing
def idx2img(json_file, image_dir):
    """
    Arguments:
    json_file: the loaded dict originating from the annotation JSON file
    image_dir: the directory containing all the images (either `train2017` or `val2017`)
    
    Outputs:
    img_dict: dictionary containing mappings of '{<image index>: image_dir + <image filename>}'
    """
    img_dict = {}
    for data in json_file['images']:

        id = data['id']
        name = data['file_name']
        img_dict[id] = os.path.join(image_dir, name)

    return img_dict


def clean_data(annot_dir, image_dir):
    """
    Arguments:
    annot_dir: the annotation JSON file path (from `annotations_trainval2017.zip` as the authors here 
               appear to focus more on `things` annotations)
    image_dir: the directory containing all the images (either `train2017` or `val2017`)
    
    Outputs:
    img_idx_dict: dictionary containing mappings of '{image_dir + <image filename>: <image index list>}'
    """
    f = open(annot_dir)
    data = json.load(f)
    img_dict = idx2img(data, image_dir)

    # even split between training and test data
    img_idx_dict = {}

    # extract the data
    for segment in data['annotations']:
        class_id = segment['category_id']
        image_id = img_dict[segment['image_id']]
        if image_id not in img_idx_dict.keys():
            img_idx_dict[image_id] = [class_id]
        elif class_id not in img_idx_dict[image_id]:
            img_idx_dict[image_id] += [class_id]
        else:
            continue
    
    return img_idx_dict

# Note: this function is similar to the one used in `coco_stuff.py`
def load_coco_data(annot_dict, target:int, n_samples:int=500,
                   seed:int=42):
    """
    Arguments:
    annot_dict: the loaded dict originating from the annotation JSON file
    target: the target class index for the dataset (see `./data/coco_target_indexes.txt` for reference)
    n_samples: the number of samples to take (ideally an even number)
    seed: the random seed which determines the samples taken
    
    Outputs:
    datalist: list of filenames to take
    """
    # set the random seed for numpy
    np.random.seed(seed)

    # even split between training and test data
    npc = n_samples // 2
        
    # select images based on whether not the image has a segmentation of the target object
    pos_data = []
    neg_data = []
    for imgc in annot_dict.keys():
        if target in annot_dict[imgc]:
            pos_data.append(imgc)

        else:
            neg_data.append(imgc)

    # sample with replacement if too little data is present    
    replace_pos = True if (len(pos_data) < npc) else False
    replace_neg = True if (len(neg_data) < npc) else False
    pos_sample = np.random.choice(pos_data, size=npc, replace=replace_pos)
    neg_sample = np.random.choice(neg_data, size=npc, replace=replace_neg)

    datalist = list(np.concatenate((pos_sample, neg_sample), axis=None))

    return datalist

# this function is just in case there are discrepancies between the filenames present
def sync_data(json_dict:dict, filelist:list):
    
    final_dict = {}
    for file in filelist:
        if file in json_dict.keys():
            final_dict[file] = json_dict[file]
    
    return final_dict

In [None]:
# this is separated because loading the JSON files takes a bit of time
train_json = clean_data("annotations/instances_train2017.json", "train2017")
test_json = clean_data("annotations/instances_val2017.json", "val2017")

In [None]:
# now we take all the needed files for training and test (from each folder)
train_files = []
test_files = []

targets = [3, 6, 31, 35, 36, 37, 40, 41, 43, 46, 47, 50, 53, 64, 75, 76, 78, 80, 85, 89]

for target in targets:
    data = load_coco_data(train_json, target)
    train_files.extend(data)

for target in targets:
    data = load_coco_data(test_json, target, 250)
    test_files.extend(data)

In [None]:
# here we remove the duplicates
train_files = list(set(train_files))
test_files = list(set(test_files))

# and add an extra step as a guarantee that all files are the same between objects
final_train = sync_data(train_json, train_files)
final_test = sync_data(test_json, test_files)

# ...and then export!
out_tr = open("trimmed/labels_train.json", "w") 
out_ts = open("trimmed/labels_val.json", "w") 

json.dump(final_train, out_tr, indent = 2) 
json.dump(final_test, out_ts, indent = 2) 

In [None]:
# this is for copying the files over to a new folder
for filename in train_files:
    target = os.path.join('trimmed', filename)
    shutil.copyfile(filename, target)

for filename in test_files:
    target = os.path.join('trimmed', filename)
    shutil.copyfile(filename, target)

## **SIIM-ISIC**

For this dataset, the process mainly involved just selecting 2000 images from the original and ensuring that all the label splits are correct.

In [None]:
df_dir = "./artifacts/data/SIIM_ISIC/metadata.csv" # path to the original metadata file for SIIM-ISIC
df = pd.read_csv(df_dir)
df = df.sort_values(by=['benign_malignant'])

df_ben = df.loc[df['benign_malignant'] == 'benign'][2000:4000]
df_mal = df.loc[df['benign_malignant'] == 'malignant'][50:550]
full_df = pd.concat([df_ben, df_mal])

target_dir = "data"

full_df.to_csv("metadata_small.csv", index=False) # here we have a different filename to prevent overwriting