## Formatting the SCIN dataset for DISCO

This notebook shows how to create a sample dataset ready to be used with DISCO. The resulting sample dataset is available [here](https://storage.googleapis.com/deai-313515.appspot.com/scin_sample.zip). The full SCIN Dataset is available at https://github.com/google-research-datasets/scin and requires using the google cloud CLI.

In [None]:
# Install the gcloud CLI: https://cloud.google.com/sdk/docs/install-sdk

# Init the gcloud CLI
!gcloud init

In [None]:
# Install the dependencies
%pip install pandas tqdm

In [14]:
import os
import shutil
import pandas as pd
from tqdm import tqdm


### Option 1: Use the Google Storage API

We are only going to use a subset of the dataset so rather than downloading the dataset locally, we can also use the google cloud storage API to subset the dataset and only download those images locally. Be aware that downloading images through the python package API is much slower. Downloading 1.5k images takes ~20min while downloading 10k images with the gsutil command takes 5min

In [None]:
!gcloud auth application-default login

In [None]:
%pip install google-cloud-storage

In [2]:
# here we will NOT use the Storage API but instead download the dataset locally
USE_STORAGE_API = False 

In [3]:
if USE_STORAGE_API:
    from google.cloud import storage
    # Google Cloud constants
    gcs_storage_client = storage.Client('dx-scin-public')
    # GCS bucket with data to read
    gcs_bucket = gcs_storage_client.bucket('dx-scin-public-data')

### Option 2: download dataset locally

The dataset is about 11GiB but is much faster to download than using the python storage API.

In [101]:
# Download the dataset from the cloud storage in local
# This command will download the dataset in a local `dataset` folder

!gsutil -m cp -r "gs://dx-scin-public-data/dataset"  .

In [5]:
cases_csv = './dataset/scin_cases.csv' # case metadata
labels_csv = './dataset/scin_labels.csv' # label metadata
images_dir = './dataset/images/'

if USE_STORAGE_API:
    import io
    # replace the path with the bucket paths (and remove the ./ prefixes)
    cases_csv = io.BytesIO(gcs_bucket.blob(cases_csv[2:]).download_as_string())
    labels_csv = io.BytesIO(gcs_bucket.blob(labels_csv[2:]).download_as_string())

#### Get the image-label mapping

In [21]:
# Read the csv
cases_df = pd.read_csv(cases_csv, dtype={'case_id': str})
labels_df = pd.read_csv(labels_csv, dtype={'case_id': str})

# For the sake of simplicity we only keep one image per case
df = pd.merge(cases_df, labels_df, on='case_id')[['image_1_path', 'weighted_skin_condition_label']]
df.columns = ['filename', 'label']
df

Unnamed: 0,filename,label
0,dataset/images/-3205742176803893704.png,"{'Inflicted skin lesions': 0.41, 'Eczema': 0.4..."
1,dataset/images/-4762289084741430925.png,"{'Prurigo nodularis': 0.41, 'SCC/SCCIS': 0.41,..."
2,dataset/images/-4027806997035329030.png,"{'Impetigo': 0.55, 'Herpes Zoster': 0.23, 'Bul..."
3,dataset/images/-5332065579713135540.png,{}
4,dataset/images/-3799298995660217860.png,"{'Lichen planus/lichenoid eruption': 0.33, 'Fo..."
...,...,...
5028,dataset/images/32575980331712012.png,"{'CD - Contact dermatitis': 0.33, 'Allergic Co..."
5029,dataset/images/-5315065439551573643.png,{}
5030,dataset/images/-4723634841049886674.png,"{'Impetigo': 0.5, 'Foreign body': 0.5}"
5031,dataset/images/-3758258982362095839.png,"{'Erythema gyratum repens': 0.33, 'Seborrheic ..."


### Preprocess the data

In [22]:
# DISCO currently doesn't support multi-label classification (i.e. a variable number of labels per one image)
# So we will only keep one label per image

# Filter out empty labels and keep the label with the greatest weight
def getFirstLabel(label: str):
    if label.startswith('{'):
        label_dict = eval(label)
        return max(label_dict, key=label_dict.get)
    else:
        return label

df = df.query("label != '{}'")
df.loc[:, 'label'] = df['label'].apply(getFirstLabel)

In [23]:
df.label.value_counts()

label
Eczema                         488
Allergic Contact Dermatitis    270
Urticaria                      214
Insect Bite                    185
Folliculitis                   142
                              ... 
Acne keloidalis                  1
Eruptive xanthoma                1
Localized skin infection         1
Keratolysis exfoliativa          1
Chicken pox exanthem             1
Name: count, Length: 211, dtype: int64

In [24]:
# We will only keep the most frequent labels for the classification task
label_count = df.label.value_counts()
label_subset = label_count[label_count > 200]
label_subset = list(label_subset.index)
label_subset

['Eczema', 'Allergic Contact Dermatitis', 'Urticaria']

In [25]:
# Only keep images with labels in the subset
df = df.query('label in @label_subset')
df

Unnamed: 0,filename,label
6,dataset/images/-6942912841265248602.png,Urticaria
8,dataset/images/-217828380359571871.png,Eczema
10,dataset/images/-3712452163219577722.png,Eczema
11,dataset/images/-1677898261371801194.png,Urticaria
21,dataset/images/7624703560142571231.png,Urticaria
...,...,...
5006,dataset/images/-2443949714670739112.png,Urticaria
5007,dataset/images/-1147438805108165437.png,Urticaria
5014,dataset/images/-683645753310790631.png,Allergic Contact Dermatitis
5021,dataset/images/-4308398826912860778.png,Eczema


### Create the sample dataset

DISCO offers two ways of connecting data:
1. Selecting images class by class
2. Selecting a CSV which maps image filenames to labels and then connecting all the images at once. The CSV should have the exact header 'filename, label' and image filenames should not include file extensions.

Here we will go with the second option as it is more practical as the number of category grows. This notebook can easily be adapted to save and split data into different folders according to the label.

In [30]:
N_SAMPLES = 400
sample_df = df.sample(N_SAMPLES)

In [33]:
output_folder = './sample/'
output_image_folder = output_folder + 'images'
os.makedirs(output_image_folder)

sample_labels = []
for i, row in tqdm(sample_df.iterrows(), total=sample_df.shape[0]):
    image_path = row['filename']
    new_image_name = image_path.split('/')[-1]
    image_output_path = os.path.join(output_image_folder, new_image_name)
    image_without_ext = new_image_name[:-4]
    sample_labels.append([image_without_ext, row['label']])

    if USE_STORAGE_API:
        gcs_bucket.blob(image_path).download_to_filename(image_output_path)
    else:
        shutil.copy('./' + image_path, image_output_path)

100%|██████████| 400/400 [00:00<00:00, 499.72it/s]


In [34]:
local_mapping = pd.DataFrame(sample_labels, columns=['filename', 'label'])
local_mapping

Unnamed: 0,filename,label
0,-2850393001491389427,Urticaria
1,-6272108417182441416,Allergic Contact Dermatitis
2,-1495727048511165464,Eczema
3,-8297415963972830482,Urticaria
4,-4982141201108309967,Urticaria
...,...,...
395,-729778047815019786,Eczema
396,-6438545414301250763,Eczema
397,5717632452093413738,Eczema
398,123261771205042552,Eczema


In [35]:
# DISCO expects a CSV file with
local_mapping.to_csv(output_folder + 'labels.csv', index=False, header=True)