### Creating a CSV file to connect data to DISCO

DISCO offers two ways to connect data:
1. Selecting files by labels (button "GROUP") where you can drag and drop files by class
2. Connecting a CSV which specifies each file path and the corresponding label

The second option is usually more practical if a dataset has many labels. This notebook shows how to create such a CSV, taking as an example a multi-class skin lesion dataset. The HAM10000 dataset is available at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T.

The archive contains the multiple files:
- HAM10000_images_part_1.zip
- HAM10000_images_part_2.zip
- HAM10000_metadata
- HAM10000_segmentations_lesion_tschandl.zip
- ISIC2018_Task3_Test_GroundTruth.csv
- ISIC2018_Task3_Test_Images.zip
- ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv

We will only work with the `HAM10000_images_part_1`, which contains 5000 images, for simplicity. Unzip `HAM10000_images_part_1.zip`.  `HAM10000_metadata` is a CSV file mapping images to labels.

In [None]:
# Running this notebook requires installing pandas:
# !pip install pandas

In [1]:
import os
import shutil
import pandas as pd

In [2]:
labels = pd.read_csv('./HAM10000_metadata')[['image_id', 'dx']] #dx is the label

In [3]:
labels

Unnamed: 0,image_id,dx
0,ISIC_0027419,bkl
1,ISIC_0025030,bkl
2,ISIC_0026769,bkl
3,ISIC_0025661,bkl
4,ISIC_0031633,bkl
...,...,...
10010,ISIC_0033084,akiec
10011,ISIC_0033550,akiec
10012,ISIC_0033536,akiec
10013,ISIC_0032854,akiec


In [4]:
labels.dx.unique()

array(['bkl', 'nv', 'df', 'mel', 'vasc', 'bcc', 'akiec'], dtype=object)

In [5]:
# Create a dataframe with all image names in the part_1 folder wihtout their file extensions.
# We are going to merge image names with the CSV to only keep a subset of the filename-label mapping
images = []
for file in os.listdir('HAM10000_images_part_1'):
    if file.endswith('.jpg'):
        images.append([file[:-4]])

local_images = pd.DataFrame(images, columns=['image_id'])

In [6]:
# We are going to create a subset of the dataset
N_IMAGES = 300

In [7]:
subset = labels.merge(local_images, on='image_id').sample(N_IMAGES, random_state=42)
assert subset.dx.nunique() == 7 # make sure that we're not leaving out a class

In [8]:
# Copy 
output_folder = f'./{N_IMAGES}/'
image_folder = output_folder + 'data'
os.makedirs(image_folder)

for i, row in subset.iterrows():
    shutil.copy(f"./HAM10000_images_part_1/{row['image_id']}.jpg", image_folder)

In [9]:
# DISCO expects a CSV with this exact header and filenames without extensions
subset.columns = ['filename', 'label']
subset.head()

Unnamed: 0,filename,label
1501,ISIC_0028424,nv
2586,ISIC_0028691,nv
2653,ISIC_0028485,nv
1055,ISIC_0025321,vasc
705,ISIC_0027261,mel


In [10]:
# Save the CSV without the indices and with the header
subset.to_csv(output_folder + 'labels.csv', index=False, header=True)

You can now use this sample dataset in DISCO to train models in the Skin Disease task, by first selecting the CSV file and then the corresponding image files.