First, we need to download and create the folds on which the model will be trained. 
We'll use data from the 2020 competition as well as from 2019 (which includes both 2018 and 2017).

First step is to download the Triple-Stratified leak-free KFold 2020 data. The Triple-Stratified fold is explained [here](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/165526) and the JPEG files with the meta data can be downloaded from [here](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/164092). We'll only focus on `384x384` images to begin with. The corresponding files should be downloaded and unzipped in `/data/siim-isic-melanoma/raw/2020/`.

Next, we'll download the data from the 2019 competition from [here](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/164910). Again, we'll only focus on `384x384` images. The corresponding files should be downloaded and unzipped in `/data/siim-isic-melanoma/raw/2019/`.

Your `/data/siim-isic-melanoma/raw` should look like this:

```
raw
├── 2019
│   ├── 384x384
│   │   └── train
│   └── train.csv
└── 2020
    ├── 384x384
    │   ├── test
    │   └── train
    ├── sample_submission.csv
    ├── test.csv
    └── train.csv
```

where the folder `train` and `test` folders contain the respective images.

We can segregate the 2019 data into the examples that newly belong to 2019, and those that were a part of the 2018 and 2017 competitions as well. This can be done based on the image sizes as mentioned under **Description** [here](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/164910): all images of shape `1024x1024` (which belong to 2019 only) are marked under even TFRecords, all others are part of the odd-numbered TFRecords. This information is present in `train.csv` for the 2019 data.

For now, we'll create our folds by excluding the data from 2019 but using the data from 2018 + 2017.

We will also remove the duplicates which are marked as -1 under the `tfrecord` column.

In [59]:
from os.path import join
from os import makedirs
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from coreml.utils.io import save_yml

In [2]:
data_root = '/data/siim-isic-melanoma/raw/'

In [3]:
dir_2019 = join(data_root, '2019')
dir_2020 = join(data_root, '2020')

In [25]:
IMAGE_SIZE = 384

# Load the csv files from 2020 competition

In [5]:
train_df_2020 = pd.read_csv(join(dir_2020, 'train.csv'))
test_df_2020 = pd.read_csv(join(dir_2020, 'test.csv'))

In [6]:
len(train_df_2020), len(test_df_2020)

(33126, 10982)

## Convert test_df_2020 to the format required

In [66]:
test_df_2020.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge
0,ISIC_0052060,IP_3579794,male,70.0,
1,ISIC_0052349,IP_7782715,male,40.0,lower extremity
2,ISIC_0058510,IP_7960270,female,55.0,torso
3,ISIC_0073313,IP_6375035,female,50.0,torso
4,ISIC_0073502,IP_0589375,female,45.0,lower extremity


In [71]:
test_df = {
    'file': [join(dir_2020, f'{IMAGE_SIZE}x{IMAGE_SIZE}/test/{file}.jpg') for file in test_df_2020['image_name'].values],
    
    # adding dummy target labels as the codebase needs a target as input
    'label': [{'classification': 0}] * len(test_df_2020)
}

In [72]:
test_df.keys()

dict_keys(['file', 'label'])

# Load the csv files from 2019 competition

The flags `INCLUDE_2019` and `INCLUDE_2018` are used to specify we want to include the data from 2019 and/or from 2018+2017.
0 corresponds to excluding.

In [73]:
INCLUDE_2019 = 0
INCLUDE_2018 = 1

In [74]:
train_df_2019 = pd.read_csv(join(dir_2019, 'train.csv'))

In [75]:
train_df_2019.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge,diagnosis,benign_malignant,target,tfrecord,width,height
0,ISIC_0000000,-1,female,55.0,anterior torso,NV,benign,0,4,1022,767
1,ISIC_0000001,-1,female,30.0,anterior torso,NV,benign,0,18,1022,767
2,ISIC_0000002,-1,female,60.0,upper extremity,MEL,malignant,1,0,1022,767
3,ISIC_0000003,-1,male,30.0,upper extremity,NV,benign,0,24,1022,767
4,ISIC_0000004,-1,male,80.0,posterior torso,MEL,malignant,1,14,1022,767


# Remove duplicates

It was officially announced that the dataset has duplicates which can cause leakage in the validation splits if included. They have been marked with `tfrecord=-1` to be removed.

In [76]:
train_df_2020 = train_df_2020[train_df_2020['tfrecord'] != -1]

In [77]:
# reset the index to match the dimensions of the new train dataframe
train_df_2020 = train_df_2020.reset_index(drop=True)

In [78]:
len(train_df_2020)

32692

In [79]:
train_df_2020['tfrecord'].value_counts()

12    2198
2     2193
13    2186
1     2185
3     2182
0     2182
9     2178
8     2177
11    2176
6     2175
14    2174
10    2174
7     2174
5     2171
4     2167
Name: tfrecord, dtype: int64

# Defining version and save directory

The codebase requires that each dataset is packaged into a version file and stored under `/data/siim-isic-melanoma/processed/versions`. The naming of the version follows this convention:

`vX.Y.Z`

X - major version - changes only when the dataset size has changed
Y - minor version - changes when the same dataset size has been chunked in various ways
Z - split numbers - for a given way of chunking the files, we can have multiple folds and this represents the fold number

In [80]:
save_dir = '/data/siim-isic-melanoma/processed/versions'
makedirs(save_dir, exist_ok=True)

save_version = 'v3.0'

# Create cross-validation folds using KFold

It has been mentioned here that one can segregate the 2019 data from that of 2018 + 2017 as follows:

All the images that have original image size of `1024x1024` are in odd numbered TFRecords `(1,3,5,7,9...)` and the other images are in even numbered TFRecords `(0,2,4,6,8,...)`. This way you can choose to only include the not-1024x1024 (which is like only including 2018 and 2017) if you like by using the following code

```python
files_train += tf.io.gfile.glob([GCS_PATH2 + '/train%.2i*.tfrec'%(2*x) for x in range(15)])
```

This extra data is always added to the train fold for each of the folds.

We'll split the 15 TFrecords into 5 folds such that we'll iteratively train on 4 folds and validate on the remaining one.

In [21]:
N_FOLDS = 5
SEED = 42
NUM_TFRECORDS = 15

In [81]:
kfold = KFold(n_splits=N_FOLDS, shuffle=True,random_state=SEED)

for fold_index, (train_tfrecord_indices, val_tfrecord_indices) in enumerate(kfold.split(np.arange(NUM_TFRECORDS))):
    print(f'Fold #{fold_index + 1} indices')
    print(f'train: {train_tfrecord_indices}')
    print(f'val: {val_tfrecord_indices}')
    print()
    
    # get the corresponding rows in the dataframe for the train and val instance
    fold_train_df_2020 = train_df_2020[train_df_2020['tfrecord'].isin(train_tfrecord_indices)]
    fold_val_df_2020 = train_df_2020[train_df_2020['tfrecord'].isin(val_tfrecord_indices)]

    fold_train_files = [join(dir_2020, f'{IMAGE_SIZE}x{IMAGE_SIZE}/train/{file}.jpg') for file in fold_train_df_2020['image_name'].values]
    fold_train_labels = fold_train_df_2020['target'].values.tolist()
    
    fold_val_files = [join(dir_2020, f'{IMAGE_SIZE}x{IMAGE_SIZE}/train/{file}.jpg') for file in fold_val_df_2020['image_name'].values]
    fold_val_labels = fold_val_df_2020['target'].values.tolist()
    
    print('lengths without extra data: \ntrain, val')
    print(len(fold_train_files), len(fold_val_files))
    print()
    
    # include all rows with odd tfrecord id to include 2019 data
    if INCLUDE_2019:
        train_df_only_2019 = train_df_2019[train_df_2019['tfrecord'] % 2 == 1]

        print('Adding 2019 data')
        # add this to overall train data
        train_files_only_2019 = [join(dir_2019, f'{IMAGE_SIZE}x{IMAGE_SIZE}/train/{file}.jpg') for file in train_df_only_2019['image_name'].values]
        train_labels_only_2019 = train_df_only_2019['target'].values.tolist()
        print(f'Length of data to be added: {len(train_labels_only_2019)}')
        
        fold_train_files += train_files_only_2019
        fold_train_labels += train_labels_only_2019

    # include all rows with even tfrecord id to include 2018 + 2017 data
    if INCLUDE_2018:
        train_df_only_2018_2017 = train_df_2019[train_df_2019['tfrecord'] % 2 == 0]

        print('Adding 2018 + 2017 data')
        # add this to overall train data
        train_files_only_2018_2017 = [join(dir_2019, f'{IMAGE_SIZE}x{IMAGE_SIZE}/train/{file}.jpg') for file in train_df_only_2018_2017['image_name'].values]
        train_labels_only_2018_2017 = train_df_only_2018_2017['target'].values.tolist()
        print(f'Length of data to be added: {len(train_labels_only_2018_2017)}')
        
        fold_train_files += train_files_only_2018_2017
        fold_train_labels += train_labels_only_2018_2017
    
    print()
    print('lengths with extra data: \ntrain, val')
    print(len(fold_train_files), len(fold_val_files))
    
    print('Checking for repititions')
    assert len(fold_train_files) == len(set(fold_train_files))
    assert len(fold_val_files) == len(set(fold_val_files))
    
    print('Checking for leakage')
    fold_train_files_set = set(fold_train_files)
    assert not [f for f in fold_val_files if f in fold_train_files_set]
      
    data_config = {}
    
    data_config['train'] = {
        'file': fold_train_files,
        'label': [{'classification': label} for label in fold_train_labels]
    }
    
    data_config['val'] = {
        'file': fold_val_files,
        'label': [{'classification': label} for label in fold_val_labels]
    }
    
    data_config['test'] = test_df
    
    fold_save_version = f'{save_version}.{fold_index}'
    fold_save_path = join(save_dir, f'{fold_save_version}.yml')
    print(f'Saving this fold to {fold_save_path}')
    save_yml(fold_save_path, data_config)
    
    print('========================================')
    print('\n')

Fold #1 indices
train: [ 1  2  3  4  5  6  7  8 10 12 13 14]
val: [ 0  9 11]

lengths without extra data: 
train, val
26156 6536

Adding 2018 + 2017 data
Length of data to be added: 12859

lengths with extra data: 
train, val
39015 6536
Checking for repititions
Checking for leakage
Saving this fold to /data/siim-isic-melanoma/processed/versions/v3.0.0.yml


Fold #2 indices
train: [ 0  1  2  3  4  6  7  9 10 11 12 14]
val: [ 5  8 13]

lengths without extra data: 
train, val
26158 6534

Adding 2018 + 2017 data
Length of data to be added: 12859

lengths with extra data: 
train, val
39017 6534
Checking for repititions
Checking for leakage
Saving this fold to /data/siim-isic-melanoma/processed/versions/v3.0.1.yml


Fold #3 indices
train: [ 0  3  4  5  6  7  8  9 10 11 12 13]
val: [ 1  2 14]

lengths without extra data: 
train, val
26140 6552

Adding 2018 + 2017 data
Length of data to be added: 12859

lengths with extra data: 
train, val
38999 6552
Checking for repititions
Checking for leaka