# EDA (augmented dataset)

[Source](https://www.kaggle.com/competitions/siim-isic-melanoma-classification/discussion/159101)

> Hello everyone.
>
>I have created a dataset which (partially) solves a class imbalance of the provided data.
As you know the original competition data has ~98% of benign cases and only ~2% of malignant.
>
>So I have scrapped only malignant melanoma images from the 2019 ISIC competition and added them to original dataset, which leaves us with ~86% benign and ~14% malignant cases.
>
>Then all images were center cropped and resized to 256x256.
I also have a kernel which uses this dataset and score 0.925 on the LB.



### Import libraries

In [1]:
import numpy as np
import os
import pandas as pd


from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import StratifiedGroupKFold

### Utils

In [2]:
def sep():
    print("-"*117)

### Paths

In [3]:
TRAIN_PATH = '/Users/alejopaullier/Desktop/alejo/aidmed/data/melanoma-classification/train_clean.csv'
TEST_PATH = '/Users/alejopaullier/Desktop/alejo/aidmed/data/melanoma-classification/test_clean.csv'

### Load data

In [4]:
train_df = pd.read_csv(TRAIN_PATH, sep=',')
test_df = pd.read_csv(TEST_PATH, sep=',')
display(train_df.head())
display(test_df.head())
print(f"Train dataframe has {train_df.shape[0]} rows and {train_df.shape[1]} columns"), sep()
print(f"Test dataframe has {test_df.shape[0]} rows and {test_df.shape[1]} columns"), sep()
print(f"Train dataframe has {train_df.target.value_counts()[0]} benignant and {train_df.target.value_counts()[1]} malignant"), sep()

Unnamed: 0,dcm_name,ID,sex,age,anatomy,target
0,ISIC_2637011,IP_7279968,0.022217,0.999753,0.0,0
1,ISIC_0015719,IP_3075186,0.0,0.993884,0.110432,0
2,ISIC_0052212,IP_2842074,0.0,0.9998,0.019996,0
3,ISIC_0068279,IP_6890425,0.0,1.0,0.0,0
4,ISIC_0074268,IP_8723313,0.0,0.995893,0.090536,0


Unnamed: 0,dcm_name,ID,sex,age,anatomy
0,ISIC_0052060,IP_3579794,0.014261,0.99827,0.057044
1,ISIC_0052349,IP_7782715,0.024984,0.999376,0.024984
2,ISIC_0058510,IP_7960270,0.0,0.997366,0.072536
3,ISIC_0073313,IP_6375035,0.0,0.996815,0.079745
4,ISIC_0073502,IP_0589375,0.0,0.999753,0.022217


Train dataframe has 37648 rows and 6 columns
---------------------------------------------------------------------------------------------------------------------
Test dataframe has 10982 rows and 5 columns
---------------------------------------------------------------------------------------------------------------------
Train dataframe has 32542 benignant and 5106 malignant
---------------------------------------------------------------------------------------------------------------------


(None, None)

### Validation

In [5]:
def create_folds(df, k):
    """
    Creates folds for training.
    :param df: a dataframe with a "target" column and an "patient_id" column.
    :param k: number of folds
    :return: folds
    """
    # Create Object
    group_fold = StratifiedGroupKFold(n_splits = k)

    length = len(df)

    # Generate indices to split data into training and test set.
    folds = group_fold.split(X = np.zeros(length),
                             y = df['target'],
                             groups = df['patient_id'].tolist())
    return folds


def target_distribution(df):
    neg_count = df.target.value_counts()[0]
    pos_count = df.target.value_counts()[1]
    neg_perc = neg_count/(neg_count+pos_count)
    pos_perc = pos_count/(neg_count+pos_count)
    
    return pos_perc*100, neg_perc*100, pos_count, neg_count

### Create folds

In [6]:
folds = create_folds(train_df, 5)

    
for fold, (train_index, valid_index) in enumerate(folds):
    print(f"====== Fold {fold} ======")
    train_set = train_df.iloc[train_index].reset_index(drop=True)
    valid_set = train_df.iloc[valid_index].reset_index(drop=True)
    print(f"Train set size is: {train_set.shape[0]}")
    print(f"Validation set size is: {valid_set.shape[0]}")
    pos_perc, neg_perc, pos_count, neg_count = target_distribution(train_set)
    print(f"Train set: {pos_perc} % malignant ({pos_count}) and {neg_perc} % benign ({neg_count})")
    pos_perc, neg_perc, pos_count, neg_count = target_distribution(valid_set)
    print(f"Validation set: {pos_perc} % malignant ({pos_count}) and {neg_perc} % benign ({neg_count})")
    print(f"Train set has {train_set.patient_id.nunique()} unique patients")
    print(f"Validation set has {valid_set.patient_id.nunique()} unique patients")
    print("\n")
    

Train set size is: 30118
Validation set size is: 7530
Train set: 13.559997343781127 % malignant (4084) and 86.44000265621887 % benign (26034)
Validation set: 13.572377158034529 % malignant (1022) and 86.42762284196547 % benign (6508)
Train set has 2736 unique patients
Validation set has 659 unique patients


Train set size is: 30116
Validation set size is: 7532
Train set: 13.560897861601806 % malignant (4084) and 86.4391021383982 % benign (26032)
Validation set: 13.568773234200743 % malignant (1022) and 86.43122676579925 % benign (6510)
Train set has 2713 unique patients
Validation set has 682 unique patients


Train set size is: 30115
Validation set size is: 7533
Train set: 13.564668769716087 % malignant (4085) and 86.43533123028391 % benign (26030)
Validation set: 13.553697066241869 % malignant (1021) and 86.44630293375813 % benign (6512)
Train set has 2712 unique patients
Validation set has 683 unique patients


Train set size is: 30120
Validation set size is: 7528
Train set: 13.565

### Baseline accuracy

In [None]:

acc_score = accuracy_score(y_holdout, np.zeros_like(y_holdout))
auc_score = roc_auc_score(y_holdout, np.zeros_like(y_holdout))