Note: This notebook runs on AWS 'g4dn.4xlarge' EC2 instance with Ubuntu's Deep Learning AMI.

# Capstone Project: Skin Lesion Classification and Diagnosis
## Notebook 3b: Oversampling by SMOTE (Diagnosis Classifier)

Just like what we have done in notebook 3a, we will be executing Synthetic Minority Oversampling Technique (SMOTE) on our highly imbalanced 3-classes image dataset (for diagnosis classification).

### Table of Contents
- [Problem Statement](#Problem-Statement)
- [Background Information on SMOTE](#Background-Information-on-SMOTE)
- [Importing Libraries](#Importing-Libraries)
- [Loading Data](#Loading-Data)
- [Train-Test-Validation Split](#Train-Test-Validation-Split)
- [Executing SMOTE](#Executing-SMOTE)
- [Exporting Data](#Exporting-Data)

### Problem Statement

Skin cancer is the most common cancer globally, with melanoma being the most deadly form. Even though dermoscopy, a skin imaging modality, has demonstrated improvement for the diagnosis of skin cancer compared to unaided visual inspection<sup>[[1]](https://challenge2019.isic-archive.com/)</sup>, numerous cases of benign lesions are still being diagnosed as malicious and vice versa<sup>[[2]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6394090/)</sup>. Every year, poor diagnostic errors adds an estimated $673 million in overall cost to manage the disease<sup>[[3]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543387/)</sup>.

In this project, we aim to improve the diagnostic rate of skin cancer through the classification of skin lesions for dermatologists working at hospitals or skin cancer clinics in Singapore, who will need experience or expertise in diagnosing skin cancer before they can accurately identify and diagnose lesions upon visual and dermoscopy inspection<sup>[[3]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543387/)</sup>. This will be done through the classification of skin lesion dermoscopy images, in which we will predict two important tasks through the usage of Convolutional Neural Network models: <br>
1. a specific skin lesion diagnosis, and <br>
2. whether the lesion is malignant, benign, or pre-cancerous. <br>

The model will be evaluated based on its accuracy, followed by its recall rate since we are looking to minimise false negatives. Ultimately, we aim to get as close to a real evaluation of a dermatologist as possible: predicting the type of skin lesion; and whether the lesion is malignant, pre-cancerous or benign from dermoscopy images. With our models, we hope to aid dermatologists in their decision-making process of diagnosing skin lesions, hence allowing them to improve their diagnostic accuracy and come up with appropriate treatments for patients with skin lesions and/or cancers.

### Background information on SMOTE
SMOTE is an oversampling technique that generates synthetic samples for the minority classes, which helps to overcome the overfitting problem that can potentially arise from random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together<sup>[[5]](https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/)</sup>. In the case of SMOTE, k-nearest neighbours is used to interpolate new synthetic instances for the minority classes.

### Importing Libraries

In [None]:
#!pip install imbalanced-learn

In [1]:
#import libraries
import keras
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.preprocessing import image
from tensorflow.keras.preprocessing.image import array_to_img
import tensorflow as tf
from imblearn.over_sampling import SMOTE
from tqdm import tqdm_notebook
from keras.preprocessing import image
from tensorflow.keras.preprocessing.image import array_to_img

import warnings
warnings.filterwarnings('ignore')

### Loading Data

In [2]:
#load in labeled_ground_truth
all_labels = pd.read_csv('../datasets/labeled_ground_truth.csv')

In [3]:
all_labels.head()

Unnamed: 0,image,mel,nv,bcc,akiec,bkl,df,vasc,lesion,diagnosis,benign,malignant,precancerous
0,DERM_001,0,0,0,0,0,1,0,df,benign,1,0,0
1,DERM_002,0,0,0,0,0,1,0,df,benign,1,0,0
2,DERM_003,0,0,0,0,0,1,0,df,benign,1,0,0
3,DERM_004,0,0,0,0,0,1,0,df,benign,1,0,0
4,DERM_005,0,0,0,0,0,1,0,df,benign,1,0,0


In [7]:
#preprocessed images download link: https://bit.ly/processed_image

#loading preprocessed image data
all_images = []
#tdqm: states current status of image loading process
for i in tqdm_notebook(range(all_labels['image'].shape[0])):
    
    #load in images with size 284 x 284
    img = image.load_img('../datasets/processed_image/' + all_labels['image'][i] + '.jpg', 
                         target_size=(284,284))
    img_array = image.img_to_array(img) 
    img = img_array/255 #divide by 255 for rescaling
    all_images.append(img)

HBox(children=(FloatProgress(value=0.0, max=10276.0), HTML(value='')))




In [8]:
#make image data into an array
X = np.array(all_images)

In [9]:
X.shape

(10276, 284, 284, 3)

In [3]:
#create list of columns to be excluded from target variables y
columns_dropped = [x for x in all_labels.columns if x not in ['benign', 'malignant', 'precancerous']]

#create y that contains all 3 target variables/classes
y = np.array(all_labels.drop(columns= columns_dropped))

In [4]:
y.shape

(10276, 3)

### Train-Test-Validation Split

Similarly, we will be doing a 20-80 train-test split, followed by a 25-75 train-validation split.
This means that our X and y data will be splited into train, validation and test sets with a ratio of 20-20-60.

In [13]:
#Train test split for train and test sets
#20/80 split with random state of 42
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = 0.2, stratify=y)

In [14]:
#check for shape of train and test datasets
print("X_train dataset: ", X_train.shape)
print("y_train dataset: ", y_train.shape)
print("X_test dataset: ", X_test.shape)
print("y_test dataset: ", y_test.shape)

X_train dataset:  (8220, 284, 284, 3)
y_train dataset:  (8220, 3)
X_test dataset:  (2056, 284, 284, 3)
y_test dataset:  (2056, 3)


In [15]:
#Train validation split for train and validation sets
#25/75 split with random state of 42
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, random_state=42, test_size = 0.25, stratify=y_train)

In [16]:
#check for shape of train and validation datasets
print("X_train dataset: ", X_train.shape)
print("y_train dataset: ", y_train.shape)
print("X_val dataset: ", X_val.shape)
print("y_val dataset: ", y_val.shape)

X_train dataset:  (6165, 284, 284, 3)
y_train dataset:  (6165, 3)
X_val dataset:  (2055, 284, 284, 3)
y_val dataset:  (2055, 3)


In [20]:
#need to reshape X_train from shape (a,b,c,d) into (a,b) for SMOTE
X_train = X_train.reshape(6165, 284 * 284 * 3)

In [21]:
X_train.shape

(6165, 241968)

### Executing SMOTE

In [22]:
#initialise SMOTE, with knn of 3
smote = SMOTE(random_state=42, k_neighbors=3)

#fit and resample on X_train, y_train
X_smote, y_smote = smote.fit_resample(X_train, y_train)

In [23]:
X_smote.shape

(12240, 241968)

In [24]:
#reshape back to initial shape of (a,b,c,d)
X_smote = X_smote.reshape(12240, 284, 284, 3)
X_smote.shape

(12240, 284, 284, 3)

### Exporting Data

In [25]:
#saving X_smote and y_smote
np.save('../datasets/npy/diagnosis/X_smote_diagnosis_284.npy', X_smote)
np.save('../datasets/npy/diagnosis/y_smote_diagnosis_284.npy', y_smote)

#saving X_test and y_test
np.save('../datasets/npy/diagnosis/X_test_diagnosis_284.npy', X_test)
np.save('../datasets/npy/diagnosis/y_test_diagnosis_284.npy', y_test)

#saving X_val and y_val
np.save('../datasets/npy/diagnosis/X_val_diagnosis_284.npy', X_val)
np.save('../datasets/npy/diagnosis/y_val_diagnosis_284.npy', y_val)