# Melanoma Classification

Kaggle Competition Page: www.kaggle.com/c/siim-isic-melanoma-classification/overview


## What is Melanoma?
Melanoma, the most severe type of skin cancer, develops in the cells (melanocytes) that produce melanin — the pigment that gives your skin its color. Melanoma can also form in your eyes and, rarely, inside your body, such as in your nose or throat.

The exact cause of all melanomas isn't clear, but exposure to ultraviolet (UV) radiation from sunlight or tanning lamps and beds increases your risk of developing melanoma.

The risk of melanoma seems to be increasing in people under 40, especially women. Knowing the warning signs of skin cancer can help ensure that cancerous changes are detected and treated before the cancer has spread. We can treat melanoma successfully if it is detected early.

<img src="https://github.com/SaschaMet/melanoma-classification/blob/master/images/melanoma.jpg?raw=1" alt="Drawing" style="width: 600px;"/>

## Symptoms & Diagnosis
Melanomas can develop anywhere on your body. They most often develop in areas with exposure to the sun, such as your back, legs, arms, and face.
Melanomas can also occur in areas that don't receive much sun exposure, such as the soles of your feet, palms of your hands, and fingernail beds. These hidden melanomas are more common in people with darker skin.

To help you identify characteristics of melanomas or other skin cancers, think of the letters ABCDE:
- A is for asymmetrical shape. Look for moles with irregular shapes, such as two very different-looking halves.
- B is for irregular border. Look for moles with rough, notched, or scalloped edges — characteristics of melanomas.
- C is for color changes. Look for growths that have many colors or an uneven distribution of color.
- D is for diameter. Look for new growth in a mole larger than 1/4 inch (about 6 millimeters).
- E is for evolving. Look for changes over time, such as a mole that grows in size or changes color or shape.


![ABCDE Melanoma](https://github.com/SaschaMet/melanoma-classification/blob/master/images/abcde-melanoma.jpg?raw=1)

Source: https://www.health.harvard.edu/cancer/melanoma-overview

The facts about Melanoma:
- Melanoma is the most severe form of skin cancer
- It makes up 2% of skin cancers but is responsible for 75% of skin cancer deaths
- Australia and New Zealand have the highest melanoma rates in the world
- 1 in 17 Australians will be diagnosed with melanoma before the age of 85
- More than 90% of melanoma can be successfully treated with surgery if detected early

Source: https://melanomapatients.org.au/about-melanoma/melanoma-facts/

<img src="https://github.com/SaschaMet/melanoma-classification/blob/master/images/melanoma-impact.jpg?raw=1" alt="Drawing" style="width: 600px;"/>

Source: https://impactmelanoma.org/wp-content/uploads/2018/11/Standard-Infographic_0.jpg

## Setup

In [2]:
import os
import json
import random
import warnings
import itertools
import numpy as np
import pandas as pd
from tqdm import tqdm
import tensorflow as tf
from pathlib import Path
from tensorflow import keras
import matplotlib.pyplot as plt
from keras.optimizers import Adam, RMSprop
from datetime import datetime, date
from tensorflow.keras import layers
import tensorflow.keras.backend as K
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential, Model
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint, EarlyStopping
from sklearn.metrics import roc_curve, auc, precision_recall_curve, plot_precision_recall_curve, confusion_matrix

In [3]:
SEED = 1
EPOCHS = 100
BATCH_SIZE = 32
NUM_CLASSES = 2
VERBOSE_LEVEL = 1
SAVE_OUTPUT = True
IMG_SIZE = (224, 224)
INPUT_SHAPE = (224, 224, 3)

CWD = os.getcwd()
warnings.filterwarnings('ignore')


In [4]:
BASE_PATH = '/kaggle/input/siim-isic-melanoma-classification'
PATH_TO_IMAGES = '/kaggle/input/siim-isic-melanoma-classification/jpeg' 
IMAGE_TYPE = ".jpg"

## Loading the data

In [5]:
""" Helper function to validate the image paths

    Parameters:
        file_path (string): Path to the image 

    Returns:
        The file path if the file exists, otherwise false if the file does not exist

"""
def check_image(file_path):
    img_file = Path(file_path)
    if img_file.is_file():
        return file_path
    return False

In [6]:
""" Helper function to get the train dataset
"""
def get_train_data():
    # read the data from the train.csv file
    train = pd.read_csv(os.path.join(BASE_PATH, 'train.csv'))
    # add the image_path to the train set
    train['image_path'] = train['image_name'].apply(lambda x: PATH_TO_IMAGES + "/train/" + x + IMAGE_TYPE)
    # check if the we have an image 
    train['image_path'] = train.apply(lambda row : check_image(row['image_path']), axis = 1)
    # if we do not have an image we will not include the data
    train = train[train['image_path'] != False]
    print("valid rows in train", train.shape[0])
    return train



In [7]:
train = get_train_data()

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/siim-isic-melanoma-classification\\train.csv'

In [None]:
train.dtypes

Train Dataset:
- image name: the filename for the specific image
- patient_id: unique patient id
- sex: gender of the patient
- age_approx: age of the patient
- anatom_site_general_challenge: location of the scan site
- diagnosis: information about the diagnosis
- benign_malignant: indicates if the scan result is malignant or benign
- target: 0 for benign and 1 for malignant
- image_path: path to the image

### Check for missing values

In [None]:
""" Helper function check a dataframe for missing values

    Parameters:
        df (dataframe): The dataframe to check

    Returns:
        A dataframe with the number of missing and zero values for each column in percent

"""
def check_for_missing_and_null(df):
    null_df = pd.DataFrame({'columns': df.columns, 
                            'percent_null': df.isnull().sum() * 100 / len(df), 
                            'percent_zero': df.isin([0]).sum() * 100 / len(df),
                            'total_zero': df.isnull().sum() * 100 / len(df) + df.isin([0]).sum() * 100 / len(df),
                           })
    return null_df

check_for_missing_and_null(train)

There is a small portion of missing values for age and sex, as well as for the anatom_site_general_challenge column. 

The target column consists of 98 % zero values. This means we have a highly imbalanced dataset.

### Removing missing values

To do the EDA, I will remove the dataset's missing values because we will not lose much information. Later, when we prepare the dataset for training, I will add these missing values again.

In [None]:
train = train.dropna()

### Target distribution

In [None]:
plt.figure(figsize = (8,6))
x = plt.bar(["Melanoma","Benign"],[len(train[train.target==1]), len(train[train.target==0])])

In [None]:
benign_cases = train[train.target == 0]
melanoma_cases = train[train.target == 1]

print("Benign Cases", len(benign_cases))
print("Melanoma Cases", len(melanoma_cases))
print(" ")
print("There are only", len(melanoma_cases), "malignant cases in the dataset. This is very important to know, because this has implications on how to perpare the dataset for training the machine learning model.")

### Gender distribution

In [None]:
female = train[train.sex == "female"]
male = train[train.sex == "male"]
plt.figure(figsize = (8,6))
x = plt.bar(
    ["Female","Male"],
    [len(female), len(male)]
)
print('There are', len(female), 'female patients in the dataset and', len(male), 'male patients.')

In [None]:
benign_cases_female = train[train.target==0][train.sex == "female"]
malignant_cases_female = train[train.target==1][train.sex == "female"]

benign_cases_male = train[train.target==0][train.sex == "male"]
malignant_cases_male = train[train.target==1][train.sex == "male"]

plt.figure(figsize = (8,6))
x = plt.bar(
    ["Benign & Female","Malignant & Female", "Benign & Male","Malignant & Male"],
    [len(benign_cases_female), len(malignant_cases_female), len(benign_cases_male), len(malignant_cases_male)]
)

In [None]:
grouped_df_by_sex = train.groupby(['target','sex'])['benign_malignant'].count().to_frame().reset_index()
grouped_df_by_sex

In [None]:
f_m = train[train.target == 1][train.sex == "female"]
m_m = train[train.target == 1][train.sex == "male"]

print("There are", len(m_m) ,"malignant male cases in the dataset compared to", len(f_m) ,"female cases.")

### Age distribution

In [None]:
# create ten age bins, from 0 to 100
age_bins = np.arange(0, 100, 10)

""" Helper function to return the age bin for a specific age

    Parameters:
        age (int)

    Returns:
        age bin (int)
"""
def add_age_bin(age):
    for idx, val in enumerate(age_bins):
        if age < val:
            return idx

# add the age bins to the train df
train['age_bin'] = train.apply(lambda row : add_age_bin(row['age_approx']), axis = 1)

In [None]:
plt.figure(figsize=(8,6))
plt.hist( train.age_bin, bins = 20)
plt.show()

In [None]:
print("The mean age of a patient in the dataset is", round(np.mean(train.age_approx, 0)))

In [None]:
plt.figure(figsize=(8,6))
plt.hist( train[train.target==1].age_bin, bins = 20)
plt.show()

The age distributions follows a normal distribution. If we look only at the malignant cases however we can see, that the distribution seems to be wider. 

In [None]:
def get_ratio_by_age_bin(age_bin):
    total = train[train['age_bin'] == age_bin]
    malignant = train[train['age_bin'] == age_bin][train['target'] == 1]
    return round((len(malignant) / len(total)) * 100, 2)
    
for age_bin in [2,3,4,5,6,7,8]:
    print("Ratio malignant / total cases for age_bin", age_bin, "=" , get_ratio_by_age_bin(age_bin))


There are indeed more malignant cases at the ends of the age distribution.

### Anatom Site General Challenge distribution

In [None]:
anatom_site = list(train.anatom_site_general_challenge.unique())
anatom_site = [x for x in anatom_site if str(x) != 'nan']

anatom_site_value_counts = []
for x in anatom_site:
    y = train[train['anatom_site_general_challenge'] == x]
    anatom_site_value_counts.append(len(y))

y_pos = np.arange(len(anatom_site))
plt.figure(figsize=(8,6))
plt.bar(y_pos, anatom_site_value_counts, align='center')
plt.xticks(y_pos, anatom_site)
plt.ylabel('# of rows')
plt.title('Anatom Site General Challenge')

plt.show()

Most often a lesion was found in the torso area, followed by the lower and upper extremity.

### Diagnosis distribution

In [None]:
diagnosis = list(train.diagnosis.unique())
diagnosis = [x for x in diagnosis if str(x) != 'unknown']

diagnosis_value_counts = []
for x in diagnosis:
    y = train[train['diagnosis'] == x]
    diagnosis_value_counts.append(len(y))

In [None]:
labels = diagnosis
sizes = diagnosis_value_counts
plt.figure(figsize=(8,6))
patches, texts = plt.pie(sizes, shadow=True, startangle=90)
plt.legend(patches, labels, loc="best")
plt.axis('equal')
plt.show()

The main finding in the dataset is "nevus". Nevus is a nonspecific medical term for a visible, circumscribed, chronic lesion of the skin (e.g. a "birthmark"). The second most common finding was melanoma.


Source: https://en.wikipedia.org/wiki/Nevus

## Images from the dataset

In [None]:
plt.figure(figsize=(16, 16))
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img_path = train.iloc[i].image_path
    img = plt.imread(img_path)
    plt.imshow(img, cmap='gray')
    plt.axis('off')
plt.tight_layout()   

## Data preparation

Because we removed some values from the dataset for the EDA, we load the train and test set again.

In [None]:
train = get_train_data()

In [None]:
# getting dummy variables for gender
sex_dummies = pd.get_dummies(train['sex'], prefix='sex', dtype="int")
train = pd.concat([train, sex_dummies], axis=1)

# getting dummy variables for anatom_site_general_challenge
anatom_dummies = pd.get_dummies(train['anatom_site_general_challenge'], prefix='anatom', dtype="int")
train = pd.concat([train, anatom_dummies], axis=1)

# getting dummy variables for target column
#target_dummies = pd.get_dummies(train['target'], prefix='target', dtype="int")
#train = pd.concat([train, target_dummies], axis=1)

# dropping not useful columns
train.drop(['sex','diagnosis','benign_malignant','anatom_site_general_challenge'], axis=1, inplace=True)

# replace missing age values wiht the mean age
train['age_approx'] = train['age_approx'].fillna(int(np.mean(train['age_approx'])))

# convert age to int
train['age_approx'] = train['age_approx'].astype('int')

print("rows in train", train.shape[0])

In [None]:
train.dtypes

### Balance the dataset

Because we have a highly imbalanced dataset we need to balance it.

In [None]:
# 1 means 50 / 50 => equal amount of positive and negative cases in Training
# 4 = 20%; 8 = ~11%; 12 = ~8%
balance = 1
p_inds = train[train.target == 1].index.tolist()
np_inds = train[train.target == 0].index.tolist()

np_sample = random.sample(np_inds, balance * len(p_inds))
train = train.loc[p_inds + np_sample]
print("Samples in train", train['target'].sum()/len(train))
print("Remaining rows in train set", len(train))

In [None]:
""" Helper function to create a train and a validation dataset

    Parameters:
    df (dataframe): The dataframe to split
    test_size (int): Size of the validation set
    classToPredict: The target column

    Returns:
    train_data (dataframe)
    val_data (dataframe)
"""
def create_splits(df, test_size, classToPredict):
    train_data, val_data = train_test_split(df,  test_size = test_size, random_state = 1, stratify = df[classToPredict])
    train_data, test_data = train_test_split(df,  test_size = 0.16, random_state = 1, stratify = df[classToPredict])
    return train_data, val_data, test_data

In [None]:
""" Helper function to plot the history of a tensorflow model

    Parameters:
        history (history object): The history from a tf model
        timestamp (string): The timestamp of the function execution

    Returns:
        Null
"""
def save_history(history, timestamp):
    f = plt.figure()
    f.set_figwidth(15)

    f.add_subplot(1, 2, 1)
    plt.plot(history['val_loss'], label='val loss')
    plt.plot(history['loss'], label='train loss')
    plt.legend()
    plt.title("Modell Loss")

    f.add_subplot(1, 2, 2)
    plt.plot(history['val_accuracy'], label='val accuracy')
    plt.plot(history['accuracy'], label='train accuracy')
    plt.legend()
    plt.title("Modell Accuracy")


In [None]:
""" Helper function to plot the auc curve

    Parameters:
        t_y (array): True binary labels
        p_y (array): Target scores

    Returns:
        Null
"""
def plot_auc(t_y, p_y):
    fpr, tpr, thresholds = roc_curve(t_y, p_y, pos_label=1)
    fig, c_ax = plt.subplots(1,1, figsize = (8, 8))
    c_ax.plot(fpr, tpr, label = '%s (AUC:%0.2f)'  % ('Target', auc(fpr, tpr)))
    c_ax.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
    c_ax.legend()
    c_ax.set_xlabel('False Positive Rate')
    c_ax.set_ylabel('True Positive Rate')

## Data augmentation

In [None]:
""" Factory function to create a training image data generator

Parameters:
    df (dataframe): Training dataframe 

Returns:
    Image Data Generator function
"""
def get_training_gen(df):
    ## prepare images for training
    train_idg = ImageDataGenerator(
        rescale = 1 / 255.0,
        horizontal_flip = True, 
        vertical_flip = True, 
        height_shift_range = 0.15, 
        width_shift_range = 0.15,
        shear_range=0.15,
        rotation_range = 90, 
        zoom_range = 0.20,
        fill_mode='nearest'
    )

    train_gen = train_idg.flow_from_dataframe(
        seed=SEED,
        dataframe=df,
        directory=None,
        x_col='image_path',
        y_col='target',
        class_mode='raw',
        shuffle=True,
        target_size=IMG_SIZE,
        #batch_size=BATCH_SIZE,
        validate_filenames = False
    )

    return train_gen

In [None]:
""" Factory function to create a validation image data generator

Parameters:
    df (dataframe): Validation dataframe 

Returns:
    Image Data Generator function
"""
def get_validation_gen(df):
    ## prepare images for validation
    val_idg = ImageDataGenerator(rescale=1. / 255.0)
    val_gen = val_idg.flow_from_dataframe(
        seed=SEED,
        dataframe=df,
        directory=None,
        x_col='image_path',
        y_col='target',
        class_mode='raw',
        shuffle=False,
        target_size=IMG_SIZE,
        #batch_size=BATCH_SIZE,
        validate_filenames = False
    )

    return val_gen

In [None]:
""" Factory function to create a test image data generator

Parameters:
    df (dataframe): Test dataframe 

Returns:
    Image Data Generator function
"""
def get_test_gen(df):
    ## prepare images for validation
    test_idg = ImageDataGenerator(rescale=1. / 255.0)
    test_gen = test_idg.flow_from_dataframe(
        seed=SEED,
        dataframe=df,
        directory=None,
        x_col='image_path',
        y_col='target',
        class_mode='raw',
        shuffle=False,
        target_size=IMG_SIZE,
        #batch_size=BATCH_SIZE,
        validate_filenames = False
    )

    return test_gen

## Transfer Learning

Conventional machine learning and deep learning algorithms, so far, have been traditionally designed to work in isolation. These algorithms are trained to solve specific tasks. The models have to be rebuilt from scratch once the feature-space distribution changes. Transfer learning is the idea of overcoming the isolated learning paradigm and utilizing knowledge acquired for one task to solve related ones. 


![Transfer Learning](https://github.com/SaschaMet/melanoma-classification/blob/master/images/transfer-learning.png?raw=1)
 

Traditional learning is isolated and occurs purely based on specific tasks, datasets, and training separate isolated models on them. No knowledge is retained, which can be transferred from one model to another. In transfer learning, you can leverage knowledge (features, weights, etc.) from previously trained models for training newer models and even tackle problems like having less data for the more recent task.

**Fine Tuning Off-the-shelf Pre-trained Models**

This is a more involved technique, where we do not just replace the final layer (for classification/regression), but we also selectively retrain some of the previous layers. 


![Transfer Learning](https://miro.medium.com/max/700/1*BBZGHtI_vhDBeqsIbgMj1w.png)
 



Source: https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a

In [None]:
""" Helper function which returns a DenseNet121 model
"""
from keras.applications.densenet import DenseNet121
from keras.preprocessing import image
from keras.applications.densenet import preprocess_input
import numpy as np

def load_pretrained_model():
    base_model = DenseNet121(
        input_shape=INPUT_SHAPE,
        include_top=False,
        weights='imagenet'
    )

    return base_model


In [None]:
model = load_pretrained_model()
model.summary()

In [None]:
last_layer_shape = model.layers[-1].output_shape
last_layer_shape

In [None]:
# create a training and validation dataset from the train df
train_df, val_df, test_df = create_splits(train, 0.2, 'target')

print("rows in train_df", train_df.shape[0])
print("rows in val_df", val_df.shape[0])
print("rows in test_df", test_df.shape[0])

# call the generator functions
#train_gen = get_training_gen(train_df)
#val_gen = get_validation_gen(val_df)
#test_gen = get_test_gen(test_df)
#valX, valY = val_gen.next()

In [None]:
train_len = train_df.shape[0]
val_len = val_df.shape[0]
test_len = test_df.shape[0]

In [None]:
""" Helper function for feature extraction
"""

def extract_features(df):

    features = []
    labels = []
    for img_path in df['image_path']:
        img = image.load_img(img_path, target_size=INPUT_SHAPE)
        img_data = image.img_to_array(img)
        img_data = np.expand_dims(img_data, axis=0)
        img_data = preprocess_input(img_data)

        feature = model.predict(img_data)
        feature_np = np.array(feature)
        features.append(feature_np.flatten())
        labels.append(df.loc[df['image_path'] == img_path, 'target'].iloc[0])
        
    feature_list_np = np.array(features)
    labels_list_np = np.array(labels)
    
    return feature_list_np, labels_list_np

In [None]:
train_features, train_labels = extract_features(train_df)
val_features, val_labels = extract_features(val_df)
test_features, test_labels = extract_features(test_df)

In [None]:
X_train, y_train = train_features, train_labels
X_val, y_val = val_features, val_labels
X_test, y_test = test_features, test_labels

# SVM

In [None]:
from sklearn.svm import SVC
classifier_SVM = SVC(kernel = 'rbf', random_state = 0)
classifier_SVM.fit(X_train, y_train)

In [None]:
train_acc_SVM = classifier_SVM.score(X_train, y_train)
val_acc_SVM = classifier_SVM.score(X_val, y_val)
test_acc_SVM = classifier_SVM.score(X_test, y_test)

In [None]:
print(train_acc_SVM)
print(val_acc_SVM)
print(test_acc_SVM)

In [None]:
y_pred_SVM = classifier_SVM.predict(X_test)

In [None]:
from sklearn import metrics
def print_performance_metrics(test_labels,predict):
    print('Accuracy:', np.round(metrics.accuracy_score(test_labels, predict),4))
    print('ROC Area:', np.round(metrics.roc_auc_score(test_labels, predict),4))
    print('Precision:', np.round(metrics.precision_score(test_labels, predict,average='weighted'),4))
    print('Recall:', np.round(metrics.recall_score(test_labels, predict,
                                               average='weighted'),4))
    print('F1 Score:', np.round(metrics.f1_score(test_labels, predict,
                                               average='weighted'),4))
    print('Cohen Kappa Score:', np.round(metrics.cohen_kappa_score(test_labels, predict),4))
    print('Matthews Corrcoef:', np.round(metrics.matthews_corrcoef(test_labels, predict),4)) 
    print('\t\tClassification Report:\n', metrics.classification_report(test_labels, predict))

print_performance_metrics(y_test,y_pred_SVM)

In [None]:
# create a confusion matrix
cm_SVM =  confusion_matrix(y_test,y_pred_SVM)
cm_SVM

In [None]:
""" Helper function to plot a confusion matrix

    Parameters:
        cm (confusion matrix)

    Returns:
        Null
"""
def plot_confusion_matrix(cm, labels):
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title('Confusion Matrix')
    plt.colorbar()
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels, rotation=55)
    plt.yticks(tick_marks, labels)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], 'd'), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

cm_plot_label =['benign', 'malignant']
plot_confusion_matrix(cm_SVM, cm_plot_label)

# RANDOM FOREST

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier_RF = RandomForestClassifier(n_estimators = 800, criterion = 'entropy', random_state = 0)
classifier_RF.fit(X_train,y_train)

In [None]:
train_acc_RF = classifier_RF.score(X_train, y_train)
val_acc_RF = classifier_RF.score(X_val, y_val)
test_acc_RF = classifier_RF.score(X_test, y_test)

In [None]:
print(train_acc_RF)
print(val_acc_RF)
print(test_acc_RF)

In [None]:
y_pred_RF = classifier_RF.predict(X_test)

In [None]:
print_performance_metrics(y_test,y_pred_RF)

In [None]:
# create a confusion matrix
cm_RF =  confusion_matrix(y_test,y_pred_RF)
cm_RF

In [None]:
plot_confusion_matrix(cm_RF, cm_plot_label)

# ADABOOST

In [None]:
from sklearn.ensemble import AdaBoostClassifier
classifier_AdaBoost = AdaBoostClassifier(n_estimators = 100)
classifier_AdaBoost.fit(X_train, y_train)

In [None]:
train_acc_AdaBoost = classifier_AdaBoost.score(X_train, y_train)
val_acc_AdaBoost = classifier_AdaBoost.score(X_val, y_val)
test_acc_AdaBoost = classifier_AdaBoost.score(X_test, y_test)

In [None]:
print(train_acc_AdaBoost)
print(val_acc_AdaBoost)
print(test_acc_AdaBoost)

In [None]:
y_pred_AdaBoost = classifier_AdaBoost.predict(X_test)

In [None]:
print_performance_metrics(y_test,y_pred_AdaBoost)

In [None]:
# create a confusion matrix
cm_AdaBoost =  confusion_matrix(y_test,y_pred_AdaBoost)
cm_AdaBoost

In [None]:
plot_confusion_matrix(cm_AdaBoost, cm_plot_label)

# KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier_kNN = KNeighborsClassifier(n_neighbors = 5, algorithm='ball_tree', leaf_size=30)
classifier_kNN.fit(X_train, y_train)

In [None]:
train_acc_kNN = classifier_kNN.score(X_train, y_train)
val_acc_kNN = classifier_kNN.score(X_val, y_val)
test_acc_kNN = classifier_kNN.score(X_test, y_test)

In [None]:
print(train_acc_kNN)
print(val_acc_kNN)
print(test_acc_kNN)

In [None]:
y_pred_kNN = classifier_kNN.predict(X_test)

In [None]:
print_performance_metrics(y_test,y_pred_kNN)

In [None]:
# create a confusion matrix
cm_kNN =  confusion_matrix(y_test,y_pred_kNN)
cm_kNN

In [None]:
plot_confusion_matrix(cm_kNN, cm_plot_label)

# XGBOOST

In [None]:
import xgboost as xgb
classifier_xgb = xgb.XGBClassifier(n_estimators = 300)
classifier_xgb.fit(X_train, y_train)

In [None]:
train_acc_xgb = classifier_xgb.score(X_train, y_train)
val_acc_xgb = classifier_xgb.score(X_val, y_val)
test_acc_xgb = classifier_xgb.score(X_test, y_test)

In [None]:
print(train_acc_xgb)
print(val_acc_xgb)
print(test_acc_xgb)

In [None]:
y_pred_xgb = classifier_xgb.predict(X_test)

In [None]:
print_performance_metrics(y_test,y_pred_xgb)

In [None]:
# create a confusion matrix
cm_xgb =  confusion_matrix(y_test,y_pred_xgb)
cm_xgb

In [None]:
plot_confusion_matrix(cm_xgb, cm_plot_label)

# BAGGING

In [None]:
from sklearn.ensemble import BaggingClassifier
classifier_Bagging = BaggingClassifier(n_estimators=100)
classifier_Bagging.fit(X_train,y_train)

In [None]:
train_acc_Bagging = classifier_Bagging.score(X_train, y_train)
val_acc_Bagging = classifier_Bagging.score(X_val, y_val)
test_acc_Bagging = classifier_Bagging.score(X_test, y_test)

In [None]:
print(train_acc_Bagging)
print(val_acc_Bagging)
print(test_acc_Bagging)

In [None]:
y_pred_Bagging = classifier_Bagging.predict(X_test)

In [None]:
print_performance_metrics(y_test,y_pred_Bagging)

In [None]:
# create a confusion matrix
cm_Bagging =  confusion_matrix(y_test,y_pred_Bagging)
cm_Bagging

In [None]:
plot_confusion_matrix(cm_Bagging, cm_plot_label)

# ANN

In [None]:
model_ANN = tf.keras.models.Sequential([
    tf.keras.Input(shape=(last_layer_shape[1],last_layer_shape[2],last_layer_shape[3])),
    tf.keras.layers.Flatten(),
    #tf.keras.layers.Dense(64, activation = 'relu'),
    #tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(32, activation = 'relu'),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])

LEARNING_RATE = 1e-4
OPTIMIZER = RMSprop(lr=LEARNING_RATE,decay=1e-2)
LOSS = 'binary_crossentropy'
METRICS = [
    'accuracy', 
    'AUC'
] 

model_ANN.compile(
    loss=LOSS,
    metrics=METRICS,
    optimizer=OPTIMIZER,
)

print("fit model on gpu")
history_ANN = model_ANN.fit(
    train_features, train_labels, 
    epochs=EPOCHS, 
    verbose=VERBOSE_LEVEL, 
    validation_data=(val_features,val_labels)
)

In [None]:
# get the current timestamp. This timestamp is used to save the model data with a unique name
now = datetime.now()
today = date.today()
current_time = now.strftime("%H:%M:%S")
timestamp = str(today) + "_" + str(current_time)

# plot model history
save_history(history_ANN.history, timestamp)

In [None]:
y_pred_ANN = model_ANN.predict(X_test)

In [None]:
""" Helper function turn the model predictions into a binary (0,1) format

    Parameters:
        pred (float): Model prediction

    Returns:
        binary prediction (int)
"""

def pred_to_binary(pred):
    if pred < 0.5:
        return 0
    else:
        return 1

y_pred_ANN = [pred_to_binary(x) for x in y_pred_ANN]

In [None]:
print_performance_metrics(y_test,y_pred_ANN)

In [None]:
# create a confusion matrix
cm_ANN =  confusion_matrix(y_test,y_pred_ANN)
cm_ANN

In [None]:
plot_confusion_matrix(cm_ANN, cm_plot_label)

# LSTM

In [None]:
#For LSTMs

train_features_2d = np.zeros((train_len,last_layer_shape[1],last_layer_shape[2]*last_layer_shape[3]))
for i in range(len(train_labels)):
    train_features_2d[i] = train_features[i].reshape(last_layer_shape[1],
                                                     last_layer_shape[2]*last_layer_shape[3])
    
val_features_2d = np.zeros((val_len,last_layer_shape[1],last_layer_shape[2]*last_layer_shape[3]))
for i in range(len(val_labels)):
    val_features_2d[i] = val_features[i].reshape(last_layer_shape[1],
                                                     last_layer_shape[2]*last_layer_shape[3])
    
test_features_2d = np.zeros((test_len,last_layer_shape[1],last_layer_shape[2]*last_layer_shape[3]))
for i in range(len(test_labels)):
    test_features_2d[i] = test_features[i].reshape(last_layer_shape[1],
                                                     last_layer_shape[2]*last_layer_shape[3])

In [None]:
model_LSTM = tf.keras.models.Sequential([
    tf.keras.Input(shape=(last_layer_shape[1], last_layer_shape[2]*last_layer_shape[3])),
    tf.keras.layers.LSTM(100, return_sequences=True),
    #tf.keras.layers.LSTM(32, return_sequences=True),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])

LEARNING_RATE = 1e-4
OPTIMIZER = Adam(lr=LEARNING_RATE,decay=1e-2)
LOSS = 'binary_crossentropy'
METRICS = [
    'accuracy', 
    'AUC'
] 

model_LSTM.compile(
    loss=LOSS,
    metrics=METRICS,
    optimizer=OPTIMIZER,
)

print("fit model on gpu")
history_LSTM = model_LSTM.fit(
    train_features_2d, train_labels, 
    epochs=EPOCHS, 
    verbose=VERBOSE_LEVEL,
    validation_data=(val_features_2d,val_labels)
)

In [None]:
# plot model history
save_history(history_LSTM.history, timestamp)

In [None]:
y_pred_LSTM = model_LSTM.predict(test_features_2d)

In [None]:
y_pred_LSTM = [pred_to_binary(x) for x in y_pred_LSTM]

In [None]:
print_performance_metrics(y_test,y_pred_LSTM)

In [None]:
# create a confusion matrix
cm_LSTM =  confusion_matrix(y_test,y_pred_LSTM)
cm_LSTM

In [None]:
plot_confusion_matrix(cm_LSTM, cm_plot_label)

# BIDIRECTIONAL LSTM

In [None]:
model_Bi_LSTM = tf.keras.models.Sequential([
    tf.keras.Input(shape=(last_layer_shape[1], last_layer_shape[2]*last_layer_shape[3])),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100)),
    #tf.keras.layers.Dropout(0.3),
    #tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])

LEARNING_RATE = 1e-4
OPTIMIZER = Adam(lr=LEARNING_RATE,decay=1e-2)
LOSS = 'binary_crossentropy'
METRICS = [
    'accuracy', 
    'AUC'
] 

model_Bi_LSTM.compile(
    loss=LOSS,
    metrics=METRICS,
    optimizer=OPTIMIZER,
)

print("fit model on gpu")
history_Bi_LSTM = model_Bi_LSTM.fit(
    train_features_2d, train_labels, 
    epochs=EPOCHS, 
    verbose=VERBOSE_LEVEL, 
    validation_data=(val_features_2d,val_labels)
)

In [None]:
# plot model history
save_history(history_Bi_LSTM.history, timestamp)

In [None]:
y_pred_Bi_LSTM = model_Bi_LSTM.predict(test_features_2d)

In [None]:
y_pred_Bi_LSTM = [pred_to_binary(x) for x in y_pred_Bi_LSTM]

In [None]:
print_performance_metrics(y_test,y_pred_Bi_LSTM)

In [None]:
# create a confusion matrix
cm_Bi_LSTM =  confusion_matrix(y_test,y_pred_Bi_LSTM)
cm_Bi_LSTM

In [None]:
plot_confusion_matrix(cm_Bi_LSTM, cm_plot_label)