# **CLASSIFICATION USING NEURAL NETWORK <br>WITH HYPERPARAMETER TUNING**

#
## Deep learning models

Deep learning is a method in **artificial intelligence** (AI) that enables computers to process data in a manner inspired by the human brain. By using neural networks with **multiple layers**, deep learning models have the ability to recognize and understand complex patterns in various types of data, including images, text, sounds, and more. This enables deep learning models to provide accurate insights and make predictions based on the learned patterns. 

<br>

![alt text](https://datafloq.com/wp-content/uploads/2021/12/blog_pictures2FDeep_Learning_2360-750x420.png "AI")

<sup><sub>Source: [wdatafloq.com](https://datafloq.com/read/deep-learning-methods-machine-learning/)</sub></sup>

# 
## Neural network

Neural network uses **interconnected nodes or neurons** in a layered structure that resembles the human brain. Each neuron **takes inputs, applies weights** to them, performs a mathematical operation, and passes the result to the next layer. Through a process called training, neural networks adjust the weights to learn patterns in the data to make accurate predictions or classifications based on the provided data. Among common application areas are image and speech recognition as well as prediction models. 

<br>

![alt text](https://i0.wp.com/i.postimg.cc/pLgLsJDt/Architecture.jpg?w=1230&ssl=1 "NN Structure")

<sup><sub>Source: [blog.knoldus.com](https://i0.wp.com/i.postimg.cc/pLgLsJDt/Architecture.jpg?w=1230&ssl=1)</sub></sup>

# 
## Important components of a neural network

#### **Input Layer**

The input layer is the first layer of the neural network which receives the input data. Each node in the input layer represents a feature or an element of the input data.

# 
#### **Output Layer**

The output layer is the final layer of the neural network which provides the predictions or outputs of the network based on the computations and transformations performed in the preceding layers. In multi-class classification, the output layer may have multiple nodes, where each node represents the probability or prediction of each individual class.

# 
#### **Neurons** or **Nodes**

Neurons are the basic units of computation in a neural network. Each neuron (organized into layers) receives inputs, performs computations using weights and biases, and produces an output.

# 
#### **Weights** and **Biases**

Each connection between the neurons has a weight associated with it, which determines the strength or importance of that connection. Biases provide an additional constant term that helps control the activation of neurons. These learn-able parameters are important during training of the network to minimize the loss.

# 
#### **Activation Function**

A neural network without an activation function is essentially just a linear regression model. The activation function perfomrs the non-linear transformation to the input making it capable to learn and perform more complex tasks. The activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it.

Some common activation functions are: **Softmax**, **Sigmoid**, **Tanh** and **ReLU**

<br>

![alt text](https://miro.medium.com/v2/resize:fit:1200/1*hkYlTODpjJgo32DoCOWN5w.png "Neuron")

<sup><sub>Source: [medium.com](https://towardsdatascience.com/the-concept-of-artificial-neurons-perceptrons-in-neural-networks-fab22249cbfc)</sub></sup>

# 
#### **Hidden Layers**

Hidden layers in a neural network are the layers that exist between the input layer and the output layer. They are responsible for performing computations and transformations to learn representations and extract features from the input data.

# 
#### **Loss Function**

During neural network training, the loss function quantifies the error between predicted and true values, providing feedback for the model's learning. Backpropagation and the loss function work together to train the network by updating weights based on the calculated gradients and minimizing the discrepancy between predicted and true values. 

Some commonly used loss functions are: **Mean Squared Error (MSE) Loss** and **Cross-Entropy Loss**

# 
#### **Back Propagation**

Backpropagation is a fundamental algorithm in the training of neural networks. It allows the neural network to learn from the data and adapt its weights to optimize performance for a given task. 

By iteratively performing the forward pass, backward pass, and weight updation steps on a training dataset, the network gradually learns to adjust its weights in a way that reduces the error and improves its ability to make accurate predictions.

# 
#### **Optimizer**

Optimizers play a crucial role in training neural networks as they determine how the model learns and converges to the optimal solution. They adjust the parameters based on the computed gradients of the loss function with respect to the network's parameters. 

Some common optimizers are: **Stochastic Gradient Descent (SGD)**, **Adam (Adaptive Moment Estimation)** and **RMSProp (Root Mean Square Propagation)**


<br>

![alt text](./images/backprop.png "Back Propagation")

<sup><sub>Source: [Lecture Notes](https://git-ce.rwth-aachen.de/spotseven-lab/ml-ai-ait-sommersemester-2023/-/blob/main/LectureNotes.d/bart23b-ml-ai-01-public.ipynb)</sub></sup>

# 
## Different types of Neural Network structures

* **Feedforward Neural Networks**: In this neural network the information flows from input layer through the hidden layers to the output layer. They are used for tasks such as classification and regression. 

* **Convolutional Neural Networks**: They use convolutional layers to extract spatial features from the input images, allowing them to capture patterns and structures effectively. CNNs are designed specifically for image processing tasks. 

* **Recurrent Neural Networks**: RNNs are designed to handle sequential data, such as time series or natural language. They have a feedback loop that allows them to have memory and consider context from previous inputs when making predictions. 

## 0. Importing Libraries and Device Agnostics

In [2]:
import warnings
warnings.filterwarnings('ignore')

import optuna
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
import random
import itertools

from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import StratifiedKFold

from spotPython.utils.metrics import mapk_score

# 
#### Setting `seed` values to increase reproducibility

In [3]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

# 
#### Using `gpu` for tensor based mathematics

* using `device` to force tensor operations, model allocation, and other computations

In [4]:
global device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

# 
## 1. Loading Dataset: Classification

#### Vector Borne Disease Prediction Data

The **Vector Borne Disease Dataset** was used in a study to predict medical prognosis. It consisted of hundreds of samples with case-specific features. The dataset included a target variable `prognosis` representing prognostic outcomes divided into **eleven classes**. To prepare the dataset for training a classifier model, preprocessing steps involved encoding prognosis names and performing feature engineering. **The goal was to predict the prognosis for unknown data based on the trained model.** 

In [5]:
train_df = pd.read_csv('./data/Kaggle/train.csv')

# remove the id column
train_df = train_df.drop(columns=['id'])

global target_column
target_column = "prognosis"

train_df

Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Lyme_disease
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Tungiasis
2,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,Lyme_disease
3,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Zika
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,Rift_Valley_fever
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
702,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Plague
703,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Malaria
704,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Zika
705,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,Plague


#
## 2. Data Preprocessing and Feature Engineering

# 
#### `Encoding`

Encoding is a way of transforming categorical variables into a numerical format since the algorithm cannot work with categorical variables.  

In [6]:
enc = OrdinalEncoder()
train_df[target_column] = enc.fit_transform(train_df[[target_column]])

train_df

Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0
2,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,3.0
3,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
702,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
703,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
704,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0
705,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,5.0


# 
#### `Feature Combination`

This is performed to create new features by combining existing features in a dataset using some logic. This process aims to add additional information that may not be evident in the original features alone. In our case this is achieved by applying the boolean operators `and`, `or` and `xor` on the initial features.

In [7]:
def combine_features(df):

    for col1, col2 in itertools.combinations(df.columns,2):
        df[f'{col1}_and_{col2}'] = df[col1] & df[col2]
        df[f'{col1}_or_{col2}'] = df[col1] | df[col2]
        df[f'{col1}_xor_{col2}'] = df[col1] ^ df[col2]
        
    return df

In [8]:
train_df = train_df.astype(int)

col_prognosis = train_df[target_column]
train_x = train_df.drop(columns=['prognosis'])
train_x = combine_features(train_x)
train_x['prognosis'] = col_prognosis
train_df = train_x.copy()

train_df

Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,toenail_loss_and_speech_problem,toenail_loss_or_speech_problem,toenail_loss_xor_speech_problem,toenail_loss_and_bullseye_rash,toenail_loss_or_bullseye_rash,toenail_loss_xor_bullseye_rash,speech_problem_and_bullseye_rash,speech_problem_or_bullseye_rash,speech_problem_xor_bullseye_rash,prognosis
0,1,1,0,1,1,1,1,0,1,1,...,0,0,0,0,0,0,0,0,0,3
1,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,7
2,0,1,1,1,0,1,1,1,1,1,...,1,1,0,1,1,0,1,1,0,3
3,0,0,1,1,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,10
4,0,0,0,0,0,0,0,0,1,0,...,0,1,1,0,1,1,0,0,0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
702,0,0,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
703,1,0,1,1,1,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,4
704,1,0,1,0,1,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,10
705,1,1,0,0,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,5


# 
#### `Clustering`

To gather more information from the data, ***feature clustering*** was applied. This involved summing up the number of features that contained a specific keyword (for example `pain`) in their name and had a value of 1 in a specific sample. The sum was then added to a new column (say, `c_0`). 

This combines and counts the occurrences of pain-related symptoms in the dataset, and represent the overall presence or intensity of the symptom pain for each sample in the data. 

The clustering process allows to simplify and condense multiple individual features into a single aggregated cluster. 

In [8]:
def cluster_features(df):    
    c_0 = df.columns[df.columns.str.contains('pain')]
    c_1 = df.columns[df.columns.str.contains('inflammation')]
    c_2 = df.columns[df.columns.str.contains('bleed')]
    c_3 = df.columns[df.columns.str.contains('skin')]
    df["c_0"] = df[c_0].sum(axis=1)
    df["c_1"] = df[c_1].sum(axis=1)
    df["c_2"] = df[c_2].sum(axis=1)
    df["c_3"] = df[c_3].sum(axis=1) 
       
    return df

In [9]:
col_prognosis = train_df[target_column]
train_x = train_df.drop(columns=['prognosis'])
train_x = cluster_features(train_x)
train_x['prognosis'] = col_prognosis
train_df = train_x.copy()

train_df

Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,toenail_loss_or_bullseye_rash,toenail_loss_xor_bullseye_rash,speech_problem_and_bullseye_rash,speech_problem_or_bullseye_rash,speech_problem_xor_bullseye_rash,c_0,c_1,c_2,c_3,prognosis
0,1,1,0,1,1,1,1,0,1,1,...,0,0,0,0,0,707,168,356,112,3
1,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,84,36,157,137,7
2,0,1,1,1,0,1,1,1,1,1,...,1,0,1,1,0,822,375,496,252,3
3,0,0,1,1,1,1,0,1,0,1,...,0,0,0,0,0,640,235,425,252,10
4,0,0,0,0,0,0,0,0,1,0,...,1,1,0,0,0,84,147,48,24,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
702,0,0,1,1,1,0,0,0,0,0,...,0,0,0,0,0,511,203,409,165,5
703,1,0,1,1,1,1,0,1,1,0,...,0,0,0,0,0,503,199,320,163,4
704,1,0,1,0,1,0,0,1,1,1,...,0,0,0,0,0,522,174,295,183,10
705,1,1,0,0,1,0,1,0,1,0,...,0,0,0,0,0,759,330,367,252,5


# 
#### `Affinity Propagation`

Affinity Propagation is a clustering algorithm used in machine learning and data analysis. The algorithm is based on the concept of **message passing** among data points to determine which points should be considered as **exemplars, or representatives**, of clusters. The exemplars are chosen based on their **affinity** or similarity to other data points in the dataset.

In [10]:
def affinity_propagation_features(df):
    from sklearn.cluster import AffinityPropagation
    from sklearn.metrics.pairwise import manhattan_distances

    X = manhattan_distances(df)
    af = AffinityPropagation(random_state=0, affinity="precomputed").fit(X)
    cluster_centers_indices = af.cluster_centers_indices_
    n_clusters_ = len(cluster_centers_indices)

    df['cluster'] = af.labels_
    
    return df

In [11]:
col_prognosis = train_df[target_column]
train_x = train_df.drop(columns=['prognosis'])   
train_df = affinity_propagation_features(train_x)
train_df['prognosis'] = col_prognosis

train_df

Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,toenail_loss_xor_bullseye_rash,speech_problem_and_bullseye_rash,speech_problem_or_bullseye_rash,speech_problem_xor_bullseye_rash,c_0,c_1,c_2,c_3,cluster,prognosis
0,1,1,0,1,1,1,1,0,1,1,...,0,0,0,0,707,168,356,112,0,3
1,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,84,36,157,137,1,7
2,0,1,1,1,0,1,1,1,1,1,...,0,1,1,0,822,375,496,252,0,3
3,0,0,1,1,1,1,0,1,0,1,...,0,0,0,0,640,235,425,252,2,10
4,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,84,147,48,24,1,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
702,0,0,1,1,1,0,0,0,0,0,...,0,0,0,0,511,203,409,165,1,5
703,1,0,1,1,1,1,0,1,1,0,...,0,0,0,0,503,199,320,163,1,4
704,1,0,1,0,1,0,0,1,1,1,...,0,0,0,0,522,174,295,183,2,10
705,1,1,0,0,1,0,1,0,1,0,...,0,0,0,0,759,330,367,252,0,5


# 
#### `Feature Selection`

Since the dataset from the previous step contains `6118` features, it was important to utilize a certain number of **most important features** to reduce the computational cost downstream. This was achieved using feature ranking with **recursive feature elimination (`RFE`)**.

Using an external estimator (`RandomForestClassifier`), the least significant features are iteratively eliminated. It starts by training an estimator on the full feature set and ranks the features based on their importance. Then, it removes the least important feature(s) and repeats the process until a desired number of features (`n_features_to_select`) is reached. This mitigates over-fitting and improves generalization by focusing on the most informative features.

In [12]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

y = train_df[target_column]
X = train_df.drop(columns = [target_column], axis = 1)

# executing RFE to reach 100 features by eliminating 20 features in each iteration
rfe_selector = RFE(estimator = RandomForestClassifier(), n_features_to_select = 100, step = 20)
rfe_selector.fit(X, y)

# get the names of most-important features
rfe_selector.get_feature_names_out()

array(['sudden_fever_xor_muscle_pain', 'sudden_fever_xor_joint_pain',
       'sudden_fever_xor_pleural_effusion',
       'sudden_fever_xor_weight_loss', 'headache_xor_diarrhea',
       'headache_xor_nausea', 'headache_xor_chills',
       'headache_xor_fatigue', 'headache_xor_facial_distortion',
       'mouth_bleed_xor_joint_pain', 'mouth_bleed_xor_pleural_effusion',
       'mouth_bleed_xor_swelling', 'mouth_bleed_xor_nausea',
       'mouth_bleed_xor_fatigue', 'mouth_bleed_or_weight_loss',
       'mouth_bleed_xor_coma', 'mouth_bleed_xor_irritability',
       'nose_bleed_xor_swelling', 'nose_bleed_xor_nausea',
       'nose_bleed_xor_digestion_trouble', 'nose_bleed_xor_weight_loss',
       'nose_bleed_xor_diziness', 'nose_bleed_xor_loss_of_appetite',
       'nose_bleed_xor_microcephaly', 'nose_bleed_or_prostraction',
       'muscle_pain_xor_nausea', 'muscle_pain_xor_chills',
       'muscle_pain_xor_gum_bleed', 'muscle_pain_xor_jaundice',
       'muscle_pain_xor_yellow_skin', 'joint_pain_x

In [13]:
# sampling the dataset based on only the important features
train_df = train_df[rfe_selector.get_feature_names_out()]
train_df['prognosis'] = y

train_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['prognosis'] = y


Unnamed: 0,sudden_fever_xor_muscle_pain,sudden_fever_xor_joint_pain,sudden_fever_xor_pleural_effusion,sudden_fever_xor_weight_loss,headache_xor_diarrhea,headache_xor_nausea,headache_xor_chills,headache_xor_fatigue,headache_xor_facial_distortion,mouth_bleed_xor_joint_pain,...,microcephaly_or_toenail_loss,bitter_tongue_or_toenail_loss,cocacola_urine_or_toenail_loss,hyperpyrexia_or_itchiness,c_0,c_1,c_2,c_3,cluster,prognosis
0,0,0,0,0,0,0,0,0,1,1,...,0,1,0,0,707,168,356,112,0,3
1,0,0,1,0,1,1,0,0,0,0,...,0,0,0,0,84,36,157,137,1,7
2,0,1,1,1,0,0,0,0,0,0,...,1,1,1,1,822,375,496,252,0,3
3,1,1,1,0,0,1,1,1,1,0,...,0,0,0,0,640,235,425,252,2,10
4,0,0,0,0,1,0,0,0,0,0,...,1,1,1,1,84,147,48,24,1,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
702,1,0,0,0,0,0,0,0,1,1,...,1,0,0,0,511,203,409,165,1,5
703,0,0,1,1,1,1,0,0,1,0,...,1,0,0,0,503,199,320,163,1,4
704,0,1,1,0,1,1,1,1,1,1,...,1,1,0,0,522,174,295,183,2,10
705,0,1,0,0,0,0,0,0,1,0,...,1,1,0,0,759,330,367,252,0,5


# 
#### Dimensionality Reduction: `Principle Component Analysis`

To further reduce the the load on computational resources and training times (*curse of dimensionilty*), **Dimensionality Reduction** can be utilized.

**Principle Component Analysis (PCA)** is a linear dimensionality reduction technique, which takes advantage of existing correlations between the correlated features in the dataset, and combines them into a new set (`n_components`) of uncorrelated variables. PCA is an unsupervised algorithm as it does not require labels in the data.

In [14]:
# from sklearn.decomposition import PCA

# X = train_df.drop(columns=[target_column], axis=1)

# pca = PCA(n_components=30, random_state=42)
# X_with_PCA = pca.fit_transform(X)

# print(f'{X_with_PCA.shape}')

# train_df = pd.DataFrame(X_with_PCA)
# train_df['prognosis'] = y

# train_df

# 
#### Dimensionality Reduction: `Linear Discriminant Analysis`

Another method of dimensionality reduction using **Linear Discriminant Analysis (LDA)** involves reducing the number of features in a dataset while preserving the discriminative information between different classes. It works by calculating summary statistics for the input features by class label, and therefore is a method of supervised learning.

In [15]:
# from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# y = train_df[target_column]
# X = train_df.drop(columns=[target_column], axis=1)

# lda = LinearDiscriminantAnalysis()
# X_with_LDA = lda.fit_transform(X, y)

# print(f'{X_with_LDA.shape}\n')

# train_df = pd.DataFrame(X_with_LDA)

# train_df['prognosis'] = y

# train_df

# 
#### `class_weights` for training data

To reduce the impact of *class imbalance* (some classes being more prevelant than the others), **class weights** are introduced. Class weights assign different weights to different classes, affecting the contribution of each class to the overall loss function or optimization process during training. By giving more weight to the minority class and less weight to the majority class, the algorithm pays more attention to the minority class samples and tries to correct the bias.

In [16]:
def apply_weights (target):
    
    class_sample_count = np.unique(target, return_counts=True)[1]
    total_samples = len(target)

    train_class_weight = 1 - (class_sample_count/total_samples)

    train_class_weight = torch.from_numpy(train_class_weight).type(torch.float).to(device)

    return train_class_weight

#
## 3. `optuna` Hyper-Parameter Tuning

Optuna is an open-source black-box hyperparameter optimization framework, designed to automate the process of hyperparameter tuning. Optuna allows easy integration with PyTorch in order to obtain optimized neural network structures.

# 
#### Simple MLP model with tune-able parameters

Defining a basic multi-layer perceptron (MLP) structure for hyperparameter tuning. During every `trial` of the tuning process, a new `custom_model` with a fresh set of parameters is generated and evaluated for the objective function.

In [17]:
def custom_model (trial, feature_in, feature_out=11): 

    # number of hidden layers (-1 for input layer)
    num_layers = trial.suggest_int('Number of layers', 2, 5, step=1)

    layers = []

    num_input = feature_in

    for i in range(num_layers):

        # number of hidden neurons
        num_output = trial.suggest_int('num_l{}'.format(i), 4, 9, step=1)
        num_output = 2 ** num_output

        layers.append(nn.Linear(in_features=num_input, out_features=num_output))
        layers.append(nn.BatchNorm1d(num_output))
        layers.append(nn.ReLU())

        # dropout probability for the layer
        prb = trial.suggest_float('dropout_prb_l{}'.format(i), 0., 1., step=0.2)
        layers.append(nn.Dropout(p=prb))

        num_input = num_output

    layers.append(nn.Linear(in_features=num_input, out_features=feature_out))

    return nn.Sequential(*layers)


# 
#### Helper function to reset model weights in every fold of training

Since the tuning process optimizes on the average k-fold `map@k` score, this function is used to reset the model weights before starting a new fold, in order to avoid over-training and over-fitting.

In [18]:
def weight_reset(m):
    if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
        m.reset_parameters()

# 
#### Training loop

In [19]:
global X, y

X = train_df.drop(columns=[target_column])
y = train_df[target_column]

# 
This `train_mapk` training process is used in every trial of the tuning process.

1. Whenever a new `trial` is initiated, a new `model` (defined in the function `custom_model`) is created. 
2. Optuna then *suggests* a fresh set of *tune-able parameters* for this trial.
3. The complete `train_df` dataset is then sampled using `5-fold`, and `train` and `test` data are produced for each fold.
4. For each `fold`, the generated model and the suggested parameters are used for training, and obtaining `5-fold map@k` score.

# 
The `map@k` or **Mean Average Precision at K** an error metric that can be used when the sequence of your predictions plays an important role in the objective of the task. In our objective, since it is desired to know whether the correctly predicted prognosis is within the top `k =3` predictions, `map@k` is a suitable metric to use for oprimization.

In [20]:
def train_mapk (model, trial):

    # number of epochs in a trial
    epochs = trial.suggest_int('epochs', 500, 1000, step=100)

    # executing k-fold cross validation optimization
    k = 5
    skf = StratifiedKFold(n_splits=k)


    # tuning the optimizer and its learning rate
    optimizer = trial.suggest_categorical("optimizer", ["Adam", "RMSprop", "SGD", "AdamW", "Adamax", "NAdam"])
    learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-1, log=True)

    optimizer = getattr(optim, optimizer)(model.parameters(), lr = learning_rate)


    # device agnostics
    if torch.cuda.is_available(): 
        model = model.cuda()


    # empty list to store 'k' map@k scores for each fold
    mapk_list = []


    for i, (train_index, test_index) in enumerate(skf.split(X, y)):
        
        # new train and test data in each fold
        X_train, y_train = X.iloc[train_index], y.iloc[train_index]
        X_test, y_test = X.iloc[test_index], y.iloc[test_index]


        # calculating class_weights for training data in each fold
        train_class_weight = apply_weights(y_train)


        # for multi-class classification problem
        loss_function = nn.CrossEntropyLoss(weight=train_class_weight)

        # device agnostics
        if torch.cuda.is_available(): 
            loss_function = loss_function.cuda()


        # making tensors for train and test data in each fold
        X_train_tensor = torch.from_numpy(X_train.values).type(torch.float).to(device)
        y_train_tensor = torch.from_numpy(y_train.values).type(torch.LongTensor).to(device)

        X_test_tensor = torch.from_numpy(X_test.values).type(torch.float).to(device)
        y_test_tensor = torch.from_numpy(y_test.values).type(torch.LongTensor).to(device)


        # resetting model weights before starting training for the next fold
        model.apply(weight_reset)


        # training loop for epochs for particular trial
        for epoch in range(epochs):
            
            # enter training mode
            model.train()

            # getting raw prediction outputs from the NN with tuned parameters
            y_train_logits = model(X_train_tensor)

            # converting raw logits to probabilities for each class
            y_train_pred_probs = torch.softmax(y_train_logits, dim=1)

            # improving model weights using gradient-based optimization
            loss = loss_function(y_train_logits, y_train_tensor)

            # zero-out the gradients of parameters before next pass
            optimizer.zero_grad()

            # perform back-prop
            loss.backward()

            # update model paramters using calculated gradients
            optimizer.step()



        # enter testing mode
        model.eval()
        
        # getting raw test predictions
        y_test_logits = model(X_test_tensor).to(device)

        # converting them to class probabilities 
        y_test_pred_probs = torch.softmax(y_test_logits, dim=1)

        # converting tensors into numpy array to calculate map@k score for this fold
        y_test_pred_probs = y_test_pred_probs.cpu()
        y_test_pred_probs = y_test_pred_probs.detach().numpy()


        mapk_list.append(mapk_score(y_test, y_test_pred_probs))


    # getting average map@k score of k-folds
    cv_mapk = np.average(mapk_list)
    print(f'---------------------------------------------\n{k}-fold Validation MAPK: {cv_mapk}\n\n')

    return cv_mapk

# 
#### Objective function to optimize

The `objective_function` is reponsible for initializing a new neural network structure (`custom_model`) and calls the training loop (`train_mapk`). It then receives the `average map@k` score which is used by optuna for optimization.

In [21]:
n_features = train_df.shape[1] - 1

def objective_function (trial):

    # tune-able MLP model
    model = custom_model(trial=trial, feature_in=n_features, feature_out=11)

    # tuning on average map@k of k-fold obtained after training loop
    cv_mapk = train_mapk(model, trial)

    return cv_mapk

# 
#### Creating `optuna study` and optimizing

In optuna, `study` refers to a variable that can store and represent a collection of `trails` and their results. We optimizing using the output obtained from the `objective_function` for a fixed number of trials (`n_trials`). 

`direction="maximize"` represnts that optuna maximizes the objective function.

In [22]:
study = optuna.create_study(direction="maximize")
study.optimize(objective_function, n_trials=50)

[I 2023-07-14 23:07:47,946] A new study created in memory with name: no-name-611ac5ad-3046-4bb1-9b17-4ec0835fc8ab
[I 2023-07-14 23:07:52,964] Trial 0 finished with value: 0.32602970066260445 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 0.4, 'num_l1': 9, 'dropout_prb_l1': 0.0, 'epochs': 200, 'optimizer': 'Adamax', 'learning_rate': 0.0016208428381543323}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.32602970066260445




[I 2023-07-14 23:08:04,106] Trial 1 finished with value: 0.13724569640062598 and parameters: {'Number of layers': 5, 'num_l0': 7, 'dropout_prb_l0': 1.0, 'num_l1': 4, 'dropout_prb_l1': 1.0, 'num_l2': 5, 'dropout_prb_l2': 1.0, 'num_l3': 8, 'dropout_prb_l3': 0.2, 'num_l4': 4, 'dropout_prb_l4': 0.6000000000000001, 'epochs': 600, 'optimizer': 'AdamW', 'learning_rate': 0.0002756878383274005}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.13724569640062598




[I 2023-07-14 23:08:10,652] Trial 2 finished with value: 0.11643358971797688 and parameters: {'Number of layers': 5, 'num_l0': 5, 'dropout_prb_l0': 0.2, 'num_l1': 7, 'dropout_prb_l1': 1.0, 'num_l2': 5, 'dropout_prb_l2': 0.8, 'num_l3': 6, 'dropout_prb_l3': 0.0, 'num_l4': 5, 'dropout_prb_l4': 0.2, 'epochs': 400, 'optimizer': 'SGD', 'learning_rate': 0.00014543607865440065}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.11643358971797688




[I 2023-07-14 23:08:16,073] Trial 3 finished with value: 0.3179985349448939 and parameters: {'Number of layers': 3, 'num_l0': 4, 'dropout_prb_l0': 0.0, 'num_l1': 8, 'dropout_prb_l1': 0.4, 'num_l2': 7, 'dropout_prb_l2': 0.6000000000000001, 'epochs': 400, 'optimizer': 'NAdam', 'learning_rate': 9.789281692902198e-05}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.3179985349448939




[I 2023-07-14 23:08:23,480] Trial 4 finished with value: 0.27063396930043615 and parameters: {'Number of layers': 3, 'num_l0': 6, 'dropout_prb_l0': 0.0, 'num_l1': 4, 'dropout_prb_l1': 0.6000000000000001, 'num_l2': 4, 'dropout_prb_l2': 0.0, 'epochs': 600, 'optimizer': 'AdamW', 'learning_rate': 0.05734553476145063}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.27063396930043615




[I 2023-07-14 23:08:34,271] Trial 5 finished with value: 0.31048679785569205 and parameters: {'Number of layers': 2, 'num_l0': 4, 'dropout_prb_l0': 0.2, 'num_l1': 9, 'dropout_prb_l1': 0.4, 'epochs': 1000, 'optimizer': 'NAdam', 'learning_rate': 5.204225407295396e-05}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.31048679785569205




[I 2023-07-14 23:08:38,954] Trial 6 finished with value: 0.3069623414244331 and parameters: {'Number of layers': 3, 'num_l0': 6, 'dropout_prb_l0': 0.6000000000000001, 'num_l1': 7, 'dropout_prb_l1': 0.2, 'num_l2': 6, 'dropout_prb_l2': 0.2, 'epochs': 400, 'optimizer': 'RMSprop', 'learning_rate': 0.0001654675864462945}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.3069623414244331




[I 2023-07-14 23:08:45,656] Trial 7 finished with value: 0.14551326873772183 and parameters: {'Number of layers': 4, 'num_l0': 6, 'dropout_prb_l0': 0.6000000000000001, 'num_l1': 8, 'dropout_prb_l1': 0.8, 'num_l2': 9, 'dropout_prb_l2': 0.6000000000000001, 'num_l3': 8, 'dropout_prb_l3': 1.0, 'epochs': 400, 'optimizer': 'NAdam', 'learning_rate': 5.521920813490299e-05}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.14551326873772183




[I 2023-07-14 23:08:48,883] Trial 8 finished with value: 0.13536276762228217 and parameters: {'Number of layers': 4, 'num_l0': 7, 'dropout_prb_l0': 0.2, 'num_l1': 8, 'dropout_prb_l1': 1.0, 'num_l2': 6, 'dropout_prb_l2': 0.8, 'num_l3': 6, 'dropout_prb_l3': 0.2, 'epochs': 200, 'optimizer': 'Adam', 'learning_rate': 0.0002880808334401327}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.13536276762228217




[I 2023-07-14 23:09:01,973] Trial 9 finished with value: 0.23458362467951915 and parameters: {'Number of layers': 3, 'num_l0': 9, 'dropout_prb_l0': 0.8, 'num_l1': 6, 'dropout_prb_l1': 0.6000000000000001, 'num_l2': 5, 'dropout_prb_l2': 0.6000000000000001, 'epochs': 1000, 'optimizer': 'AdamW', 'learning_rate': 1.3546110875192154e-05}. Best is trial 0 with value: 0.32602970066260445.


---------------------------------------------
5-fold Validation MAPK: 0.23458362467951915




[I 2023-07-14 23:09:05,062] Trial 10 finished with value: 0.3266939699663703 and parameters: {'Number of layers': 2, 'num_l0': 9, 'dropout_prb_l0': 0.4, 'num_l1': 9, 'dropout_prb_l1': 0.0, 'epochs': 200, 'optimizer': 'Adamax', 'learning_rate': 0.0030204231520277736}. Best is trial 10 with value: 0.3266939699663703.


---------------------------------------------
5-fold Validation MAPK: 0.3266939699663703




[I 2023-07-14 23:09:08,261] Trial 11 finished with value: 0.3340891685812273 and parameters: {'Number of layers': 2, 'num_l0': 9, 'dropout_prb_l0': 0.4, 'num_l1': 9, 'dropout_prb_l1': 0.0, 'epochs': 200, 'optimizer': 'Adamax', 'learning_rate': 0.003065053133099761}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.3340891685812273




[I 2023-07-14 23:09:11,382] Trial 12 finished with value: 0.3281606899077681 and parameters: {'Number of layers': 2, 'num_l0': 9, 'dropout_prb_l0': 0.4, 'num_l1': 9, 'dropout_prb_l1': 0.0, 'epochs': 200, 'optimizer': 'Adamax', 'learning_rate': 0.0029153340590448505}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.3281606899077681




[I 2023-07-14 23:09:23,475] Trial 13 finished with value: 0.32577997536043685 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 0.4, 'num_l1': 6, 'dropout_prb_l1': 0.2, 'epochs': 800, 'optimizer': 'Adamax', 'learning_rate': 0.007929518396366294}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.32577997536043685




[I 2023-07-14 23:09:26,636] Trial 14 finished with value: 0.33050977258349146 and parameters: {'Number of layers': 2, 'num_l0': 9, 'dropout_prb_l0': 0.8, 'num_l1': 9, 'dropout_prb_l1': 0.0, 'epochs': 200, 'optimizer': 'Adamax', 'learning_rate': 0.01017469980191424}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.33050977258349146




[I 2023-07-14 23:09:29,925] Trial 15 finished with value: 0.17994872307128823 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 1.0, 'num_l1': 5, 'dropout_prb_l1': 0.2, 'epochs': 200, 'optimizer': 'Adamax', 'learning_rate': 0.014805163864751154}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.17994872307128823




[I 2023-07-14 23:09:42,998] Trial 16 finished with value: 0.17052242533213463 and parameters: {'Number of layers': 4, 'num_l0': 8, 'dropout_prb_l0': 0.8, 'num_l1': 8, 'dropout_prb_l1': 0.2, 'num_l2': 9, 'dropout_prb_l2': 0.2, 'num_l3': 4, 'dropout_prb_l3': 1.0, 'epochs': 800, 'optimizer': 'Adam', 'learning_rate': 0.0011338245471515687}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.17052242533213463




[I 2023-07-14 23:09:47,306] Trial 17 finished with value: 0.3005960110545067 and parameters: {'Number of layers': 3, 'num_l0': 9, 'dropout_prb_l0': 0.8, 'num_l1': 7, 'dropout_prb_l1': 0.0, 'num_l2': 8, 'dropout_prb_l2': 0.0, 'epochs': 400, 'optimizer': 'SGD', 'learning_rate': 0.011772499061573925}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.3005960110545067




[I 2023-07-14 23:09:52,980] Trial 18 finished with value: 0.293483834448773 and parameters: {'Number of layers': 2, 'num_l0': 7, 'dropout_prb_l0': 0.6000000000000001, 'num_l1': 9, 'dropout_prb_l1': 0.4, 'epochs': 600, 'optimizer': 'RMSprop', 'learning_rate': 0.0910412542015756}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.293483834448773




[I 2023-07-14 23:09:57,261] Trial 19 finished with value: 0.16258116072320444 and parameters: {'Number of layers': 3, 'num_l0': 9, 'dropout_prb_l0': 0.8, 'num_l1': 8, 'dropout_prb_l1': 0.0, 'num_l2': 7, 'dropout_prb_l2': 1.0, 'epochs': 200, 'optimizer': 'Adamax', 'learning_rate': 0.0005610125240586225}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.16258116072320444




[I 2023-07-14 23:10:06,536] Trial 20 finished with value: 0.1553524456431259 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 1.0, 'num_l1': 5, 'dropout_prb_l1': 0.2, 'epochs': 600, 'optimizer': 'Adamax', 'learning_rate': 0.0052078075483533}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.1553524456431259




[I 2023-07-14 23:10:09,745] Trial 21 finished with value: 0.3250874038557586 and parameters: {'Number of layers': 2, 'num_l0': 9, 'dropout_prb_l0': 0.4, 'num_l1': 9, 'dropout_prb_l1': 0.0, 'epochs': 200, 'optimizer': 'Adamax', 'learning_rate': 0.003071410130217537}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.3250874038557586




[I 2023-07-14 23:10:12,912] Trial 22 finished with value: 0.32360736523157857 and parameters: {'Number of layers': 2, 'num_l0': 9, 'dropout_prb_l0': 0.6000000000000001, 'num_l1': 9, 'dropout_prb_l1': 0.0, 'epochs': 200, 'optimizer': 'Adamax', 'learning_rate': 0.024169094522565607}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.32360736523157857




[I 2023-07-14 23:10:19,223] Trial 23 finished with value: 0.33285386075317147 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 0.2, 'num_l1': 8, 'dropout_prb_l1': 0.2, 'epochs': 400, 'optimizer': 'Adamax', 'learning_rate': 0.003705336219023195}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.33285386075317147




[I 2023-07-14 23:10:27,320] Trial 24 finished with value: 0.3255901841307895 and parameters: {'Number of layers': 3, 'num_l0': 8, 'dropout_prb_l0': 0.2, 'num_l1': 8, 'dropout_prb_l1': 0.2, 'num_l2': 4, 'dropout_prb_l2': 0.4, 'epochs': 400, 'optimizer': 'Adamax', 'learning_rate': 0.020946616412132968}. Best is trial 11 with value: 0.3340891685812273.


---------------------------------------------
5-fold Validation MAPK: 0.3255901841307895




[I 2023-07-14 23:10:33,282] Trial 25 finished with value: 0.3468118403089934 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 0.0, 'num_l1': 8, 'dropout_prb_l1': 0.2, 'epochs': 400, 'optimizer': 'Adamax', 'learning_rate': 0.005014075233600803}. Best is trial 25 with value: 0.3468118403089934.


---------------------------------------------
5-fold Validation MAPK: 0.3468118403089934




[I 2023-07-14 23:10:38,038] Trial 26 finished with value: 0.2840725202277495 and parameters: {'Number of layers': 3, 'num_l0': 7, 'dropout_prb_l0': 0.0, 'num_l1': 7, 'dropout_prb_l1': 0.4, 'num_l2': 8, 'dropout_prb_l2': 0.4, 'epochs': 400, 'optimizer': 'RMSprop', 'learning_rate': 0.0008546247639185668}. Best is trial 25 with value: 0.3468118403089934.


---------------------------------------------
5-fold Validation MAPK: 0.2840725202277495




[I 2023-07-14 23:10:41,503] Trial 27 finished with value: 0.3083291712449639 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 0.0, 'num_l1': 8, 'dropout_prb_l1': 0.2, 'epochs': 400, 'optimizer': 'SGD', 'learning_rate': 0.00218846052948995}. Best is trial 25 with value: 0.3468118403089934.


---------------------------------------------
5-fold Validation MAPK: 0.3083291712449639




[I 2023-07-14 23:10:51,035] Trial 28 finished with value: 0.3017014617254353 and parameters: {'Number of layers': 4, 'num_l0': 7, 'dropout_prb_l0': 0.2, 'num_l1': 7, 'dropout_prb_l1': 0.6000000000000001, 'num_l2': 8, 'dropout_prb_l2': 0.2, 'num_l3': 4, 'dropout_prb_l3': 0.6000000000000001, 'epochs': 600, 'optimizer': 'Adam', 'learning_rate': 0.005668670414999856}. Best is trial 25 with value: 0.3468118403089934.


---------------------------------------------
5-fold Validation MAPK: 0.3017014617254353




[I 2023-07-14 23:10:57,105] Trial 29 finished with value: 0.3613058968468018 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 0.2, 'num_l1': 6, 'dropout_prb_l1': 0.4, 'epochs': 400, 'optimizer': 'Adamax', 'learning_rate': 0.0014724815864737035}. Best is trial 29 with value: 0.3613058968468018.


---------------------------------------------
5-fold Validation MAPK: 0.3613058968468018




[I 2023-07-14 23:11:08,865] Trial 30 finished with value: 0.3375953118236606 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 0.0, 'num_l1': 6, 'dropout_prb_l1': 0.4, 'epochs': 800, 'optimizer': 'Adamax', 'learning_rate': 0.00148688577437784}. Best is trial 29 with value: 0.3613058968468018.


---------------------------------------------
5-fold Validation MAPK: 0.3375953118236606




[I 2023-07-14 23:11:20,615] Trial 31 finished with value: 0.3276629041387807 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 0.0, 'num_l1': 6, 'dropout_prb_l1': 0.4, 'epochs': 800, 'optimizer': 'Adamax', 'learning_rate': 0.0015383889115363394}. Best is trial 29 with value: 0.3613058968468018.


---------------------------------------------
5-fold Validation MAPK: 0.3276629041387807




[I 2023-07-14 23:11:32,556] Trial 32 finished with value: 0.31092464955215926 and parameters: {'Number of layers': 2, 'num_l0': 8, 'dropout_prb_l0': 0.0, 'num_l1': 5, 'dropout_prb_l1': 0.4, 'epochs': 800, 'optimizer': 'Adamax', 'learning_rate': 0.0014205108140507273}. Best is trial 29 with value: 0.3613058968468018.


---------------------------------------------
5-fold Validation MAPK: 0.31092464955215926




[I 2023-07-14 23:11:41,493] Trial 33 finished with value: 0.3399627076882096 and parameters: {'Number of layers': 2, 'num_l0': 7, 'dropout_prb_l0': 0.2, 'num_l1': 6, 'dropout_prb_l1': 0.6000000000000001, 'epochs': 600, 'optimizer': 'Adamax', 'learning_rate': 0.0006198551796928318}. Best is trial 29 with value: 0.3613058968468018.


---------------------------------------------
5-fold Validation MAPK: 0.3399627076882096




[I 2023-07-14 23:11:47,738] Trial 34 finished with value: 0.36755235907168776 and parameters: {'Number of layers': 2, 'num_l0': 7, 'dropout_prb_l0': 0.2, 'num_l1': 6, 'dropout_prb_l1': 0.8, 'epochs': 600, 'optimizer': 'AdamW', 'learning_rate': 0.0005175593986696155}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.36755235907168776




[I 2023-07-14 23:11:59,015] Trial 35 finished with value: 0.18696600406219824 and parameters: {'Number of layers': 5, 'num_l0': 7, 'dropout_prb_l0': 0.2, 'num_l1': 6, 'dropout_prb_l1': 0.8, 'num_l2': 7, 'dropout_prb_l2': 0.8, 'num_l3': 9, 'dropout_prb_l3': 0.6000000000000001, 'num_l4': 9, 'dropout_prb_l4': 1.0, 'epochs': 600, 'optimizer': 'AdamW', 'learning_rate': 0.00043957736509462166}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.18696600406219824




[I 2023-07-14 23:12:06,605] Trial 36 finished with value: 0.3399577131821663 and parameters: {'Number of layers': 3, 'num_l0': 5, 'dropout_prb_l0': 0.2, 'num_l1': 5, 'dropout_prb_l1': 0.8, 'num_l2': 4, 'dropout_prb_l2': 0.0, 'epochs': 600, 'optimizer': 'AdamW', 'learning_rate': 0.0007288256867555272}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.3399577131821663




[I 2023-07-14 23:12:14,760] Trial 37 finished with value: 0.18616688309526186 and parameters: {'Number of layers': 3, 'num_l0': 6, 'dropout_prb_l0': 0.2, 'num_l1': 4, 'dropout_prb_l1': 0.6000000000000001, 'num_l2': 9, 'dropout_prb_l2': 1.0, 'epochs': 600, 'optimizer': 'AdamW', 'learning_rate': 0.00037085927048932334}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.18616688309526186




[I 2023-07-14 23:12:21,018] Trial 38 finished with value: 0.3331635201278594 and parameters: {'Number of layers': 2, 'num_l0': 7, 'dropout_prb_l0': 0.0, 'num_l1': 6, 'dropout_prb_l1': 0.8, 'epochs': 600, 'optimizer': 'NAdam', 'learning_rate': 0.0008153094888704268}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.3331635201278594




[I 2023-07-14 23:12:25,175] Trial 39 finished with value: 0.3142176938700762 and parameters: {'Number of layers': 2, 'num_l0': 6, 'dropout_prb_l0': 0.2, 'num_l1': 5, 'dropout_prb_l1': 0.6000000000000001, 'epochs': 400, 'optimizer': 'AdamW', 'learning_rate': 0.00019824497840078341}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.3142176938700762




[I 2023-07-14 23:12:29,682] Trial 40 finished with value: 0.15781307228715077 and parameters: {'Number of layers': 3, 'num_l0': 5, 'dropout_prb_l0': 0.0, 'num_l1': 7, 'dropout_prb_l1': 1.0, 'num_l2': 6, 'dropout_prb_l2': 0.4, 'epochs': 400, 'optimizer': 'SGD', 'learning_rate': 0.0005411438779951037}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.15781307228715077




[I 2023-07-14 23:12:40,514] Trial 41 finished with value: 0.2972463623347651 and parameters: {'Number of layers': 5, 'num_l0': 5, 'dropout_prb_l0': 0.2, 'num_l1': 5, 'dropout_prb_l1': 0.8, 'num_l2': 4, 'dropout_prb_l2': 0.0, 'num_l3': 5, 'dropout_prb_l3': 0.8, 'num_l4': 8, 'dropout_prb_l4': 0.0, 'epochs': 600, 'optimizer': 'AdamW', 'learning_rate': 0.0007722196098269275}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.2972463623347651




[I 2023-07-14 23:12:48,264] Trial 42 finished with value: 0.31949189225185626 and parameters: {'Number of layers': 3, 'num_l0': 5, 'dropout_prb_l0': 0.2, 'num_l1': 4, 'dropout_prb_l1': 0.8, 'num_l2': 5, 'dropout_prb_l2': 0.2, 'epochs': 600, 'optimizer': 'AdamW', 'learning_rate': 0.0002742073585513442}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.31949189225185626




[I 2023-07-14 23:12:54,548] Trial 43 finished with value: 0.343715246562115 and parameters: {'Number of layers': 2, 'num_l0': 6, 'dropout_prb_l0': 0.4, 'num_l1': 6, 'dropout_prb_l1': 0.8, 'epochs': 600, 'optimizer': 'AdamW', 'learning_rate': 0.0019310026033814848}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.343715246562115




[I 2023-07-14 23:12:58,707] Trial 44 finished with value: 0.34299437285652445 and parameters: {'Number of layers': 2, 'num_l0': 6, 'dropout_prb_l0': 0.4, 'num_l1': 6, 'dropout_prb_l1': 0.6000000000000001, 'epochs': 400, 'optimizer': 'AdamW', 'learning_rate': 0.0018633280832952943}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.34299437285652445




[I 2023-07-14 23:13:03,059] Trial 45 finished with value: 0.16457729830519763 and parameters: {'Number of layers': 2, 'num_l0': 6, 'dropout_prb_l0': 0.4, 'num_l1': 6, 'dropout_prb_l1': 1.0, 'epochs': 400, 'optimizer': 'AdamW', 'learning_rate': 0.0022326037940125854}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.16457729830519763




[I 2023-07-14 23:13:07,321] Trial 46 finished with value: 0.3625178969799887 and parameters: {'Number of layers': 2, 'num_l0': 6, 'dropout_prb_l0': 0.4, 'num_l1': 7, 'dropout_prb_l1': 0.8, 'epochs': 400, 'optimizer': 'AdamW', 'learning_rate': 0.001225559951234991}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.3625178969799887




[I 2023-07-14 23:13:11,675] Trial 47 finished with value: 0.1710951286917724 and parameters: {'Number of layers': 2, 'num_l0': 6, 'dropout_prb_l0': 0.4, 'num_l1': 7, 'dropout_prb_l1': 1.0, 'epochs': 400, 'optimizer': 'AdamW', 'learning_rate': 0.0010063663402173654}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.1710951286917724




[I 2023-07-14 23:13:15,931] Trial 48 finished with value: 0.3450221423101255 and parameters: {'Number of layers': 2, 'num_l0': 7, 'dropout_prb_l0': 0.4, 'num_l1': 7, 'dropout_prb_l1': 0.8, 'epochs': 400, 'optimizer': 'AdamW', 'learning_rate': 0.0012369644023945106}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.3450221423101255




[I 2023-07-14 23:13:20,379] Trial 49 finished with value: 0.33737721839310086 and parameters: {'Number of layers': 2, 'num_l0': 7, 'dropout_prb_l0': 0.6000000000000001, 'num_l1': 7, 'dropout_prb_l1': 0.8, 'epochs': 400, 'optimizer': 'NAdam', 'learning_rate': 0.005006304358310314}. Best is trial 34 with value: 0.36755235907168776.


---------------------------------------------
5-fold Validation MAPK: 0.33737721839310086




# 
#### Getting best `map@k` and best model paramters during the training

In [23]:
print(f'Best MAPK while optimizing: {study.best_trial.value}')

Best MAPK while optimizing: 0.36755235907168776


In [24]:
for key, value in study.best_trial.params.items():
    print(f'{key}: {value}')

Number of layers: 2
num_l0: 7
dropout_prb_l0: 0.2
num_l1: 6
dropout_prb_l1: 0.8
epochs: 600
optimizer: AdamW
learning_rate: 0.0005175593986696155


#
## 4. Cross validation evaluation of tuned neural network

# 
#### Creating the network

A new neural network with the tuned parameters is then created (from scratch).

In [27]:
class mcClassifier(nn.Module):
# feedforward MLP with batch-norm
    def __init__ (self, num_input, num_output=11):
        super().__init__()


        # tuned number of layers
        self.linear_layer_nn = nn.Sequential(

            # layer 1 with tuned parameters
            nn.Linear(in_features = num_input, out_features = 2**7),
            nn.BatchNorm1d(2**7),
            nn.ReLU(),
            nn.Dropout(p=0.2),

            # layer 2 with tuned parameters
            nn.Linear(2**7, 2**6),
            nn.BatchNorm1d(2**6),
            nn.ReLU(),
            nn.Dropout(p=0.8),

            # nn.Linear(2**7, 2**7),
            # nn.BatchNorm1d(2**7),
            # nn.ReLU(),
            # nn.Dropout(p=0.8),

            # nn.Linear(2**7, 2**7),
            # nn.BatchNorm1d(2**7),
            # nn.ReLU(),
            # nn.Dropout(p=0.8),

            nn.Linear(2**6, num_output)
        )

    def forward (self, x):
        return self.linear_layer_nn(x)

# 
#### 10-fold evaluation

The created netwrok is then trained and evaulated to obtain the `10-fold map@k` score.

In [28]:
# tuned number of epochs
epochs = 600

# 10-fold CV
k = 10
skf = StratifiedKFold(n_splits=k)

mapk_list = []

for i, (train_index, test_index) in enumerate(skf.split(X, y)):

    # resetting model weights
    mcModel = mcClassifier(num_input=n_features, num_output=11).to(device)

    loss_function = nn.CrossEntropyLoss()

    # tuned optimizer and learning rate
    optimizer = torch.optim.AdamW(params=mcModel.parameters(), lr=0.0005175593986696155)
    
    print(f"Fold {i}:")
    
    X_train, y_train = X.iloc[train_index], y.iloc[train_index]
    X_test, y_test = X.iloc[test_index], y.iloc[test_index]

    X_train_tensor = torch.from_numpy(X_train.values).type(torch.float).to(device)
    y_train_tensor = torch.from_numpy(y_train.values).type(torch.LongTensor).to(device)

    X_test_tensor = torch.from_numpy(X_test.values).type(torch.float).to(device)
    y_test_tensor = torch.from_numpy(y_test.values).type(torch.LongTensor).to(device)


    for epoch in range(epochs):
        
        # enter training mode
        mcModel.train()

        y_train_logits = mcModel(X_train_tensor)
        y_train_pred_probs = torch.softmax(y_train_logits, dim=1)

        loss = loss_function(y_train_logits, y_train_tensor)

        mapk = mapk_score(y_train, y_train_pred_probs.cpu().detach().numpy())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if epoch % 100 == 0:
            print(f'Epoch: {epoch}\nTrain Loss: {loss}\nTrain MAPK: {mapk}\n')


    # enter testing mode
    mcModel.eval()
    
    y_test_logits = mcModel(X_test_tensor).to(device)

    y_test_pred_probs = torch.softmax(y_test_logits, dim=1)
    y_test_pred_probs = y_test_pred_probs.cpu()
    y_test_pred_probs = y_test_pred_probs.detach().numpy()
    
    print(f'\n---------------------------------------------\nTest MAPK: {mapk_score(y_test, y_test_pred_probs)}\n\n')

    mapk_list.append(mapk_score(y_test, y_test_pred_probs))


print(f'\n---------------------------------------------\n{k}-fold Validation MAPK: {np.average(mapk_list)}')

Fold 0:
Epoch: 0
Train Loss: 2.7624573707580566
Train MAPK: 0.13967505241090147

Epoch: 100
Train Loss: 2.1396093368530273
Train MAPK: 0.29585953878406707

Epoch: 200
Train Loss: 2.027801990509033
Train MAPK: 0.3341194968553459

Epoch: 300
Train Loss: 1.9301660060882568
Train MAPK: 0.3472222222222222

Epoch: 400
Train Loss: 1.776792287826538
Train MAPK: 0.4166666666666667

Epoch: 500
Train Loss: 1.6231062412261963
Train MAPK: 0.4792976939203354


---------------------------------------------
Test MAPK: 0.3075117370892019


Fold 1:
Epoch: 0
Train Loss: 2.792454719543457
Train MAPK: 0.14622641509433962

Epoch: 100
Train Loss: 2.163517713546753
Train MAPK: 0.28249475890985326

Epoch: 200
Train Loss: 2.0273964405059814
Train MAPK: 0.3270440251572327

Epoch: 300
Train Loss: 1.9333668947219849
Train MAPK: 0.3288784067085954

Epoch: 400
Train Loss: 1.7458288669586182
Train MAPK: 0.4166666666666667

Epoch: 500
Train Loss: 1.6583613157272339
Train MAPK: 0.4761530398322851


--------------------

# 
## 5. Generating Submissions

In [29]:
test_df = pd.read_csv('./data/Kaggle/test.csv')
col_id = test_df['id']

test_df = test_df.astype(int)
test_df = combine_features(test_df)
test_df = cluster_features(test_df)
test_df = affinity_propagation_features(test_df)
test_df = test_df[rfe_selector.get_feature_names_out()]

# n_features = test_df.shape[1] - 1
# test_df.columns = [f"x{i}" for i in range(1, n_features+2)]

test_df

  df[f'{col1}_xor_{col2}'] = df[col1] ^ df[col2]
  df[f'{col1}_and_{col2}'] = df[col1] & df[col2]
  df[f'{col1}_or_{col2}'] = df[col1] | df[col2]
  df[f'{col1}_xor_{col2}'] = df[col1] ^ df[col2]
  df[f'{col1}_and_{col2}'] = df[col1] & df[col2]
  df[f'{col1}_or_{col2}'] = df[col1] | df[col2]
  df[f'{col1}_xor_{col2}'] = df[col1] ^ df[col2]
  df[f'{col1}_and_{col2}'] = df[col1] & df[col2]
  df[f'{col1}_or_{col2}'] = df[col1] | df[col2]
  df[f'{col1}_xor_{col2}'] = df[col1] ^ df[col2]
  df[f'{col1}_and_{col2}'] = df[col1] & df[col2]
  df[f'{col1}_or_{col2}'] = df[col1] | df[col2]
  df[f'{col1}_xor_{col2}'] = df[col1] ^ df[col2]
  df[f'{col1}_and_{col2}'] = df[col1] & df[col2]
  df[f'{col1}_or_{col2}'] = df[col1] | df[col2]
  df[f'{col1}_xor_{col2}'] = df[col1] ^ df[col2]
  df[f'{col1}_and_{col2}'] = df[col1] & df[col2]
  df[f'{col1}_or_{col2}'] = df[col1] | df[col2]
  df[f'{col1}_xor_{col2}'] = df[col1] ^ df[col2]
  df[f'{col1}_and_{col2}'] = df[col1] & df[col2]
  df[f'{col1}_or_{col2}'] 

Unnamed: 0,sudden_fever_xor_muscle_pain,sudden_fever_xor_joint_pain,sudden_fever_xor_pleural_effusion,sudden_fever_xor_weight_loss,headache_xor_diarrhea,headache_xor_nausea,headache_xor_chills,headache_xor_fatigue,headache_xor_facial_distortion,mouth_bleed_xor_joint_pain,...,yellow_eyes_xor_facial_distortion,microcephaly_or_toenail_loss,bitter_tongue_or_toenail_loss,cocacola_urine_or_toenail_loss,hyperpyrexia_or_itchiness,c_0,c_1,c_2,c_3,cluster
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,10109,4290,5720,2860,0
1,1,0,0,1,0,0,0,0,1,1,...,0,0,0,0,0,10185,4326,5865,2884,1
2,0,0,0,1,0,1,0,1,0,1,...,0,1,0,0,1,10542,4398,5937,3009,1
3,0,1,0,0,1,1,1,1,1,1,...,0,0,0,0,1,10225,4441,5887,3096,1
4,1,1,1,1,1,0,1,1,0,0,...,1,0,0,0,0,10637,4477,6020,3096,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,0,0,1,0,0,1,1,1,0,0,...,0,1,0,0,0,14573,6144,8192,4183,0
299,0,0,0,0,1,1,1,0,1,0,...,1,1,1,1,0,14945,6325,8460,4280,1
300,0,1,0,0,1,1,0,0,1,0,...,0,1,1,0,0,14798,6417,8405,4180,0
301,0,1,1,0,0,0,1,1,0,1,...,0,0,0,0,1,14813,6269,8408,4207,0


In [30]:
test_model = mcClassifier(num_input=n_features, num_output=11).to(device)

test_tensor = torch.from_numpy(test_df.values).type(torch.float).to(device)

test_model.eval()
y_test_logits = test_model(test_tensor).to(device)

y_test_pred_probs = torch.softmax(y_test_logits, dim=1)
y_test_pred_probs = y_test_pred_probs.cpu()
y_test_pred_probs = y_test_pred_probs.detach().numpy()

In [31]:
test_sorted_prediction_ids = np.argsort(-y_test_pred_probs, axis=1)
test_top_3_prediction_ids = test_sorted_prediction_ids[:,:3]
original_shape = test_top_3_prediction_ids.shape
test_top_3_prediction = enc.inverse_transform(test_top_3_prediction_ids.reshape(-1, 1))
test_top_3_prediction = test_top_3_prediction.reshape(original_shape)

test_df['prognosis'] = np.apply_along_axis(lambda x: np.array(' '.join(x), dtype="object"), 1, test_top_3_prediction)


  test_df['prognosis'] = np.apply_along_axis(lambda x: np.array(' '.join(x), dtype="object"), 1, test_top_3_prediction)


In [32]:
submission = pd.DataFrame()

submission['id'] = col_id
submission['prognosis'] = test_df['prognosis']

submission.to_csv('./data/submission.csv', index=False)