<a href="https://colab.research.google.com/github/YahyaEryani/quantum-model/blob/main/notebooks/05_TabNet_model_training_and_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Training, Tuning, and Evaluation

 In this notebook, we will train an TabNet model on the Higgs boson dataset we have preprocessed in the `01_data_exploration` notebook. We will perform the model training and tuning process to obtain the best model with the highest accuracy possible.

## Installing and Importing Libraries
In this section, we will install and import the necessary libraries and packages that will be used throughout the notebook.

In [1]:
!pip install torch==1.10.0+cpu torchvision==0.11.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/cpu/torch_stable.html


In [2]:
!pip install pytorch-tabnet

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!pip install torch -U


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch
  Downloading torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl (619.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m619.9/619.9 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-curand-cu11==10.2.10.91
  Downloading nvidia_curand_cu11-10.2.10.91-py3-none-manylinux1_x86_64.whl (54.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.6/54.6 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-nccl-cu11==2.14.3
  Downloading nvidia_nccl_cu11-2.14.3-py3-none-manylinux1_x86_64.whl (177.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.1/177.1 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-nvtx-cu11==11.7.91
  Downloading nvidia_nvtx_cu11-11.7.91-py3-none-manylinux1_x86_64.whl (98 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.6/98.6 kB[0m 

In [4]:
import torch
import numpy as np
import pandas as pd
from pytorch_tabnet.tab_model import TabNetClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

## Loading Data

This code cell loads the training, validation, and test datasets that were saved in pickle format to the local directory.

In [5]:
# Mount Google Drive in Colab
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')

# Load data from Google Drive
train_path = '/content/drive/MyDrive/Higgs_dataset/processed/training_data.pkl'
val_path   = '/content/drive/MyDrive/Higgs_dataset/processed/validation_data.pkl'
test_path  = '/content/drive/MyDrive/Higgs_dataset/processed/testing_data.pkl'

train_data = pd.read_pickle(train_path)
val_data = pd.read_pickle(val_path)
test_data = pd.read_pickle(test_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Prepare the data for training
This code separates the features and class labels from the train, validation, and test datasets.

In [6]:
# Separate features and labels
y_train = train_data['class_label']
X_train = train_data.drop('class_label', axis=1)
y_val = val_data['class_label']
X_val = val_data.drop('class_label', axis=1)
y_test = test_data['class_label']
X_test = test_data.drop('class_label', axis=1)

##Train the TabNet model
This code sets the hyperparameters for a TabNet model, including the number of decision steps, the number of attention heads, learning rate, batch size, and number of epochs. It then trains the TabNet model using the Adam optimizer with a specified learning rate, batch size, and number of epochs. Additionally, the model performance is monitored using the validation set during training.

In [11]:
tabnet_params = dict(
    n_d=32,  # Decrease n_d and n_a to reduce model complexity
    n_a=32,
    n_steps=6,  # Increase n_steps to allow for more decision steps
    gamma=1.5,  # Increase gamma to enforce sparsity in the feature selection
    n_independent=2,
    n_shared=2,
    epsilon=1e-4,  # Decrease epsilon for stronger regularization
    seed=42,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=0.005, weight_decay=1e-5),  # Lower learning rate and weight decay for smoother convergence
    scheduler_params=dict(
        mode="min",
        patience=7,  # Increase patience to wait longer before reducing the learning rate
        min_lr=1e-6,  # Lower min_lr to allow for a smaller learning rate
        factor=0.8,  # Decrease factor to reduce the learning rate more aggressively
    ),
    scheduler_fn=torch.optim.lr_scheduler.ReduceLROnPlateau,
    mask_type="entmax",
    verbose=1
)

# Train the TabNet model
tabnet_model = TabNetClassifier(**tabnet_params)
tabnet_model.fit(
    X_train=X_train.values, y_train=y_train.values,
    eval_set=[(X_val.values, y_val.values)],
    max_epochs=100,
    patience=10,
    batch_size=256,
    virtual_batch_size=128,
    num_workers=0,
    drop_last=False,
)



epoch 0  | loss: 0.671   | val_0_auc: 0.70402 |  0:01:06s
epoch 1  | loss: 0.62438 | val_0_auc: 0.72351 |  0:02:14s
epoch 2  | loss: 0.60851 | val_0_auc: 0.74686 |  0:03:22s
epoch 3  | loss: 0.59422 | val_0_auc: 0.76263 |  0:04:29s
epoch 4  | loss: 0.57679 | val_0_auc: 0.77825 |  0:05:35s
epoch 5  | loss: 0.56328 | val_0_auc: 0.78844 |  0:06:42s
epoch 6  | loss: 0.55399 | val_0_auc: 0.79538 |  0:07:49s
epoch 7  | loss: 0.54699 | val_0_auc: 0.7998  |  0:08:55s
epoch 8  | loss: 0.54241 | val_0_auc: 0.80371 |  0:10:02s
epoch 9  | loss: 0.53607 | val_0_auc: 0.80827 |  0:11:09s
epoch 10 | loss: 0.53249 | val_0_auc: 0.81125 |  0:12:16s
epoch 11 | loss: 0.53065 | val_0_auc: 0.81093 |  0:13:22s
epoch 12 | loss: 0.52888 | val_0_auc: 0.81364 |  0:14:29s
epoch 13 | loss: 0.52539 | val_0_auc: 0.81595 |  0:15:35s
epoch 14 | loss: 0.52355 | val_0_auc: 0.81945 |  0:16:42s
epoch 15 | loss: 0.5217  | val_0_auc: 0.81686 |  0:17:48s
epoch 16 | loss: 0.52196 | val_0_auc: 0.81825 |  0:18:54s
epoch 17 | los



In [12]:
# Calculate the accuracy on the training set
y_train_pred = tabnet_model.predict(X_train.values)
train_accuracy = accuracy_score(y_train, y_train_pred)
print(f"Training Accuracy: {train_accuracy * 100:.2f}%")

Training Accuracy: 76.48%


## Make predictions on the test data and evaluate the model performance
This code uses the TabNet model that was previously trained to make predictions on the test data.

In [13]:
# Make predictions on the test data
y_test_pred = tabnet_model.predict(X_test.values)

# Calculate the accuracy of the model on the test data
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

Test Accuracy: 75.13%
