<a href="https://colab.research.google.com/github/YahyaEryani/quantum-model/blob/main/notebooks/03_HGB_Training_and_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Training, Tuning, and Evaluation

 In this notebook, we will train an histogram gradient boosted classifier model on the Higgs boson dataset we have preprocessed in the `01_data_exploration` notebook. We will perform the model training and tuning process to obtain the best model with the highest accuracy possible.

## Importing Libraries
In this section, we will import the necessary libraries and packages that will be used throughout the notebook.

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

## Loading Data

This code cell loads the training, validation, and test datasets that were saved in pickle format to the local directory.

In [4]:
# Mount Google Drive in Colab
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')

# Load data from Google Drive
train_path = '/content/drive/MyDrive/Higgs_dataset/processed/training_data.pkl'
val_path   = '/content/drive/MyDrive/Higgs_dataset/processed/validation_data.pkl'
test_path  = '/content/drive/MyDrive/Higgs_dataset/processed/testing_data.pkl'

train_data = pd.read_pickle(train_path)
val_data = pd.read_pickle(val_path)
test_data = pd.read_pickle(test_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Prepare the data for training
This code separates the features and class labels from the train, validation, and test datasets.

In [2]:
# Separate features and labels
y_train = train_data['class_label']
X_train = train_data.drop('class_label', axis=1)
y_val = val_data['class_label']
X_val = val_data.drop('class_label', axis=1)
y_test = test_data['class_label']
X_test = test_data.drop('class_label', axis=1)


##Train the Histogram Gradient Boosting Classifier model
This code sets the hyperparameters for a Histogram Gradient Boosting Classifier model, including the maximum depth of trees, learning rate, maximum number of leaf nodes, and loss function. It then trains the Histogram Gradient Boosting Classifier model using early stopping and a specified number of iterations.

In [8]:
# Set the hyperparameters for the Histogram Gradient Boosting Classifier model
hist_gradient_boosting = HistGradientBoostingClassifier(
    loss='binary_crossentropy',
    learning_rate=0.1,
    max_iter=100,
    max_leaf_nodes=31,
    max_depth=None,
    min_samples_leaf=20,
    l2_regularization=0.0,
    random_state=42,
)

# Train the model
hist_gradient_boosting.fit(X_train, y_train)

# Evaluate the model on the validation set (optional)
y_val_pred = hist_gradient_boosting.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {val_accuracy * 100:.2f}%")



Validation Accuracy: 73.05%


# Model Evaluation

## Make predictions on the test data and evaluate the model performance
This code uses the Histogram Gradient Boosting Classifier model that was previously trained to make predictions on the test data.

In [10]:
# Make predictions on the test set
y_pred = hist_gradient_boosting.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 73.01%
