<a href="https://colab.research.google.com/github/YahyaEryani/quantum-model/blob/main/notebooks/02_XGBoost_model_training_and_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Training and Tuning

 In this notebook, we will train an XGBoost model on the Higgs boson dataset we have preprocessed in the previous notebook `01_data_exploration`. We will perform the model training and tuning process to obtain the best model with the highest accuracy possible.

## Importing Libraries
In this section, we will import the necessary libraries and packages that will be used throughout the notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score,recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV

## Loading Data

This code cell loads the training, validation, and test datasets that were saved in CSV format to the local directory.

In [2]:
# Mount Google Drive in Colab
from google.colab import drive
drive.mount('/content/drive')

# Load data from Google Drive
train_path = '/content/drive/MyDrive/Higgs_dataset/processed/train.csv'
val_path   = '/content/drive/MyDrive/Higgs_dataset/processed/validation.csv'
test_path  = '/content/drive/MyDrive/Higgs_dataset/processed/test.csv'
train_data = pd.read_csv(train_path)
val_data = pd.read_csv(val_path)
test_data = pd.read_csv(test_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Prepare the data for training
This code separates the features and class labels from the train, validation, and test datasets, and creates DMatrix objects for the XGBoost model to train, validate, and test on.

In [3]:
# Separate features and labels
train_labels = train_data['class_label']
train_features = train_data.drop('class_label', axis=1)
val_labels = val_data['class_label']
val_features = val_data.drop('class_label', axis=1)
test_labels = test_data['class_label']
test_features = test_data.drop('class_label', axis=1)

# Convert the data into XGBoost DMatrix format
dtrain = xgb.DMatrix(train_features, label=train_labels)
dval = xgb.DMatrix(val_features, label=val_labels)
dtest = xgb.DMatrix(test_features, label=test_labels)

## Train the XGBoost model
This code sets the hyperparameters for an XGBoost model, including the maximum depth of trees, learning rate, subsampling rate, and evaluation metric. It then trains the XGBoost model using early stopping and a specified number of rounds.

In [4]:
# Set the XGBoost parameters
params = {
    'max_depth': 6,
    'eta': 0.3,
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'seed': 42
}

# Train the XGBoost model with early stopping
num_rounds = 1000
watchlist = [(dtrain, 'train'), (dval, 'eval')]
xgb_model = xgb.train(params, dtrain, num_rounds, evals=watchlist, early_stopping_rounds=10, verbose_eval=0)

# Model Evaluation


## Make predictions on the test data and evaluate the model performance
This code uses the XGBoost model that was previously trained to make predictions on the test data.

In [6]:
# Use the XGBoost model to make predictions on the test data
test_preds = xgb_model.predict(dtest)

# Convert the predicted probabilities to class labels
test_preds = [1 if x > 0.5 else 0 for x in test_preds]

# Calculate the accuracy of the model on the test data
accuracy = accuracy_score(test_labels, test_preds)
auc = roc_auc_score(test_labels, test_preds)
accuracy = accuracy_score(test_labels, test_preds)
precision = precision_score(test_labels, test_preds)
recall = recall_score(test_labels, test_preds)
f1 = f1_score(test_labels, test_preds)
print("Test AUC score: {:.4f}".format(auc))
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)

Test AUC score: 0.7363
Accuracy: 73.76%
Precision: 0.7475109188383597
Recall: 0.7598257869045615
F1 Score: 0.753618046932877
