# **Model Training and Tuning**

In this notebook, we will train a Random Forest (RF) model on the Higgs boson dataset we have preprocessed in the previous notebook `01_data_exploration`. We will perform the model training and tuning process to obtain the best model with the highest accuracy possible.

## **Importing Libraries**

In this section, we will import the necessary libraries and packages that will be used throughout the notebook.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score,recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV

## **Loading Data**

This code cell loads the training, validation, and test datasets that were saved in pkl format to the local directory.

In [None]:
# Mount Google Drive in Colab
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')

# Load data from Google Drive
train_path = '/content/drive/MyDrive/Higgs_dataset/processed/training_data.pkl'
val_path   = '/content/drive/MyDrive/Higgs_dataset/processed/validation_data.pkl'
test_path  = '/content/drive/MyDrive/Higgs_dataset/processed/testing_data.pkl'

train_data = pd.read_pickle(train_path)
val_data = pd.read_pickle(val_path)
test_data = pd.read_pickle(test_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Prepare the data for training**
This code separates the features and class labels from the train, validation, and test datasets.

In [None]:
# Separate features and labels
y_train = train_data['class_label']
X_train = train_data.drop('class_label', axis=1)
y_val = val_data['class_label']
X_val = val_data.drop('class_label', axis=1)
y_test = test_data['class_label']
X_test = test_data.drop('class_label', axis=1)


## **Train the RF model**
This code sets the hyperparameters for an RF model, including the required number of trees in the Random Forest and the function to measure the quality of a split and the maximum depth of RF.

In [None]:
#Fitting Decision Tree classifier to the training set  
RFclassifier = RandomForestClassifier(n_estimators= 10, criterion="entropy")  
RFclassifier.fit(X_train, y_train)  

# Calculate the accuracy on the training set
y_train_pred = RFclassifier.predict(X_train.values)
train_accuracy = accuracy_score(y_train, y_train_pred)
print(f"Training Accuracy: {train_accuracy * 100:.2f}%")



Training Accuracy: 98.81%


## **Model Evaluation**
## Make predictions on the test data and evaluate the model performance
This code uses the RF model that was previously trained to make predictions on the test data.

In [None]:
# Make predictions on the test data
y_test_pred = RFclassifier.predict(X_test.values)

# Calculate the accuracy of the model on the test data
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")



Test Accuracy: 69.36%
