# **Model Training and Tuning**

In this notebook, we will train a Random Forest (RF) model on the Higgs boson dataset we have preprocessed in the previous notebook `01_data_exploration`. We will perform the model training and tuning process to obtain the best model with the highest accuracy possible.

## **Importing Libraries**

In this section, we will import the necessary libraries and packages that will be used throughout the notebook.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score,recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

## **Loading Data**

This code cell loads the training, validation, and test datasets that were saved in pkl format to the local directory.

In [None]:
# Mount Google Drive in Colab
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')

# Load data from Google Drive
train_path = '/content/drive/MyDrive/Colab Notebooks/training_data.pkl'
val_path   = '/content/drive/MyDrive/Colab Notebooks/validation_data.pkl'
test_path  = '/content/drive/MyDrive/Colab Notebooks/testing_data.pkl'

train_data = pd.read_pickle(train_path)
val_data = pd.read_pickle(val_path)
test_data = pd.read_pickle(test_path)

Mounted at /content/drive


## **Prepare the data for training**
This code separates the features and class labels from the train, validation, and test datasets.

In [None]:
# Separate features and labels
y_train = train_data['class_label']
X_train = train_data.drop('class_label', axis=1)
y_val = val_data['class_label']
X_val = val_data.drop('class_label', axis=1)
y_test = test_data['class_label']
X_test = test_data.drop('class_label', axis=1)


In [None]:
#Normalize the data
# # Normalize the feature columns
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train=scaler.transform(X_train)
X_val=scaler.transform(X_val)
X_test=scaler.transform(X_test)

## **Train the RF model**
This code sets the hyperparameters for an RF model, including the required number of trees in the Random Forest and the function to measure the quality of a split and the maximum depth of RF.

## **Model Evaluation**
## Make predictions on the test data and evaluate the model performance
This code uses the RF model that was previously trained to make predictions on the test data.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_selection import RFE

param_grid = {'n_estimators': [50, 100, 200],
              'max_depth': [5, 10, 20],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4],
              'max_features': ['auto', 'sqrt', 'log2']}

rf_model = RandomForestClassifier()

# Create an RFE object with the number of features to select
rfe = RFE(rf_model, n_features_to_select=7)

# Fit the RFE model to the data
rfe.fit(X_train, y_train)

In [None]:
# Create a RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_grid, n_iter=10, cv=5, verbose=2, random_state=42)
# Fit the RandomizedSearchCV model to the selected features
random_search.fit(X_train, y_train)
print()
print("Best hyper parameters",random_search.best_params_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=50; total time=  41.2s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=50; total time=  31.8s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=50; total time=  31.4s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=50; total time=  29.1s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=50; total time=  27.5s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=  27.8s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=  28.5s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=  29.0s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=  28.0s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=  27.5s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time= 2.9min
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time= 3.0min
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time= 2.9min
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time= 2.9min
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time= 2.9min
[CV] END max_depth=20, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time= 2.7min
[CV] END max_depth=20, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time= 2.7min
[CV] END max_depth=20, max_feature

  warn(


[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 3.3min


  warn(


[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 3.3min


  warn(


[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 3.3min


  warn(


[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 3.3min


  warn(


[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 3.3min
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=200; total time= 6.4min
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=200; total time= 6.4min
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=200; total time= 6.4min
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=200; total time= 6.4min
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=200; total time= 6.5min
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=50; total time=  43.9s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=10, n_estimators=50; total time=  43.8s
[CV] END max_depth=10, max_features

  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=  27.6s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=  27.2s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=  27.4s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=  27.3s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=  27.7s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 3.5min
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 3.5min
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 3.6min
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 3.5min
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 3.5min

Best hyper parameters {'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 20}


# Validation step

In [None]:
# valid_pred = random_search.predict(X_val)
# # Calculate the accuracy score
# accuracy = accuracy_score(y_val, valid_pred)
# print("my accuracy on validation data is ",accuracy)

my accuracy on validation data is  0.7340644732469259


# testing Step

In [None]:
from sklearn.metrics import accuracy_score
# Predict the response for the selected features
test_pred = random_search.predict(X_test)
# Calculate the accuracy score
accuracy = accuracy_score(y_test, test_pred)
print("my accuracy on test data is ",accuracy)

my accuracy on test data is  0.7322888888888889


In [None]:
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
import numpy as np

# Define the hyperparameter grid to search over
param_grid = {
    'n_estimators': [100, 500],
    'max_depth': [5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 3]
}

# Create a Random Forest classifier object
rf_model = RandomForestClassifier(random_state=42)

# Create a RandomizedSearchCV object with 100 iterations
random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_grid, n_iter=100, cv=5, scoring='accuracy', verbose=2, random_state=42)

# Fit the RandomizedSearchCV model to the training data
random_search.fit(X_train, y_train)

# Print the best hyperparameters found by RandomizedSearchCV
print("Best hyperparameters:", random_search.best_params_)

# Make predictions on the validation set and calculate evaluation metrics
valid_pred = random_search.predict(X_val)
accuracy = accuracy_score(y_val, valid_pred)
roc_auc = roc_auc_score(y_val, valid_pred)
f1 = f1_score(y_val, valid_pred)
print("Validation accuracy:", accuracy)
print("Validation ROC AUC score:", roc_auc)
print("Validation F1 score:", f1)

# Make predictions on the test set and calculate evaluation metrics
test_pred = random_search.predict(X_test)
accuracy = accuracy_score(y_test, test_pred)
roc_auc = roc_auc_score(y_test, test_pred)
f1 = f1_score(y_test, test_pred)
print("Test accuracy:", accuracy)
print("Test ROC AUC score:", roc_auc)
print("Test F1 score:", f1)




Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.2min
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.2min
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.1min
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.1min
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.2min
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500; total time= 5.6min
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500; total time= 5.5min
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500; total time= 5.6min
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500; total time= 5.5min
[CV] END max_depth=5, min_samples_leaf=