### Hyper Parameter Tuning - 

In this notebook, we will refine the parameters of the best model (XGboost) for a significant performance improvement. 




### Importing Necessary Libraries



In [8]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
import pandas as pd

### Load the Label Datasets


In [10]:
# Load feature and label data
X_train = pd.read_csv('../data/X_train.csv')
y_train = pd.read_csv('../data/y_train.csv')

# Flatten labels to 1D array
y_train = y_train.values.ravel()

### Tuning using GridSearchCV

Uses cross-validation to test all combinations of hyper parameters. 

In [6]:
# Define parameter grid for XGBoost
xgb_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 10],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1],
    'colsample_bytree': [0.8, 1]
}

# Set up GridSearch
xgb_grid = GridSearchCV(
    estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
    param_grid=xgb_param_grid,
    scoring='roc_auc',
    cv=3,
    n_jobs=-1,
    verbose=1
)



### Cleaning 'y_train' dataset

The labels in y_train should be [0 1], but we can see values like -0.6 or 1.6. This means that 'y_train.csv' was incorrectly saved or modified. To fix this we manually correct the dataset and round off the values to 0 and or 1. 

In [12]:
import numpy as np

print("Unique labels in y_train:", np.unique(y_train))
print("First few y values:", y_train[:10])
print("Data type:", y_train.dtype)


Unique labels in y_train: [-0.60102348  1.66382851]
First few y values: [-0.60102348 -0.60102348 -0.60102348 -0.60102348 -0.60102348 -0.60102348
 -0.60102348 -0.60102348 -0.60102348 -0.60102348]
Data type: float64


In [14]:
# Convert to binary: everything > 0.5 is 1, else 0
y_train = (y_train > 0.5).astype(int)
print("Unique values after conversion:", np.unique(y_train))


Unique values after conversion: [0 1]


In [15]:
xgb_grid.fit(X_train, y_train)


Fitting 3 folds for each of 48 candidates, totalling 144 fits


GridSearchCV(cv=3,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False,
                                     eval_metric='logloss', gamma=None,
                                     gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     m...
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, 

### Checking Best XGBoost Results


In [16]:
print("Best XGBoost Parameters:", xgb_grid.best_params_)
print("Best ROC-AUC Score:", xgb_grid.best_score_)


Best XGBoost Parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
Best ROC-AUC Score: 0.8493628955207955


### Saving the Model

We can now save this model in the models folder with refined results for further deployment. 

In [18]:
import joblib

# Save the best model
joblib.dump(xgb_grid.best_estimator_, '../models/best_model_xgb_tuned.pkl')


['../models/best_model_xgb_tuned.pkl']