Model Tuning

In this notebook, we will apply hyperparameter tuning to our saved model in order to maximize the performance, and prepare it for future deployment. 

In [45]:
#import libraries
import pickle
import pandas as pd
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import classification_report

from pprint import pprint

In [10]:
#load our cleaned data set
data = pd.read_csv('../data/kois_cleaned.csv')

#make a copy of data
kois = data.copy()

In [3]:
#load our model
xgb_model = pickle.load(open("xgboost_model.sav", 'rb'))

In [4]:
xgb_model

In [38]:
#set up our parameters
params = {
    'eta': [0.01, 0.05, 0.1, 0.15, 0.2],
    'max_depth': [3, 5, 7, 9],
    'subsample': [0.7, 0.8],
    'colsample_bytree': [0.7, 0.8],
    'n_estimators': [100, 200, 500],
    'objective': ['multi:softmax'],
    'eval_metric': ['mlogloss'],
    'num_class': [3]
}   

In [39]:
#setup our random search object
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=params,
    n_jobs=-1,
    cv=5,
    scoring='f1_macro'
)

In [40]:
#split our data into features and target
#start by dropping error columns
kois = kois.loc[:, ~kois.columns.str.contains('_err')]

#assign our features and target
X = kois.drop(columns=['koi_disposition_encoded'])
y = kois['koi_disposition_encoded']

#split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [41]:
#fit our random search object
grid_search.fit(X_train, y_train)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [54]:
# Identify our best parameters
best_params = grid_search.best_params_

# Identify our best model
best_model = grid_search.best_estimator_

# Create a formatted string with line breaks for each parameter
formatted_params = "\n".join([f"{key}: {value}" for key, value in best_params.items()])

# Print our best parameters with line breaks
print(f"Our best parameters are:\n{formatted_params}")

Our best parameters are:
colsample_bytree: 0.8
eta: 0.1
eval_metric: mlogloss
max_depth: 9
n_estimators: 200
num_class: 3
objective: multi:softmax
subsample: 0.8


In [55]:
#complete predictions using our best params and estimator
y_pred = best_model.predict(X_test)

#generate a classification report
clf_report = classification_report(y_test, y_pred)

#display the results
print(f"\n Classification Report: \n{clf_report}")


 Classification Report: 
              precision    recall  f1-score   support

           0       0.85      0.77      0.81       557
           1       0.80      0.87      0.84       573
           2       0.99      0.99      0.99      1122

    accuracy                           0.91      2252
   macro avg       0.88      0.88      0.88      2252
weighted avg       0.91      0.91      0.91      2252



  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


We have seen a slight increase in our performance after running a grid search to tune our hyperparameters. We can run GridSearchCV again, but with the range of our parameters centered tightly around our best params from the previous run. This let's us improve our performance through iteration without an excessively long single run of GridSearchCV.

In [62]:
#create a new dictionary based on our best params. 
params2 = {
    'eta': [0.08, 0.1, 0.12],
    'max_depth': [9],
    'subsample': [0.8],
    'colsample_bytree': [0.8],
    'n_estimators': [200],
    'objective': ['multi:softmax'],
    'eval_metric': ['mlogloss'],
    'num_class': [3]
}

In [63]:
#setup our random search object
grid_search2 = GridSearchCV(
    estimator=xgb_model,
    param_grid=params2,
    n_jobs=-1,
    cv=5,
    scoring='f1_macro'
)

We can use our existing data splits. So we can continue with our model fitting.

In [64]:
#fit our random search object
grid_search2.fit(X_train, y_train)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [65]:
# Identify our best parameters
best_params2 = grid_search2.best_params_

# Identify our best model
best_model2 = grid_search2.best_estimator_

# Create a formatted string with line breaks for each parameter
formatted_params2 = "\n".join([f"{key}: {value}" for key, value in best_params.items()])

# Print our best parameters with line breaks
print(f"Our best parameters are:\n{formatted_params2}")

Our best parameters are:
colsample_bytree: 0.8
eta: 0.1
eval_metric: mlogloss
max_depth: 9
n_estimators: 200
num_class: 3
objective: multi:softmax
subsample: 0.8


In [66]:
#complete predictions using our best params and estimator
y_pred = best_model.predict(X_test)

#generate a classification report
clf_report = classification_report(y_test, y_pred)

#display the results
print(f"\n Classification Report: \n{clf_report}")


 Classification Report: 
              precision    recall  f1-score   support

           0       0.85      0.77      0.81       557
           1       0.80      0.87      0.84       573
           2       0.99      0.99      0.99      1122

    accuracy                           0.91      2252
   macro avg       0.88      0.88      0.88      2252
weighted avg       0.91      0.91      0.91      2252



  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


We can see that our new, narrower range of parameters did not identify any combinations of parameters that yield better performance. 