# Model Tuning

In this notebook, we will apply hyperparameter tuning to our saved model in order to maximize the performance, and prepare it for future deployment. 

In [1]:
#import libraries
import pickle
import pandas as pd
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import classification_report

from pprint import pprint

import warnings

warnings.filterwarnings('ignore')

### Importing Model and Data

In [2]:
#load our cleaned data set
data = pd.read_csv('../data/kois_cleaned.csv')

#make a copy of data
kois = data.copy()

In [3]:
#load our model
xgb_model = pickle.load(open("../models/xgboost_model.sav", 'rb'))

In [4]:
xgb_model

### Establishing Parameters

In [5]:
#set up our parameters
params = {
    'eta': [0.01, 0.05, 0.1, 0.15, 0.2],
    'max_depth': [3, 5, 7, 9],
    'subsample': [0.7, 0.8],
    'colsample_bytree': [0.7, 0.8],
    'n_estimators': [100, 200, 500],
    'objective': ['multi:softmax'],
    'eval_metric': ['mlogloss'],
    'num_class': [3]
}   

In [6]:
#setup our random search object
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=params,
    n_jobs=-1,
    cv=5,
    scoring='f1_macro'
)

In [7]:
#split our data into features and target
#start by dropping error columns
kois = kois.loc[:, ~kois.columns.str.contains('_err')]

#assign our features and target
X = kois.drop(columns=['koi_disposition_encoded'])
y = kois['koi_disposition_encoded']

#split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Performing Grid Search

In [8]:
#fit our random search object
grid_search.fit(X_train, y_train)

In [9]:
# Identify our best parameters
best_params = grid_search.best_params_

# Identify our best model
best_model = grid_search.best_estimator_

# Create a formatted string with line breaks for each parameter
formatted_params = "\n".join([f"{key}: {value}" for key, value in best_params.items()])

# Print our best parameters with line breaks
print(f"Our best parameters are:\n{formatted_params}")

Our best parameters are:
colsample_bytree: 0.8
eta: 0.1
eval_metric: mlogloss
max_depth: 9
n_estimators: 200
num_class: 3
objective: multi:softmax
subsample: 0.8


In [10]:
#complete predictions using our best params and estimator
y_pred = best_model.predict(X_test)

#generate a classification report
clf_report = classification_report(y_test, y_pred)

#display the results
print(f"\n Classification Report: \n{clf_report}")


 Classification Report: 
              precision    recall  f1-score   support

           0       0.85      0.77      0.81       557
           1       0.80      0.87      0.84       573
           2       0.99      0.99      0.99      1122

    accuracy                           0.91      2252
   macro avg       0.88      0.88      0.88      2252
weighted avg       0.91      0.91      0.91      2252



We have seen a slight increase in our performance after running a grid search to tune our hyperparameters. We can run GridSearchCV again, but with the range of our parameters centered tightly around our best params from the previous run. This let's us improve our performance through iteration without an excessively long single run of GridSearchCV.

In [11]:
#create a new dictionary based on our best params. 
params2 = {
    'eta': [0.08, 0.1, 0.12],
    'max_depth': [9, 10, 11],
    'subsample': [0.8, 0.85, 0.9],
    'colsample_bytree': [0.8, 0.85, 0.9],
    'n_estimators': [150, 200, 250],
    'objective': ['multi:softmax'],
    'eval_metric': ['mlogloss'],
    'num_class': [3]
}

In [12]:
#setup our random search object
grid_search2 = GridSearchCV(
    estimator=xgb_model,
    param_grid=params2,
    n_jobs=-1,
    cv=5,
    scoring='f1_macro'
)

We can use our existing data splits. So we can continue with our model fitting.

In [13]:
#fit our random search object
grid_search2.fit(X_train, y_train)

In [14]:
# Identify our best parameters
best_params2 = grid_search2.best_params_

# Identify our best model
best_model2 = grid_search2.best_estimator_

# Create a formatted string with line breaks for each parameter
formatted_params2 = "\n".join([f"{key}: {value}" for key, value in best_params2.items()])

# Print our best parameters with line breaks
print(f"Our best parameters are:\n{formatted_params2}")

Our best parameters are:
colsample_bytree: 0.85
eta: 0.12
eval_metric: mlogloss
max_depth: 10
n_estimators: 200
num_class: 3
objective: multi:softmax
subsample: 0.8


Slight adjustment to our best parameters. We can use `best_model2` to make predictions.

In [15]:
#complete predictions using our best params and estimator
y_pred = best_model.predict(X_test)
y_pred2 = best_model2.predict(X_test)

#generate a classification report
clf_report = classification_report(y_test, y_pred)
clf_report2 = classification_report(y_test, y_pred2)

#display the results
print(f"\n Classification Report: \n{clf_report}")
print(f"\n Classification Report 2: \n{clf_report2}")


 Classification Report: 
              precision    recall  f1-score   support

           0       0.85      0.77      0.81       557
           1       0.80      0.87      0.84       573
           2       0.99      0.99      0.99      1122

    accuracy                           0.91      2252
   macro avg       0.88      0.88      0.88      2252
weighted avg       0.91      0.91      0.91      2252


 Classification Report 2: 
              precision    recall  f1-score   support

           0       0.84      0.76      0.80       557
           1       0.79      0.86      0.83       573
           2       0.99      0.99      0.99      1122

    accuracy                           0.90      2252
   macro avg       0.88      0.87      0.87      2252
weighted avg       0.90      0.90      0.90      2252



Despite our second run of gridsearch identifying better parameters, we did not get a better performing model. This may just be due to rounding error as our parameter ranges were quite narrow. 

### Training With Null Values

One final thing to benchmark is our model performance with a much more reduced version of our data cleaning. Here, we will actually leave in our Null values. To do this, we will reimport our original dataset, and only drop some non contributing columns, and encode our target variable. 

In [16]:
#import our data
kois_w_nulls = pd.read_csv('../data/kepler_exoplanet_search_results.csv')

In [17]:
#import module
from sklearn.preprocessing import LabelEncoder

#instantiate labelencoder object
label_enc = LabelEncoder()

In [18]:
#encode our target variable
kois_w_nulls['koi_disposition_encoded'] = label_enc.fit_transform(kois_w_nulls['koi_disposition'])

In [19]:
#remove irrelevant columns
kois_w_nulls.columns

Index(['rowid', 'kepid', 'kepoi_name', 'kepler_name', 'koi_disposition',
       'koi_pdisposition', 'koi_score', 'koi_fpflag_nt', 'koi_fpflag_ss',
       'koi_fpflag_co', 'koi_fpflag_ec', 'koi_period', 'koi_period_err1',
       'koi_period_err2', 'koi_time0bk', 'koi_time0bk_err1',
       'koi_time0bk_err2', 'koi_impact', 'koi_impact_err1', 'koi_impact_err2',
       'koi_duration', 'koi_duration_err1', 'koi_duration_err2', 'koi_depth',
       'koi_depth_err1', 'koi_depth_err2', 'koi_prad', 'koi_prad_err1',
       'koi_prad_err2', 'koi_teq', 'koi_teq_err1', 'koi_teq_err2', 'koi_insol',
       'koi_insol_err1', 'koi_insol_err2', 'koi_model_snr', 'koi_tce_plnt_num',
       'koi_tce_delivname', 'koi_steff', 'koi_steff_err1', 'koi_steff_err2',
       'koi_slogg', 'koi_slogg_err1', 'koi_slogg_err2', 'koi_srad',
       'koi_srad_err1', 'koi_srad_err2', 'ra', 'dec', 'koi_kepmag',
       'koi_disposition_encoded'],
      dtype='object')

We see that our 7th column (index 6) is the first one we will keep. The rest will be dropped. 

In [20]:
#copy kois_w_nulls without the first 7 columns
kois_w_nulls = kois_w_nulls.iloc[:, 7:]
kois_w_nulls.shape

(9564, 44)

We can still drop the two columns that have 100% null values, as well as two columns that are simply labels. 

In [21]:
kois_w_nulls.isnull().sum()

koi_fpflag_nt                 0
koi_fpflag_ss                 0
koi_fpflag_co                 0
koi_fpflag_ec                 0
koi_period                    0
koi_period_err1             454
koi_period_err2             454
koi_time0bk                   0
koi_time0bk_err1            454
koi_time0bk_err2            454
koi_impact                  363
koi_impact_err1             454
koi_impact_err2             454
koi_duration                  0
koi_duration_err1           454
koi_duration_err2           454
koi_depth                   363
koi_depth_err1              454
koi_depth_err2              454
koi_prad                    363
koi_prad_err1               363
koi_prad_err2               363
koi_teq                     363
koi_teq_err1               9564
koi_teq_err2               9564
koi_insol                   321
koi_insol_err1              321
koi_insol_err2              321
koi_model_snr               363
koi_tce_plnt_num            346
koi_tce_delivname           346
koi_stef

In [22]:
#drop the err columsn for koi_teq
kois_w_nulls = kois_w_nulls.drop(['koi_teq_err1', 'koi_teq_err2',
                                  'koi_tce_plnt_num',
                                  'koi_tce_delivname'], axis=1)
kois_w_nulls.shape

(9564, 40)

We can then pass this dataset through our model.

In [23]:
#specify X and y variables
X_nulls = kois_w_nulls.drop(['koi_disposition_encoded'], axis=1)
y_nulls = kois_w_nulls['koi_disposition_encoded']

#split into train and test sets
X_train_nulls, X_test_nulls, y_train_nulls, y_test_nulls = train_test_split(X_nulls, y_nulls, random_state=42)

In [26]:
#import xgboost
import xgboost as xgb

#instantiate our model
xgbc = xgb.XGBClassifier()

#retrain a basic model on the new data. 
xgbc.fit(X_train_nulls, y_train_nulls)

In [27]:
#test base performance (no hyperparameter tuning)
y_pred_nulls = xgbc.predict(X_test_nulls)

#print a classification report
clf_report_nulls = classification_report(y_test_nulls, y_pred_nulls)

#display the results
print(f"\n Classification Report: \n{clf_report_nulls}")


 Classification Report: 
              precision    recall  f1-score   support

           0       0.81      0.77      0.79       567
           1       0.80      0.83      0.82       574
           2       0.98      0.98      0.98      1250

    accuracy                           0.90      2391
   macro avg       0.86      0.86      0.86      2391
weighted avg       0.90      0.90      0.90      2391



Perform GridSearch on our new model

In [30]:
#set up our parameters

params_nulls = {
    'eta': [0.01, 0.05, 0.1, 0.15, 0.2],
    'max_depth': [3, 5, 7, 9],
    'subsample': [0.7, 0.8],
    'colsample_bytree': [0.7, 0.8],
    'n_estimators': [100, 200, 500],
    'objective': ['multi:softmax'],
    'eval_metric': ['mlogloss'],
    'num_class': [3]
}   

In [33]:
#setup our random search object
grid_search_nulls = GridSearchCV(
    estimator=xgb_model,
    param_grid=params_nulls,
    n_jobs=-1,
    cv=5,
    scoring='f1_macro'
)

In [34]:
#fit our grid search object
grid_search_nulls.fit(X_train_nulls, y_train_nulls)

In [35]:
# Identify our best parameters
best_params_nulls = grid_search_nulls.best_params_

# Identify our best model
best_model_nulls = grid_search_nulls.best_estimator_

# Create a formatted string with line breaks for each parameter
formatted_params_nulls = "\n".join([f"{key}: {value}" for key, value in best_params_nulls.items()])

# Print our best parameters with line breaks
print(f"Our best parameters are:\n{formatted_params_nulls}")

Our best parameters are:
colsample_bytree: 0.8
eta: 0.01
eval_metric: mlogloss
max_depth: 9
n_estimators: 500
num_class: 3
objective: multi:softmax
subsample: 0.7


In [36]:
#complete predictions using our best params and estimator
y_pred_nulls = best_model_nulls.predict(X_test_nulls)

#generate a classification report
clf_report_nulls = classification_report(y_test_nulls, y_pred_nulls)

#display the results
print(f"\n Classification Report: \n{clf_report_nulls}")


 Classification Report: 
              precision    recall  f1-score   support

           0       0.82      0.77      0.79       567
           1       0.81      0.85      0.83       574
           2       0.98      0.98      0.98      1250

    accuracy                           0.90      2391
   macro avg       0.87      0.87      0.87      2391
weighted avg       0.90      0.90      0.90      2391



As we can see, our best results remain from our first `best_model`. We can conclude here, as without further domain knowledge, or deep learning techniques, we have likely maxed out our performance. 

In [37]:
#dump best_model to a pickle file

#specify file name
filename = 'xgboost_best_model.sav'

#open file in binary write mode and dump the model
with open(filename, 'wb') as file:
    pickle.dump(best_model, file)

### Conclusion

To summarize, we took the following steps in order to fine tunne our model:
- Performed a grid search
- Re-established the ranges of our different parameters and re-trained the grid search
- Trained a model with the original data with very minimal cleaning involved 
  - ie. only two 100% null columns removed, and other label columns
- Performed a grid search of our second model that includes nulls and selected the best parameters
- Compared the evaluation metrics of the two models to determine our "best mode"