# Hyperparameter Tuning.
Generally in the ML , we use the grid search/random search to tune the hyperparameters.
However I feel that is a kitchen sink approach, where we randomly put in a range of parameters and see if that works.
In this blog post we will see if we can find a better way of doing it.

Let us begin with the regularisation parameter for XGBoost.


In [1]:
import pandas as pd 
from loguru import logger 
from blog.data.data_cleaner_factory import DataCleanerFactory

dcf = DataCleanerFactory()
lnt_dataset  = dcf.getDataset('lnt')
X,y = lnt_dataset.get_data(path='../data/lnt_dataset.csv')


2020-11-01 23:33:50.197 | INFO     | blog.data.lnt_dataset:_read_data:18 - Reading data from path ../data/lnt_dataset.csv
2020-11-01 23:33:51.205 | INFO     | blog.data.lnt_dataset:_read_data:20 - Read data with shape (233154, 41)
2020-11-01 23:33:51.206 | INFO     | blog.data.lnt_dataset:_process_data:29 - Dropping all id columns
2020-11-01 23:33:51.310 | INFO     | blog.data.lnt_dataset:_process_data:36 - Calculating customer age
2020-11-01 23:33:51.922 | INFO     | blog.data.lnt_dataset:_process_data:39 - Calculating financial age of customer
2020-11-01 23:33:52.623 | INFO     | blog.data.lnt_dataset:get_data:95 - Dropping na rows.
2020-11-01 23:33:52.813 | INFO     | blog.data.lnt_dataset:get_data:101 - Shape of training data X :(225493, 32), y : (225493,).


In [2]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,random_state=65)
print(f"Train shape : {X_train.shape} , {y_train.shape}")
print(f"Test shape : {X_test.shape},{y_test.shape} ")

Train shape : (169119, 32) , (169119,)
Test shape : (56374, 32),(56374,) 


In [3]:
from xgboost import XGBClassifier
xgb_clf = XGBClassifier()
print(f"Created a base classifier : {xgb_clf}")

Created a base classifier : XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)


In [4]:
xgb_clf.fit(X_train,y_train)

ValueError: DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields Employment.Type, PERFORM_CNS.SCORE.DESCRIPTION

In [26]:
tree_stats = (
    xgb_clf.get_booster()
    .trees_to_dataframe()
    )
tree_stats.head(30)

Unnamed: 0,Tree,Node,ID,Feature,Split,Yes,No,Missing,Gain,Cover
0,0,0,0-0,disbursed_amount,51909.0,0-1,0-2,0-1,934.316406,43716.25
1,0,1,0-1,ltv,73.475006,0-3,0-4,0-3,422.601562,18428.75
2,0,2,0-2,PERFORM_CNS.SCORE,629.0,0-5,0-6,0-5,634.912109,25287.5
3,0,3,0-3,PERFORM_CNS.SCORE,708.0,0-7,0-8,0-7,65.041016,10530.5
4,0,4,0-4,PERFORM_CNS.SCORE,659.0,0-9,0-10,0-9,213.455078,7898.25
5,0,5,0-5,ltv,78.925003,0-11,0-12,0-11,144.80957,17775.25
6,0,6,0-6,PRI.SANCTIONED.AMOUNT,197495.0,0-13,0-14,0-13,168.913086,7512.25
7,0,7,0-7,Leaf,,,,,-0.139301,8313.75
8,0,8,0-8,Leaf,,,,,-0.158877,2216.75
9,0,9,0-9,Leaf,,,,,-0.102518,5698.5
