# Hyperparameter Tuning.
Generally in the ML , we use the grid search/random search to tune the hyperparameters.
However I feel that is a kitchen sink approach, where we randomly put in a range of parameters and see if that works.
In this blog post we will see if we can find a better way of doing it.

Let us begin with the regularisation parameter for XGBoost.


In [3]:
import pandas as pd 
from loguru import logger 
from blog.data.data_cleaner_factory import DataCleanerFactory

dcf = DataCleanerFactory()
lnt_dataset  = dcf.getDataset('lnt')
X,y = lnt_dataset.get_data(path='../data/lnt_dataset.csv')
X.drop(['Employment.Type','PERFORM_CNS.SCORE.DESCRIPTION'],axis=1,inplace=True)

2020-11-02 00:14:56.645 | INFO     | blog.data.lnt_dataset:_read_data:18 - Reading data from path ../data/lnt_dataset.csv
2020-11-02 00:14:57.533 | INFO     | blog.data.lnt_dataset:_read_data:20 - Read data with shape (233154, 41)
2020-11-02 00:14:57.533 | INFO     | blog.data.lnt_dataset:_process_data:29 - Dropping all id columns
2020-11-02 00:14:57.587 | INFO     | blog.data.lnt_dataset:_process_data:36 - Calculating customer age
2020-11-02 00:14:58.168 | INFO     | blog.data.lnt_dataset:_process_data:39 - Calculating financial age of customer
2020-11-02 00:14:58.810 | INFO     | blog.data.lnt_dataset:get_data:95 - Dropping na rows.
2020-11-02 00:14:58.944 | INFO     | blog.data.lnt_dataset:get_data:101 - Shape of training data X :(225493, 32), y : (225493,).


In [4]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,random_state=65)
print(f"Train shape : {X_train.shape} , {y_train.shape}")
print(f"Test shape : {X_test.shape},{y_test.shape} ")

Train shape : (169119, 30) , (169119,)
Test shape : (56374, 30),(56374,) 


In [5]:
from xgboost import XGBClassifier
xgb_clf = XGBClassifier()
print(f"Created a base classifier : {xgb_clf}")

Created a base classifier : XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)


In [6]:
xgb_clf.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [7]:
tree_stats = (
    xgb_clf.get_booster()
    .trees_to_dataframe()
    )
tree_stats.head(30)

Unnamed: 0,Tree,Node,ID,Feature,Split,Yes,No,Missing,Gain,Cover
0,0,0,0-0,disbursed_amount,50751.0,0-1,0-2,0-1,925.40625,42279.75
1,0,1,0-1,ltv,73.494995,0-3,0-4,0-3,336.103516,16007.5
2,0,2,0-2,PERFORM_CNS.SCORE,644.0,0-5,0-6,0-5,645.25,26272.25
3,0,3,0-3,PERFORM_CNS.SCORE,708.0,0-7,0-8,0-7,68.816406,9752.75
4,0,4,0-4,PERFORM_CNS.SCORE,660.0,0-9,0-10,0-9,159.201172,6254.75
5,0,5,0-5,ltv,79.434998,0-11,0-12,0-11,157.087891,18695.25
6,0,6,0-6,PRI.SANCTIONED.AMOUNT,185995.0,0-13,0-14,0-13,172.448242,7577.0
7,0,7,0-7,Leaf,,,,,-0.139229,7642.5
8,0,8,0-8,Leaf,,,,,-0.159929,2110.25
9,0,9,0-9,Leaf,,,,,-0.104,4517.75
