# Hyperparameter Tuning on Gradient Boosted Tree Model

Read the following documentation:
- https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#for-better-accuracy 
- https://lightgbm.readthedocs.io/en/latest/Parameters.html


Then, run the code below

In [None]:
import pandas as pd
import numpy as np

import lightgbm as lgbm
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, recall_score

### Data Preparation

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df = pd.read_csv('/content/drive/MyDrive/dataset/cleaned_data.csv')
df['Attrition'] = [0 if x == 'No' else 1 for x in df['Attrition'].values]

categorical_variable = []

for x in df.columns:
    if df[x].dtype == 'O':
        categorical_variable.append(x)
        
df[categorical_variable] = df[categorical_variable].astype('category')

independent_var = df.drop('Attrition', axis = 'columns').columns.tolist()
target_var = ['Attrition']

X = df[independent_var]
y = df['Attrition']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

#!pip install imblearn

from imblearn.over_sampling import RandomOverSampler
X_train_sampled, y_train_sampled = RandomOverSampler(random_state = 42).fit_resample(X_train, y_train)

### **RandomizedSearch**

In [None]:
from sklearn.model_selection import RandomizedSearchCV

lgb_model = LGBMClassifier()
parameters = {
    'num_leaves':[10,20,40,60,80,100,120,150,170,200],
    'max_depth':[3,4,5,6,7,8,9,10,11,12],
    'min_data_in_leaf':[1,2,3,5,10,15,20,25,30],
    'boosting':['gbdt', 'dart']
}

lgb_search = RandomizedSearchCV(estimator = lgb_model, param_distributions = parameters, cv = 5, n_jobs = -1,
                              scoring = 'recall', n_iter = 100)

lgb_search.fit(X_train_sampled, y_train_sampled)
lgb_search.best_params_

{'num_leaves': 150, 'min_data_in_leaf': 3, 'max_depth': 12, 'boosting': 'dart'}

In [None]:
search_model = LGBMClassifier(num_leaves= 80, min_data_in_leaf = 1, max_depth= 11, boosting= 'gbdt')

In [None]:
search_model.fit(X_train_sampled, y_train_sampled)

In [None]:
search_pred = search_model.predict(X_test)
search_recall = recall_score(y_test, search_pred)
print(search_recall)

0.8470588235294118


### **Default model**

In [None]:
default_model = LGBMClassifier()
default_model.fit(X_train_sampled, y_train_sampled)

In [None]:
def_pred = default_model.predict(X_test)
def_recall = recall_score(y_test, def_pred)
print(def_recall)

0.8588235294117647


### **1. Bayesian Optimization**

In [None]:
pip install bayesian-optimization

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bayesian-optimization
  Downloading bayesian_optimization-1.4.2-py3-none-any.whl (17 kB)
Collecting colorama>=0.4.6
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: colorama, bayesian-optimization
Successfully installed bayesian-optimization-1.4.2 colorama-0.4.6


In [None]:
from bayes_opt import BayesianOptimization

#Prepare train set and test set
lgbmdata = lgbm.Dataset(data=X_train_sampled, label=y_train_sampled)

def cvlgbm(num_leaves, max_depth, min_data_in_leaf,
         learning_rate, bagging_freq):
  clf = LGBMClassifier(objective='binary',
                       num_leaves=round(num_leaves),
                       max_depth=round(max_depth),
                       min_data_in_leaf=round(min_data_in_leaf),
                       learning_rate=learning_rate,
                       bagging_freq = round(bagging_freq),
                       metric='recall',
                       eval_metric='recall',
                       verbose=-1)
  scores = cross_val_score(clf, X_train_sampled, y_train_sampled, cv=5, scoring='recall')
  return np.mean(scores)

In [None]:
optimizer = BayesianOptimization(cvlgbm,
                                 {'num_leaves':(5,50),
                                  'max_depth':(-1,10),
                                  'min_data_in_leaf':(10,50),
                                  'learning_rate':(0.01,0.5),
                                  'bagging_freq':(0,10)},
                                 random_state=100,
                                 verbose=2)

In [None]:
optimizer.maximize(init_points=2, n_iter=5)

|   iter    |  target   | baggin... | learni... | max_depth | min_da... | num_le... |
-------------------------------------------------------------------------------------
| [0m1        [0m | [0m0.9048   [0m | [0m5.434    [0m | [0m0.1464   [0m | [0m3.67     [0m | [0m43.79    [0m | [0m5.212    [0m |
| [95m2        [0m | [95m1.0      [0m | [95m1.216    [0m | [95m0.3387   [0m | [95m8.084    [0m | [95m15.47    [0m | [95m30.88    [0m |
| [0m3        [0m | [0m0.9995   [0m | [0m0.6936   [0m | [0m0.2388   [0m | [0m7.821    [0m | [0m13.97    [0m | [0m30.35    [0m |
| [0m4        [0m | [0m0.9995   [0m | [0m10.0     [0m | [0m0.5      [0m | [0m-1.0     [0m | [0m10.0     [0m | [0m50.0     [0m |
| [0m5        [0m | [0m0.9098   [0m | [0m0.0      [0m | [0m0.01     [0m | [0m10.0     [0m | [0m50.0     [0m | [0m50.0     [0m |
| [0m6        [0m | [0m0.9482   [0m | [0m0.0      [0m | [0m0.01     [0m | [0m10.0     [0m | [0m10.0 

In [None]:
best_params = optimizer.max['params']
best_params['min_data_in_leaf'] = int(np.floor(best_params['min_data_in_leaf']))
best_params['bagging_freq'] = int(np.floor(best_params['bagging_freq']))
best_params['num_leaves'] = int(np.floor(best_params['num_leaves']))
best_params['max_depth'] = int(np.floor(best_params['max_depth']))
best_params

{'bagging_freq': 1,
 'learning_rate': 0.33866705151612153,
 'max_depth': 8,
 'min_data_in_leaf': 15,
 'num_leaves': 30}

In [None]:
tuned_clf = LGBMClassifier(**best_params)

In [None]:
tuned_clf.fit(X_train_sampled, y_train_sampled)
pred = tuned_clf.predict(X_test)

In [None]:
recall = recall_score(y_test, pred)
print(recall)

0.8764705882352941


### **2. Hyperopt**

In [None]:
from hyperopt import tpe, hp
from hyperopt.fmin import fmin

In [None]:
space = {'num_leaves':hp.quniform('num_leaves',5,50,1),
        'max_depth':hp.quniform('max_depth',-1,10,1),
        'min_data_in_leaf':hp.quniform('min_data_in_leaf',10,50,1),
        'learning_rate':hp.uniform('learning_rate',0.01,0.5),
        'bagging_freq':hp.quniform('bagging_freq',0,10,1)}

def recall(target, pred):
  return -recall_score(target, pred)

def cvlgbm(params):
  clf = LGBMClassifier(objective='binary',
                       num_leaves=int(params['num_leaves']),
                       max_depth=int(params['max_depth']),
                       min_data_in_leaf=int(params['min_data_in_leaf']),
                       learning_rate=params['learning_rate'],
                       bagging_freq = int(params['bagging_freq']),
                       metric='recall',
                       eval_metric='recall',
                       verbose=-1, random_state=100)
  scores = cross_val_score(clf, X_train_sampled, y_train_sampled, cv=5, scoring='recall', error_score='raise')
  return -np.mean(scores)

best = fmin(fn=cvlgbm, space=space, algo=tpe.suggest, max_evals=20)

 90%|█████████ | 18/20 [00:23<00:01,  1.04it/s, best loss: -0.9995012468827931]

In [None]:
print(best)

{'bagging_freq': 8.0, 'learning_rate': 0.16825692602252965, 'max_depth': 10.0, 'min_data_in_leaf': 23.0, 'num_leaves': 43.0}


In [None]:
hp_clf = LGBMClassifier(bagging_freq=8, learning_rate=0.16825692602252965, max_depth=10, min_data_in_leaf=23, num_leaves=43)

hp_clf.fit(X_train_sampled, y_train_sampled)
pred = hp_clf.predict(X_test)

recall = recall_score(y_test, pred)
print(recall)

0.8647058823529412


### **3. Optuna**

In [None]:
pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.1.0-py3-none-any.whl (365 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.3/365.3 KB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
Collecting alembic>=1.5.0
  Downloading alembic-1.10.1-py3-none-any.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting cmaes>=0.9.1
  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully installed Mako-1.2.4 alembic-1.10.1 cmaes-0.9.1 colorlog-6.7.0 optuna-3.1.0


In [None]:
import optuna

In [None]:
def objective(trial):
  params = {'objective':'binary',
            'boosting':'gbdt',
            'metric':'recall',
            'random_state':100,
            'verbosity':-1,
            'num_leaves':trial.suggest_int('num_leaves',5,50),
            'max_depth':trial.suggest_int('max_depth',-1,10),
            'min_data_in_leaf':trial.suggest_int('min_data_in_leaf',10,50),
            'learning_rate':trial.suggest_float('learning_rate',0.01,0.5),
            'bagging_freq':trial.suggest_int('bagging_freq',0,10)}

  lgb = LGBMClassifier(**params)
  lgb.fit(X_train_sampled, y_train_sampled)
  pred = lgb.predict(X_test)
  recall = recall_score(y_test, pred)
  return recall

study_tuner = optuna.create_study(direction='maximize')

study_tuner.optimize(objective, n_trials=50)

[32m[I 2023-03-07 14:02:54,750][0m A new study created in memory with name: no-name-8215f339-693c-400b-b2b2-0e90a8765471[0m
[32m[I 2023-03-07 14:02:55,173][0m Trial 0 finished with value: 0.8411764705882353 and parameters: {'num_leaves': 10, 'max_depth': 6, 'min_data_in_leaf': 14, 'learning_rate': 0.44354235099780437, 'bagging_freq': 6}. Best is trial 0 with value: 0.8411764705882353.[0m
[32m[I 2023-03-07 14:02:55,531][0m Trial 1 finished with value: 0.6647058823529411 and parameters: {'num_leaves': 10, 'max_depth': 4, 'min_data_in_leaf': 14, 'learning_rate': 0.020754165694198523, 'bagging_freq': 9}. Best is trial 0 with value: 0.8411764705882353.[0m
[32m[I 2023-03-07 14:02:55,670][0m Trial 2 finished with value: 0.8058823529411765 and parameters: {'num_leaves': 9, 'max_depth': 4, 'min_data_in_leaf': 15, 'learning_rate': 0.3946133016745912, 'bagging_freq': 8}. Best is trial 0 with value: 0.8411764705882353.[0m
[32m[I 2023-03-07 14:02:55,836][0m Trial 3 finished with value

In [None]:
best = study_tuner.best_params
print(best)

{'num_leaves': 31, 'max_depth': -1, 'min_data_in_leaf': 26, 'learning_rate': 0.4922187372632303, 'bagging_freq': 1}


In [None]:
op_clf = LGBMClassifier(**best)

op_clf.fit(X_train_sampled, y_train_sampled)
pred = op_clf.predict(X_test)

recall = recall_score(y_test, pred)
print(recall)

0.8823529411764706


### **4. Ray Tune**

In [None]:
pip install ray[tune]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from ray import air, tune
from ray.air import session
from ray.tune.schedulers import ASHAScheduler
from ray.tune.integration.lightgbm import TuneReportCheckpointCallback

In [None]:
def train_attrition(config):
  X_train, X_val, y_train, y_val = train_test_split(
    X_train_sampled, y_train_sampled, test_size=0.3, random_state=42)
  train_set = lgbm.Dataset(X_train, label=y_train)
  test_set = lgbm.Dataset(X_val, label=y_val)
  gbm = lgbm.train(config, train_set,
                   valid_sets=[test_set],
                   valid_names=['eval'],
                   verbose_eval=False,
                   callbacks=[TuneReportCheckpointCallback(
                       {'binary_error':'eval-binary_error',
                        'binary_logloss':'eval-binary_logloss'})]
                   ),
  pred = gbm.predict(test_set)
  pred_labels = np.rint(pred)
  session.report({'recall':recall_score(y_test, pred_labels),
                  'done':True})
  
config = {'objective':'binary',
            'boosting':'gbdt',
            'metric':['binary_error','binary_logloss'],
            'random_state':100,
            'verbosity':-1,
            'num_leaves':tune.randint(5,50),
            'max_depth':tune.randint(-1,10),
            'min_data_in_leaf':tune.randint(10,50),
            'learning_rate':tune.loguniform(0.01,0.5),
            'bagging_freq':tune.randint(0,10)}

tuner = tune.Tuner(train_attrition,
                   tune_config=tune.TuneConfig(
                       metric='binary_error',
                       mode='min',
                       scheduler=ASHAScheduler(),
                       num_samples=10),
                   param_space=config
                   )

In [None]:
results = tuner.fit()

2023-03-08 10:22:10,123	INFO worker.py:1553 -- Started a local Ray instance.


0,1
Current time:,2023-03-08 10:22:41
Running for:,00:00:22.34
Memory:,1.7/12.7 GiB

Trial name,status,loc,bagging_freq,learning_rate,max_depth,min_data_in_leaf,num_leaves,iter,total time (s),binary_error,binary_logloss
train_attrition_18094_00000,TERMINATED,172.28.0.12:31270,8,0.0273048,6,15,6,100,14.1078,0.2,0.49091
train_attrition_18094_00001,TERMINATED,172.28.0.12:31340,8,0.0302323,0,20,41,100,1.90472,0.039834,0.205112
train_attrition_18094_00002,TERMINATED,172.28.0.12:31270,4,0.167504,6,23,43,1,0.0577185,0.221577,0.634087
train_attrition_18094_00003,TERMINATED,172.28.0.12:31270,3,0.389997,7,22,48,1,0.0784388,0.208299,0.557248
train_attrition_18094_00004,TERMINATED,172.28.0.12:31270,8,0.0697215,7,23,44,1,0.0672958,0.212448,0.664281
train_attrition_18094_00005,TERMINATED,172.28.0.12:31270,5,0.0370093,2,49,9,1,0.0710759,0.304564,0.686552
train_attrition_18094_00006,TERMINATED,172.28.0.12:31270,5,0.468551,9,39,30,1,0.048347,0.217427,0.550696
train_attrition_18094_00007,TERMINATED,172.28.0.12:31270,7,0.0266605,-1,38,33,1,0.0638478,0.231535,0.682256
train_attrition_18094_00008,TERMINATED,172.28.0.12:31270,3,0.0330174,2,15,40,1,0.0581408,0.304564,0.687283
train_attrition_18094_00009,TERMINATED,172.28.0.12:31340,3,0.0100409,9,38,10,1,0.0667381,0.512033,0.690695




Trial name,binary_error,binary_logloss,date,done,episodes_total,experiment_id,hostname,iterations_since_restore,node_ip,pid,time_since_restore,time_this_iter_s,time_total_s,timestamp,timesteps_since_restore,timesteps_total,training_iteration,trial_id,warmup_time
train_attrition_18094_00000,0.2,0.49091,2023-03-08_10-22-40,True,,82abb0aa3b3e4cd595a571f6b7e0fff3,256b117648ba,100,172.28.0.12,31270,14.1078,0.00548291,14.1078,1678270960,0,,100,18094_00000,0.00805187
train_attrition_18094_00001,0.039834,0.205112,2023-03-08_10-22-40,True,,f26159d23cd14fafb2f07da86d2ec749,256b117648ba,100,172.28.0.12,31340,1.90472,0.033309,1.90472,1678270960,0,,100,18094_00001,0.00463963
train_attrition_18094_00002,0.221577,0.634087,2023-03-08_10-22-40,True,,82abb0aa3b3e4cd595a571f6b7e0fff3,256b117648ba,1,172.28.0.12,31270,0.0577185,0.0577185,0.0577185,1678270960,0,,1,18094_00002,0.00805187
train_attrition_18094_00003,0.208299,0.557248,2023-03-08_10-22-40,True,,82abb0aa3b3e4cd595a571f6b7e0fff3,256b117648ba,1,172.28.0.12,31270,0.0784388,0.0784388,0.0784388,1678270960,0,,1,18094_00003,0.00805187
train_attrition_18094_00004,0.212448,0.664281,2023-03-08_10-22-40,True,,82abb0aa3b3e4cd595a571f6b7e0fff3,256b117648ba,1,172.28.0.12,31270,0.0672958,0.0672958,0.0672958,1678270960,0,,1,18094_00004,0.00805187
train_attrition_18094_00005,0.304564,0.686552,2023-03-08_10-22-40,True,,82abb0aa3b3e4cd595a571f6b7e0fff3,256b117648ba,1,172.28.0.12,31270,0.0710759,0.0710759,0.0710759,1678270960,0,,1,18094_00005,0.00805187
train_attrition_18094_00006,0.217427,0.550696,2023-03-08_10-22-40,True,,82abb0aa3b3e4cd595a571f6b7e0fff3,256b117648ba,1,172.28.0.12,31270,0.048347,0.048347,0.048347,1678270960,0,,1,18094_00006,0.00805187
train_attrition_18094_00007,0.231535,0.682256,2023-03-08_10-22-40,True,,82abb0aa3b3e4cd595a571f6b7e0fff3,256b117648ba,1,172.28.0.12,31270,0.0638478,0.0638478,0.0638478,1678270960,0,,1,18094_00007,0.00805187
train_attrition_18094_00008,0.304564,0.687283,2023-03-08_10-22-40,True,,82abb0aa3b3e4cd595a571f6b7e0fff3,256b117648ba,1,172.28.0.12,31270,0.0581408,0.0581408,0.0581408,1678270960,0,,1,18094_00008,0.00805187
train_attrition_18094_00009,0.512033,0.690695,2023-03-08_10-22-40,True,,f26159d23cd14fafb2f07da86d2ec749,256b117648ba,1,172.28.0.12,31340,0.0667381,0.0667381,0.0667381,1678270960,0,,1,18094_00009,0.00463963


2023-03-08 10:22:41,088	INFO tune.py:798 -- Total run time: 27.30 seconds (22.29 seconds for the tuning loop).


In [None]:
best = results.get_best_result().config
print(best)

{'objective': 'binary', 'boosting': 'gbdt', 'metric': ['binary_error', 'binary_logloss'], 'random_state': 100, 'verbosity': -1, 'num_leaves': 41, 'max_depth': 0, 'min_data_in_leaf': 20, 'learning_rate': 0.03023228780231226, 'bagging_freq': 8}


In [None]:
rt_clf = LGBMClassifier(bagging_freq=8, learning_rate=0.03023228780231226, max_depth=0, min_data_in_leaf=20, num_leaves=41)

rt_clf.fit(X_train_sampled, y_train_sampled)
pred = rt_clf.predict(X_test)

recall = recall_score(y_test, pred)
print(recall)

0.8176470588235294


### Answer the Questions Below

#### 1. Which model is better? The tuned model, or the default model? Explain why. 

The default model outperforms the tuned model using  RandomizedSearchCV. The tuned model's recall score may be lower because the parameter ranges are too narrow or do not include the default value, a value less than the default value, or a value greater than the default value.

#### 2. Coba cari tahu di internet, apakah ada cara-cara tertentu untuk melakukan Hyperparameter Tuning secara automated sehingga lebih baik daripada cara kita melakukan Randomized Search CV ini? Jika ya, sebutkan cara-cara melakukan Hyperparameter Tuning dengan cara yang lebih baik daripadda Randomized Search CV ini. 

Bayesian Optimization (https://github.com/fmfn/BayesianOptimization), Hyperopt (https://github.com/hyperopt/hyperopt), and Optuna (https://optuna.org/) gave better result than Randomized Search CV.

#### 3. Jika model underfit mapun overfit. Apakah Hyperparameter Tuning adalah solusi utama untuk menyelesaikan persoalan tersebut?

Hyperparameter tuning is one method for dealing with overfitting and underfitting in order to reduce or increase the model's complexity. Other methods for dealing with overfitting and underfitting include adding features and increasing training duration.