# Using LGBM with Optuna for hyperparameter tuning. 

I hope everyone enjoyiong this competition, definitely I am, although couldn't commit enough. I noticed everyone trying their best to help each other for all new findings while competing. This competiton is also a good example for new joiners. After noticing that we don't have a good sample notebook on hyper-parameter tuning I thought I can share mine. This is solely for those new to optuna and thinking about implementing it in this comeptiton. From experts expecting feedback. Ofcourse, I open to any feedback and suggestion to improve the note book from anyone. 


### Optuna: 
Optuna can help you search the best parameters if you can specify the search space. it is easy to install in your exisitng data science stack. Detail can be found here: https://optuna.readthedocs.io/en/stable/index.html 

### Data Processing: 
1. Started with denoised data shared by Raddar. 
 Data here: https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format
 Discussion and codes in this page: https://www.kaggle.com/competitions/amex-default-prediction/discussion/328514
2. Features were generated by using aggreagation using code shared here: https://www.kaggle.com/code/ambrosm/amex-lightgbm-quickstart
3. Data I shared here: https://www.kaggle.com/datasets/kmmohsin/amex-denoised-aggregated-features
4. Actual contribution of this notebook is to integrate with Optuna and find some parameters that lead to 0.796 in LB using CPU. In GPU best I got 0.795. Parameters will be shared in next section. 

### Note: 
1. You can run this code with GPU as well, just have to change device type  = 'gpu'. 
2. GPU need a different set of parameters. Shared along with CPU parametes with inline comment. 
3. You will not be able to get best result if you just run the notebook as is. 
4. For best result please tweak the parameters close to mine shared in 5. 
   
   change these lines to your desired search area
   
    n_est = trial.suggest_int("n_estimators", 10, 100, step=10)
    
    lr = trial.suggest_float("learning_rate", .01, .03, step=.01)

5. Best parameters I got with these features in CPU, **OOF 0.796 and LB score 0.796.** 

       n_estimators= 4800,  
       learning_rate= .01,  
       reg_lambda=50,
       min_child_samples=2400,
       num_leaves = 95,  # in gpu try with 40
       colsample_bytree=0.19,
       max_bins = 511,   #  for gpu try with 255
       random_state = 42,
       n_jobs = 16  # number of physical cpu cores
       # device= 'gpu'

### Future Work: 
1. There might be other parameters to tune with Optuna, leaving it for others to work on it. 
2. Feature engineering. Leaving for toppers. 


### Disclaimer:
1. Please tweak it to run for GPU with new set of parameters
2. Parameters I found with 16 cores in my local machine. 
3. I have shared the code that can run in notebook without memory error. Only need to tweak for right parameters, which may run longer.

Good luck!

### Imports

In [1]:
# imports

import numpy as np
import pandas as pd
from cycler import cycler
from IPython.display import display
import datetime
import scipy.stats
import warnings
from colorama import Fore, Back, Style
import gc

from sklearn.model_selection import StratifiedKFold
from sklearn.calibration import CalibrationDisplay
from lightgbm import LGBMClassifier, log_evaluation

import optuna

### Configurations

In [2]:
# config
DATA_PATH = '../input/amex-data-integer-dtypes-parquet-format/'   # denoised data from raddar
LABELS_PATH = '../input/amex-default-prediction/train_labels.csv' # original data sources

TEST_FEAT_PATH = '../input/amex-denoised-aggregated-features/test_feat.parquet' # aggregated features I shared
TRAIN_FEAT_PATH = '../input/amex-denoised-aggregated-features/train_feat.parquet'

### Helper functions

In [3]:
# helper functions

def get_data(read_from_cache=True):
    train = pd.read_parquet(TRAIN_FEAT_PATH)
    test = pd.read_parquet(TEST_FEAT_PATH)
    return test, train

def amex_metric(y_true: np.array, y_pred: np.array) -> float:

    # count of positives and negatives
    n_pos = y_true.sum()
    n_neg = y_true.shape[0] - n_pos

    # sorting by describing prediction values
    indices = np.argsort(y_pred)[::-1]
    preds, target = y_pred[indices], y_true[indices]

    # filter the top 4% by cumulative row weights
    weight = 20.0 - target * 19.0
    cum_norm_weight = (weight / weight.sum()).cumsum()
    four_pct_filter = cum_norm_weight <= 0.04

    # default rate captured at 4%
    d = target[four_pct_filter].sum() / n_pos

    # weighted Gini coefficient
    lorentz = (target / n_pos).cumsum()
    gini = ((lorentz - cum_norm_weight) * weight).sum()

    # max weighted Gini coefficient
    gini_max = 10 * n_neg * (1 - 19 / (n_pos + 20 * n_neg))

    # normalized weighted Gini coefficient
    g = gini / gini_max

    return 0.5 * (g + d)


def lgb_amex_metric(y_true, y_pred):
    """The competition metric with lightgbm's calling convention"""
    return ('amex',
            amex_metric(y_true, y_pred),
            True)

### Optuna objective definition

In [4]:
def objective(trial):
    
    target = pd.read_csv(LABELS_PATH).target.values
    test, train = get_data(read_from_cache=True)
    print(f"target shape: {target.shape}, train shape: {train.shape}, test shape: {test.shape}")
    features = [f for f in train.columns if f != 'customer_ID' and f != 'target']
    
    n_est = trial.suggest_int("n_estimators", 10, 30, step=10)
    lr = trial.suggest_float("learning_rate", .01, .02, step=.01)
   
    def my_booster(n_est, lr):
        return LGBMClassifier(n_estimators= n_est,  # original 1200
                   learning_rate= lr,  # original 0.03
                   reg_lambda=50,
                   min_child_samples=2400,
                   num_leaves = 95,  # with cpu 95
                   colsample_bytree=0.19,
                   max_bins = 511,   # originally for CPU 511, for gpu 255
                   random_state = 42,
                   n_jobs = 16  # number of physical cpu cores
                   # min_data_in_leaf = 1000, 
                   # device= 'gpu'
                )    
    
    cv_folds = 5
    ONLY_FIRST_FOLD = False
    score_list, y_pred_list = [], []
    kf = StratifiedKFold(n_splits=cv_folds)
    
    for fold, (idx_tr, idx_va) in enumerate(kf.split(train, target)):
        X_tr, X_va, y_tr, y_va, model = None, None, None, None, None
        start_time = datetime.datetime.now()
        X_tr = train.iloc[idx_tr][features]
        X_va = train.iloc[idx_va][features]
        y_tr = target[idx_tr]
        y_va = target[idx_va]

        with warnings.catch_warnings():
            warnings.filterwarnings('ignore', category=UserWarning)
            model = my_booster(n_est, lr)
            model.fit(X_tr, y_tr,
                      eval_set = [(X_va, y_va)], 
                      eval_metric=[lgb_amex_metric],
                      callbacks=[log_evaluation(10)])
        X_tr, y_tr = None, None
        y_va_pred = model.predict_proba(X_va, raw_score=True)
        score = amex_metric(y_va, y_va_pred)
        n_trees = model.best_iteration_
        if n_trees is None: n_trees = model.n_estimators
        print(f"{Fore.GREEN}{Style.BRIGHT}Fold {fold} | {str(datetime.datetime.now() - start_time)[-12:-7]} |"
              f" {n_trees:5} trees |"
              f"                Score = {score:.5f}{Style.RESET_ALL}")
        score_list.append(score)

    print(f"{Fore.GREEN}{Style.BRIGHT}OOF Score:                       {np.mean(score_list):.5f}{Style.RESET_ALL}")
    return np.mean(score_list)

### Driver

In [5]:
if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=2)  # change it to cover the search space

    print("Number of finished trials: {}".format(len(study.trials)))

    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))

    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

[32m[I 2022-07-15 01:04:36,657][0m A new study created in memory with name: no-name-b8cb477e-96c1-4efa-afa0-e574b44421bb[0m


target shape: (458913,), train shape: (458913, 469), test shape: (924621, 469)
[10]	valid_0's binary_logloss: 0.488114	valid_0's amex: 0.729313
[20]	valid_0's binary_logloss: 0.431348	valid_0's amex: 0.736506
[30]	valid_0's binary_logloss: 0.38933	valid_0's amex: 0.73983
[32m[1mFold 0 | 00:30 |    30 trees |                Score = 0.73963[0m
[10]	valid_0's binary_logloss: 0.4877	valid_0's amex: 0.730007
[20]	valid_0's binary_logloss: 0.430641	valid_0's amex: 0.735035
[30]	valid_0's binary_logloss: 0.388466	valid_0's amex: 0.73766
[32m[1mFold 1 | 00:31 |    30 trees |                Score = 0.73746[0m
[10]	valid_0's binary_logloss: 0.48788	valid_0's amex: 0.735276
[20]	valid_0's binary_logloss: 0.430813	valid_0's amex: 0.741658
[30]	valid_0's binary_logloss: 0.388549	valid_0's amex: 0.744718
[32m[1mFold 2 | 00:31 |    30 trees |                Score = 0.74447[0m
[10]	valid_0's binary_logloss: 0.487749	valid_0's amex: 0.736298
[20]	valid_0's binary_logloss: 0.430581	valid_0's am

[32m[I 2022-07-15 01:07:34,430][0m Trial 0 finished with value: 0.7430070320187431 and parameters: {'n_estimators': 30, 'learning_rate': 0.02}. Best is trial 0 with value: 0.7430070320187431.[0m


[32m[1mFold 4 | 00:31 |    30 trees |                Score = 0.74645[0m
[32m[1mOOF Score:                       0.74301[0m
target shape: (458913,), train shape: (458913, 469), test shape: (924621, 469)
[10]	valid_0's binary_logloss: 0.488114	valid_0's amex: 0.729313
[32m[1mFold 0 | 00:25 |    10 trees |                Score = 0.72907[0m
[10]	valid_0's binary_logloss: 0.4877	valid_0's amex: 0.730007
[32m[1mFold 1 | 00:25 |    10 trees |                Score = 0.72978[0m
[10]	valid_0's binary_logloss: 0.48788	valid_0's amex: 0.735276
[32m[1mFold 2 | 00:24 |    10 trees |                Score = 0.73503[0m
[10]	valid_0's binary_logloss: 0.487749	valid_0's amex: 0.736298
[32m[1mFold 3 | 00:23 |    10 trees |                Score = 0.73608[0m
[10]	valid_0's binary_logloss: 0.487524	valid_0's amex: 0.734196


[32m[I 2022-07-15 01:09:44,723][0m Trial 1 finished with value: 0.7327885173921829 and parameters: {'n_estimators': 10, 'learning_rate': 0.02}. Best is trial 0 with value: 0.7430070320187431.[0m


[32m[1mFold 4 | 00:23 |    10 trees |                Score = 0.73398[0m
[32m[1mOOF Score:                       0.73279[0m
Number of finished trials: 2
Best trial:
  Value: 0.7430070320187431
  Params: 
    n_estimators: 30
    learning_rate: 0.02


Now that you know all the best parameters you can probably run with those best parameters to have fresh interpretation. Just clear up memories I will call garbage collector.

In [6]:
gc.collect()

104

### Train and Infer with best parameters

In [7]:
target = pd.read_csv(LABELS_PATH).target.values
test, train = get_data(read_from_cache=True)
print(f"target shape: {target.shape}, train shape: {train.shape}, test shape: {test.shape}")
features = [f for f in train.columns if f != 'customer_ID' and f != 'target']

def my_booster(n_est, lr):
    return LGBMClassifier(n_estimators= n_est,  # try 4800
               learning_rate= lr,  # try 0.01
               reg_lambda=50,
               min_child_samples=2400,
               num_leaves = 95,  # with cpu 95
               colsample_bytree=0.19,
               max_bins = 511,   # originally for CPU 511, for gpu 255
               random_state = 42,
               n_jobs = 16  # number of physical cpu cores
               # min_data_in_leaf = 1000, 
               # device= 'gpu'
            )    

cv_folds = 5
ONLY_FIRST_FOLD = False
score_list, y_pred_list = [], []
kf = StratifiedKFold(n_splits=cv_folds)

for fold, (idx_tr, idx_va) in enumerate(kf.split(train, target)):
    X_tr, X_va, y_tr, y_va, model = None, None, None, None, None
    start_time = datetime.datetime.now()
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    y_tr = target[idx_tr]
    y_va = target[idx_va]

    with warnings.catch_warnings():
        warnings.filterwarnings('ignore', category=UserWarning)
        model = my_booster(50, .01)  # passing best params, try with 4800 and 0.01
        model.fit(X_tr, y_tr,
                  eval_set = [(X_va, y_va)], 
                  eval_metric=[lgb_amex_metric],
                  callbacks=[log_evaluation(10)])  # change to 100 for large number of trees
    X_tr, y_tr = None, None
    y_va_pred = model.predict_proba(X_va, raw_score=True)
    score = amex_metric(y_va, y_va_pred)
    n_trees = model.best_iteration_
    if n_trees is None: n_trees = model.n_estimators
    print(f"{Fore.GREEN}{Style.BRIGHT}Fold {fold} | {str(datetime.datetime.now() - start_time)[-12:-7]} |"
          f" {n_trees:5} trees |"
          f"                Score = {score:.5f}{Style.RESET_ALL}")
    score_list.append(score)
    # inference
    y_pred_list.append(model.predict_proba(test[features], raw_score=True))

print(f"{Fore.GREEN}{Style.BRIGHT}OOF Score:                       {np.mean(score_list):.5f}{Style.RESET_ALL}")

target shape: (458913,), train shape: (458913, 469), test shape: (924621, 469)
[10]	valid_0's binary_logloss: 0.525976	valid_0's amex: 0.727389
[20]	valid_0's binary_logloss: 0.48902	valid_0's amex: 0.732419
[30]	valid_0's binary_logloss: 0.457844	valid_0's amex: 0.736497
[40]	valid_0's binary_logloss: 0.432076	valid_0's amex: 0.738707
[50]	valid_0's binary_logloss: 0.409634	valid_0's amex: 0.739711
[32m[1mFold 0 | 00:33 |    50 trees |                Score = 0.73947[0m
[10]	valid_0's binary_logloss: 0.525716	valid_0's amex: 0.727602
[20]	valid_0's binary_logloss: 0.48856	valid_0's amex: 0.732113
[30]	valid_0's binary_logloss: 0.457256	valid_0's amex: 0.733427
[40]	valid_0's binary_logloss: 0.431392	valid_0's amex: 0.735422
[50]	valid_0's binary_logloss: 0.408896	valid_0's amex: 0.737463
[32m[1mFold 1 | 00:33 |    50 trees |                Score = 0.73724[0m
[10]	valid_0's binary_logloss: 0.5259	valid_0's amex: 0.732635
[20]	valid_0's binary_logloss: 0.488782	valid_0's amex: 0.73

### Submission ready file

In [8]:
sub = pd.DataFrame({'customer_ID': test.index,
                        'prediction': np.mean(y_pred_list, axis=0)})
sub.to_csv('submission.csv', index=False)