<h1>Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines</h1>

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load data

In [2]:
train = pd.read_csv('./data/training_set_features.csv', index_col='respondent_id')

In [3]:
test = pd.read_csv('./data/test_set_features.csv', index_col='respondent_id')

In [4]:
labels = pd.read_csv('./data/training_set_labels.csv', index_col='respondent_id')

### Imputation strategy

In [5]:
num_cols = train.select_dtypes('number').columns

In [6]:
cat_cols = ['race', 'sex', 
       'marital_status', 'rent_or_own',  'hhs_geo_region',
       'census_msa', 'employment_industry', 'employment_occupation']

In [7]:
ord_cols = ['age_group', 'education',  'income_poverty',
        'employment_status']

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from category_encoders import OrdinalEncoder as oe
from catboost import CatBoostClassifier
from catboost import Pool, cv
from sklearn.metrics import roc_curve, roc_auc_score
import optuna

  from .autonotebook import tqdm as notebook_tqdm


#### Impute train

In [9]:
# Categorical columns None
for col in (cat_cols+ord_cols):
    train[col] = train[col].fillna(value='None')

In [10]:
for col in num_cols:
    train[col] = train[col].fillna(value=-1)

#### Impute test

In [11]:
# Categorical columns None
for col in (cat_cols+ord_cols):
    test[col] = test[col].fillna(value='None')

In [12]:
for col in num_cols:
    test[col] = test[col].fillna(value=-1)

### Train test split

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split( train, labels, test_size=0.3, random_state=68)

In [15]:
# Get the index number of categorical features
categorical_features_indices = np.where(X_train.dtypes != float)[0]

### Optimize with Optuna and cross validation

Using just the selected features idetified in the previous notebook, led to poor perfomances; it seems to be better to use all the columns. CatBoost can easily deal with useless features not using them. 

In [16]:
train_dataset = Pool(data=X_train,
                     label=y_train.h1n1_vaccine,
                     cat_features = categorical_features_indices)

In [17]:
def objective(trial):
    param = {
        'iterations':trial.suggest_categorical('iterations', [100,200,300,500,1000,1200,1500]), #number of boosting iterations (trees) to train
        'learning_rate':trial.suggest_float("learning_rate", 0.001, 0.3), #step size at each iteration while moving towards a minimum of the loss function
        'random_strength':trial.suggest_int("random_strength", 1,10), # strength of the randomness in tree building
        'bagging_temperature':trial.suggest_int("bagging_temperature", 0,10), #degree of randomness in the selection of training data
        'max_bin':trial.suggest_categorical('max_bin', [4,5,6,8,10,20,30]), #maximum number of bins used for feature discretization
        'grow_policy':trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']), #policy used to grow the trees
        'min_data_in_leaf':trial.suggest_int("min_data_in_leaf", 1,10), #Minimum number of samples in a leaf node
        'od_type' : "Iter", #Type of overfitting detector
        'od_wait' : 100, #Number of iterations to wait before stopping the training process if no improvement
        "depth": trial.suggest_int("max_depth", 2,10), #Maximum depth of the trees
        "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights
        'one_hot_max_size':trial.suggest_categorical('one_hot_max_size', [5,10,12,100,500,1024]), #Maximum number of categories for one-hot encoding 
        'custom_metric' : ['AUC'],
        "loss_function": "Logloss",
        'auto_class_weights':trial.suggest_categorical('auto_class_weights', ['Balanced', 'SqrtBalanced']), #Strategy for automatically adjusting class weights
        }

    scores = cv(train_dataset,
            param,
            fold_count=5, 
            early_stopping_rounds=10,         
            plot=False, verbose=False)

    return scores['test-AUC-mean'].max()

Even though CatBoost is quite fast and I used early stopping, it's going to run 100 fits so it takes a lot of time. You can avoid re-running the optimisation and copy the parameters from the print report replacing `**trial.params` when calling CatBoost

In [18]:
sampler = optuna.samplers.TPESampler(seed=68)  # Make the sampler behave in a deterministic way.
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)

[I 2024-07-26 11:15:57,185] A new study created in memory with name: no-name-5ab0c57d-3beb-4395-bb2c-b844e41e6711
  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights


Training on fold [0/5]

bestTest = 0.4269336823
bestIteration = 569

Training on fold [1/5]

bestTest = 0.4240454756
bestIteration = 497

Training on fold [2/5]

bestTest = 0.4324963035
bestIteration = 536

Training on fold [3/5]

bestTest = 0.4286695316
bestIteration = 546

Training on fold [4/5]


[I 2024-07-26 11:16:59,200] Trial 0 finished with value: 0.863647892026189 and parameters: {'iterations': 1500, 'learning_rate': 0.029356482739949695, 'random_strength': 8, 'bagging_temperature': 10, 'max_bin': 6, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 1, 'max_depth': 4, 'l2_leaf_reg': 0.001991194871120998, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 0 with value: 0.863647892026189.



bestTest = 0.4289279999
bestIteration = 556

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4282883211
bestIteration = 149

Training on fold [1/5]

bestTest = 0.4184852753
bestIteration = 141

Training on fold [2/5]

bestTest = 0.4316686352
bestIteration = 162

Training on fold [3/5]

bestTest = 0.4278046405
bestIteration = 121

Training on fold [4/5]


[I 2024-07-26 11:17:10,558] Trial 1 finished with value: 0.8641005119276188 and parameters: {'iterations': 200, 'learning_rate': 0.1464067066361795, 'random_strength': 10, 'bagging_temperature': 3, 'max_bin': 10, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 1, 'max_depth': 3, 'l2_leaf_reg': 0.028402775147703313, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4293192688
bestIteration = 150

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.440402881
bestIteration = 26

Training on fold [1/5]

bestTest = 0.4278125101
bestIteration = 49

Training on fold [2/5]

bestTest = 0.4381016884
bestIteration = 55

Training on fold [3/5]

bestTest = 0.4288447231
bestIteration = 48

Training on fold [4/5]


[I 2024-07-26 11:17:35,333] Trial 2 finished with value: 0.8583460042272002 and parameters: {'iterations': 200, 'learning_rate': 0.27287829596201946, 'random_strength': 8, 'bagging_temperature': 8, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 1, 'max_depth': 5, 'l2_leaf_reg': 0.027330135035255495, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4317851117
bestIteration = 46

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4280817397
bestIteration = 482

Training on fold [1/5]

bestTest = 0.4236283199
bestIteration = 454

Training on fold [2/5]

bestTest = 0.4307632902
bestIteration = 534

Training on fold [3/5]

bestTest = 0.431009149
bestIteration = 294

Training on fold [4/5]


[I 2024-07-26 11:19:18,461] Trial 3 finished with value: 0.8629184180369525 and parameters: {'iterations': 1200, 'learning_rate': 0.0603209284932487, 'random_strength': 3, 'bagging_temperature': 7, 'max_bin': 4, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 2, 'max_depth': 2, 'l2_leaf_reg': 1.300471404766049e-07, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4326428811
bestIteration = 465

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4316518579
bestIteration = 93

Training on fold [1/5]

bestTest = 0.4264396537
bestIteration = 56

Training on fold [2/5]

bestTest = 0.4358573965
bestIteration = 69

Training on fold [3/5]

bestTest = 0.4316967894
bestIteration = 59

Training on fold [4/5]


[I 2024-07-26 11:19:26,897] Trial 4 finished with value: 0.8595852167876092 and parameters: {'iterations': 300, 'learning_rate': 0.22423670437233847, 'random_strength': 6, 'bagging_temperature': 2, 'max_bin': 30, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 4, 'max_depth': 4, 'l2_leaf_reg': 0.00010293033487726667, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4358042991
bestIteration = 76

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4852991359
bestIteration = 99

Training on fold [1/5]

bestTest = 0.4839102243
bestIteration = 99

Training on fold [2/5]

bestTest = 0.4902656278
bestIteration = 99

Training on fold [3/5]

bestTest = 0.485158478
bestIteration = 99

Training on fold [4/5]


[I 2024-07-26 11:19:33,971] Trial 5 finished with value: 0.8521741475778146 and parameters: {'iterations': 100, 'learning_rate': 0.06628011038512191, 'random_strength': 4, 'bagging_temperature': 4, 'max_bin': 20, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 3, 'max_depth': 2, 'l2_leaf_reg': 13.751833235431702, 'one_hot_max_size': 100, 'auto_class_weights': 'Balanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4932316559
bestIteration = 99

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4307915324
bestIteration = 98

Training on fold [1/5]

bestTest = 0.4226802156
bestIteration = 70

Training on fold [2/5]

bestTest = 0.4334947317
bestIteration = 124

Training on fold [3/5]

bestTest = 0.436377036
bestIteration = 80

Training on fold [4/5]


[I 2024-07-26 11:21:46,428] Trial 6 finished with value: 0.8616877078561375 and parameters: {'iterations': 1200, 'learning_rate': 0.09658215406978513, 'random_strength': 8, 'bagging_temperature': 2, 'max_bin': 30, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 6, 'max_depth': 10, 'l2_leaf_reg': 2.6558249848041764, 'one_hot_max_size': 5, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4293134391
bestIteration = 100

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.464009135
bestIteration = 84

Training on fold [1/5]

bestTest = 0.4517555363
bestIteration = 122

Training on fold [2/5]

bestTest = 0.462515409
bestIteration = 113

Training on fold [3/5]

bestTest = 0.4609296958
bestIteration = 80

Training on fold [4/5]


[I 2024-07-26 11:22:16,319] Trial 7 finished with value: 0.862524125827921 and parameters: {'iterations': 500, 'learning_rate': 0.2714096381817127, 'random_strength': 4, 'bagging_temperature': 6, 'max_bin': 8, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 1, 'max_depth': 2, 'l2_leaf_reg': 4.9369231964322795, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4682843267
bestIteration = 83

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4313097636
bestIteration = 65

Training on fold [1/5]

bestTest = 0.4181319598
bestIteration = 109

Training on fold [2/5]

bestTest = 0.4369396156
bestIteration = 93

Training on fold [3/5]

bestTest = 0.4295737898
bestIteration = 73

Training on fold [4/5]


[I 2024-07-26 11:23:10,093] Trial 8 finished with value: 0.8621129363871262 and parameters: {'iterations': 1500, 'learning_rate': 0.2053434310118264, 'random_strength': 8, 'bagging_temperature': 0, 'max_bin': 20, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 10, 'max_depth': 3, 'l2_leaf_reg': 9.501510078266123e-06, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4302185711
bestIteration = 119

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4441066792
bestIteration = 20

Training on fold [1/5]

bestTest = 0.4335742604
bestIteration = 35

Training on fold [2/5]

bestTest = 0.445031423
bestIteration = 45

Training on fold [3/5]

bestTest = 0.4441460909
bestIteration = 33

Training on fold [4/5]


[I 2024-07-26 11:26:18,014] Trial 9 finished with value: 0.8536284008955322 and parameters: {'iterations': 100, 'learning_rate': 0.25900665720714294, 'random_strength': 3, 'bagging_temperature': 0, 'max_bin': 6, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 8, 'max_depth': 7, 'l2_leaf_reg': 1.1694576328936887e-07, 'one_hot_max_size': 12, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4443909157
bestIteration = 39



  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights


Training on fold [0/5]

bestTest = 0.4862090765
bestIteration = 50

Training on fold [1/5]

bestTest = 0.4726565482
bestIteration = 58

Training on fold [2/5]

bestTest = 0.4895097855
bestIteration = 34

Training on fold [3/5]

bestTest = 0.4745293001
bestIteration = 62

Training on fold [4/5]


[I 2024-07-26 11:26:34,412] Trial 10 finished with value: 0.8519961198262097 and parameters: {'iterations': 200, 'learning_rate': 0.15034104921866256, 'random_strength': 10, 'bagging_temperature': 4, 'max_bin': 10, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 6, 'max_depth': 7, 'l2_leaf_reg': 0.06866979073979582, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4746860704
bestIteration = 70

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4370721231
bestIteration = 1127

Training on fold [1/5]

bestTest = 0.4243519149
bestIteration = 1499

Training on fold [2/5]

bestTest = 0.4334998546
bestIteration = 1499

Training on fold [3/5]

bestTest = 0.4291885789
bestIteration = 1499

Training on fold [4/5]


[I 2024-07-26 11:31:03,489] Trial 11 finished with value: 0.8626369796529552 and parameters: {'iterations': 1500, 'learning_rate': 0.008172454094813684, 'random_strength': 10, 'bagging_temperature': 9, 'max_bin': 6, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 4, 'max_depth': 5, 'l2_leaf_reg': 0.004159650728547687, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4325173645
bestIteration = 1499



  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights


Training on fold [0/5]

bestTest = 0.4309647825
bestIteration = 88

Training on fold [1/5]

bestTest = 0.4262937528
bestIteration = 111

Training on fold [2/5]

bestTest = 0.4398138717
bestIteration = 84

Training on fold [3/5]

bestTest = 0.4317121208
bestIteration = 85

Training on fold [4/5]


[I 2024-07-26 11:31:49,700] Trial 12 finished with value: 0.8602123184914771 and parameters: {'iterations': 1000, 'learning_rate': 0.15199867028113004, 'random_strength': 7, 'bagging_temperature': 10, 'max_bin': 5, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 1, 'max_depth': 4, 'l2_leaf_reg': 0.000130285490979316, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4300054189
bestIteration = 106



  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights


Training on fold [0/5]

bestTest = 0.4375473825
bestIteration = 199

Training on fold [1/5]

bestTest = 0.4363755734
bestIteration = 199

Training on fold [2/5]

bestTest = 0.4450582865
bestIteration = 199

Training on fold [3/5]

bestTest = 0.4359606215
bestIteration = 199

Training on fold [4/5]


[I 2024-07-26 11:33:00,795] Trial 13 finished with value: 0.8589146047709896 and parameters: {'iterations': 200, 'learning_rate': 0.01531209232658981, 'random_strength': 1, 'bagging_temperature': 5, 'max_bin': 6, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 3, 'max_depth': 6, 'l2_leaf_reg': 0.20127448967652856, 'one_hot_max_size': 1024, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4441568326
bestIteration = 199



  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights


Training on fold [0/5]

bestTest = 0.4737573644
bestIteration = 88

Training on fold [1/5]

bestTest = 0.4616253621
bestIteration = 94

Training on fold [2/5]

bestTest = 0.4783586855
bestIteration = 82

Training on fold [3/5]

bestTest = 0.4751642426
bestIteration = 87

Training on fold [4/5]


[I 2024-07-26 11:36:12,507] Trial 14 finished with value: 0.8548841474157653 and parameters: {'iterations': 1500, 'learning_rate': 0.14690216116501453, 'random_strength': 10, 'bagging_temperature': 2, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 3, 'max_depth': 9, 'l2_leaf_reg': 0.0012700604558864655, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4809768942
bestIteration = 90

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4317698081
bestIteration = 126

Training on fold [1/5]

bestTest = 0.4202879243
bestIteration = 155

Training on fold [2/5]

bestTest = 0.4326630517
bestIteration = 192

Training on fold [3/5]

bestTest = 0.4283857663
bestIteration = 118

Training on fold [4/5]


[I 2024-07-26 11:36:38,629] Trial 15 finished with value: 0.8619844345622972 and parameters: {'iterations': 500, 'learning_rate': 0.10902072034609424, 'random_strength': 9, 'bagging_temperature': 10, 'max_bin': 4, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 5, 'max_depth': 4, 'l2_leaf_reg': 9.46314667205651e-06, 'one_hot_max_size': 12, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8641005119276188.



bestTest = 0.4310683534
bestIteration = 144

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4248927152
bestIteration = 142

Training on fold [1/5]

bestTest = 0.4210729016
bestIteration = 151

Training on fold [2/5]

bestTest = 0.4293761693
bestIteration = 142

Training on fold [3/5]

bestTest = 0.4296664949
bestIteration = 87

Training on fold [4/5]


[I 2024-07-26 11:36:45,033] Trial 16 finished with value: 0.8642133032213637 and parameters: {'iterations': 300, 'learning_rate': 0.19373131951584033, 'random_strength': 6, 'bagging_temperature': 4, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 3, 'l2_leaf_reg': 0.1793344946668409, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 16 with value: 0.8642133032213637.



bestTest = 0.4280529349
bestIteration = 127

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4262312805
bestIteration = 127

Training on fold [1/5]

bestTest = 0.4207886186
bestIteration = 107

Training on fold [2/5]

bestTest = 0.4296821913
bestIteration = 138

Training on fold [3/5]

bestTest = 0.4268313144
bestIteration = 115

Training on fold [4/5]


[I 2024-07-26 11:36:51,784] Trial 17 finished with value: 0.864632449767259 and parameters: {'iterations': 300, 'learning_rate': 0.18816956560325993, 'random_strength': 6, 'bagging_temperature': 3, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 3, 'l2_leaf_reg': 0.8214169197680212, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 17 with value: 0.864632449767259.



bestTest = 0.4270145064
bestIteration = 113

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.462140698
bestIteration = 111

Training on fold [1/5]

bestTest = 0.449476374
bestIteration = 148

Training on fold [2/5]

bestTest = 0.4606841436
bestIteration = 214

Training on fold [3/5]

bestTest = 0.4594622476
bestIteration = 159

Training on fold [4/5]


[I 2024-07-26 11:37:00,217] Trial 18 finished with value: 0.8642606915210294 and parameters: {'iterations': 300, 'learning_rate': 0.1963385646960396, 'random_strength': 6, 'bagging_temperature': 5, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 3, 'l2_leaf_reg': 88.75986414349809, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 17 with value: 0.864632449767259.



bestTest = 0.4642626878
bestIteration = 102

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.460548384
bestIteration = 59

Training on fold [1/5]

bestTest = 0.4497520676
bestIteration = 81

Training on fold [2/5]

bestTest = 0.4598481924
bestIteration = 87

Training on fold [3/5]

bestTest = 0.4576226659
bestIteration = 71

Training on fold [4/5]


[I 2024-07-26 11:37:06,115] Trial 19 finished with value: 0.864939107462057 and parameters: {'iterations': 300, 'learning_rate': 0.23569005361166914, 'random_strength': 5, 'bagging_temperature': 7, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 6, 'l2_leaf_reg': 99.11695455168547, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 19 with value: 0.864939107462057.



bestTest = 0.4623382534
bestIteration = 69

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4618127859
bestIteration = 57

Training on fold [1/5]

bestTest = 0.4515811211
bestIteration = 68

Training on fold [2/5]

bestTest = 0.4629880341
bestIteration = 60

Training on fold [3/5]

bestTest = 0.4581180791
bestIteration = 68

Training on fold [4/5]


[I 2024-07-26 11:37:12,102] Trial 20 finished with value: 0.8645877595772367 and parameters: {'iterations': 300, 'learning_rate': 0.2348670283036838, 'random_strength': 5, 'bagging_temperature': 7, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 7, 'l2_leaf_reg': 90.56559174112068, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 19 with value: 0.864939107462057.



bestTest = 0.4615723544
bestIteration = 82

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4641950141
bestIteration = 45

Training on fold [1/5]

bestTest = 0.4504849222
bestIteration = 61

Training on fold [2/5]

bestTest = 0.4651648897
bestIteration = 68

Training on fold [3/5]

bestTest = 0.4615801921
bestIteration = 41

Training on fold [4/5]


[I 2024-07-26 11:37:17,658] Trial 21 finished with value: 0.8624742662806805 and parameters: {'iterations': 300, 'learning_rate': 0.2986378408800963, 'random_strength': 5, 'bagging_temperature': 7, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 8, 'l2_leaf_reg': 98.70852826178593, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 19 with value: 0.864939107462057.



bestTest = 0.4693546903
bestIteration = 52

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4716102351
bestIteration = 49

Training on fold [1/5]

bestTest = 0.4573710779
bestIteration = 56

Training on fold [2/5]

bestTest = 0.4632409291
bestIteration = 62

Training on fold [3/5]

bestTest = 0.4668002345
bestIteration = 51

Training on fold [4/5]


[I 2024-07-26 11:37:22,512] Trial 22 finished with value: 0.8599138388033742 and parameters: {'iterations': 300, 'learning_rate': 0.23478879966857782, 'random_strength': 5, 'bagging_temperature': 7, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 7, 'l2_leaf_reg': 1.4179471935692007, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 19 with value: 0.864939107462057.



bestTest = 0.4703752638
bestIteration = 41

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4663417863
bestIteration = 52

Training on fold [1/5]

bestTest = 0.4435003446
bestIteration = 114

Training on fold [2/5]

bestTest = 0.4629707012
bestIteration = 84

Training on fold [3/5]

bestTest = 0.4602919725
bestIteration = 69

Training on fold [4/5]


[I 2024-07-26 11:37:28,688] Trial 23 finished with value: 0.864040909948705 and parameters: {'iterations': 300, 'learning_rate': 0.1766102688172111, 'random_strength': 4, 'bagging_temperature': 6, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 6, 'l2_leaf_reg': 16.414831944848384, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 19 with value: 0.864939107462057.



bestTest = 0.4638399575
bestIteration = 84

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4739639645
bestIteration = 37

Training on fold [1/5]

bestTest = 0.4644124745
bestIteration = 53

Training on fold [2/5]

bestTest = 0.4795641633
bestIteration = 28

Training on fold [3/5]

bestTest = 0.4661612077
bestIteration = 38

Training on fold [4/5]


[I 2024-07-26 11:37:33,187] Trial 24 finished with value: 0.8564027877520338 and parameters: {'iterations': 300, 'learning_rate': 0.23759349195814128, 'random_strength': 7, 'bagging_temperature': 8, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 8, 'l2_leaf_reg': 1.0635393991514817, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 19 with value: 0.864939107462057.



bestTest = 0.4766251969
bestIteration = 44

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4612028579
bestIteration = 59

Training on fold [1/5]

bestTest = 0.4473615907
bestIteration = 88

Training on fold [2/5]

bestTest = 0.4576892785
bestIteration = 79

Training on fold [3/5]

bestTest = 0.459851678
bestIteration = 55

Training on fold [4/5]


[I 2024-07-26 11:37:38,577] Trial 25 finished with value: 0.8649879135745673 and parameters: {'iterations': 1000, 'learning_rate': 0.2156867432301418, 'random_strength': 2, 'bagging_temperature': 8, 'max_bin': 8, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 5, 'l2_leaf_reg': 23.486944579260328, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.4612628708
bestIteration = 64

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4634112468
bestIteration = 33

Training on fold [1/5]

bestTest = 0.454246092
bestIteration = 56

Training on fold [2/5]

bestTest = 0.4642904821
bestIteration = 62

Training on fold [3/5]

bestTest = 0.4610832954
bestIteration = 49

Training on fold [4/5]


[I 2024-07-26 11:37:45,499] Trial 26 finished with value: 0.8634735774251003 and parameters: {'iterations': 1000, 'learning_rate': 0.17770173746722442, 'random_strength': 1, 'bagging_temperature': 8, 'max_bin': 8, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 7, 'max_depth': 5, 'l2_leaf_reg': 12.480240405873914, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.4606995028
bestIteration = 73

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4648200352
bestIteration = 50

Training on fold [1/5]

bestTest = 0.453566199
bestIteration = 78

Training on fold [2/5]

bestTest = 0.4697375146
bestIteration = 55

Training on fold [3/5]

bestTest = 0.4679156741
bestIteration = 38

Training on fold [4/5]


[I 2024-07-26 11:38:02,788] Trial 27 finished with value: 0.8610437147655091 and parameters: {'iterations': 1000, 'learning_rate': 0.20922868518021023, 'random_strength': 2, 'bagging_temperature': 9, 'max_bin': 8, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 6, 'l2_leaf_reg': 0.5606299501794255, 'one_hot_max_size': 12, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.4614888594
bestIteration = 53

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4604547564
bestIteration = 56

Training on fold [1/5]

bestTest = 0.4536677789
bestIteration = 66

Training on fold [2/5]

bestTest = 0.4606998378
bestIteration = 78

Training on fold [3/5]

bestTest = 0.4592846007
bestIteration = 41

Training on fold [4/5]


[I 2024-07-26 11:38:30,518] Trial 28 finished with value: 0.864142043525715 and parameters: {'iterations': 1000, 'learning_rate': 0.29995937910339976, 'random_strength': 2, 'bagging_temperature': 6, 'max_bin': 8, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 5, 'l2_leaf_reg': 19.472391086493886, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.458827552
bestIteration = 56

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4679639922
bestIteration = 73

Training on fold [1/5]

bestTest = 0.4520689546
bestIteration = 105

Training on fold [2/5]

bestTest = 0.4720845651
bestIteration = 84

Training on fold [3/5]

bestTest = 0.4674097374
bestIteration = 81

Training on fold [4/5]


[I 2024-07-26 11:39:02,045] Trial 29 finished with value: 0.8588974891241679 and parameters: {'iterations': 1000, 'learning_rate': 0.172797708570632, 'random_strength': 7, 'bagging_temperature': 1, 'max_bin': 8, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 5, 'l2_leaf_reg': 0.009172197144442293, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.4769180044
bestIteration = 46

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4656053966
bestIteration = 65

Training on fold [1/5]

bestTest = 0.4520665149
bestIteration = 78

Training on fold [2/5]

bestTest = 0.4619349169
bestIteration = 66

Training on fold [3/5]

bestTest = 0.4602943464
bestIteration = 55

Training on fold [4/5]


[I 2024-07-26 11:39:07,663] Trial 30 finished with value: 0.8629200945573526 and parameters: {'iterations': 100, 'learning_rate': 0.2518180465506497, 'random_strength': 3, 'bagging_temperature': 9, 'max_bin': 4, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 4, 'l2_leaf_reg': 0.39119673843807323, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.4657292284
bestIteration = 63

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4622279283
bestIteration = 61

Training on fold [1/5]

bestTest = 0.4532731004
bestIteration = 62

Training on fold [2/5]

bestTest = 0.4608995713
bestIteration = 65

Training on fold [3/5]

bestTest = 0.4618123527
bestIteration = 59

Training on fold [4/5]


[I 2024-07-26 11:39:16,172] Trial 31 finished with value: 0.8642958788055776 and parameters: {'iterations': 300, 'learning_rate': 0.22498426046976158, 'random_strength': 5, 'bagging_temperature': 7, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 8, 'l2_leaf_reg': 81.61915563595787, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.4616866917
bestIteration = 61

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4600599474
bestIteration = 57

Training on fold [1/5]

bestTest = 0.4534847617
bestIteration = 61

Training on fold [2/5]

bestTest = 0.4660143912
bestIteration = 51

Training on fold [3/5]

bestTest = 0.4592403173
bestIteration = 58

Training on fold [4/5]


[I 2024-07-26 11:39:23,804] Trial 32 finished with value: 0.8634946138976914 and parameters: {'iterations': 300, 'learning_rate': 0.21696965658871908, 'random_strength': 5, 'bagging_temperature': 6, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 6, 'l2_leaf_reg': 4.633912596380363, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.4628757605
bestIteration = 64

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.459485419
bestIteration = 52

Training on fold [1/5]

bestTest = 0.4501531924
bestIteration = 54

Training on fold [2/5]

bestTest = 0.4598545042
bestIteration = 81

Training on fold [3/5]

bestTest = 0.4575600666
bestIteration = 56

Training on fold [4/5]


[I 2024-07-26 11:39:35,341] Trial 33 finished with value: 0.8644041592239994 and parameters: {'iterations': 300, 'learning_rate': 0.25275169328934294, 'random_strength': 4, 'bagging_temperature': 3, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 6, 'l2_leaf_reg': 23.62548823337259, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.4658984262
bestIteration = 49

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4703846261
bestIteration = 40

Training on fold [1/5]

bestTest = 0.453794589
bestIteration = 49

Training on fold [2/5]

bestTest = 0.4621746621
bestIteration = 54

Training on fold [3/5]

bestTest = 0.4674793182
bestIteration = 47

Training on fold [4/5]


[I 2024-07-26 11:39:40,844] Trial 34 finished with value: 0.8611125702384641 and parameters: {'iterations': 1200, 'learning_rate': 0.2807248158212123, 'random_strength': 6, 'bagging_temperature': 8, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 7, 'l2_leaf_reg': 4.080506119045869, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 25 with value: 0.8649879135745673.



bestTest = 0.4681509742
bestIteration = 42

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4629127246
bestIteration = 92

Training on fold [1/5]

bestTest = 0.4512582995
bestIteration = 81

Training on fold [2/5]

bestTest = 0.45751723
bestIteration = 100

Training on fold [3/5]

bestTest = 0.4610131355
bestIteration = 88

Training on fold [4/5]


[I 2024-07-26 11:40:21,880] Trial 35 finished with value: 0.8654490694569118 and parameters: {'iterations': 300, 'learning_rate': 0.12276716157236615, 'random_strength': 2, 'bagging_temperature': 7, 'max_bin': 30, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 9, 'l2_leaf_reg': 30.88038992767587, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 35 with value: 0.8654490694569118.



bestTest = 0.4615935627
bestIteration = 73

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4535011776
bestIteration = 21

Training on fold [1/5]

bestTest = 0.4425727262
bestIteration = 39

Training on fold [2/5]

bestTest = 0.4522476326
bestIteration = 23

Training on fold [3/5]

bestTest = 0.4553986097
bestIteration = 22

Training on fold [4/5]


[I 2024-07-26 11:40:42,199] Trial 36 finished with value: 0.8454408957897013 and parameters: {'iterations': 500, 'learning_rate': 0.122271751760048, 'random_strength': 2, 'bagging_temperature': 3, 'max_bin': 30, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 10, 'l2_leaf_reg': 0.03402856058335481, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 35 with value: 0.8654490694569118.



bestTest = 0.4563538393
bestIteration = 22

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4586422325
bestIteration = 177

Training on fold [1/5]

bestTest = 0.4483374684
bestIteration = 181

Training on fold [2/5]

bestTest = 0.4565872156
bestIteration = 219

Training on fold [3/5]

bestTest = 0.4576414174
bestIteration = 166

Training on fold [4/5]


[I 2024-07-26 11:45:30,003] Trial 37 finished with value: 0.8669401824070608 and parameters: {'iterations': 300, 'learning_rate': 0.0725964444647288, 'random_strength': 3, 'bagging_temperature': 5, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 6, 'max_depth': 9, 'l2_leaf_reg': 25.460998532136795, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 37 with value: 0.8669401824070608.



bestTest = 0.4591021283
bestIteration = 179

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4587024856
bestIteration = 230

Training on fold [1/5]

bestTest = 0.4515586051
bestIteration = 219

Training on fold [2/5]

bestTest = 0.4532886141
bestIteration = 270

Training on fold [3/5]

bestTest = 0.458797651
bestIteration = 213

Training on fold [4/5]


[I 2024-07-26 11:51:57,350] Trial 38 finished with value: 0.8667361480313968 and parameters: {'iterations': 1200, 'learning_rate': 0.05889848477633064, 'random_strength': 3, 'bagging_temperature': 5, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 6, 'max_depth': 9, 'l2_leaf_reg': 30.49451253234566, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 37 with value: 0.8669401824070608.



bestTest = 0.4590123658
bestIteration = 248

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4726479385
bestIteration = 159

Training on fold [1/5]

bestTest = 0.4563275303
bestIteration = 218

Training on fold [2/5]

bestTest = 0.4777974388
bestIteration = 143

Training on fold [3/5]

bestTest = 0.4722173506
bestIteration = 124

Training on fold [4/5]


[I 2024-07-26 11:56:06,937] Trial 39 finished with value: 0.8586641741550849 and parameters: {'iterations': 1200, 'learning_rate': 0.04924386870185847, 'random_strength': 3, 'bagging_temperature': 5, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 9, 'l2_leaf_reg': 8.481028134568838e-07, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 37 with value: 0.8669401824070608.



bestTest = 0.4713881921
bestIteration = 160

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4660807847
bestIteration = 123

Training on fold [1/5]

bestTest = 0.4625918519
bestIteration = 108

Training on fold [2/5]

bestTest = 0.4743059968
bestIteration = 111

Training on fold [3/5]

bestTest = 0.4720258658
bestIteration = 75

Training on fold [4/5]


[I 2024-07-26 11:58:43,486] Trial 40 finished with value: 0.859394697009894 and parameters: {'iterations': 1200, 'learning_rate': 0.07839593478533496, 'random_strength': 2, 'bagging_temperature': 5, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 6, 'max_depth': 9, 'l2_leaf_reg': 1.4688305330211798e-08, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 37 with value: 0.8669401824070608.



bestTest = 0.4670377208
bestIteration = 99

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4584125568
bestIteration = 270

Training on fold [1/5]

bestTest = 0.4493229151
bestIteration = 282

Training on fold [2/5]

bestTest = 0.4550509534
bestIteration = 365

Training on fold [3/5]

bestTest = 0.4590799603
bestIteration = 254

Training on fold [4/5]


[I 2024-07-26 12:06:40,525] Trial 41 finished with value: 0.8670406409711484 and parameters: {'iterations': 1200, 'learning_rate': 0.051579119175126475, 'random_strength': 3, 'bagging_temperature': 6, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 6, 'max_depth': 10, 'l2_leaf_reg': 26.182447610783683, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 41 with value: 0.8670406409711484.



bestTest = 0.4573775348
bestIteration = 279

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4597917059
bestIteration = 305

Training on fold [1/5]

bestTest = 0.4482251065
bestIteration = 341

Training on fold [2/5]

bestTest = 0.4551552557
bestIteration = 381

Training on fold [3/5]

bestTest = 0.457938387
bestIteration = 339

Training on fold [4/5]


[I 2024-07-26 12:16:56,634] Trial 42 finished with value: 0.8670574726329783 and parameters: {'iterations': 1200, 'learning_rate': 0.0401132746444437, 'random_strength': 3, 'bagging_temperature': 6, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 10, 'l2_leaf_reg': 8.316797707777736, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 42 with value: 0.8670574726329783.



bestTest = 0.4593226066
bestIteration = 323

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4568253832
bestIteration = 354

Training on fold [1/5]

bestTest = 0.4483537655
bestIteration = 415

Training on fold [2/5]

bestTest = 0.4552692592
bestIteration = 386

Training on fold [3/5]

bestTest = 0.459520419
bestIteration = 343

Training on fold [4/5]


[I 2024-07-26 12:26:38,378] Trial 43 finished with value: 0.8670543378935059 and parameters: {'iterations': 1200, 'learning_rate': 0.03612474452108618, 'random_strength': 3, 'bagging_temperature': 6, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 10, 'l2_leaf_reg': 6.973969672008123, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 42 with value: 0.8670574726329783.



bestTest = 0.4608248451
bestIteration = 342

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.458449164
bestIteration = 294

Training on fold [1/5]

bestTest = 0.4490652017
bestIteration = 321

Training on fold [2/5]

bestTest = 0.4571998577
bestIteration = 341

Training on fold [3/5]

bestTest = 0.4614836771
bestIteration = 228

Training on fold [4/5]


[I 2024-07-26 12:34:06,463] Trial 44 finished with value: 0.8662151202595304 and parameters: {'iterations': 1200, 'learning_rate': 0.04304130497958788, 'random_strength': 3, 'bagging_temperature': 4, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 10, 'l2_leaf_reg': 6.503356107329111, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 42 with value: 0.8670574726329783.



bestTest = 0.461435337
bestIteration = 294

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4594029295
bestIteration = 486

Training on fold [1/5]

bestTest = 0.4482348909
bestIteration = 510

Training on fold [2/5]

bestTest = 0.4574293385
bestIteration = 516

Training on fold [3/5]

bestTest = 0.4659452654
bestIteration = 319

Training on fold [4/5]


[I 2024-07-26 12:41:48,480] Trial 45 finished with value: 0.8661646725527963 and parameters: {'iterations': 1200, 'learning_rate': 0.02677733766798697, 'random_strength': 4, 'bagging_temperature': 6, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 4, 'max_depth': 10, 'l2_leaf_reg': 3.73480484288306, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 42 with value: 0.8670574726329783.



bestTest = 0.4590345715
bestIteration = 486

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4630574413
bestIteration = 182

Training on fold [1/5]

bestTest = 0.4511627623
bestIteration = 177

Training on fold [2/5]

bestTest = 0.4581732709
bestIteration = 191

Training on fold [3/5]

bestTest = 0.4632594099
bestIteration = 153

Training on fold [4/5]


[I 2024-07-26 12:44:36,896] Trial 46 finished with value: 0.8648496822956698 and parameters: {'iterations': 1200, 'learning_rate': 0.07357850005881951, 'random_strength': 3, 'bagging_temperature': 5, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 6, 'max_depth': 10, 'l2_leaf_reg': 2.2538536332408636, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 42 with value: 0.8670574726329783.



bestTest = 0.4616116113
bestIteration = 192

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4660103278
bestIteration = 54

Training on fold [1/5]

bestTest = 0.4567015524
bestIteration = 94

Training on fold [2/5]

bestTest = 0.4706253926
bestIteration = 92

Training on fold [3/5]

bestTest = 0.4695637146
bestIteration = 66

Training on fold [4/5]


[I 2024-07-26 12:45:48,748] Trial 47 finished with value: 0.8606778019285182 and parameters: {'iterations': 1200, 'learning_rate': 0.08564940608004865, 'random_strength': 1, 'bagging_temperature': 4, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 9, 'l2_leaf_reg': 0.00029726225174476226, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 42 with value: 0.8670574726329783.



bestTest = 0.4730740481
bestIteration = 62

Training on fold [0/5]


  "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100), #L2 regularization coefficient for leaf weights



bestTest = 0.4599902051
bestIteration = 209

Training on fold [1/5]

bestTest = 0.4520625107
bestIteration = 247

Training on fold [2/5]

bestTest = 0.4588034204
bestIteration = 258

Training on fold [3/5]


In [None]:
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}={},".format(key, value))


: 

### Check the model

In [None]:
final_model = CatBoostClassifier(verbose=False,  cat_features=categorical_features_indices, 
                          **trial.params)

: 

In [None]:
final_model.fit(X_train, y_train.h1n1_vaccine)

: 

In [None]:
predictions_h1 = final_model.predict_proba(X_test)

: 

In [None]:
predictions_h1 = predictions_h1[:,1].reshape(-1,1)

: 

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

: 

In [None]:
def plot_roc(y_true, y_score, label_name, ax):
    fpr, tpr, thresholds = roc_curve(y_true, y_score)
    ax.plot(fpr, tpr)
    ax.plot([0, 1], [0, 1], color='grey', linestyle='--')
    ax.set_ylabel('TPR')
    ax.set_xlabel('FPR')
    ax.set_title(
        f"{label_name}: AUC = {roc_auc_score(y_true, y_score):.4f}"
    )

: 

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
plot_roc(
    y_test['h1n1_vaccine'], 
    predictions_h1, 
    'h1n1_vaccine',
    ax=ax
)

: 

In [None]:
roc_auc_score(y_test.h1n1_vaccine, predictions_h1)

: 

## Second part

In [None]:
train_dataset_se = Pool(data=X_train,
                     label=y_train.seasonal_vaccine,
                     cat_features = categorical_features_indices)

: 

I made a second objective function in case I'd want to change some tuning

In [None]:
def objective2(trial):
    param = {
        'iterations':trial.suggest_categorical('iterations', [100,200,300,500,1000,1200,1500]),
        'learning_rate':trial.suggest_float("learning_rate", 0.001, 0.3),
        'random_strength':trial.suggest_int("random_strength", 1,10),
        'bagging_temperature':trial.suggest_int("bagging_temperature", 0,10),
        'max_bin':trial.suggest_categorical('max_bin', [4,5,6,8,10,20,30]),
        'grow_policy':trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']),
        'min_data_in_leaf':trial.suggest_int("min_data_in_leaf", 1,10),
        'od_type' : "Iter",
        'od_wait' : 100,
        "depth": trial.suggest_int("max_depth", 2,10),
        "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100),
         'one_hot_max_size':trial.suggest_categorical('one_hot_max_size', [5,10,12,100,500,1024]),
        'custom_metric' : ['AUC'],
        "loss_function": "Logloss",
        'auto_class_weights':trial.suggest_categorical('auto_class_weights', ['Balanced', 'SqrtBalanced']),
        }

    scores = cv(train_dataset_se,
            param,
            fold_count=5, 
            early_stopping_rounds=10,         
            plot=False, verbose=False) 

    return scores['test-AUC-mean'].max()

: 

In [None]:
sampler = optuna.samplers.TPESampler(seed=68)  # Make the sampler behave in a deterministic way.
study2 = optuna.create_study(direction="maximize", sampler=sampler)
study2.optimize(objective2, n_trials=100)

: 

In [None]:
print("Number of finished trials: {}".format(len(study2.trials)))
print("Best trial:")
trial2 = study2.best_trial
print("  Value: {}".format(trial2.value))
print("  Params: ")
for key, value in trial2.params.items():
    print("    {}={},".format(key, value))


: 

In [None]:
Number of finished trials: 100
Best trial:
  Value: 0.8679538793376752
  Params: 
    iterations=500,
    learning_rate=0.05745075659543725,
    random_strength=4,
    bagging_temperature=8,
    max_bin=5,
    grow_policy=Lossguide,
    min_data_in_leaf=7,
    max_depth=6,
    l2_leaf_reg=11.323094517862078,
    one_hot_max_size=10,
    auto_class_weights=Balanced,

: 

In [None]:
final_model_se = CatBoostClassifier(verbose=False,  cat_features=categorical_features_indices, 
                                    **trial2.params)

: 

In [None]:
final_model_se.fit(X_train, y_train.seasonal_vaccine)

: 

In [None]:
predictions_se = final_model_se.predict_proba(X_test)

: 

In [None]:
predictions_se = predictions_se[:,1].reshape(-1,1)

: 

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
plot_roc(
    y_test['seasonal_vaccine'], 
    predictions_se, 
    'seasonal_vaccine',
    ax=ax
)

: 

In [None]:
roc_auc_score(y_test.seasonal_vaccine, predictions_se)

: 

## Combined score

In [None]:
# Let's see the score combined of the two best predictions
roc_auc_score(y_test, np.hstack((predictions_h1, predictions_se)))

: 

## Retrain on full dataset

#### Seasonal

In [None]:
final_model_se.fit(train, labels.seasonal_vaccine)

: 

In [None]:
final_se = final_model_se.predict_proba(test)

: 

In [None]:
final_se = final_se[:,1].reshape(-1,1)

: 

#### H1N1

In [None]:
final_model.fit(train, labels.h1n1_vaccine)

: 

In [None]:
final_h1 = final_model.predict_proba(test)

: 

In [None]:
final_h1 = final_h1[:,1].reshape(-1,1)

: 

## Make submission

In [None]:
submission_df = pd.read_csv("./submission_format.csv", 
                            index_col="respondent_id")

: 

In [None]:
# Make sure we have the rows in the same order
np.testing.assert_array_equal(test.index.values, 
                              submission_df.index.values)

# Save predictions to submission data frame
submission_df["h1n1_vaccine"] = final_h1
submission_df["seasonal_vaccine"] = final_se

submission_df.head()

: 

In [None]:
date = pd.Timestamp.now().strftime(format='%Y-%m-%d_%H-%M_')
submission_df.to_csv(f'predictions/{date}submssion_catboost_optunacvi.csv', index=True)

: 

The model scored `0.8608` not too bad, much better than the benchmark. There is a lot of room for improvment (my present score is `0.8638`, ranking **7th** on the 16th February 2021). We can verify some more parameters, CatBoost has a lot of [them](https://catboost.ai/docs/concepts/parameter-tuning.html), we may try some feature engineering (I tried some basice ones but the score did not improve so I omitted them in the analysis), change the model (LightGBM, XGBoost for example), try different imputations, increase the rounds for `optuna` a little bit and so on. Spend some time analysing the dataset and the correlation among the variables using EDA notebooks.

© Andrea Dalseno 2021