# Modelos de ML

**Objetivo**: Criar modelos de ML para a projeção de todas as nossas séries.

**Metodologias**: 

- Regressão Linear
- Árvore de decisão
- Random Forest
- XGBoost
- LightGBM

## 0. Setup

In [1]:
%load_ext autotime

time: 112 µs (started: 2024-01-04 14:04:17 -03:00)


In [2]:
#---- Manipulação de dados:

import pandas as pd
import numpy as np

#---- Modelagem:

from hierarchicalforecast.utils import aggregate
from mlforecast import MLForecast
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

#---- Reconciliação

from hierarchicalforecast.methods import BottomUp, TopDown, ERM, OptimalCombination, MinTrace, MiddleOut
from hierarchicalforecast.core import HierarchicalReconciliation

#---- Visualização

import plotly.express as px

time: 1.44 s (started: 2024-01-04 14:04:17 -03:00)


## 1. Dados: vendas de roupas no varejo

In [3]:
dados = pd.read_csv('https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-hierarchical-forecasting/main/retail-usa-clothing.csv')

dados.head()

Unnamed: 0,date,state,item,quantity,region,country
0,1997-11-25,NewYork,mens_clothing,8,Mid-Alantic,USA
1,1997-11-26,NewYork,mens_clothing,9,Mid-Alantic,USA
2,1997-11-27,NewYork,mens_clothing,11,Mid-Alantic,USA
3,1997-11-28,NewYork,mens_clothing,11,Mid-Alantic,USA
4,1997-11-29,NewYork,mens_clothing,10,Mid-Alantic,USA


time: 958 ms (started: 2024-01-04 14:04:18 -03:00)


In [4]:
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388024 entries, 0 to 388023
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   date      388024 non-null  object
 1   state     388024 non-null  object
 2   item      388024 non-null  object
 3   quantity  388024 non-null  int64 
 4   region    388024 non-null  object
 5   country   388024 non-null  object
dtypes: int64(1), object(5)
memory usage: 17.8+ MB
time: 53 ms (started: 2024-01-04 14:04:19 -03:00)


## 2. Modificação nos dados 

In [5]:
def clean_data_baseline(df: pd.DataFrame):

    #---- 1. Excluindo a variável de country:

    df = df\
        .drop(columns = 'country')

    #---- 2. Mudando o tipo da variável de date para datetime:

    df['date'] = pd.to_datetime(df['date'])

    #---- 3. Renomeando as variáveis de quantidade de vendas e data:
    # date -> ds
    # quantity -> y

    df = df\
        .rename(columns = {'date': 'ds', 
                           'quantity': 'y'})

    return df

time: 746 µs (started: 2024-01-04 14:04:19 -03:00)


In [6]:
df = clean_data_baseline(df = dados)

df.head()

Unnamed: 0,ds,state,item,y,region
0,1997-11-25,NewYork,mens_clothing,8,Mid-Alantic
1,1997-11-26,NewYork,mens_clothing,9,Mid-Alantic
2,1997-11-27,NewYork,mens_clothing,11,Mid-Alantic
3,1997-11-28,NewYork,mens_clothing,11,Mid-Alantic
4,1997-11-29,NewYork,mens_clothing,10,Mid-Alantic


time: 99.1 ms (started: 2024-01-04 14:04:19 -03:00)


In [7]:
def format_hierarchical_df(df: pd.DataFrame, cols_hierarchical: list):

    #---- 1. Cria uma lista de listas: [[col1], [col1, col2], ..., [col1, col2, coln]]

    hier_list = [cols_hierarchical[:i] for i in range(1, len(cols_hierarchical) + 1)]

    #---- 2. Aplica a função aggregate que formata os dados em que a lib hierarchical pede

    Y_df, S_df, tags = aggregate(df = df, spec = hier_list)

    return Y_df, S_df, tags

time: 999 µs (started: 2024-01-04 14:04:19 -03:00)


In [8]:
cols_hierarchical = ['region', 'state', 'item']

Y_df, S_df, tags = format_hierarchical_df(df = df, cols_hierarchical = cols_hierarchical)

time: 564 ms (started: 2024-01-04 14:04:19 -03:00)


In [9]:
display(Y_df.head())
display(Y_df.tail())

Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
EastNorthCentral,1997-11-25,507
EastNorthCentral,1997-11-26,504
EastNorthCentral,1997-11-27,510
EastNorthCentral,1997-11-28,507
EastNorthCentral,1997-11-29,513


Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
SouthCentral/Tennessee/womens_shoes,2009-07-24,31
SouthCentral/Tennessee/womens_shoes,2009-07-25,30
SouthCentral/Tennessee/womens_shoes,2009-07-26,31
SouthCentral/Tennessee/womens_shoes,2009-07-27,29
SouthCentral/Tennessee/womens_shoes,2009-07-28,30


time: 8.65 ms (started: 2024-01-04 14:04:20 -03:00)


- **Dados de treino: 25/11/1997 a 31/12/2008**
- **Dados de validação: 01/01/2009 a 28/07/2009**

In [10]:
def split_train_test(df: pd.DataFrame, dt_start_train: str):

    #---- 1. Dados de treino

    train = df.query(f'ds < "{dt_start_train}"')

    #---- 2. Dados de teste:
    
    valid = df.query(f'ds >= "{dt_start_train}"')

    return train, valid

time: 671 µs (started: 2024-01-04 14:04:20 -03:00)


In [11]:
Y_train_df, Y_valid_df = split_train_test(df = Y_df, dt_start_train = '2009-01-01')

time: 78.9 ms (started: 2024-01-04 14:04:20 -03:00)


In [12]:
display(Y_train_df.head())
display(Y_train_df.tail())

Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
EastNorthCentral,1997-11-25,507
EastNorthCentral,1997-11-26,504
EastNorthCentral,1997-11-27,510
EastNorthCentral,1997-11-28,507
EastNorthCentral,1997-11-29,513


Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
SouthCentral/Tennessee/womens_shoes,2008-12-27,31
SouthCentral/Tennessee/womens_shoes,2008-12-28,29
SouthCentral/Tennessee/womens_shoes,2008-12-29,28
SouthCentral/Tennessee/womens_shoes,2008-12-30,31
SouthCentral/Tennessee/womens_shoes,2008-12-31,31


time: 10.3 ms (started: 2024-01-04 14:04:20 -03:00)


## 3. Modelagem

In [13]:
#---- Features de data:

from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean

@njit
def rolling_mean_7(x):
    return rolling_mean(x, window_size = 7)

@njit
def rolling_mean_14(x):
    return rolling_mean(x, window_size = 14)

@njit
def rolling_mean_21(x):
    return rolling_mean(x, window_size = 21)

@njit
def rolling_mean_28(x):
    return rolling_mean(x, window_size = 28)

time: 6.68 ms (started: 2024-01-04 14:04:20 -03:00)


In [14]:
def rmse(y_true, y_pred):
    
    return np.sqrt(np.mean(np.square(y_true - y_pred)))

time: 754 µs (started: 2024-01-04 14:04:20 -03:00)


In [15]:
n_horizon = Y_valid_df.ds.nunique() # Quantidade de dias para a projeção

time: 6.05 ms (started: 2024-01-04 14:04:20 -03:00)


In [20]:
import optuna
from sklearn.metrics import mean_squared_error

def objective(trial):

    learning_rate = trial.suggest_float('learning_rate', 1e-3, 1e-1)
    num_leaves = trial.suggest_int('num_leaves', 2, 256)
    min_data_in_leaf = trial.suggest_int('min_data_in_leaf', 1, 100)
    bagging_fraction = trial.suggest_float('learning_rate', 1e-3, 1e-1)
    feature_fraction = trial.suggest_float('learning_rate', 1e-3, 1e-1)
    reg_alpha = trial.suggest_float('reg_alpha', 1e-3, 10.0, log = True)
    reg_lambda = trial.suggest_float('reg_alpha', 1e-3, 10.0, log = True)
    min_child_samples = trial.suggest_int('min_child_samples', 5, 100)
    max_depth = trial.suggest_int('max_depth', 3, 15)
    min_child_weight = trial.suggest_float('min_child_weight', 1e-3, 10.0, log = True)
    subsample = trial.suggest_float('subsample', 0.1, 1.0)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.1, 1.0)
    subsample_freq = trial.suggest_int('subsample_freq', 1, 10)
    min_split_gain = trial.suggest_float('min_split_gain', 1e-4, 1.0, log = True)
    max_delta_step = trial.suggest_int('max_delta_step', 0, 10)
    scale_pos_weight = trial.suggest_float('scale_pos_weight', 0.1, 1.0)
    

    lgbm = LGBMRegressor(num_leaves = num_leaves,
                         max_depth = max_depth, 
                         learning_rate = learning_rate,
                         n_estimators = 500,
                         min_data_in_leaf = min_data_in_leaf, 
                         bagging_fraction = bagging_fraction, 
                         feature_fraction = feature_fraction,
                         reg_alpha = reg_alpha,
                         reg_lambda = reg_lambda,
                         min_child_samples = min_child_samples,
                         min_child_weight = min_child_weight,
                         subsample = subsample,
                         colsample_bytree = colsample_bytree,
                         subsample_freq = subsample_freq,
                         min_split_gain = min_split_gain,
                         max_delta_step = max_delta_step,
                         scale_pos_weight = scale_pos_weight,
                         random_state = 19
                         )
    
    models_list = [lgbm]

    model = MLForecast(models = models_list,
                       freq = 'D',
                       num_threads = 6,
                       lags = [1, 7, 14, 21, 28, 30], 
                       date_features = ['dayofweek', 'month', 'year', 'quarter', 'day', 'week'],
                       lag_transforms = {
                           1: [expanding_mean],
                           7: [rolling_mean_7],
                           14: [rolling_mean_14],
                           21: [rolling_mean_21],
                           28: [rolling_mean_28],
                       }
               )

    model.fit(Y_train_df.reset_index(), id_col = 'unique_id', time_col = 'ds', target_col = 'y', fitted = True)

    Y_hat_df = model.predict(h = n_horizon)

    p = Y_hat_df.reset_index().merge(Y_valid_df.reset_index(), on = ['unique_id', 'ds'], how = 'left')

    error = rmse(p['y'], p['LGBMRegressor'])
    
    return error

time: 1.92 ms (started: 2024-01-04 14:05:12 -03:00)


In [21]:
study = optuna.create_study(direction = 'minimize')
study.optimize(objective, n_trials = 20)

[I 2024-01-04 14:05:13,175] A new study created in memory with name: no-name-6f4f7b5f-6406-4046-9dbb-3783c28232de


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000769 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:05:20,191] Trial 0 finished with value: 231.30983868415868 and parameters: {'learning_rate': 0.05805685598993946, 'num_leaves': 101, 'min_data_in_leaf': 55, 'reg_alpha': 0.2177333499358141, 'min_child_samples': 83, 'max_depth': 5, 'min_child_weight': 0.08007425289212589, 'subsample': 0.5875687293630835, 'colsample_bytree': 0.1606547866531502, 'subsample_freq': 7, 'min_split_gain': 0.0027001804021674868, 'max_delta_step': 6, 'scale_pos_weight': 0.35293362358087665}. Best is trial 0 with value: 231.30983868415868.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001059 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:05:27,868] Trial 1 finished with value: 231.08613693157005 and parameters: {'learning_rate': 0.07261204006702895, 'num_leaves': 163, 'min_data_in_leaf': 61, 'reg_alpha': 0.006028638715307247, 'min_child_samples': 20, 'max_depth': 14, 'min_child_weight': 3.03867554259231, 'subsample': 0.685584547587064, 'colsample_bytree': 0.9004943822690571, 'subsample_freq': 9, 'min_split_gain': 0.0012221918428850136, 'max_delta_step': 5, 'scale_pos_weight': 0.9745899484603823}. Best is trial 1 with value: 231.08613693157005.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000866 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:05:36,918] Trial 2 finished with value: 218.12630382244862 and parameters: {'learning_rate': 0.04793646712541075, 'num_leaves': 193, 'min_data_in_leaf': 76, 'reg_alpha': 0.01752004170400618, 'min_child_samples': 71, 'max_depth': 10, 'min_child_weight': 0.002101813043794349, 'subsample': 0.8839903777749064, 'colsample_bytree': 0.7289662025885446, 'subsample_freq': 1, 'min_split_gain': 0.0009066091866419913, 'max_delta_step': 10, 'scale_pos_weight': 0.7220516154559721}. Best is trial 2 with value: 218.12630382244862.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000748 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:05:42,735] Trial 3 finished with value: 272.7209326562359 and parameters: {'learning_rate': 0.015246309154669957, 'num_leaves': 193, 'min_data_in_leaf': 57, 'reg_alpha': 0.021829436410338195, 'min_child_samples': 50, 'max_depth': 3, 'min_child_weight': 1.9756405835587552, 'subsample': 0.4145821653836218, 'colsample_bytree': 0.19898756134503864, 'subsample_freq': 9, 'min_split_gain': 0.3220725483535617, 'max_delta_step': 4, 'scale_pos_weight': 0.3280400901243785}. Best is trial 2 with value: 218.12630382244862.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001103 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:05:52,020] Trial 4 finished with value: 230.39252026865526 and parameters: {'learning_rate': 0.03297227239792847, 'num_leaves': 121, 'min_data_in_leaf': 18, 'reg_alpha': 1.9069914031101722, 'min_child_samples': 77, 'max_depth': 10, 'min_child_weight': 0.005401446143634378, 'subsample': 0.39643071051279677, 'colsample_bytree': 0.3120052726231334, 'subsample_freq': 3, 'min_split_gain': 0.03329291835891981, 'max_delta_step': 10, 'scale_pos_weight': 0.7930498586156657}. Best is trial 2 with value: 218.12630382244862.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001271 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:05:59,610] Trial 5 finished with value: 269.62417828640076 and parameters: {'learning_rate': 0.010742904661647172, 'num_leaves': 121, 'min_data_in_leaf': 90, 'reg_alpha': 0.032066785531133755, 'min_child_samples': 100, 'max_depth': 5, 'min_child_weight': 0.0021772891569909764, 'subsample': 0.3748364470446376, 'colsample_bytree': 0.13893698904479196, 'subsample_freq': 3, 'min_split_gain': 0.05799545887473512, 'max_delta_step': 8, 'scale_pos_weight': 0.9276667901303554}. Best is trial 2 with value: 218.12630382244862.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001323 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:06:06,785] Trial 6 finished with value: 175.26376022427254 and parameters: {'learning_rate': 0.09544604507310606, 'num_leaves': 200, 'min_data_in_leaf': 81, 'reg_alpha': 2.5790973046368277, 'min_child_samples': 61, 'max_depth': 3, 'min_child_weight': 1.368067424129913, 'subsample': 0.5176619674906686, 'colsample_bytree': 0.6567682819196545, 'subsample_freq': 7, 'min_split_gain': 0.0032334954198209053, 'max_delta_step': 9, 'scale_pos_weight': 0.5352391480594676}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001073 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:06:14,378] Trial 7 finished with value: 257.14731705528044 and parameters: {'learning_rate': 0.07361843019171468, 'num_leaves': 67, 'min_data_in_leaf': 63, 'reg_alpha': 0.00412335473728842, 'min_child_samples': 20, 'max_depth': 5, 'min_child_weight': 0.019172621046218488, 'subsample': 0.7060320037184752, 'colsample_bytree': 0.9022028407014343, 'subsample_freq': 1, 'min_split_gain': 0.006647923808312924, 'max_delta_step': 2, 'scale_pos_weight': 0.4624669068653182}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001067 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:06:21,583] Trial 8 finished with value: 252.4978911230433 and parameters: {'learning_rate': 0.058451441957139395, 'num_leaves': 82, 'min_data_in_leaf': 30, 'reg_alpha': 0.08164049975189455, 'min_child_samples': 54, 'max_depth': 4, 'min_child_weight': 1.8292330409403947, 'subsample': 0.8753498879907287, 'colsample_bytree': 0.9874323132097471, 'subsample_freq': 6, 'min_split_gain': 0.025198127141659397, 'max_delta_step': 3, 'scale_pos_weight': 0.10717613386071277}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001243 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:06:27,979] Trial 9 finished with value: 281.13873253987606 and parameters: {'learning_rate': 0.012727897143818962, 'num_leaves': 242, 'min_data_in_leaf': 5, 'reg_alpha': 0.03247173608728822, 'min_child_samples': 96, 'max_depth': 4, 'min_child_weight': 0.07230425966363091, 'subsample': 0.5275787061546114, 'colsample_bytree': 0.7362670735209902, 'subsample_freq': 10, 'min_split_gain': 0.24010024993025925, 'max_delta_step': 1, 'scale_pos_weight': 0.9174687744894415}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001441 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:06:36,867] Trial 10 finished with value: 189.98095903554363 and parameters: {'learning_rate': 0.09963709653048382, 'num_leaves': 13, 'min_data_in_leaf': 93, 'reg_alpha': 7.416237038209158, 'min_child_samples': 43, 'max_depth': 8, 'min_child_weight': 0.4593039248047754, 'subsample': 0.19873650219116884, 'colsample_bytree': 0.4987724154008645, 'subsample_freq': 7, 'min_split_gain': 0.00016174054559046198, 'max_delta_step': 7, 'scale_pos_weight': 0.6142014046003514}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001499 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:06:45,943] Trial 11 finished with value: 192.75140561675542 and parameters: {'learning_rate': 0.09751618343770461, 'num_leaves': 14, 'min_data_in_leaf': 98, 'reg_alpha': 9.963291514185457, 'min_child_samples': 42, 'max_depth': 8, 'min_child_weight': 0.5149761582478813, 'subsample': 0.14831077965854905, 'colsample_bytree': 0.46612980835479834, 'subsample_freq': 7, 'min_split_gain': 0.00010421304441445616, 'max_delta_step': 7, 'scale_pos_weight': 0.6194378289454338}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001693 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:06:54,178] Trial 12 finished with value: 181.1259966234339 and parameters: {'learning_rate': 0.0992170128754679, 'num_leaves': 8, 'min_data_in_leaf': 83, 'reg_alpha': 1.120881123511234, 'min_child_samples': 33, 'max_depth': 8, 'min_child_weight': 0.3677909579289827, 'subsample': 0.11305010035352488, 'colsample_bytree': 0.5410073013426184, 'subsample_freq': 5, 'min_split_gain': 0.00020657014176026476, 'max_delta_step': 8, 'scale_pos_weight': 0.5686407664233605}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001075 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:07:05,685] Trial 13 finished with value: 198.85595192786533 and parameters: {'learning_rate': 0.0832616293371835, 'num_leaves': 250, 'min_data_in_leaf': 76, 'reg_alpha': 0.6130395940487279, 'min_child_samples': 5, 'max_depth': 14, 'min_child_weight': 8.77629139355262, 'subsample': 0.27565931084142375, 'colsample_bytree': 0.6484927314454525, 'subsample_freq': 4, 'min_split_gain': 0.00044310111697715984, 'max_delta_step': 9, 'scale_pos_weight': 0.47828599230843616}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001108 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:07:14,541] Trial 14 finished with value: 202.62539367403812 and parameters: {'learning_rate': 0.08534042184726824, 'num_leaves': 163, 'min_data_in_leaf': 39, 'reg_alpha': 1.501139558382304, 'min_child_samples': 66, 'max_depth': 7, 'min_child_weight': 0.28404119389461846, 'subsample': 0.1148735886313071, 'colsample_bytree': 0.37677500056295954, 'subsample_freq': 5, 'min_split_gain': 0.004042068800779706, 'max_delta_step': 8, 'scale_pos_weight': 0.30701746558538556}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001185 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:07:24,678] Trial 15 finished with value: 195.51505352597687 and parameters: {'learning_rate': 0.08816353899373361, 'num_leaves': 48, 'min_data_in_leaf': 77, 'reg_alpha': 2.455783645518474, 'min_child_samples': 31, 'max_depth': 11, 'min_child_weight': 0.20884318006745245, 'subsample': 0.2929084760296936, 'colsample_bytree': 0.6219757276924248, 'subsample_freq': 5, 'min_split_gain': 0.00035768332993254724, 'max_delta_step': 9, 'scale_pos_weight': 0.7139744194660446}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001063 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:07:33,777] Trial 16 finished with value: 225.2499744272712 and parameters: {'learning_rate': 0.06908184572007725, 'num_leaves': 217, 'min_data_in_leaf': 81, 'reg_alpha': 0.33354982663733, 'min_child_samples': 61, 'max_depth': 12, 'min_child_weight': 0.8618004872808909, 'subsample': 0.9619627978494784, 'colsample_bytree': 0.5773538568528469, 'subsample_freq': 8, 'min_split_gain': 0.011829984719440046, 'max_delta_step': 6, 'scale_pos_weight': 0.5306567634554036}. Best is trial 6 with value: 175.26376022427254.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001158 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:07:45,566] Trial 17 finished with value: 17.6090018124214 and parameters: {'learning_rate': 0.035608626868287335, 'num_leaves': 152, 'min_data_in_leaf': 46, 'reg_alpha': 0.7493647554395146, 'min_child_samples': 33, 'max_depth': 7, 'min_child_weight': 0.027338398098262277, 'subsample': 0.5328363136202824, 'colsample_bytree': 0.7452960968224038, 'subsample_freq': 4, 'min_split_gain': 0.00203245703916804, 'max_delta_step': 0, 'scale_pos_weight': 0.1703693448857999}. Best is trial 17 with value: 17.6090018124214.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.044603 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:07:56,852] Trial 18 finished with value: 18.413736228833073 and parameters: {'learning_rate': 0.0270466113472588, 'num_leaves': 151, 'min_data_in_leaf': 41, 'reg_alpha': 0.0011272911988604133, 'min_child_samples': 29, 'max_depth': 6, 'min_child_weight': 0.021572671389730787, 'subsample': 0.5427646496114741, 'colsample_bytree': 0.8017418465303501, 'subsample_freq': 2, 'min_split_gain': 0.0018065990197147393, 'max_delta_step': 0, 'scale_pos_weight': 0.1494177242906738}. Best is trial 17 with value: 17.6090018124214.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001074 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2928
[LightGBM] [Info] Number of data points in the train set: 456000, number of used features: 17
[LightGBM] [Info] Start training from score 116.233066


[I 2024-01-04 14:08:07,981] Trial 19 finished with value: 17.800950167207095 and parameters: {'learning_rate': 0.030117425057549097, 'num_leaves': 150, 'min_data_in_leaf': 39, 'reg_alpha': 0.0010530492066892427, 'min_child_samples': 13, 'max_depth': 6, 'min_child_weight': 0.027530489742442675, 'subsample': 0.6814073910338947, 'colsample_bytree': 0.8021148774253098, 'subsample_freq': 2, 'min_split_gain': 0.0011226124299370558, 'max_delta_step': 0, 'scale_pos_weight': 0.13434101662640205}. Best is trial 17 with value: 17.6090018124214.


time: 2min 54s (started: 2024-01-04 14:05:13 -03:00)


In [22]:
study.best_params

{'learning_rate': 0.035608626868287335,
 'num_leaves': 152,
 'min_data_in_leaf': 46,
 'reg_alpha': 0.7493647554395146,
 'min_child_samples': 33,
 'max_depth': 7,
 'min_child_weight': 0.027338398098262277,
 'subsample': 0.5328363136202824,
 'colsample_bytree': 0.7452960968224038,
 'subsample_freq': 4,
 'min_split_gain': 0.00203245703916804,
 'max_delta_step': 0,
 'scale_pos_weight': 0.1703693448857999}

time: 5.45 ms (started: 2024-01-04 14:12:02 -03:00)


In [23]:
#---- Salvando os melhores parâmetros em um JSON:

import json

with open('lgbm-best-parameters.json', 'w') as jsn:
    json.dump(dict(study.best_params), jsn)

time: 2.19 ms (started: 2024-01-04 14:12:02 -03:00)


In [24]:
study.best_value

17.6090018124214

time: 4.51 ms (started: 2024-01-04 14:12:09 -03:00)
