# Modelos de ML

**Objetivo**: Criar modelos de ML para a projeção de todas as nossas séries.

**Metodologias**: 

- Regressão Linear
- Árvore de decisão
- Random Forest
- XGBoost
- LightGBM

## 0. Setup

In [1]:
%load_ext autotime

time: 162 µs (started: 2024-01-04 15:21:35 -03:00)


In [2]:
#---- Manipulação de dados:

import pandas as pd
import numpy as np

#---- Modelagem:

from hierarchicalforecast.utils import aggregate
from mlforecast import MLForecast
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

#---- Reconciliação

from hierarchicalforecast.methods import BottomUp, TopDown, ERM, OptimalCombination, MinTrace, MiddleOut
from hierarchicalforecast.core import HierarchicalReconciliation

#---- Visualização

import plotly.express as px

time: 4.35 s (started: 2024-01-04 15:21:35 -03:00)


## 1. Dados: vendas de roupas no varejo

In [3]:
dados = pd.read_csv('https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-hierarchical-forecasting/main/retail-usa-clothing.csv')

dados.head()

Unnamed: 0,date,state,item,quantity,region,country
0,1997-11-25,NewYork,mens_clothing,8,Mid-Alantic,USA
1,1997-11-26,NewYork,mens_clothing,9,Mid-Alantic,USA
2,1997-11-27,NewYork,mens_clothing,11,Mid-Alantic,USA
3,1997-11-28,NewYork,mens_clothing,11,Mid-Alantic,USA
4,1997-11-29,NewYork,mens_clothing,10,Mid-Alantic,USA


time: 1.77 s (started: 2024-01-04 15:21:40 -03:00)


In [4]:
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388024 entries, 0 to 388023
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   date      388024 non-null  object
 1   state     388024 non-null  object
 2   item      388024 non-null  object
 3   quantity  388024 non-null  int64 
 4   region    388024 non-null  object
 5   country   388024 non-null  object
dtypes: int64(1), object(5)
memory usage: 17.8+ MB
time: 66.2 ms (started: 2024-01-04 15:21:41 -03:00)


## 2. Modificação nos dados 

In [5]:
def clean_data_baseline(df: pd.DataFrame):

    #---- 1. Excluindo a variável de country:

    df = df\
        .drop(columns = 'country')

    #---- 2. Mudando o tipo da variável de date para datetime:

    df['date'] = pd.to_datetime(df['date'])

    #---- 3. Renomeando as variáveis de quantidade de vendas e data:
    # date -> ds
    # quantity -> y

    df = df\
        .rename(columns = {'date': 'ds', 
                           'quantity': 'y'})

    return df

time: 597 µs (started: 2024-01-04 15:21:41 -03:00)


In [6]:
df = clean_data_baseline(df = dados)

df.head()

Unnamed: 0,ds,state,item,y,region
0,1997-11-25,NewYork,mens_clothing,8,Mid-Alantic
1,1997-11-26,NewYork,mens_clothing,9,Mid-Alantic
2,1997-11-27,NewYork,mens_clothing,11,Mid-Alantic
3,1997-11-28,NewYork,mens_clothing,11,Mid-Alantic
4,1997-11-29,NewYork,mens_clothing,10,Mid-Alantic


time: 95.6 ms (started: 2024-01-04 15:21:41 -03:00)


In [7]:
def format_hierarchical_df(df: pd.DataFrame, cols_hierarchical: list):

    #---- 1. Cria uma lista de listas: [[col1], [col1, col2], ..., [col1, col2, coln]]

    hier_list = [cols_hierarchical[:i] for i in range(1, len(cols_hierarchical) + 1)]

    #---- 2. Aplica a função aggregate que formata os dados em que a lib hierarchical pede

    Y_df, S_df, tags = aggregate(df = df, spec = hier_list)

    return Y_df, S_df, tags

time: 690 µs (started: 2024-01-04 15:21:42 -03:00)


In [8]:
cols_hierarchical = ['region', 'state', 'item']

Y_df, S_df, tags = format_hierarchical_df(df = df, cols_hierarchical = cols_hierarchical)

time: 536 ms (started: 2024-01-04 15:21:42 -03:00)


In [9]:
display(Y_df.head())
display(Y_df.tail())

Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
EastNorthCentral,1997-11-25,507
EastNorthCentral,1997-11-26,504
EastNorthCentral,1997-11-27,510
EastNorthCentral,1997-11-28,507
EastNorthCentral,1997-11-29,513


Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
SouthCentral/Tennessee/womens_shoes,2009-07-24,31
SouthCentral/Tennessee/womens_shoes,2009-07-25,30
SouthCentral/Tennessee/womens_shoes,2009-07-26,31
SouthCentral/Tennessee/womens_shoes,2009-07-27,29
SouthCentral/Tennessee/womens_shoes,2009-07-28,30


time: 10.4 ms (started: 2024-01-04 15:21:42 -03:00)


- **Dados de treino: 25/11/1997 a 31/12/2008**
- **Dados de validação: 01/01/2009 a 28/07/2009**

In [10]:
def split_train_test(df: pd.DataFrame, dt_start_train: str):

    #---- 1. Dados de treino

    train = df.query(f'ds < "{dt_start_train}"')

    #---- 2. Dados de teste:
    
    valid = df.query(f'ds >= "{dt_start_train}"')

    return train, valid

time: 534 µs (started: 2024-01-04 15:21:42 -03:00)


In [11]:
Y_train_df, Y_valid_df = split_train_test(df = Y_df, dt_start_train = '2009-01-01')

time: 78 ms (started: 2024-01-04 15:21:42 -03:00)


In [12]:
display(Y_train_df.head())
display(Y_train_df.tail())

Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
EastNorthCentral,1997-11-25,507
EastNorthCentral,1997-11-26,504
EastNorthCentral,1997-11-27,510
EastNorthCentral,1997-11-28,507
EastNorthCentral,1997-11-29,513


Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
SouthCentral/Tennessee/womens_shoes,2008-12-27,31
SouthCentral/Tennessee/womens_shoes,2008-12-28,29
SouthCentral/Tennessee/womens_shoes,2008-12-29,28
SouthCentral/Tennessee/womens_shoes,2008-12-30,31
SouthCentral/Tennessee/womens_shoes,2008-12-31,31


time: 8.71 ms (started: 2024-01-04 15:21:42 -03:00)


## 3. Modelagem

In [13]:
#---- Features de data:

from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean

@njit
def rolling_mean_7(x):
    return rolling_mean(x, window_size = 7)

@njit
def rolling_mean_14(x):
    return rolling_mean(x, window_size = 14)

@njit
def rolling_mean_21(x):
    return rolling_mean(x, window_size = 21)

@njit
def rolling_mean_28(x):
    return rolling_mean(x, window_size = 28)

time: 10 ms (started: 2024-01-04 15:21:42 -03:00)


In [14]:
def rmse(y_true, y_pred):
    
    return np.sqrt(np.mean(np.square(y_true - y_pred)))

time: 397 µs (started: 2024-01-04 15:21:42 -03:00)


In [15]:
n_horizon = Y_valid_df.ds.nunique() # Quantidade de dias para a projeção

time: 2.71 ms (started: 2024-01-04 15:21:42 -03:00)


In [18]:
import optuna
from sklearn.metrics import mean_squared_error

def objective(trial):

    learning_rate = trial.suggest_float('learning_rate', 1e-3, 1e-1)
    reg_alpha = trial.suggest_float('reg_alpha', 1e-3, 10.0, log = True)
    reg_lambda = trial.suggest_float('reg_alpha', 1e-3, 10.0, log = True)
    max_depth = trial.suggest_int('max_depth', 3, 15)
    min_child_weight = trial.suggest_float('min_child_weight', 1e-3, 10.0, log = True)
    subsample = trial.suggest_float('subsample', 0.1, 1.0)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.1, 1.0)
    max_delta_step = trial.suggest_int('max_delta_step', 0, 10)
    scale_pos_weight = trial.suggest_float('scale_pos_weight', 0.1, 1.0)    

    xgb = XGBRegressor(max_depth = max_depth, 
                       learning_rate = learning_rate,
                       n_estimators = 500,
                       reg_alpha = reg_alpha,
                       reg_lambda = reg_lambda,
                       min_child_weight = min_child_weight,
                       subsample = subsample,
                       colsample_bytree = colsample_bytree,
                       max_delta_step = max_delta_step,
                       scale_pos_weight = scale_pos_weight,
                       random_state = 19
                       )
    
    models_list = [xgb]

    model = MLForecast(models = models_list,
                       freq = 'D',
                       num_threads = 6,
                       lags = [1, 7, 14, 21, 28, 30], 
                       date_features = ['dayofweek', 'month', 'year', 'quarter', 'day', 'week'],
                       lag_transforms = {
                           1: [expanding_mean],
                           7: [rolling_mean_7],
                           14: [rolling_mean_14],
                           21: [rolling_mean_21],
                           28: [rolling_mean_28],
                       }
               )

    model.fit(Y_train_df.reset_index(), id_col = 'unique_id', time_col = 'ds', target_col = 'y', fitted = True)

    Y_hat_df = model.predict(h = n_horizon)

    p = Y_hat_df.reset_index().merge(Y_valid_df.reset_index(), on = ['unique_id', 'ds'], how = 'left')

    error = rmse(p['y'], p['XGBRegressor'])
    
    return error

time: 3.69 ms (started: 2024-01-04 15:23:33 -03:00)


In [19]:
study = optuna.create_study(direction = 'minimize')
study.optimize(objective, n_trials = 20)

[I 2024-01-04 15:23:34,180] A new study created in memory with name: no-name-a3ffe4e1-892a-4049-a5dd-3efa0b6cee54
[I 2024-01-04 15:23:42,416] Trial 0 finished with value: 248.99696137742419 and parameters: {'learning_rate': 0.019573948483793138, 'reg_alpha': 0.0020840509706208377, 'max_depth': 4, 'min_child_weight': 0.07495249054694259, 'subsample': 0.5798555306491624, 'colsample_bytree': 0.3609085505489348, 'max_delta_step': 8, 'scale_pos_weight': 0.9749333713029478}. Best is trial 0 with value: 248.99696137742419.
[I 2024-01-04 15:23:53,747] Trial 1 finished with value: 238.77005755596343 and parameters: {'learning_rate': 0.07010398867263551, 'reg_alpha': 0.002170705163882954, 'max_depth': 11, 'min_child_weight': 0.2300071475052542, 'subsample': 0.2341838959647129, 'colsample_bytree': 0.33498629762737153, 'max_delta_step': 3, 'scale_pos_weight': 0.7149209282420981}. Best is trial 1 with value: 238.77005755596343.
[I 2024-01-04 15:24:02,029] Trial 2 finished with value: 252.5442090998

time: 3min 34s (started: 2024-01-04 15:23:34 -03:00)


In [20]:
study.best_params

{'learning_rate': 0.09706508183742094,
 'reg_alpha': 2.5856419356137534,
 'max_depth': 12,
 'min_child_weight': 1.7605206073240915,
 'subsample': 0.6489869134403273,
 'colsample_bytree': 0.9926559366238281,
 'max_delta_step': 0,
 'scale_pos_weight': 0.4118214241620429}

time: 3.62 ms (started: 2024-01-04 15:27:08 -03:00)


In [22]:
#---- Salvando os melhores parâmetros em um JSON:

import json

with open('xgboost-best-parameters.json', 'w') as jsn:
    json.dump(dict(study.best_params), jsn)

time: 1.53 ms (started: 2024-01-04 15:27:08 -03:00)


In [21]:
study.best_value

17.664120472195915

time: 3.01 ms (started: 2024-01-04 15:27:08 -03:00)
