# Modelos de ML

**Objetivo**: Criar modelos de ML para a projeção de todas as nossas séries.

**Metodologias**: 

- Regressão Linear
- Árvore de decisão
- Random Forest
- XGBoost
- LightGBM

## 0. Setup

In [1]:
%load_ext autotime

time: 181 µs (started: 2024-01-04 12:16:56 -03:00)


In [2]:
#---- Manipulação de dados:

import pandas as pd
import numpy as np

#---- Modelagem:

from hierarchicalforecast.utils import aggregate
from mlforecast import MLForecast
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

#---- Reconciliação

from hierarchicalforecast.methods import BottomUp, TopDown, ERM, OptimalCombination, MinTrace, MiddleOut
from hierarchicalforecast.core import HierarchicalReconciliation

#---- Visualização

import plotly.express as px

time: 1.67 s (started: 2024-01-04 12:16:56 -03:00)


## 1. Dados: vendas de roupas no varejo

In [3]:
dados = pd.read_csv('https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-hierarchical-forecasting/main/retail-usa-clothing.csv')

dados.head()

Unnamed: 0,date,state,item,quantity,region,country
0,1997-11-25,NewYork,mens_clothing,8,Mid-Alantic,USA
1,1997-11-26,NewYork,mens_clothing,9,Mid-Alantic,USA
2,1997-11-27,NewYork,mens_clothing,11,Mid-Alantic,USA
3,1997-11-28,NewYork,mens_clothing,11,Mid-Alantic,USA
4,1997-11-29,NewYork,mens_clothing,10,Mid-Alantic,USA


time: 1.11 s (started: 2024-01-04 12:16:57 -03:00)


In [4]:
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388024 entries, 0 to 388023
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   date      388024 non-null  object
 1   state     388024 non-null  object
 2   item      388024 non-null  object
 3   quantity  388024 non-null  int64 
 4   region    388024 non-null  object
 5   country   388024 non-null  object
dtypes: int64(1), object(5)
memory usage: 17.8+ MB
time: 59.6 ms (started: 2024-01-04 12:16:58 -03:00)


## 2. Modificação nos dados 

In [5]:
def clean_data_baseline(df: pd.DataFrame):

    #---- 1. Excluindo a variável de country:

    df = df\
        .drop(columns = 'country')

    #---- 2. Mudando o tipo da variável de date para datetime:

    df['date'] = pd.to_datetime(df['date'])

    #---- 3. Renomeando as variáveis de quantidade de vendas e data:
    # date -> ds
    # quantity -> y

    df = df\
        .rename(columns = {'date': 'ds', 
                           'quantity': 'y'})

    return df

time: 822 µs (started: 2024-01-04 12:16:58 -03:00)


In [6]:
df = clean_data_baseline(df = dados)

df.head()

Unnamed: 0,ds,state,item,y,region
0,1997-11-25,NewYork,mens_clothing,8,Mid-Alantic
1,1997-11-26,NewYork,mens_clothing,9,Mid-Alantic
2,1997-11-27,NewYork,mens_clothing,11,Mid-Alantic
3,1997-11-28,NewYork,mens_clothing,11,Mid-Alantic
4,1997-11-29,NewYork,mens_clothing,10,Mid-Alantic


time: 109 ms (started: 2024-01-04 12:16:58 -03:00)


In [7]:
def format_hierarchical_df(df: pd.DataFrame, cols_hierarchical: list):

    #---- 1. Cria uma lista de listas: [[col1], [col1, col2], ..., [col1, col2, coln]]

    hier_list = [cols_hierarchical[:i] for i in range(1, len(cols_hierarchical) + 1)]

    #---- 2. Aplica a função aggregate que formata os dados em que a lib hierarchical pede

    Y_df, S_df, tags = aggregate(df = df, spec = hier_list)

    return Y_df, S_df, tags

time: 740 µs (started: 2024-01-04 12:16:59 -03:00)


In [8]:
cols_hierarchical = ['region', 'state', 'item']

Y_df, S_df, tags = format_hierarchical_df(df = df, cols_hierarchical = cols_hierarchical)

time: 666 ms (started: 2024-01-04 12:16:59 -03:00)


In [9]:
display(Y_df.head())
display(Y_df.tail())

Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
EastNorthCentral,1997-11-25,507
EastNorthCentral,1997-11-26,504
EastNorthCentral,1997-11-27,510
EastNorthCentral,1997-11-28,507
EastNorthCentral,1997-11-29,513


Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
SouthCentral/Tennessee/womens_shoes,2009-07-24,31
SouthCentral/Tennessee/womens_shoes,2009-07-25,30
SouthCentral/Tennessee/womens_shoes,2009-07-26,31
SouthCentral/Tennessee/womens_shoes,2009-07-27,29
SouthCentral/Tennessee/womens_shoes,2009-07-28,30


time: 15.3 ms (started: 2024-01-04 12:16:59 -03:00)


- **Dados de treino: 25/11/1997 a 31/12/2008**
- **Dados de validação: 01/01/2009 a 28/07/2009**

In [10]:
def split_train_test(df: pd.DataFrame, dt_start_train: str):

    #---- 1. Dados de treino

    train = df.query(f'ds < "{dt_start_train}"')

    #---- 2. Dados de teste:
    
    valid = df.query(f'ds >= "{dt_start_train}"')

    return train, valid

time: 574 µs (started: 2024-01-04 12:16:59 -03:00)


In [11]:
Y_train_df, Y_valid_df = split_train_test(df = Y_df, dt_start_train = '2009-01-01')

time: 84 ms (started: 2024-01-04 12:16:59 -03:00)


In [12]:
display(Y_train_df.head())
display(Y_train_df.tail())

Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
EastNorthCentral,1997-11-25,507
EastNorthCentral,1997-11-26,504
EastNorthCentral,1997-11-27,510
EastNorthCentral,1997-11-28,507
EastNorthCentral,1997-11-29,513


Unnamed: 0_level_0,ds,y
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
SouthCentral/Tennessee/womens_shoes,2008-12-27,31
SouthCentral/Tennessee/womens_shoes,2008-12-28,29
SouthCentral/Tennessee/womens_shoes,2008-12-29,28
SouthCentral/Tennessee/womens_shoes,2008-12-30,31
SouthCentral/Tennessee/womens_shoes,2008-12-31,31


time: 12.6 ms (started: 2024-01-04 12:16:59 -03:00)


## 3. Modelagem

In [13]:
#---- Features de data:

from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean

@njit
def rolling_mean_7(x):
    return rolling_mean(x, window_size = 7)

@njit
def rolling_mean_14(x):
    return rolling_mean(x, window_size = 14)

@njit
def rolling_mean_21(x):
    return rolling_mean(x, window_size = 21)

@njit
def rolling_mean_28(x):
    return rolling_mean(x, window_size = 28)

time: 5.29 ms (started: 2024-01-04 12:16:59 -03:00)


In [14]:
def rmse(y_true, y_pred):
    
    return np.sqrt(np.mean(np.square(y_true - y_pred)))

time: 1.77 ms (started: 2024-01-04 12:16:59 -03:00)


In [15]:
n_horizon = Y_valid_df.ds.nunique() # Quantidade de dias para a projeção

time: 2.63 ms (started: 2024-01-04 12:16:59 -03:00)


In [16]:
import optuna
from sklearn.metrics import mean_squared_error

def objective(trial):

    max_depth = trial.suggest_int('max_depth', 3, 15)
    min_samples_split = trial.suggest_float('min_samples_split', 0.1, 1.0)
    min_samples_leaf = trial.suggest_float('min_samples_leaf', 0.1, 0.5)
    max_features = trial.suggest_float('max_features', 0.1, 1.0)

    dec_tree = DecisionTreeRegressor(random_state = 19, 
                                     max_depth = max_depth,
                                     min_samples_split = min_samples_split,
                                     min_samples_leaf = min_samples_leaf,
                                     max_features = max_features) 
    
    models_list = [dec_tree]

    model = MLForecast(models = models_list,
                       freq = 'D',
                       num_threads = 6,
                       lags = [1, 7, 14, 21, 28, 30], 
                       date_features = ['dayofweek', 'month', 'year', 'quarter', 'day', 'week'],
                       lag_transforms = {
                           1: [expanding_mean],
                           7: [rolling_mean_7],
                           14: [rolling_mean_14],
                           21: [rolling_mean_21],
                           28: [rolling_mean_28],
                       }
               )

    model.fit(Y_train_df.reset_index(), id_col = 'unique_id', time_col = 'ds', target_col = 'y', fitted = True)

    Y_hat_df = model.predict(h = n_horizon)

    p = Y_hat_df.reset_index().merge(Y_valid_df.reset_index(), on = ['unique_id', 'ds'], how = 'left')

    error = rmse(p['y'], p['DecisionTreeRegressor'])
    
    return error

time: 326 ms (started: 2024-01-04 12:16:59 -03:00)


  from .autonotebook import tqdm as notebook_tqdm


In [17]:
study = optuna.create_study(direction = 'minimize')
study.optimize(objective, n_trials = 30)

[I 2024-01-04 12:17:00,255] A new study created in memory with name: no-name-625ff928-4989-40c5-9aa8-79fa52e63621
[I 2024-01-04 12:17:10,122] Trial 0 finished with value: 207.83736313631593 and parameters: {'max_depth': 15, 'min_samples_split': 0.11066409463439794, 'min_samples_leaf': 0.2217124074054182, 'max_features': 0.5039649175254981}. Best is trial 0 with value: 207.83736313631593.
[I 2024-01-04 12:17:13,041] Trial 1 finished with value: 246.60828939522204 and parameters: {'max_depth': 5, 'min_samples_split': 0.3839721352877493, 'min_samples_leaf': 0.46364927165163006, 'max_features': 0.2363977299579}. Best is trial 0 with value: 207.83736313631593.
[I 2024-01-04 12:17:16,306] Trial 2 finished with value: 218.5790662142466 and parameters: {'max_depth': 9, 'min_samples_split': 0.9690925635138773, 'min_samples_leaf': 0.29587172243265947, 'max_features': 0.46029178148881533}. Best is trial 0 with value: 207.83736313631593.
[I 2024-01-04 12:17:19,549] Trial 3 finished with value: 241

time: 1min 53s (started: 2024-01-04 12:17:00 -03:00)


In [18]:
study.best_params

{'max_depth': 7,
 'min_samples_split': 0.40868549783968977,
 'min_samples_leaf': 0.10047873536633527,
 'max_features': 0.9772330502742559}

time: 3.08 ms (started: 2024-01-04 12:18:53 -03:00)


In [19]:
study.best_value

183.7324358084525

time: 2.85 ms (started: 2024-01-04 12:18:53 -03:00)
