<h1 style="background-color:black; color:white; padding:10px;">
    Notebook 06 - Primeiros experimentos com PyCaret
</h1>

Primeiros experimentos com PyCaret realizados em 3 ambientes diferentes:

1. Máquina Windows 10 - 64 bits, 16 GB RAM, 1 processador Intel Core i5-8365U (1.60GHz, 4 núcleos, 8 threads). 
2. Google Colab ( https://colab.research.google.com/ ).
3. Servidor CentOS Linux 7 - 64 bits, 128 GB RAM, 32 processadores Intel Xeon E5-2620 v4 (2.10GHz, 8 núcleos, 16 threads).

<hr style="background-color:transparent;height:4px;border:none;border-top:2px solid #c0c0c0;border-bottom:2px solid #c0c0c0;">

### Definições iniciais

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- No Google Colab.

In [None]:
!pip install pycaret

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Habilitação de recarga de módulo editado.

In [1]:
%load_ext autoreload
%autoreload 2

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Pacotes de uso geral.

In [2]:
import sys
# import pycaret    # instalado com: pip install pycaret
import numpy as np
import pandas as pd
from pathlib import Path
from pycaret.regression import *
from IPython.display import Markdown

# Google Colab
from google.colab import drive
from pycaret.utils import enable_colab 
enable_colab()

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Ajuste para módulos em diretórios fora de `sys.path`.

In [3]:
# máquina Windows 10
sys.path.append('d:/py_utils')

# Google Colab
nm_dir_gdrive = '/content/gdrive'
drive.mount(nm_dir_gdrive)
p_gdrive = Path(nm_dir_gdrive)
nm_dir_tcc = 'MyDrive/TCC_PUC_MG'
p_tcc = p_gdrive / nm_dir_tcc
sys.path.append(str(p_tcc))

# servidor Linux
p_user = Path('~').expanduser()
nm_dir_tcc = 'jup_ws/tcc'
p_tcc = p_user / nm_dir_tcc
sys.path.append(str(p_tcc))

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Módulos em diretórios fora de `sys.path`.

In [4]:
import tcc
from pd_utils import (
    d_pd, exemplo_linha, resumo_tipos, resumo_categ, resumo_serie,
    DisplayPandas, ExemploLinha, ResumoTipos, ResumoCateg, ResumoSerie
)

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Utilidades de exibição e resumo de objetos `pandas`.

In [5]:
# índice do primeiro elemento True de uma série boleana
# prim_true = lambda sr: sr[lambda s: s].index[0]

In [6]:
# índice do ultimo elemento True de uma série boleana
# ulti_true = lambda sr: sr[lambda s: s].index[-1]

In [7]:
# primeiro item, 4 items intermediárias, item final e dimensões de objeto pandas
# h1s4t1 = DisplayPandas(head=1, sample=4, tail=1)

In [8]:
# primeiro item, 1 items intermediário, item final e dimensões de objeto pandas
h1s1t1 = DisplayPandas(head=1, sample=1, tail=1)

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Outras definições.

In [9]:
d = display
pdod = pd.options.display
pdoc = pd.option_context
pdod.precision = 2
pdod.max_columns = None

def print_expr(expr, sep='=', inv=False):
    if inv:
        print(f'{eval(expr)} {sep} {expr}')
    else:
        print(f'{expr} {sep} {eval(expr)}')

def print_rslt(expr, sep='-->', inv=True):
    print_expr(expr, sep, inv)

def teste_features(todas, alvo, cate, nume, igno):
    def p_r(e):
        nonlocal todas, alvo, cate, nume, igno
        print(f'{eval(e)} --> {e}')
    p_r( 'set(cate).isdisjoint( set(nume) )' )
    p_r( 'set(nume).isdisjoint( set(igno) )' )
    p_r( 'set(igno).isdisjoint( set(cate) )' )
    p_r( 'alvo not in ( set(cate) | set(nume) | set(igno) )' )
    p_r( 'set(todas) == ( {alvo} | set(cate) | set(nume) | set(igno) )' )

def d_dct(dct, tit_chave='chave', tit_valor='valor'):
    txt_md = f'{tit_chave}|{tit_valor}\n:--|:--'
    for chave, valor in dct.items():
        txt_md += f'\n{chave}|{valor!r}'
    display(Markdown(txt_md))

def salvar_experi(best_models):
    nomes = []
    data_hora = f'{datetime.now():%Y%m%d_%H%M%S}'
    nome = f'{data_hora}.pkl'
    save_config(nome)
    nomes.append(nome)
    for ind, model in enumerate(best_models, 1):
        nome = f'{data_hora}_{ind}_{model.__class__.__name__}'
        save_model(model, nome)
        nomes.append(nome)
    return nomes

<hr style="background-color:transparent;height:4px;border:none;border-top:2px solid #c0c0c0;border-bottom:2px solid #c0c0c0;">

### Importação de `df_vendas_bricks`

- Dataframe consolidado de vendas das peças de interesse com modificações do notebook 05 (arquivo `df_vendas_bricks_nb5.parquet`).
- Colunas do dataframe:
<table>
<tr><th style="text-align:left;">Grupo</th><th>Coluna</th><th>Descrição</th></tr>
<tr style="border-top:1px solid black;"><td rowspan=9 style="text-align:left;">Variáveis da linha de pedido (o = order):</td><td><code>o_itid</code></td><td>ID arbitrário da linha de pedido de peça</td></tr>
<tr><td><code>p_no</code></td><td>código do modelo da peça</td></tr>
<tr><td><code>c_id</code></td><td>código/ID da cor da peça</td></tr>
<tr><td><code>n_u</code></td><td>estado da peça</td></tr>
<tr><td><code>o_qtty</code></td><td>quantidade</td></tr>
<tr><td><code>o_unpr</code></td><td>preço unitário em USD</td></tr>
<tr><td><code>o_sctr</code></td><td>país do vendedor</td></tr>
<tr><td><code>o_bctr</code></td><td>país do comprador</td></tr>
<tr><td><code>o_dthr</code></td><td>data e hora do pedido</td></tr>
<tr style="border-top:1px solid black;"><td rowspan=5 style="text-align:left;">Características físicas da peça comercializada (p = part)</td><td><code>p_wt</code></td><td>peso/massa em gramas</td></tr>
<tr><td><code>p_dx</code></td><td>largura em studs</td></tr>
<tr><td><code>p_dy</code></td><td>comprimento em studs</td></tr>
<tr><td><code>p_dz</code></td><td>altura em bricks</td></tr>
<tr><td><code>p_dv</code></td><td>volume externo em studs cúbicos</td></tr>
<tr style="border-top:1px solid black;"><td rowspan=3 style="text-align:left;">Outros atributos da peça comercializada (p = part)</td><td><code>p_nm</code></td><td>nome do modelo de peça</td></tr>
<tr><td><code>p_pfx</code></td><td>prefixo do código de modelo</td></tr>
<tr><td><code>p_sfx</code></td><td>sufixo do código de modelo</td></tr>
<tr style="border-top:1px solid black;"><td rowspan=3 style="text-align:left;">Atributos da cor da peça comercializada (c = color)</td><td><code>c_nm</code></td><td>nome da cor</td></tr>
<tr><td><code>c_cd</code></td><td>código da cor (RGB 24 bits hexadecimal)</td></tr>
<tr><td><code>c_tp</code></td><td>tipo/família da cor</td></tr>
<tr style="border-top:1px solid black;"><td rowspan=3 style="text-align:left;">Métricas calculadas da linha do pedido (o = order):</td><td><code>o_ttpr</code></td><td>preço total em USD <code>o_qtty * o_unpr</code></td></tr>
<tr><td><code>o_ttwt</code></td><td>peso/massa total em gramas <code>o_qtty * p_wt</code></td></tr>
<tr><td><code>o_ttdv</code></td><td>volume externo total em studs cúbicos <code>o_qtty * p_dv</code></td></tr>
<tr style="border-top:1px solid black;"><td rowspan=2 style="text-align:left;">Categorias calculadas (grp = grupo):</td><td><code>grp_dim</code></td><td>grupo de peças com mesmas dimensões externas</td></tr>
<tr><td><code>grp_cor</code></td><td>grupo de cores adaptado do Studio 2.0</td></tr>
</table>

In [10]:
nms_cols = [
    'p_no', 'c_id', 'n_u', 'o_qtty', 
    'o_unpr',    # variável alvo
    'o_sctr', 'o_bctr', 
    'o_dthr',    # aspecto temporal será ignorado (descartar)
    'p_wt', 'p_dx', 'p_dy', 'p_dz', 'p_dv', 
    'p_nm',    # descrição de `p_no` (descartar)
    'p_pfx', 'p_sfx',    # fragmentos de `p_no` (descartar)
    'c_nm',    # `c_nm` é descrição de `c_id` (descartar)
    'c_cd',    # `c_cd` como RGB hexadec texto não é útil (descartar)
    'c_tp', 
    'o_ttpr',    # métrica calculada com variável target (descartar)
    'o_ttwt', 'o_ttdv', 'grp_dim', 
    'grp_cor',    # grupos formados não refletem preços (descartar)
]    # nova ordem das colunas
df_vendas_bricks = (
    pd.read_parquet(tcc.pckl_df_vendas_bricks_nb5.with_suffix('.parquet'))
    [nms_cols]
)
h1s1t1(df_vendas_bricks)    # primeira linha, linha aleatória, última linha e dimensões

Unnamed: 0_level_0,p_no,c_id,n_u,o_qtty,o_unpr,o_sctr,o_bctr,o_dthr,p_wt,p_dx,p_dy,p_dz,p_dv,p_nm,p_pfx,p_sfx,c_nm,c_cd,c_tp,o_ttpr,o_ttwt,o_ttdv,grp_dim,grp_cor
o_itid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
40646,3005,80,N,20,0.1,DE,GB,2021-02-01 00:14:32.880,0.44,1.0,1.0,1.0,1.2,Brick 1 x 1,3005,,80,2e5543,solid,1.94,8.8,24.0,1x1x1,12
576403,14716,11,N,2,0.15,BE,BE,2021-04-04 23:39:49.673,1.24,1.0,1.0,3.0,3.6,Brick 1 x 1 x 3,14716,,11,212121,solid,0.3,2.48,7.2,1x1x3,-1
770553,6213,6,U,2,6.06,DE,IL,2021-04-30 07:33:47.810,10.18,2.0,6.0,3.0,43.2,Brick 2 x 6 x 3,6213,,6,00642e,solid,12.12,20.36,86.4,2x6x3,12


(1291936, 24)

<hr style="background-color:transparent;height:4px;border:none;border-top:2px solid #c0c0c0;border-bottom:2px solid #c0c0c0;">

### A - Experimento na máquina Windows 10

<hr style="background-color:transparent;height:0px;border:none;border-top:2px solid #c0c0c0;">

#### A.1 - *Dataset* completo, *train size* de 20%, todas variáveis não consideradas dispensáveis

- 7 variáveis categóricas, 8 variáveis numéricas.
- *Dataset* original com dimensões (1291936, 24).
- *Train set* processado com dimensões (258387, 350).
- Várias horas de processamento, mas não anotadas (estimativa de mais de 12 horas).
- Opções padrão do PyCaret para pré-processamento e treinamento (*cross validation* com 10 *folds*).

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Escolha, classificação e conferência das variáveis.

In [12]:
target = 'o_unpr'
categorical_features = ['p_no', 'c_id', 'n_u', 'o_sctr', 'o_bctr', 'c_tp', 'grp_dim', ]
numeric_features = ['o_qtty', 'p_wt', 'p_dx', 'p_dy', 'p_dz', 'p_dv', 'o_ttwt', 'o_ttdv', ]
ignore_features = ['o_dthr', 'p_nm', 'p_pfx', 'p_sfx', 'c_nm', 'c_cd', 'o_ttpr', 'grp_cor']
teste_features(df_vendas_bricks.columns, target, 
               categorical_features, numeric_features, ignore_features)

True --> set(cate).isdisjoint( set(nume) )
True --> set(nume).isdisjoint( set(igno) )
True --> set(igno).isdisjoint( set(cate) )
True --> alvo not in ( set(cate) | set(nume) | set(igno) )
True --> set(todas) == ( {alvo} | set(cate) | set(nume) | set(igno) )


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Configuração da regressão com PyCaret.

In [27]:
cfg_regress = setup(
    session_id=0, data=df_vendas_bricks, target=target, ignore_features=ignore_features, 
    categorical_features=categorical_features, numeric_features=numeric_features, 
    train_size = 0.2, silent=True, verbose=True, 
)

Unnamed: 0,Description,Value
0,session_id,0
1,Target,o_unpr
2,Original Data,"(1291936, 24)"
3,Missing Values,False
4,Numeric Features,8
5,Categorical Features,7
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(258387, 350)"


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Comparação dos resultados de *cross validation* dos modelos treinados com PyCaret.

In [28]:
best_models = compare_models(n_select=3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,0.0519,0.1347,0.3433,0.5446,0.0768,0.4105,460.174
et,Extra Trees Regressor,0.0538,0.1331,0.3436,0.5355,0.0796,0.425,798.467
dt,Decision Tree Regressor,0.0568,0.149,0.365,0.4688,0.0844,0.4449,8.081
gbr,Gradient Boosting Regressor,0.0766,0.171,0.3951,0.3912,0.1014,0.7813,75.676
lightgbm,Light Gradient Boosting Machine,0.0638,0.1791,0.4058,0.3714,0.0904,0.5548,2.042
knn,K Neighbors Regressor,0.0682,0.1821,0.4087,0.3554,0.1014,0.5481,27.093
ridge,Ridge Regression,0.0796,0.1991,0.4337,0.2718,0.1041,0.7315,0.545
lr,Linear Regression,0.0824,0.2092,0.4463,0.2243,0.1068,0.7653,4.054
omp,Orthogonal Matching Pursuit,0.0833,0.2123,0.4496,0.2127,0.1095,0.761,0.568
br,Bayesian Ridge,0.0638,0.1796,0.3713,0.1781,0.0834,0.5802,8.906


<hr style="background-color:transparent;height:4px;border:none;border-top:2px solid #c0c0c0;border-bottom:2px solid #c0c0c0;">

### B - Experimentos no Google Colab

<hr style="background-color:transparent;height:0px;border:none;border-top:2px solid #c0c0c0;">

#### B.1 - *Dataset* completo, *train size* de 20%, sem as variáveis `o_sctr` e `o_sctr`

- 5 variáveis categóricas, 8 variáveis numéricas.
- *Dataset* original com dimensões (1291936, 24).
- *Train set* processado com dimensões (258387, 205).
- Opções padrão do PyCaret para pré-processamento e treinamento (*cross validation* com 10 *folds*).

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Escolha, classificação e conferência das variáveis.

In [13]:
target = 'o_unpr'
categorical_features = ['p_no', 'c_id', 'n_u', 'c_tp', 'grp_dim', ]
numeric_features = ['o_qtty', 'p_wt', 'p_dx', 'p_dy', 'p_dz', 'p_dv', 'o_ttwt', 'o_ttdv', ]
ignore_features = [
    'o_sctr', 'o_bctr',
    'o_dthr', 'p_nm', 'p_pfx', 'p_sfx', 'c_nm', 'c_cd', 'o_ttpr', 'grp_cor', 
]
teste_features(df_vendas_bricks.columns, target, 
               categorical_features, numeric_features, ignore_features)

True --> set(cate).isdisjoint( set(nume) )
True --> set(nume).isdisjoint( set(igno) )
True --> set(igno).isdisjoint( set(cate) )
True --> alvo not in ( set(cate) | set(nume) | set(igno) )
True --> set(todas) == ( {alvo} | set(cate) | set(nume) | set(igno) )


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Configuração da regressão com PyCaret.

In [None]:
%%time
cfg_regress = setup(
    session_id=0, data=df_vendas_bricks, target=target, ignore_features=ignore_features, 
    categorical_features=categorical_features, numeric_features=numeric_features, 
    train_size = 0.2, silent=True, verbose=True, 
)

Unnamed: 0,Description,Value
0,session_id,0
1,Target,o_unpr
2,Original Data,"(1291936, 24)"
3,Missing Values,False
4,Numeric Features,8
5,Categorical Features,5
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(258387, 205)"


CPU times: user 31.1 s, sys: 2.45 s, total: 33.5 s
Wall time: 32.9 s


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Comparação dos resultados de *cross validation* dos modelos treinados com PyCaret.

In [None]:
%%time
best_models = compare_models(n_select=3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,0.0503,0.129,0.335,0.569,0.07,0.411,539.16
et,Extra Trees Regressor,0.0505,0.132,0.34,0.548,0.08,0.415,681.85
dt,Decision Tree Regressor,0.0511,0.133,0.341,0.54,0.08,0.417,7.87
knn,K Neighbors Regressor,0.0598,0.171,0.394,0.403,0.09,0.473,49.36
gbr,Gradient Boosting Regressor,0.0764,0.169,0.394,0.395,0.1,0.779,104.16
lightgbm,Light Gradient Boosting Machine,0.063,0.182,0.409,0.359,0.09,0.549,2.77
ridge,Ridge Regression,0.0782,0.2,0.434,0.271,0.1,0.706,0.49
br,Bayesian Ridge,0.0783,0.208,0.445,0.231,0.1,0.705,9.43
lr,Linear Regression,0.0836,0.214,0.451,0.209,0.11,0.772,2.88
omp,Orthogonal Matching Pursuit,0.0869,0.221,0.459,0.18,0.11,0.804,0.57


CPU times: user 44min 47s, sys: 18.1 s, total: 45min 5s
Wall time: 4h 36min 52s


<hr style="background-color:transparent;height:0px;border:none;border-top:2px solid #c0c0c0;">

#### B.2 - Somente dados de jul/2021, *train size* de 20%, somente 2 variáveis categóricas (`c_id` e `n_u`)

- 2 variáveis categóricas, 8 variáveis numéricas.
- *Dataset* reduzido com dimensões (168832, 24).
- *Train set* processado com dimensões (33766, 92).
- Opções padrão do PyCaret para pré-processamento e treinamento (*cross validation* com 10 *folds*).

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Escolha, classificação e conferência das variáveis.

In [14]:
target = 'o_unpr'
categorical_features = ['c_id', 'n_u', ]
numeric_features = ['o_qtty', 'p_wt', 'p_dx', 'p_dy', 'p_dz', 'p_dv', 'o_ttwt', 'o_ttdv', ]
ignore_features = [
    'p_no', 'grp_dim', 'o_sctr', 'o_bctr', 'c_tp', 
    'o_dthr', 'p_nm', 'p_pfx', 'p_sfx', 'c_nm', 'c_cd', 'o_ttpr', 'grp_cor', 
]
teste_features(df_vendas_bricks.columns, target, 
               categorical_features, numeric_features, ignore_features)

True --> set(cate).isdisjoint( set(nume) )
True --> set(nume).isdisjoint( set(igno) )
True --> set(igno).isdisjoint( set(cate) )
True --> alvo not in ( set(cate) | set(nume) | set(igno) )
True --> set(todas) == ( {alvo} | set(cate) | set(nume) | set(igno) )


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Configuração da regressão com PyCaret.

In [None]:
%%time
cfg_regress = setup(
    session_id=0, 
    data=df_vendas_bricks[lambda df: (df.o_dthr.dt.year==2021) & (df.o_dthr.dt.month==7)], 
            # somente linhas de de pedido de jul/2021
    target=target, ignore_features=ignore_features, 
    categorical_features=categorical_features, numeric_features=numeric_features, 
    train_size = 0.2, silent=True, verbose=True, 
)

Unnamed: 0,Description,Value
0,session_id,0
1,Target,o_unpr
2,Original Data,"(168832, 24)"
3,Missing Values,False
4,Numeric Features,8
5,Categorical Features,2
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(33766, 92)"


CPU times: user 2.83 s, sys: 258 ms, total: 3.08 s
Wall time: 3.04 s


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Comparação dos resultados de *cross validation* dos modelos treinados com PyCaret.

In [None]:
%%time
best_models = compare_models(n_select=3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,0.08,0.35,0.48,0.167,0.11,0.83,4.92
lightgbm,Light Gradient Boosting Machine,0.08,0.35,0.48,0.153,0.11,0.76,0.35
huber,Huber Regressor,0.08,0.37,0.49,0.108,0.12,0.72,3.34
rf,Random Forest Regressor,0.06,0.36,0.48,0.0864,0.1,0.55,32.28
et,Extra Trees Regressor,0.06,0.36,0.48,0.0771,0.1,0.56,38.38
en,Elastic Net,0.11,0.37,0.5,0.0704,0.14,1.42,0.05
knn,K Neighbors Regressor,0.07,0.37,0.49,0.0623,0.12,0.68,0.76
lasso,Lasso Regression,0.12,0.38,0.51,0.0144,0.15,1.56,0.05
dummy,Dummy Regressor,0.12,0.38,0.51,-0.0013,0.15,1.59,0.03
llar,Lasso Least Angle Regression,0.12,0.38,0.51,-0.0013,0.15,1.59,0.05


CPU times: user 26 s, sys: 5.73 s, total: 31.7 s
Wall time: 14min 12s


<hr style="background-color:transparent;height:4px;border:none;border-top:2px solid #c0c0c0;border-bottom:2px solid #c0c0c0;">

### C - Experimentos no servidor Linux

<hr style="background-color:transparent;height:0px;border:none;border-top:2px solid #c0c0c0;">

#### C.1 - Somente dados de jul/2021, *train size* de 80%, sem as variáveis `grp_dim`, `o_sctr`, `o_bctr`

- 4 variáveis categóricas, 8 variáveis numéricas.
- *Dataset* reduzido com dimensões (168832, 24).
- *Train set* processado com dimensões (135065, 177).
- Opções padrão do PyCaret para pré-processamento e treinamento (*cross validation* com 10 *folds*).

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Escolha, classificação e conferência das variáveis.

In [16]:
target = 'o_unpr'
categorical_features = ['p_no', 'c_id', 'n_u', 'c_tp', ]
numeric_features = ['o_qtty', 'p_wt', 'p_dx', 'p_dy', 'p_dz', 'p_dv', 'o_ttwt', 'o_ttdv', ]
ignore_features = [
    'grp_dim',  'o_sctr', 'o_bctr',
    'o_dthr', 'p_nm', 'p_pfx', 'p_sfx', 'c_nm', 'c_cd', 'o_ttpr', 'grp_cor', 
]
teste_features(df_vendas_bricks.columns, target, 
               categorical_features, numeric_features, ignore_features)

True --> set(cate).isdisjoint( set(nume) )
True --> set(nume).isdisjoint( set(igno) )
True --> set(igno).isdisjoint( set(cate) )
True --> alvo not in ( set(cate) | set(nume) | set(igno) )
True --> set(todas) == ( {alvo} | set(cate) | set(nume) | set(igno) )


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Configuração da regressão com PyCaret.

In [38]:
%%time
cfg_regress = setup(
    session_id=0, 
    data=df_vendas_bricks[lambda df: (df.o_dthr.dt.year==2021) & (df.o_dthr.dt.month==7)], 
            # somente linhas de de pedido de jul/2021
    target=target, ignore_features=ignore_features, 
    categorical_features=categorical_features, numeric_features=numeric_features, 
    train_size = 0.8, silent=True, verbose=True, 
)

Unnamed: 0,Description,Value
0,session_id,0
1,Target,o_unpr
2,Original Data,"(168832, 24)"
3,Missing Values,False
4,Numeric Features,8
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(135065, 177)"


CPU times: user 5.55 s, sys: 4.06 s, total: 9.62 s
Wall time: 4.64 s


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Comparação dos resultados de *cross validation* dos modelos treinados com PyCaret.

In [39]:
%%time
best_models = compare_models(n_select=3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,0.0524,0.1821,0.4026,0.3065,0.0807,0.4961,17.159
lightgbm,Light Gradient Boosting Machine,0.0615,0.2218,0.4342,0.279,0.0913,0.6162,0.431
et,Extra Trees Regressor,0.0533,0.1871,0.4123,0.2699,0.0815,0.5005,23.403
gbr,Gradient Boosting Regressor,0.0721,0.2008,0.4248,0.2458,0.0984,0.8015,4.254
dt,Decision Tree Regressor,0.0544,0.1914,0.4191,0.2301,0.0848,0.5076,0.433
ridge,Ridge Regression,0.0771,0.247,0.4653,0.1342,0.1042,0.796,0.219
knn,K Neighbors Regressor,0.0625,0.2476,0.4651,0.1182,0.0975,0.5764,1.284
br,Bayesian Ridge,0.0772,0.2513,0.4715,0.0862,0.1043,0.7946,0.664
lr,Linear Regression,0.079,0.2521,0.4727,0.0795,0.1058,0.8211,0.531
huber,Huber Regressor,0.0763,0.2696,0.4863,0.0719,0.1191,0.6736,6.73


CPU times: user 21min 15s, sys: 21.6 s, total: 21min 37s
Wall time: 10min 51s


<hr style="background-color:transparent;height:0px;border:none;border-top:2px solid #c0c0c0;">

#### C.2 - *Dataset* completo, *train size* de 50%, 5 variáveis categóricas, 8 variáveis numéricas, SEM outras opções de pré-processamento

- 5 variáveis categóricas, 8 variáveis numéricas.
- *Dataset* original com dimensões (1291936, 24).
- *Train set* processado com dimensões (645968, 348).
- Opções padrão do PyCaret para pré-processamento e treinamento, exceto:
    - *cross validation* com 5 *folds* ;
    - conjunto reduzido de modelos com melhores desempenhos nos experimentos anteriores .
- Principais parâmetros do experimento no quadro resumo abaixo.

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Escolha, classificação e conferência das variáveis.

In [45]:
target = 'o_unpr'
categorical_features = ['p_no', 'c_id', 'n_u', 'o_sctr', 'o_bctr', ]
numeric_features = ['p_wt', 'p_dx', 'p_dy', 'p_dz', 'p_dv', 'o_qtty', 'o_ttwt', 'o_ttdv', ]
ignore_features = [
    'grp_cor', 'c_tp', 'grp_dim', # cat
    # num
    'o_dthr', 'p_nm', 'p_pfx', 'p_sfx', 'c_nm', 'c_cd', 'o_ttpr', # sempre
]
teste_features(df_vendas_bricks.columns, target, 
               categorical_features, numeric_features, ignore_features)

True --> set(cate).isdisjoint( set(nume) )
True --> set(nume).isdisjoint( set(igno) )
True --> set(igno).isdisjoint( set(cate) )
True --> alvo not in ( set(cate) | set(nume) | set(igno) )
True --> set(todas) == ( {alvo} | set(cate) | set(nume) | set(igno) )


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Configuração da regressão com PyCaret.

In [46]:
include = ['et','rf','knn','gbr','dt','lightgbm','ridge']
dct_compare_models = dict(n_select=3, fold=5, include=include)

dct_setup_1 = dict(
    session_id=0, data=df_vendas_bricks, silent=True, verbose=True, 
)

dct_setup_2 = dict(
    target=target, categorical_features=categorical_features, 
    numeric_features=numeric_features, ignore_features=ignore_features, 
)

dct_setup_3 = dict(  
    train_size = 0.5, 
)

In [122]:
%%time
cfg_regress = setup(**{**dct_setup_1, **dct_setup_2, **dct_setup_3})

Unnamed: 0,Description,Value
0,session_id,0
1,Target,o_unpr
2,Original Data,"(1291936, 24)"
3,Missing Values,False
4,Numeric Features,8
5,Categorical Features,5
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(645968, 348)"


CPU times: user 48.5 s, sys: 20.9 s, total: 1min 9s
Wall time: 52 s


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Quadro resumo dos principais parâmetros do experimento.

In [47]:
d_dct({**dct_setup_2, **dct_setup_3}, tit_chave='parâmetro setup')
d_dct(dct_compare_models, tit_chave='parâmetro compare_models')

parâmetro setup|valor
:--|:--
target|'o_unpr'
categorical_features|['p_no', 'c_id', 'n_u', 'o_sctr', 'o_bctr']
numeric_features|['p_wt', 'p_dx', 'p_dy', 'p_dz', 'p_dv', 'o_qtty', 'o_ttwt', 'o_ttdv']
ignore_features|['grp_cor', 'c_tp', 'grp_dim', 'o_dthr', 'p_nm', 'p_pfx', 'p_sfx', 'c_nm', 'c_cd', 'o_ttpr']
train_size|0.5

parâmetro compare_models|valor
:--|:--
n_select|3
fold|5
include|['et', 'rf', 'knn', 'gbr', 'dt', 'lightgbm', 'ridge']

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Comparação dos resultados de *cross validation* dos modelos treinados com PyCaret.

In [67]:
%%time
best_models = compare_models(**dct_compare_models)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,0.0483,0.0755,0.2706,0.6984,0.0715,0.398,345.574
et,Extra Trees Regressor,0.0503,0.0883,0.294,0.6443,0.0747,0.4088,573.99
dt,Decision Tree Regressor,0.0527,0.1184,0.3406,0.5088,0.0785,0.4231,13.508
gbr,Gradient Boosting Regressor,0.0757,0.128,0.3553,0.4792,0.1004,0.779,86.716
lightgbm,Light Gradient Boosting Machine,0.0608,0.1481,0.3824,0.3971,0.0865,0.5374,3.098
knn,K Neighbors Regressor,0.0633,0.1503,0.3856,0.3847,0.0949,0.5128,67.39
ridge,Ridge Regression,0.078,0.1746,0.4163,0.2841,0.1025,0.7166,1.234


CPU times: user 10h 31min 53s, sys: 2min 25s, total: 10h 34min 19s
Wall time: 1h 54min 20s


<hr style="background-color:transparent;height:0px;border:none;border-top:2px solid #c0c0c0;">

#### C.3 - *Dataset* completo, *train size* de 60%, 5 variáveis categóricas, 8 variáveis numéricas, SEM outras opções de pré-processamento

- 5 variáveis categóricas, 8 variáveis numéricas.
- *Dataset* original com dimensões (1291936, 24).
- *Train set* processado com dimensões (775161, 349).
- Opções padrão do PyCaret para pré-processamento e treinamento, exceto:
    - *cross validation* com 5 *folds* ;
    - conjunto reduzido de modelos com melhores desempenhos nos experimentos anteriores .
- Principais parâmetros do experimento no quadro resumo abaixo.

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Escolha, classificação e conferência das variáveis.

In [48]:
target = 'o_unpr'
categorical_features = ['p_no', 'c_id', 'n_u', 'o_sctr', 'o_bctr', ]
numeric_features = ['p_wt', 'p_dx', 'p_dy', 'p_dz', 'p_dv', 'o_qtty', 'o_ttwt', 'o_ttdv', ]
ignore_features = [
    'grp_cor', 'c_tp', 'grp_dim', # cat
    # num
    'o_dthr', 'p_nm', 'p_pfx', 'p_sfx', 'c_nm', 'c_cd', 'o_ttpr', # sempre
]
teste_features(df_vendas_bricks.columns, target, 
               categorical_features, numeric_features, ignore_features)

True --> set(cate).isdisjoint( set(nume) )
True --> set(nume).isdisjoint( set(igno) )
True --> set(igno).isdisjoint( set(cate) )
True --> alvo not in ( set(cate) | set(nume) | set(igno) )
True --> set(todas) == ( {alvo} | set(cate) | set(nume) | set(igno) )


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Configuração da regressão com PyCaret.

In [49]:
include = ['et','rf','gbr','dt','lightgbm','ridge']
dct_compare_models = dict(n_select=3, include=include)

dct_setup_1 = dict(
    session_id=0, data=df_vendas_bricks, silent=True, verbose=True, use_gpu=True, 
)

dct_setup_2 = dict(
    target=target, categorical_features=categorical_features, 
    numeric_features=numeric_features, ignore_features=ignore_features, 
)

dct_setup_3 = dict(  
    train_size = 0.6, fold=5, 
)

In [136]:
%%time
cfg_regress = setup(**{**dct_setup_1, **dct_setup_2, **dct_setup_3})

Unnamed: 0,Description,Value
0,session_id,0
1,Target,o_unpr
2,Original Data,"(1291936, 24)"
3,Missing Values,False
4,Numeric Features,8
5,Categorical Features,5
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(775161, 349)"


CPU times: user 1min 4s, sys: 19.7 s, total: 1min 23s
Wall time: 48.8 s


<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Quadro resumo dos principais parâmetros do experimento.

In [50]:
d_dct({**dct_setup_2, **dct_setup_3}, tit_chave='parâmetro setup')
d_dct(dct_compare_models, tit_chave='parâmetro compare_models')

parâmetro setup|valor
:--|:--
target|'o_unpr'
categorical_features|['p_no', 'c_id', 'n_u', 'o_sctr', 'o_bctr']
numeric_features|['p_wt', 'p_dx', 'p_dy', 'p_dz', 'p_dv', 'o_qtty', 'o_ttwt', 'o_ttdv']
ignore_features|['grp_cor', 'c_tp', 'grp_dim', 'o_dthr', 'p_nm', 'p_pfx', 'p_sfx', 'c_nm', 'c_cd', 'o_ttpr']
train_size|0.6
fold|5

parâmetro compare_models|valor
:--|:--
n_select|3
include|['et', 'rf', 'gbr', 'dt', 'lightgbm', 'ridge']

<hr style="height:0px;border:none;border-top:1px solid #c0c0c0;">

- Comparação dos resultados de *cross validation* dos modelos treinados com PyCaret.

In [137]:
%%time
best_models = compare_models(**dct_compare_models)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,0.0479,0.0741,0.2705,0.7208,0.0704,0.3966,518.288
et,Extra Trees Regressor,0.0494,0.0742,0.2722,0.7081,0.0732,0.4067,870.352
dt,Decision Tree Regressor,0.0517,0.0891,0.2981,0.6436,0.0771,0.4206,67.286
gbr,Gradient Boosting Regressor,0.0755,0.1264,0.3534,0.5223,0.0999,0.7795,377.664
lightgbm,Light Gradient Boosting Machine,0.0604,0.1731,0.4091,0.3667,0.0856,0.5323,5.432
ridge,Ridge Regression,0.078,0.2001,0.443,0.2547,0.102,0.7143,2.534


CPU times: user 14h 9min 59s, sys: 6min 24s, total: 14h 16min 23s
Wall time: 3h 2min 56s


In [53]:
salvar_experi(best_models)

Transformation Pipeline and Model Successfully Saved
Transformation Pipeline and Model Successfully Saved
Transformation Pipeline and Model Successfully Saved


['20211217_161835.pkl',
 '20211217_161835_1_RandomForestRegressor',
 '20211217_161835_2_ExtraTreesRegressor',
 '20211217_161835_3_DecisionTreeRegressor']

In [54]:
list(Path().glob('20211217_161835*.pkl'))

[PosixPath('20211217_161835.pkl'),
 PosixPath('20211217_161835_1_RandomForestRegressor.pkl'),
 PosixPath('20211217_161835_2_ExtraTreesRegressor.pkl'),
 PosixPath('20211217_161835_3_DecisionTreeRegressor.pkl')]