In [1]:
import numpy as np
import pandas as pd

# Carregando a ABT

In [2]:
df_abt = pd.read_csv('C:\\Users\\HP\\Documents\\GitHub\\Case ML\\propensao_revenda_abt.csv')
df_abt.head()

Unnamed: 0,data_ref_safra,seller_id,uf,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia,nao_revendeu_next_6m
0,2018-01-01,0015a82c2db000af6aaaf3ae2ecb0532,SP,3,3,1,2685.0,74,1
1,2018-01-01,001cca7ae9ae17fb1caed9dfb1094831,ES,171,207,9,21275.23,2,0
2,2018-01-01,002100f778ceb8431b7a1020ff7ab48f,SP,38,42,15,781.8,2,0
3,2018-01-01,003554e2dce176b5555353e4f3555ac8,GO,1,1,1,120.0,16,1
4,2018-01-01,004c9cd9d87a3c30c522c48c4fc07416,SP,130,141,75,16228.88,8,0


In [3]:
df_abt.shape

(11627, 9)

In [4]:
df_abt['data_ref_safra'].value_counts()

2018-06-01    2213
2018-05-01    2104
2018-04-01    1941
2018-03-01    1874
2018-02-01    1805
2018-01-01    1690
Name: data_ref_safra, dtype: int64

# Identificar as Variáveis de Modelagem

In [5]:
key_vars = ['data_ref_safra', 'seller_id']
num_vars = ['tot_orders_12m', 'tot_items_12m', 'tot_items_dist_12m', 'receita_12m', 'recencia']
cat_vars = ['uf']
target = 'nao_revendeu_next_6m'

features = cat_vars + num_vars

# filtra as colunas com as features
X = df_abt[features]
# filtra o target
y = df_abt[target]

In [6]:
X.head()

Unnamed: 0,uf,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia
0,SP,3,3,1,2685.0,74
1,ES,171,207,9,21275.23,2
2,SP,38,42,15,781.8,2
3,GO,1,1,1,120.0,16
4,SP,130,141,75,16228.88,8


In [7]:
y

0        1
1        0
2        0
3        1
4        0
        ..
11622    0
11623    0
11624    0
11625    1
11626    0
Name: nao_revendeu_next_6m, Length: 11627, dtype: int64

# Train-Test Split

Dividindo a base de forma aleatória em 2 grupos: um para treinar o **modelo** e outro para **avaliar** a performance.

Aqui iremos utilizar a função `train_test_split()` do submódulo `model_selection` do `sklearn`.

* Parâmetros

 - `train_size`: percentual da base que irá ser utilizado para treinamento do modelo.
 - `stratify`: faz  uma amostragem aleatória estratificada, mantendo a mesma distribuição da variável target para ambos os datasets de treino e teste.
 - `random_state`: Controla o embaralhamento dos dados antes de se fazer o split. Utilizando o mesmo valor inteiro para esse parâmetro em várias chamadas da função `train_test_split`, garantimos a reproducibilidade da amostragem.

In [8]:
from sklearn.model_selection import train_test_split

# Explicação dos parâmetros da função train_test_split()
# train_size := percentual que será deixado para a base de treino. 
#               
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y, random_state=42)

In [9]:
X_train.shape, X_test.shape

((9301, 6), (2326, 6))

In [10]:
y_train.shape, y_test.shape

((9301,), (2326,))

# Feature Engineering

- Imputação de Missing Values
- Feature Scaling -> só pra modelos lineares (regressão logística, SVM, redes neurais) Exceção: qualquer modelo baseado em árvore
- Decodificação/tratamento das Variáveis Categóricas

## Imputação de Missing Values

In [11]:
X_train.isnull().sum()

uf                    0
tot_orders_12m        0
tot_items_12m         0
tot_items_dist_12m    0
receita_12m           0
recencia              0
dtype: int64

In [12]:
X_train.describe()

Unnamed: 0,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia
count,9301.0,9301.0,9301.0,9301.0,9301.0
mean,29.304698,32.925062,10.692076,3870.99396,74.494786
std,83.755751,93.223141,21.703739,11139.341438,95.127856
min,1.0,1.0,1.0,5.95,0.0
25%,2.0,2.0,2.0,228.4,6.0
50%,7.0,8.0,4.0,878.9,27.0
75%,23.0,26.0,10.0,3158.0,111.0
max,1421.0,1479.0,335.0,192353.24,364.0


Na nossa base de treinamento não temos valores faltantes. Mesmo assim, é bom criar uma estratégia de imputação pois em produção pode acontecer de vir dados faltantes. 

Estratégia para imputação de missing values:

* Variáveis categóricas: substitui o valor faltante pela palavra `missing`.
* Variáveis numéricas: substitui o valor faltante com a média da variável.

In [13]:
X_train[num_vars].mean()

tot_orders_12m          29.304698
tot_items_12m           32.925062
tot_items_dist_12m      10.692076
receita_12m           3870.993960
recencia                74.494786
dtype: float64

In [14]:
X_train[cat_vars].fillna('missing', inplace=True)
X_test[cat_vars].fillna('missing', inplace=True)

for num_var in num_vars:
    media = X_train[num_var].mean()
    X_train[num_var].fillna(media, inplace=True)
    X_test[num_var].fillna(media, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


ATENÇÃO!!!

o que significa a mensagem acima `A value is trying to be set on a copy of a slice from a Dataframe`?

Perceba que não é um erro, mas sim um **aviso** (`SettingWithCopyWarning`).

Pandas mostra um `SettingWithCopyWarning` porque fazer uma atribuição em uma cópia de um slice de um dataframe frequentemente não é algo intencional e sim  um erro. O pandas não tem como garantir que a operação funcione como desejado, por isso ele tenta avisar mesmo quando a operação funciona corretamente. Como a própria documentação informa, existem muitos falsos positivos, ou seja, situações em que um aviso será mostrado na tela sem necessidade.

Formas de se lidar com a situação:

* Podemos ignorar completamente o aviso no nosso caso
* Podemos modificar o valor do atributo pd.mode.chained_assignment para um dos seguintes valores:

    - `warn`, que já é o padrão, significa que a mensagem SettingWithCopyWarning será mostrada na tela
    - `raise` significa que o pandas irá retornar um erro e não apenas uma mensagem, o que irá o quebrar o seu código. 
    - `None` significa que o pandas jamais irá mostrar a mensagem novamente. Você terá total responsabilidade sobre a operação que está fazendo.

In [15]:
# com a opção abaixo, o pandas irá retornar um erro e não apenas um aviso.
pd.options.mode.chained_assignment='raise'

X_train[cat_vars] = X_train[cat_vars].fillna('missing')
X_test[cat_vars]  = X_test[cat_vars].fillna('missing')

SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [16]:
# com a opção abaixo, o pandas irá suprimir totalmente a mensagem.
pd.options.mode.chained_assignment=None

X_train[cat_vars] = X_train[cat_vars].fillna('missing')
X_test[cat_vars]  = X_test[cat_vars].fillna('missing')

CONCLUSÃO:

Eu recomendo deixar como padrão, sem alterar em nada. Basta apenas lembrarmos que o pandas emite um aviso e não um erro e existem muitos falsos positivos, ou seja, situações em que o pandas irá mostrar o aviso `SettingWithCopyingWarning` desnecessariamente.

In [17]:
pd.options.mode.chained_assignment='warn'

X_train[cat_vars] = X_train[cat_vars].fillna('missing')
X_test[cat_vars]  = X_test[cat_vars].fillna('missing')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


## Feature Scaling

In [18]:
X_train[num_vars].describe()

Unnamed: 0,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia
count,9301.0,9301.0,9301.0,9301.0,9301.0
mean,29.304698,32.925062,10.692076,3870.99396,74.494786
std,83.755751,93.223141,21.703739,11139.341438,95.127856
min,1.0,1.0,1.0,5.95,0.0
25%,2.0,2.0,2.0,228.4,6.0
50%,7.0,8.0,4.0,878.9,27.0
75%,23.0,26.0,10.0,3158.0,111.0
max,1421.0,1479.0,335.0,192353.24,364.0


In [19]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train[num_vars])

X_train_num_scalonado = pd.DataFrame(scaler.transform(X_train[num_vars]), columns=num_vars)
X_train_num_scalonado.describe()

Unnamed: 0,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia
count,9301.0,9301.0,9301.0,9301.0,9301.0
mean,-1.3750960000000001e-17,4.2780770000000005e-17,1.833462e-17,1.604279e-17,5.977849000000001e-17
std,1.000054,1.000054,1.000054,1.000054,1.000054
min,-0.3379615,-0.342477,-0.4465865,-0.346991,-0.7831438
25%,-0.3260214,-0.3317494,-0.400509,-0.3270202,-0.7200674
50%,-0.2663208,-0.2673843,-0.3083541,-0.2686204,-0.4993
75%,-0.07527886,-0.07428879,-0.03188913,-0.06401027,0.3837696
max,16.61701,15.51281,14.94329,16.92132,3.043491


In [20]:
X_test_num_scalonado = pd.DataFrame(scaler.transform(X_test[num_vars]), columns=num_vars)
X_test_num_scalonado.describe()

Unnamed: 0,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia
count,2326.0,2326.0,2326.0,2326.0,2326.0
mean,0.001798,0.008501,0.000955,0.025825,0.019159
std,1.087326,1.166866,1.026328,1.164259,0.997846
min,-0.337962,-0.342477,-0.446587,-0.346808,-0.783144
25%,-0.326021,-0.331749,-0.400509,-0.326886,-0.709555
50%,-0.278261,-0.278112,-0.308354,-0.269868,-0.446736
75%,-0.099159,-0.098426,-0.031889,-0.070627,0.446846
max,16.246866,18.505788,12.224723,16.921322,3.043491


In [21]:
X_test_num_scalonado.head()

Unnamed: 0,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia
0,0.008302,-0.009924,-0.077967,-0.013035,-0.783144
1,-0.278261,-0.224474,-0.216199,-0.27143,-0.141867
2,-0.075279,-0.095744,-0.170122,-0.131379,-0.593915
3,-0.254381,-0.267384,-0.354432,-0.152717,1.266839
4,-0.242441,-0.256657,-0.170122,-0.24254,0.657101


## Tratamento das Variáveis Categóricas

In [22]:
X_train.head()

Unnamed: 0,uf,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia
6588,MG,4,5,2,338.6,34
1424,MG,3,3,3,268.7,172
10016,RS,3,5,3,2349.0,14
8299,BA,32,34,12,7788.66,84
4315,SP,22,44,13,2599.96,6


In [23]:
pd.get_dummies(X_train['uf'], columns=['uf'], prefix='uf').head()

Unnamed: 0,uf_AM,uf_BA,uf_CE,uf_DF,uf_ES,uf_GO,uf_MA,uf_MG,uf_MS,uf_MT,...,uf_PE,uf_PI,uf_PR,uf_RJ,uf_RN,uf_RO,uf_RS,uf_SC,uf_SE,uf_SP
6588,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1424,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
10016,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
8299,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4315,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [24]:
pd.get_dummies(X_test['uf'], columns=['uf'], prefix='uf').head()

Unnamed: 0,uf_AM,uf_BA,uf_CE,uf_DF,uf_ES,uf_GO,uf_MA,uf_MG,uf_MS,uf_MT,uf_PB,uf_PE,uf_PR,uf_RJ,uf_RN,uf_RO,uf_RS,uf_SC,uf_SE,uf_SP
3360,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
9697,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
5491,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
9980,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
7675,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


Existem valores da variável `uf` na base de treino que não estão presentes na base de teste. Por isso que o `get_dummies` retorna bases diferentes. Isso impossibilita a aplicação do modelo na base de teste. Para resolver o problema, vamos utilizar a função `OneHotEncoder` do pacote `feature-engine`.

In [25]:
# instalando o pacote feature-engine
!pip install feature-engine==1.0.2

Collecting feature-engine==1.0.2
  Downloading feature_engine-1.0.2-py2.py3-none-any.whl (152 kB)
Installing collected packages: feature-engine
Successfully installed feature-engine-1.0.2


You should consider upgrading via the 'c:\users\hp\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.


In [26]:
from feature_engine.encoding import OneHotEncoder

ohe = OneHotEncoder(variables=cat_vars)

In [27]:
X_train_cat_ohe = ohe.fit_transform(X_train[cat_vars])
X_test_cat_ohe  = ohe.transform(X_test[cat_vars])

In [28]:
X_train_cat_ohe

Unnamed: 0,uf_MG,uf_RS,uf_BA,uf_SP,uf_SC,uf_PR,uf_SE,uf_RN,uf_RJ,uf_CE,...,uf_ES,uf_MA,uf_DF,uf_GO,uf_MS,uf_RO,uf_PA,uf_PB,uf_PI,uf_AM
6588,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1424,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10016,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8299,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4315,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2777,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3079,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3025,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7656,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
X_test_cat_ohe

Unnamed: 0,uf_MG,uf_RS,uf_BA,uf_SP,uf_SC,uf_PR,uf_SE,uf_RN,uf_RJ,uf_CE,...,uf_ES,uf_MA,uf_DF,uf_GO,uf_MS,uf_RO,uf_PA,uf_PB,uf_PI,uf_AM
3360,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9697,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5491,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9980,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7675,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3348,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6272,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
6583,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7700,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Reconstruindo as bases de treino e teste

In [30]:
# realinhando os índices das tabelas
X_train_num_scalonado = X_train_num_scalonado.set_index(X_train.index)
X_test_num_scalonado  = X_test_num_scalonado.set_index(X_test.index)

X_train_cat_ohe = X_train_cat_ohe.set_index(X_train.index)
X_test_cat_ohe = X_test_cat_ohe.set_index(X_test.index)

In [31]:
X_train_transformado = pd.concat([X_train_num_scalonado, X_train_cat_ohe], axis='columns')
X_train_transformado

Unnamed: 0,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia,uf_MG,uf_RS,uf_BA,uf_SP,uf_SC,...,uf_ES,uf_MA,uf_DF,uf_GO,uf_MS,uf_RO,uf_PA,uf_PB,uf_PI,uf_AM
6588,-0.302141,-0.299567,-0.400509,-0.317127,-0.425711,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1424,-0.314081,-0.321022,-0.354432,-0.323402,1.025046,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10016,-0.314081,-0.299567,-0.354432,-0.136640,-0.635966,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8299,0.032182,0.011531,0.060266,0.351715,0.099926,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4315,-0.087219,0.118807,0.106343,-0.114109,-0.720067,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2777,-0.302141,-0.310294,-0.308354,-0.325799,-0.110329,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3079,-0.326021,-0.331749,-0.400509,-0.230817,-0.194431,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3025,-0.314081,-0.321022,-0.400509,-0.169859,-0.688529,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7656,-0.087219,-0.085016,0.382808,-0.261151,-0.751606,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
X_test_transformado = pd.concat([X_test_num_scalonado, X_test_cat_ohe], axis='columns')
X_test_transformado

Unnamed: 0,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia,uf_MG,uf_RS,uf_BA,uf_SP,uf_SC,...,uf_ES,uf_MA,uf_DF,uf_GO,uf_MS,uf_RO,uf_PA,uf_PB,uf_PI,uf_AM
3360,0.008302,-0.009924,-0.077967,-0.013035,-0.783144,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9697,-0.278261,-0.224474,-0.216199,-0.271430,-0.141867,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5491,-0.075279,-0.095744,-0.170122,-0.131379,-0.593915,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9980,-0.254381,-0.267384,-0.354432,-0.152717,1.266839,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7675,-0.242441,-0.256657,-0.170122,-0.242540,0.657101,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3348,-0.146920,-0.149381,0.290653,0.154834,-0.656991,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6272,-0.075279,-0.085016,0.428886,0.014598,-0.604427,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
6583,-0.314081,-0.321022,-0.354432,-0.260711,0.415308,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7700,-0.337962,-0.342477,-0.446587,-0.343126,0.362744,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


# Treinando uma Regressão Logística

In [33]:
from sklearn.linear_model import LogisticRegression

# instanciar um modelo de regressão logística
lr_model = LogisticRegression(random_state=42)

In [34]:
# treinando o modelo na base de treino transformada
lr_model.fit(X_train_transformado, y_train)

LogisticRegression(random_state=42)

In [35]:
lr_model.predict(X_train_transformado)

array([0, 1, 0, ..., 0, 0, 0], dtype=int64)

In [36]:
from sklearn.metrics import accuracy_score

# avaliando o modelo na base de treino transformada
acc_train = accuracy_score(y_train, lr_model.predict(X_train_transformado))
acc_train

0.8110955811203097

In [37]:
# avaliando o modelo na base de teste transformada
acc_test = accuracy_score(y_test, lr_model.predict(X_test_transformado))
acc_test

0.8172828890799656

O código acima que escrevemos não está de fácil leitura e manutenção. Principalmente no que diz respeito as partes de imputação de valores faltantes e feature engineering, que estão espalhados por todo o código. O interessante seriamos deixar esses passos todos juntos em um só objeto, já que eles são aplicados em sequência. Por isso aqui iremos utilizar o conceito de `Pipeline` do `sklearn`, juntamente com a biblioteca `feature-engine`.

# Criando Modelos com Pipelines

Aqui iremos utilizar o pacote `feature-engine` para criarmos pipelines de feature engineering que irá facilitar o treinamento e aplicação dos modelos, além de melhorar a legibilidade e manuntenção do código.

## Regressão Logística

In [38]:
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer
from feature_engine.imputation import CategoricalImputer
from feature_engine.encoding import OneHotEncoder
from feature_engine.wrappers import SklearnTransformerWrapper
from sklearn.preprocessing import StandardScaler

In [39]:
data_pipe = Pipeline(steps=[
                ('numeric_imputer', MeanMedianImputer(variables=num_vars, imputation_method='median')),
                ('numeric_scaler', SklearnTransformerWrapper(variables=num_vars, transformer=StandardScaler())),
                ('categoric_imputer', CategoricalImputer(variables=cat_vars, fill_value='missing')),
                ('one_hot_encoder', OneHotEncoder(variables=cat_vars))
])

In [40]:
X_train.head()

Unnamed: 0,uf,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia
6588,MG,4,5,2,338.6,34
1424,MG,3,3,3,268.7,172
10016,RS,3,5,3,2349.0,14
8299,BA,32,34,12,7788.66,84
4315,SP,22,44,13,2599.96,6


In [41]:
data_pipe.fit_transform(X_train).head()

Unnamed: 0,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia,uf_MG,uf_RS,uf_BA,uf_SP,uf_SC,...,uf_ES,uf_MA,uf_DF,uf_GO,uf_MS,uf_RO,uf_PA,uf_PB,uf_PI,uf_AM
6588,-0.302141,-0.299567,-0.400509,-0.317127,-0.425711,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1424,-0.314081,-0.321022,-0.354432,-0.323402,1.025046,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10016,-0.314081,-0.299567,-0.354432,-0.13664,-0.635966,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8299,0.032182,0.011531,0.060266,0.351715,0.099926,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4315,-0.087219,0.118807,0.106343,-0.114109,-0.720067,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [42]:
data_pipe.transform(X_test).head()

Unnamed: 0,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia,uf_MG,uf_RS,uf_BA,uf_SP,uf_SC,...,uf_ES,uf_MA,uf_DF,uf_GO,uf_MS,uf_RO,uf_PA,uf_PB,uf_PI,uf_AM
3360,0.008302,-0.009924,-0.077967,-0.013035,-0.783144,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9697,-0.278261,-0.224474,-0.216199,-0.27143,-0.141867,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5491,-0.075279,-0.095744,-0.170122,-0.131379,-0.593915,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9980,-0.254381,-0.267384,-0.354432,-0.152717,1.266839,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7675,-0.242441,-0.256657,-0.170122,-0.24254,0.657101,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


O interessante é criarmos um pipeline com os dados e o algoritmo no final!

In [43]:
lr_model_pipe = Pipeline(steps=[
                ('numeric_imputer', MeanMedianImputer(variables=num_vars, imputation_method='median')),
                ('numeric_scaler', SklearnTransformerWrapper(variables=num_vars, transformer=StandardScaler())),
                ('categoric_imputer', CategoricalImputer(variables=cat_vars, fill_value='missing')),
                ('one_hot_encoder', OneHotEncoder(variables=cat_vars)),
                ('algoritmo', LogisticRegression(random_state=42))
])

In [44]:
X_train.head(3)

Unnamed: 0,uf,tot_orders_12m,tot_items_12m,tot_items_dist_12m,receita_12m,recencia
6588,MG,4,5,2,338.6,34
1424,MG,3,3,3,268.7,172
10016,RS,3,5,3,2349.0,14


In [45]:
lr_model_pipe.fit(X_train, y_train)

Pipeline(steps=[('numeric_imputer',
                 MeanMedianImputer(variables=['tot_orders_12m', 'tot_items_12m',
                                              'tot_items_dist_12m',
                                              'receita_12m', 'recencia'])),
                ('numeric_scaler',
                 SklearnTransformerWrapper(transformer=StandardScaler(),
                                           variables=['tot_orders_12m',
                                                      'tot_items_12m',
                                                      'tot_items_dist_12m',
                                                      'receita_12m',
                                                      'recencia'])),
                ('categoric_imputer',
                 CategoricalImputer(fill_value='missing', variables=['uf'])),
                ('one_hot_encoder', OneHotEncoder(variables=['uf'])),
                ('algoritmo', LogisticRegression(random_state=42))])

In [46]:
# Avaliando o modelo na base de treino e teste

y_pred_train = lr_model_pipe.predict(X_train)
y_pred_test  = lr_model_pipe.predict(X_test)

acc_train = accuracy_score(y_train, y_pred_train)
acc_test  = accuracy_score(y_test, y_pred_test)

print(f"Acc de Treino: {acc_train}")
print(f"Acc de Teste: {acc_test}")

Acc de Treino: 0.8110955811203097
Acc de Teste: 0.8172828890799656


## Árvore de Decisão

In [47]:
from sklearn.tree import DecisionTreeClassifier

tree_model_pipe = Pipeline(steps=[
                ('numeric_imputer', MeanMedianImputer(variables=num_vars, imputation_method='median')),
                ('categoric_imputer', CategoricalImputer(variables=cat_vars, fill_value='missing')),
                ('one_hot_encoder', OneHotEncoder(variables=cat_vars)),
                ('algoritmo', DecisionTreeClassifier(random_state=42))
])

In [48]:
tree_model_pipe.fit(X_train, y_train)

# Avaliando o modelo na base de treino e teste
y_pred_train = tree_model_pipe.predict(X_train)
y_pred_test  = tree_model_pipe.predict(X_test)

acc_train = accuracy_score(y_train, y_pred_train)
acc_test  = accuracy_score(y_test, y_pred_test)

print(f"Acc de Treino: {acc_train}")
print(f"Acc de Teste: {acc_test}")

Acc de Treino: 0.9998924846790668
Acc de Teste: 0.7549441100601891


A regressão logística apesar de ter tido um score menor no treino, ela possui um score melhor no teste e também é mais robusta, ou seja, a sua performance se mantém estável entre treino e teste. Enquanto a árvore de decisão acerta tudo na base de treino e erra bastante na base de teste. Dizemos que a árvore de decisão overfitou (sobreajustou) os dados de treino. É como um estudante que só decorou a resolução da lista de questões que o professor passou para estudar para a prova. Na hora da prova, em que o professor colocou questões novas, o aluno já não se sai tão bem.