# Aula 5 - pipeline avançado e outras ferramentas

Na aula de hoje, vamos explorar os seguintes tópicos em Python:

- 1) Preenchendo NaNs com o sklearn
- 2) Utilizando dados categóricos com o sklearn
- 3) Pipelines mais completas

____
____
____

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from ml_utils import *

Para a aula de hoje, vamos utilizar novamente a base de risco de crédito:

In [3]:
df = pd.read_csv("../datasets/german_credit_data.csv", index_col=0)

df

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,male,2,own,,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,male,1,own,little,,2096,12,education,good
3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,53,male,2,free,little,little,4870,24,car,bad
...,...,...,...,...,...,...,...,...,...,...
995,31,female,1,own,little,,1736,12,furniture/equipment,good
996,40,male,3,own,little,little,3857,30,car,good
997,38,male,2,own,little,,804,12,radio/TV,good
998,23,male,2,free,little,little,1845,45,radio/TV,bad


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   817 non-null    object
 5   Checking account  606 non-null    object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   object
dtypes: int64(4), object(6)
memory usage: 85.9+ KB


In [5]:
df.isnull().sum()

Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
Risk                  0
dtype: int64

Em todas os modelos que criamos até agora, sempre nos certificamos que os dados recebidos pelos estimadores não tivessem as duas seguintes carecterísticas:

- Dados missing (NaN);
- Dados não-numéricos (str)

A esta altura, já entendemos bem o porquê disso: os estimadores dependem de algoritmos de aprendizagem, que, de uma forma ou de outra, realizam **cálculos matemáticos** no processo de aprendizagem. Sendo assim, é natural que os dados fornecidos ao estimador devem ser todos numéricos, e sem "buracos"!

Até o momento, seguimos o caminho mais simples, que é: simplesmente eliminar colunas não numéricas, e colunas/linhas que têm NaNs.

Apesar desta ser uma abordagem possível, é evidente que estamos **jogando informação fora**. Deve haver alguma forma menos drástica de resolver o problema, não é mesmo?

E é isso que aprenderemos a fazer na aula de hoje, utilizando as ferramentas do sklearn!

_________________

Antes de avançarmos, vamos fazer o train-test split -- mas desta vez com todas as features, bem como dados missing!

In [6]:
df = pd.read_csv("../datasets/german_credit_data.csv", index_col=0)

X = df.drop(columns="Risk")
y = df["Risk"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

_________________

## 1) Preenchendo NaNs com o sklearn

No submódulo [sklearn.impute](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute), existem algumas classes que são utilizadas para a o preenchimento (input) de dados NaN.

Sugiro a leitura do [User Guide](https://scikit-learn.org/stable/modules/impute.html) para maiores detalhes sobre os inputers.

Vamos utilizar o [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer), que preenche os dados de uma coluna a partir de estatísticas descritivas dos dados naquela coluna, ou então com algum valor determinado.

Obs.: no nosso caso, as colunas com NaNs são colunas categóricas. Portanto, os NaNs serão preenchidos **com a moda** de cada coluna, se optarmos pela opção de inputar dados nulos a partir de alguma estatística descritiva.

Os inputers se comportam como transformers, no que diz respeito aos métodos `.fit()` e `.transform()`. Por este motivo, aqui também é muito importante que eles **sejam fitados apenas nos dados de treino!**. 

O Pipeline irá garantir isso pra gente depois, automaticamente!

In [9]:
X_train.isnull().sum()

Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     149
Checking account    323
Credit amount         0
Duration              0
Purpose               0
dtype: int64

In [12]:
X_train["Saving accounts"].value_counts(dropna=False)

little        480
NaN           149
moderate       87
quite rich     48
rich           36
Name: Saving accounts, dtype: int64

In [13]:
X_train["Checking account"].value_counts(dropna=False)

NaN         323
little      225
moderate    201
rich         51
Name: Checking account, dtype: int64

Vamos considerar duas formas diferentes de lidar com dados nulos:

- **Preencher os dados nulos com alguma estatística descritiva**: é válido quando acreditamos que os dados missing ocorreram por algum erro de coleta de dados, e quando não há grande quantidade de observações com valor missing em features;

- **Preencher os dados nulos com algum valor fixo**: é válido em casos em que a falta de informação é, em si mesma, uma informação relevante. Neste caso, não eliminamos observações com dados nulos ou a feature completa: apenas garantimos que as observações que originalmente tinham informação faltante sejam tratadas como igual, e de maneira diferente das observações que tinham todas as informações preenchidas --- efetivamente criamos um novo nível categórico para o que antes tínhamos como vazio!

Importante: como sempre, **não há regra escrita em pedra!** Sempre alinhe com a área de negócio a forma adequada para lidar com missings, e reflita sobre **o que faz sentido para o seu problema!**.

In [15]:
cols_missing = ["Saving accounts", "Checking account"]

In [17]:
X_train[cols_missing]

Unnamed: 0,Saving accounts,Checking account
675,little,
703,moderate,moderate
12,little,moderate
845,,moderate
795,moderate,
...,...,...
284,moderate,moderate
169,little,moderate
856,,
655,little,little


In [18]:
from sklearn.impute import SimpleImputer

inputer_moda = SimpleImputer(strategy="most_frequent").fit(X_train[cols_missing])

inputer_constante = SimpleImputer(strategy="constant", fill_value="vazio").fit(X_train[cols_missing])

In [20]:
display(X_train["Saving accounts"].value_counts(dropna=False))

display(X_train["Checking account"].value_counts(dropna=False))

little        480
NaN           149
moderate       87
quite rich     48
rich           36
Name: Saving accounts, dtype: int64

NaN         323
little      225
moderate    201
rich         51
Name: Checking account, dtype: int64

In [26]:
X_train_inputed_moda = pd.DataFrame(inputer_moda.transform(X_train[cols_missing]),
                                     columns=cols_missing,
                                     index=X_train[cols_missing].index)

display(X_train_inputed_moda["Saving accounts"].value_counts(dropna=False))

display(X_train_inputed_moda["Checking account"].value_counts(dropna=False))

little        629
moderate       87
quite rich     48
rich           36
Name: Saving accounts, dtype: int64

little      548
moderate    201
rich         51
Name: Checking account, dtype: int64

In [28]:
X_train_inputed_constante = pd.DataFrame(inputer_constante.transform(X_train[cols_missing]),
                                     columns=cols_missing,
                                     index=X_train[cols_missing].index)

display(X_train_inputed_constante["Saving accounts"].value_counts(dropna=False))

display(X_train_inputed_constante["Checking account"].value_counts(dropna=False))

little        480
vazio         149
moderate       87
quite rich     48
rich           36
Name: Saving accounts, dtype: int64

vazio       323
little      225
moderate    201
rich         51
Name: Checking account, dtype: int64

Na prática, vamos **adicionar o inputer como parte do nosso Pipeline**, e, com isso, garantiremos que o data leakage não vai acontecer! ;)

In [38]:
X_train.isnull().sum()

Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     149
Checking account    323
Credit amount         0
Duration              0
Purpose               0
dtype: int64

In [34]:
X_train_nan_filled = pd.concat([X_train.drop(columns=cols_missing), X_train_inputed_constante], axis=1)

X_train_nan_filled.isnull().sum()

Age                 0
Sex                 0
Job                 0
Housing             0
Credit amount       0
Duration            0
Purpose             0
Saving accounts     0
Checking account    0
dtype: int64

In [35]:
X_train_nan_filled.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 800 entries, 675 to 695
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               800 non-null    int64 
 1   Sex               800 non-null    object
 2   Job               800 non-null    int64 
 3   Housing           800 non-null    object
 4   Credit amount     800 non-null    int64 
 5   Duration          800 non-null    int64 
 6   Purpose           800 non-null    object
 7   Saving accounts   800 non-null    object
 8   Checking account  800 non-null    object
dtypes: int64(4), object(5)
memory usage: 62.5+ KB


____
_____
____

## 2) Utilizando dados categóricos com o sklearn

Quando começamos a aprender sobre o pandas, aprendemos como numerizar features categóricas usando o `pd.get_dummies()`, bem como usando o `.astype("category").cat.codes`.

In [43]:
X_train_cats = X_train_nan_filled.select_dtypes(exclude=np.number)

X_train_cats

Unnamed: 0,Sex,Housing,Purpose,Saving accounts,Checking account
675,female,rent,radio/TV,little,vazio
703,male,own,business,moderate,moderate
12,female,own,radio/TV,little,moderate
845,male,own,furniture/equipment,vazio,moderate
795,female,rent,furniture/equipment,moderate,vazio
...,...,...,...,...,...
284,male,own,car,moderate,moderate
169,male,own,business,little,moderate
856,female,own,education,vazio,vazio
655,male,free,car,little,little


In [42]:
pd.get_dummies(X_train_cats, drop_first=True)

Unnamed: 0,Sex_male,Housing_own,Housing_rent,Purpose_car,Purpose_domestic appliances,Purpose_education,Purpose_furniture/equipment,Purpose_radio/TV,Purpose_repairs,Purpose_vacation/others,Saving accounts_moderate,Saving accounts_quite rich,Saving accounts_rich,Saving accounts_vazio,Checking account_moderate,Checking account_rich,Checking account_vazio
675,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1
703,1,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
12,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0
845,1,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0
795,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284,1,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0
169,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
856,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
655,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [46]:
X_train_cats["Housing"].astype("category")

675    rent
703     own
12      own
845     own
795    rent
       ... 
284     own
169     own
856     own
655    free
695    rent
Name: Housing, Length: 800, dtype: category
Categories (3, object): ['free', 'own', 'rent']

In [47]:
X_train_cats["Housing"].astype("category").cat.codes

675    2
703    1
12     1
845    1
795    2
      ..
284    1
169    1
856    1
655    0
695    2
Length: 800, dtype: int8

Agora, com o objetivo de incluir essas estratégias de pré-processamento no pipeline, é importante que também usemos o sklearn pra fazer isso!

As classes relevantes são:

- [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) - faz o one-hot encoding;
- [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder) - faz o categorical (ordinal) encoding.

Ambos os encoders também funcionam com os métodos `.fit()` e `.transform()`, então também é uma boa ideia colocá-los **como etapa inicial da Pipeline**.

Mas, para ilustrar seu funcionamento, vejamos a seguir.

Resultado dos encodings:

In [48]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

In [49]:
encoder_oh = OneHotEncoder().fit(X_train_cats)

encoder_oe = OrdinalEncoder().fit(X_train_cats)

In [52]:
encoder_oh.transform(X_train_cats).toarray()

array([[1., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 1.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 1.]])

In [54]:
pd.DataFrame(encoder_oh.transform(X_train_cats).toarray(),
             index=X_train_cats.index)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
675,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
703,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
12,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
845,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
795,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
169,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
856,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
655,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [55]:
pd.DataFrame(encoder_oe.transform(X_train_cats),
             index=X_train_cats.index)

Unnamed: 0,0,1,2,3,4
675,0.0,2.0,5.0,0.0,3.0
703,1.0,1.0,0.0,1.0,1.0
12,0.0,1.0,5.0,0.0,1.0
845,1.0,1.0,4.0,4.0,1.0
795,0.0,2.0,4.0,1.0,3.0
...,...,...,...,...,...
284,1.0,1.0,1.0,1.0,1.0
169,1.0,1.0,0.0,0.0,1.0
856,0.0,1.0,3.0,4.0,3.0
655,1.0,0.0,1.0,0.0,0.0


Na prática, vamos deixar esta etapa pro Pipeline também! :)

É isso que faremos agora! Vamos construir uma pipeline mais completa, que vai **incluir todos os pré-processamentos**, tratando separadamente features numéricas de features categóricas!

___________

## 3) Pipelines mais completas

Veremos agora como construir pipelines mais completas!

Pra fazer isso, vai ser muito importante que usemos também o [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).

Vamos ver na prática como funciona!!

In [57]:
from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

In [58]:
df = pd.read_csv("../datasets/german_credit_data.csv", index_col=0)

X = df.drop(columns="Risk")
y = df["Risk"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [61]:
X_train.select_dtypes(include=np.number).columns.tolist()

['Age', 'Job', 'Credit amount', 'Duration']

In [62]:
X_train.select_dtypes(exclude=np.number).columns.tolist()

['Sex', 'Housing', 'Saving accounts', 'Checking account', 'Purpose']

In [65]:
X_train.select_dtypes(include=np.number).mean()

Age                35.38625
Job                 1.90500
Credit amount    3185.84125
Duration           20.82750
dtype: float64

In [66]:
pre_processador

ColumnTransformer(transformers=[('transf_num',
                                 Pipeline(steps=[('input_num', SimpleImputer()),
                                                 ('std', StandardScaler())]),
                                 ['Age', 'Job', 'Credit amount', 'Duration']),
                                ('transf_cat',
                                 Pipeline(steps=[('input_cat',
                                                  SimpleImputer(fill_value='vazio',
                                                                strategy='constant')),
                                                 ('onehot', OneHotEncoder())]),
                                 ['Sex', 'Housing', 'Saving accounts',
                                  'Checking account', 'Purpose'])])

In [68]:
from sklearn.svm import SVC

__________

Relemenbrando como era antes....

In [70]:
df = pd.read_csv("../datasets/german_credit_data.csv", index_col=0)

X = df.select_dtypes(include=np.number)
y = df["Risk"]

X_train_num, X_test_num, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# =======================================

pipe_svc = Pipeline([("standard_scaler", StandardScaler()),
                     ("svc", SVC(probability=True, random_state=42))])

pipe_svc.fit(X_train_num, y_train)

_ = clf_metrics_train_test(pipe_svc, X_train_num, y_train, X_test_num, y_test, cutoff=0.5, 
                           print_plot=True, plot_conf_matrix=False, print_cr=True, pos_label="bad")

Métricas de avaliação de treino - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.75      0.22      0.34       240
        good       0.74      0.97      0.84       560

    accuracy                           0.74       800
   macro avg       0.75      0.59      0.59       800
weighted avg       0.75      0.74      0.69       800


################################################################################

Métricas de avaliação de teste - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.44      0.13      0.21        60
        good       0.71      0.93      0.81       140

    accuracy                           0.69       200
   macro avg       0.58      0.53      0.51       200
weighted avg       0.63      0.69      0.63       200



________

Uma vez montado o objeto completo de pré-processamento com o `ColumnTransformer`, basta passarmos ele como um passo da pipeline final, que incluirá o estimador a ser treinado:

In [63]:
# ==========================================================
# o que eu vou fazer com as features numéricas

pipe_features_num = Pipeline([("input_num", SimpleImputer(strategy="mean")),
                              ("std", StandardScaler())])

features_num = X_train.select_dtypes(include=np.number).columns.tolist()

# ==========================================================
# o que eu vou fazer com as features categóricos

pipe_features_cat = Pipeline([("input_cat", SimpleImputer(strategy="constant", fill_value="vazio")),
                              ("onehot", OneHotEncoder())])

features_cat = X_train.select_dtypes(exclude=np.number).columns.tolist()

# ==========================================================

pre_processador = ColumnTransformer([("transf_num", pipe_features_num, features_num),
                                     ("transf_cat", pipe_features_cat, features_cat)])

In [71]:
pipe_svc = Pipeline([("pre processador", pre_processador),
                     ("svc", SVC(probability=True, random_state=42))])

pipe_svc.fit(X_train, y_train)

_ = clf_metrics_train_test(pipe_svc, X_train, y_train, X_test, y_test, cutoff=0.5, 
                           print_plot=True, plot_conf_matrix=False, print_cr=True, pos_label="bad")

Métricas de avaliação de treino - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.86      0.49      0.62       240
        good       0.81      0.97      0.88       560

    accuracy                           0.82       800
   macro avg       0.84      0.73      0.75       800
weighted avg       0.83      0.82      0.81       800


################################################################################

Métricas de avaliação de teste - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.64      0.38      0.48        60
        good       0.77      0.91      0.84       140

    accuracy                           0.75       200
   macro avg       0.71      0.65      0.66       200
weighted avg       0.73      0.75      0.73       200



In [74]:
from sklearn.linear_model import LogisticRegression

pipe_lr = Pipeline([("pre_processador", pre_processador),
                    ("lr", LogisticRegression(random_state=42))])

pipe_lr.fit(X_train, y_train)

_ = clf_metrics_train_test(pipe_lr, X_train, y_train, X_test, y_test, cutoff=0.5, 
                           print_plot=True, plot_conf_matrix=False, print_cr=True, pos_label="bad")

Métricas de avaliação de treino - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.62      0.40      0.48       240
        good       0.78      0.89      0.83       560

    accuracy                           0.74       800
   macro avg       0.70      0.65      0.66       800
weighted avg       0.73      0.74      0.73       800


################################################################################

Métricas de avaliação de teste - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.57      0.45      0.50        60
        good       0.78      0.86      0.82       140

    accuracy                           0.73       200
   macro avg       0.68      0.65      0.66       200
weighted avg       0.72      0.73      0.72       200



In [75]:
from sklearn.ensemble import RandomForestClassifier

pipe_rf = Pipeline([("pre_processador", pre_processador),
                    ("rf", RandomForestClassifier(max_depth=8, random_state=42))])

pipe_rf.fit(X_train, y_train)

_ = clf_metrics_train_test(pipe_rf, X_train, y_train, X_test, y_test, cutoff=0.5, 
                           print_plot=True, plot_conf_matrix=False, print_cr=True, pos_label="bad")

Métricas de avaliação de treino - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.97      0.65      0.78       240
        good       0.87      0.99      0.93       560

    accuracy                           0.89       800
   macro avg       0.92      0.82      0.85       800
weighted avg       0.90      0.89      0.88       800


################################################################################

Métricas de avaliação de teste - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.63      0.43      0.51        60
        good       0.79      0.89      0.84       140

    accuracy                           0.76       200
   macro avg       0.71      0.66      0.68       200
weighted avg       0.74      0.76      0.74       200



In [77]:
from lightgbm import LGBMClassifier

pipe_lgbm = Pipeline([("pre_processador", pre_processador),
                      ("lgbm", LGBMClassifier(reg_alpha=0.5, random_state=42))])

pipe_lgbm.fit(X_train, y_train)

_ = clf_metrics_train_test(pipe_lgbm, X_train, y_train, X_test, y_test, cutoff=0.5, 
                           print_plot=True, plot_conf_matrix=False, print_cr=True, pos_label="bad")

Métricas de avaliação de treino - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.99      0.95      0.97       240
        good       0.98      0.99      0.99       560

    accuracy                           0.98       800
   macro avg       0.98      0.97      0.98       800
weighted avg       0.98      0.98      0.98       800


################################################################################

Métricas de avaliação de teste - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.67      0.50      0.57        60
        good       0.81      0.89      0.85       140

    accuracy                           0.78       200
   macro avg       0.74      0.70      0.71       200
weighted avg       0.76      0.78      0.76       200



Podemos fazer também um gridsearch, normalmente:

In [78]:
pipe_lgbm = Pipeline([("pre_processador", pre_processador),
                      ("lgbm", LGBMClassifier(reg_alpha=0.5, random_state=42))])

param_grid_lgbm = {"lgbm__num_leaves" : range(20, 60, 2),
                   "lgbm__max_depth" : range(1, 6, 1)}

splitter = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_lgbm = GridSearchCV(estimator=pipe_lgbm,
                         param_grid=param_grid_lgbm,
                         scoring="f1_weighted",
                         cv=splitter,
                         verbose=10,
                         n_jobs=-1,
                         return_train_score=True)

grid_lgbm.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('pre_processador',
                                        ColumnTransformer(transformers=[('transf_num',
                                                                         Pipeline(steps=[('input_num',
                                                                                          SimpleImputer()),
                                                                                         ('std',
                                                                                          StandardScaler())]),
                                                                         ['Age',
                                                                          'Job',
                                                                          'Credit '
                                                                          'amount',
                             

In [80]:
grid_lgbm.best_params_

{'lgbm__max_depth': 5, 'lgbm__num_leaves': 22}

In [81]:
_ = clf_metrics_train_test(grid_lgbm, X_train, y_train, X_test, y_test, cutoff=0.5, 
                           print_plot=True, plot_conf_matrix=False, print_cr=True, pos_label="bad")

Métricas de avaliação de treino - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.92      0.77      0.84       240
        good       0.91      0.97      0.94       560

    accuracy                           0.91       800
   macro avg       0.92      0.87      0.89       800
weighted avg       0.91      0.91      0.91       800


################################################################################

Métricas de avaliação de teste - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.68      0.50      0.58        60
        good       0.81      0.90      0.85       140

    accuracy                           0.78       200
   macro avg       0.74      0.70      0.71       200
weighted avg       0.77      0.78      0.77       200



Usando nossa função:

In [82]:
best_params_delta = calc_best_params_delta(grid_lgbm, peso_delta=0.5, print_deltas=True)

Unnamed: 0,mean_train_score,mean_test_score,delta,mean_train_score_norm,mean_test_score_norm,delta_norm,metrica_criterio_final
29,0.795214,0.729948,0.065267,0.443646,0.715897,0.29636,0.709769
39,0.795214,0.729948,0.065267,0.443646,0.715897,0.29636,0.709769
21,0.795214,0.729948,0.065267,0.443646,0.715897,0.29636,0.709769
22,0.795214,0.729948,0.065267,0.443646,0.715897,0.29636,0.709769
23,0.795214,0.729948,0.065267,0.443646,0.715897,0.29636,0.709769
...,...,...,...,...,...,...,...
17,0.712305,0.679456,0.032850,0.100000,0.100000,0.10000,0.500000
18,0.712305,0.679456,0.032850,0.100000,0.100000,0.10000,0.500000
19,0.712305,0.679456,0.032850,0.100000,0.100000,0.10000,0.500000
0,0.712305,0.679456,0.032850,0.100000,0.100000,0.10000,0.500000


In [83]:
best_params_delta

{'lgbm__max_depth': 2, 'lgbm__num_leaves': 38}

In [84]:
pipe_best_delta = Pipeline([("pre_processador", pre_processador),
                            ("lgbm", LGBMClassifier(reg_alpha=0.5, random_state=42))]).set_params(**best_params_delta)

pipe_best_delta.fit(X_train, y_train)

_ = clf_metrics_train_test(pipe_best_delta, X_train, y_train, X_test, y_test, cutoff=0.5, 
                           print_plot=True, plot_conf_matrix=False, print_cr=True, pos_label="bad")

Métricas de avaliação de treino - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.79      0.46      0.58       240
        good       0.80      0.95      0.87       560

    accuracy                           0.80       800
   macro avg       0.80      0.71      0.73       800
weighted avg       0.80      0.80      0.78       800


################################################################################

Métricas de avaliação de teste - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.68      0.42      0.52        60
        good       0.79      0.91      0.84       140

    accuracy                           0.77       200
   macro avg       0.73      0.67      0.68       200
weighted avg       0.75      0.77      0.75       200



_________

Uma última coisa: vamos generalizar ainda mais a pipeline!!

Pras features não-numéricas, já sabemos que podemos fazer duas codificações diferentes para numerizá-las (onehot e ordinal).

Até o momento, usamos uma abordagem ou a outra. Mas, por que não usar **ambas**?

O `ColumnTransformer` permite que façamos isso também!! :D

Lembrem-se que sempre utilizamos uma codificação ordinal, estamos **adicionando um viés ordinal**, que pode ser desejado ou indesejado, dependendo da feature a ser codificada!

Por isso, é importante analisar as features categóricas separadamente, e processar as colunas separadamente!

In [85]:
X_train[['Sex', 'Purpose']]

Unnamed: 0,Sex,Purpose
675,female,radio/TV
703,male,business
12,female,radio/TV
845,male,furniture/equipment
795,female,furniture/equipment
...,...,...
284,male,car
169,male,business
856,female,education
655,male,car


In [87]:
X_train["Purpose"].unique()

array(['radio/TV', 'business', 'furniture/equipment', 'car', 'education',
       'domestic appliances', 'repairs', 'vacation/others'], dtype=object)

In [86]:
X_train[['Housing', 'Saving accounts', 'Checking account']]

Unnamed: 0,Housing,Saving accounts,Checking account
675,rent,little,
703,own,moderate,moderate
12,own,little,moderate
845,own,,moderate
795,rent,moderate,
...,...,...,...
284,own,moderate,moderate
169,own,little,moderate
856,own,,
655,free,little,little


In [90]:
pipe_features_num = Pipeline([("input_num", SimpleImputer(strategy="mean")),
                              ("std", StandardScaler())])

features_num = X_train.select_dtypes(include=np.number).columns.tolist()

# ==========================================================

pipe_features_oh = Pipeline([("input_cat_oh", SimpleImputer(strategy="constant", fill_value="unknown")),
                             ("onehot", OneHotEncoder())])

features_oh = ['Sex', 'Purpose']

# ==========================================================

pipe_features_oe = Pipeline([("input_cat_oe", SimpleImputer(strategy="constant", fill_value="unknown")),
                             ("ordinal", OrdinalEncoder())])

features_oe = ['Housing', 'Saving accounts', 'Checking account']

# ==========================================================

pre_processador_oh_oe = ColumnTransformer([("transf_num", pipe_features_num, features_num),
                                           ("transf_cat_oh", pipe_features_oh, features_oh),
                                           ("transf_cat_oe", pipe_features_oe, features_oe)])

In [91]:
pipe_lgbm_oh_oe = Pipeline([("pre_processador", pre_processador_oh_oe),
                            ("lgbm", LGBMClassifier(reg_alpha=0.5, random_state=42))])

pipe_lgbm_oh_oe.fit(X_train, y_train)

_ = clf_metrics_train_test(pipe_lgbm_oh_oe, X_train, y_train, X_test, y_test, cutoff=0.5, 
                           print_plot=True, plot_conf_matrix=False, print_cr=True, pos_label="bad")

Métricas de avaliação de treino - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.99      0.95      0.97       240
        good       0.98      0.99      0.99       560

    accuracy                           0.98       800
   macro avg       0.98      0.97      0.98       800
weighted avg       0.98      0.98      0.98       800


################################################################################

Métricas de avaliação de teste - com cutoff = 0.50
              precision    recall  f1-score   support

         bad       0.60      0.48      0.54        60
        good       0.80      0.86      0.83       140

    accuracy                           0.75       200
   macro avg       0.70      0.67      0.68       200
weighted avg       0.74      0.75      0.74       200



In [None]:
# to-do: fazer a otimização de hiperparametros!

__________

Agora sim, ganhamos um poder enorme! 

Com o pipeline mais genérico que acabamos de conhecer, agora somos capazes de criar modelos que se utilizam de toda a informação disponível, e de maneiras diferentes! Muito legal, não é mesmo? 

Aproveite estas novas ferramentas incríveis que estão à sua disposição! Explore suas aplicações, e pratique muito!