# Notebook com informações sobre pipelines

A ideia deste notebook é testar vários modelos em somente um pipeline e, também, novos tipos de encoders. 

## 1. Importando as bibliotecas iniciais

In [1]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

## 2. Contextualizando e carregando os dados 

Uma empresa de ciência de dados e Big Data oferece cursos e gostaria de contratar algum(ns) cientista(s) de dado(s) que completaram os seus cursos. Para isso, criaram um questionário que traz informações de dados demográficos, sociais, educação e etc, com o objetivo de minimizar os custos de contratação e otimizando o processo de contratação, pois sabem que o candidato deve ser treinado e adequado à equipe. Resumindo, vamos se dizer que uma empresa ao final do seu curso lhe gera um questionário de feedback do curso e pergunta se você gostaria de receber vagas deles, é o mesmo caso aqui.

1. **Verdadeiro Negativo**: São os candidatos que o nosso modelo disse que não estão a procura e realmente não estão a procura de um novo trabalho
2. **Falso Positivo**: São os candidatos que o nosso modelo disse que estão a procura, mas na realidade não estão a procura de um novo trabalho - *Maior prejudicial, pois iremos comunicar esses caras e na verdade eles não estão a procura de um novo trabalho*
3. **Falso Negativo**: São os candidatos que o nosso modelo disse que **NÃO** estão a procura, mas na realidade estão a procura de um novo trabalho
4. **Verdadeiro Positivo**: São os candidatos que o nosso modelo disse que estão a procura e realmente estão a procura de um novo trabalho

Como queremos diminuir o número de Falsos Positivos iremos em busca da minimização da Precision.

In [2]:
dados_treino = pd.read_csv(filepath_or_buffer = "../data/raw/aug_train.csv")

dados_treino

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


## 3. Informações iniciais dos dados

In [3]:
dados_treino.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

---

Inicialmente, não precisamos fazer nenhuma transformação nos dados, pois todos estão no formato e tipo ideal.

In [4]:
dados_treino["target"].value_counts()

0.0    14381
1.0     4777
Name: target, dtype: int64

## 4. Separando em treino e teste

Apesar de termos dados de ter uma base de teste também, ela não possui rótulo. Logo, teremos que dividir nossos dados de treino (que estão rotulados) em treino e teste.

Como não temos nenhuma dependência temporal aliada a série, podemos fazer o split aleatório. 

In [5]:
from sklearn.model_selection import train_test_split

X = dados_treino.drop("target", axis = 1)
y = dados_treino[["target"]]

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1234, stratify = y)

In [6]:
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

(14368, 13) (14368, 1) (4790, 13) (4790, 1)


Com isso, temos 14.368 observações nos dados de treino e 4.790 nos dados de teste.

## 5. Criando Pipelines

In [7]:
#---- Funções

from sklearn.pipeline import make_pipeline # Função para o Pipeline
from sklearn.compose import make_column_transformer # Função caso queiramos criar uma função nossa e colocar dentro do Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer, OneHotEncoder, MinMaxScaler, Normalizer, PolynomialFeatures, RobustScaler # Encoders
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV # Grid Searchs
from sklearn import set_config # Pipelines bisualmente bonitos
from sklearn.linear_model import LogisticRegression # Um primeiro modelo
from sklearn.metrics import classification_report

#---- Deixando os pipelines bonitos 

set_config(display = "diagram")

### 5.1. **Pipeline I**: Regressão Logística + OHE (qualitativas) + StandardScaler (quantitativas)

In [8]:
#---- Definindo nosso modelo

log_reg = LogisticRegression(random_state = 1234, max_iter = 400)

#---- Definindo nossos encoder

ohe = OneHotEncoder()
scaler = StandardScaler()

In [9]:
#---- Definindo as features numéricas em uma lista para aplicarmos o Scaler

numeric_features = ["city_development_index", "training_hours"]

#---- Definindo as features categóricas em uma lista para aplicarmos o OHE

categorical_features = list(dados_treino.select_dtypes("object").columns)

In [10]:
ct = make_column_transformer(
    (ohe, categorical_features),
    (scaler, numeric_features),  
    remainder = "drop")

ct

In [11]:
final_pipeline = make_pipeline(ct, log_reg)

final_pipeline.fit(x_train, y_train.values.ravel())

In [12]:
cross_val_score(final_pipeline, x_train, y_train, cv = 5, scoring = "precision")

array([       nan, 0.57857143,        nan, 0.61061947, 0.61267606])

**Apesar de todas essas `warnings`, ele quis dizer que não encontrou uma observação de exemplo que possuía a categoria `city_140` para generalizar corretamente para uma predição futura.**

### 5.2. **Tunagem do Pipeline I**: Testando enconders nas variáveis quantitativas

In [13]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'lo

In [14]:
params = {}
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), "drop"]

In [15]:
grid = GridSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

grid.fit(x_train, y_train)

In [16]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.3185,0.0076,0.0098,0.0066,StandardScaler(),{'columntransformer__standardscaler': Standard...,,0.5852,,0.6243,,,1
1,0.2255,0.0223,0.0092,0.0059,MinMaxScaler(),{'columntransformer__standardscaler': MinMaxSc...,,0.5875,,0.6229,,,2
2,0.2762,0.0312,0.0092,0.0053,drop,{'columntransformer__standardscaler': 'drop'},,0.5775,,0.6167,,,3


### 5.3. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas

In [17]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'lo

In [18]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import category_encoders as ce


le = LabelEncoder()
oe = OrdinalEncoder()
encoder = ce.BackwardDifferenceEncoder()
encoder1 = ce.BaseNEncoder()
encoder2 = ce.BinaryEncoder()
encoder3 = ce.CatBoostEncoder()
encoder5 = ce.GLMMEncoder()
encoder6 = ce.HashingEncoder()
encoder7 = ce.HelmertEncoder()
encoder8 = ce.JamesSteinEncoder()
encoder9 = ce.LeaveOneOutEncoder()
encoder10 = ce.MEstimateEncoder()
encoder13 = ce.SumEncoder()
encoder15 = ce.TargetEncoder()
encoder16 = ce.WOEEncoder()


params = {}
params["columntransformer__onehotencoder"] = [ohe, le, oe, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]

In [19]:
grid = GridSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

grid.fit(x_train, y_train)

In [20]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
12,1.1909,0.0236,0.0677,0.0034,SumEncoder(),{'columntransformer__onehotencoder': SumEncode...,0.5819,0.5841,0.6021,0.6236,0.5979,0.0168,1
14,0.1926,0.0058,0.0239,0.0011,WOEEncoder(),{'columntransformer__onehotencoder': WOEEncode...,0.6027,0.5787,0.5793,0.6278,0.5971,0.0202,2
15,0.0173,0.0022,0.0058,0.0004,drop,{'columntransformer__onehotencoder': 'drop'},0.5958,0.5743,0.5877,0.6214,0.5948,0.0172,3
8,1.9518,0.2035,0.0704,0.0059,HelmertEncoder(),{'columntransformer__onehotencoder': HelmertEn...,0.5828,0.5893,0.5942,0.612,0.5946,0.0109,4
6,15.5505,1.4782,0.0263,0.0016,GLMMEncoder(),{'columntransformer__onehotencoder': GLMMEncod...,0.586,0.5847,0.5898,0.6171,0.5944,0.0132,5
5,0.301,0.0188,0.0467,0.0015,CatBoostEncoder(),{'columntransformer__onehotencoder': CatBoostE...,0.5831,0.5702,0.584,0.6268,0.591,0.0214,6
10,0.2578,0.0087,0.044,0.0008,LeaveOneOutEncoder(),{'columntransformer__onehotencoder': LeaveOneO...,0.5922,0.5625,0.5857,0.6219,0.5906,0.0212,7
9,0.2475,0.0095,0.0255,0.0015,JamesSteinEncoder(),{'columntransformer__onehotencoder': JamesStei...,0.5957,0.5674,0.5711,0.6195,0.5884,0.021,8
13,0.258,0.0082,0.0244,0.0016,TargetEncoder(),{'columntransformer__onehotencoder': TargetEnc...,0.5916,0.5721,0.569,0.611,0.5859,0.0169,9
3,0.3831,0.0598,0.0416,0.0043,BaseNEncoder(),{'columntransformer__onehotencoder': BaseNEnco...,0.582,0.5702,0.5797,0.6057,0.5844,0.0131,10


### 5.4. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas e quantitativas

In [21]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'lo

In [22]:
params = {}
params["columntransformer__onehotencoder"] = [ohe, le, oe, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), Normalizer(), RobustScaler(), "drop"]

params

{'columntransformer__onehotencoder': [OneHotEncoder(),
  LabelEncoder(),
  OrdinalEncoder(),
  BaseNEncoder(),
  BinaryEncoder(),
  CatBoostEncoder(),
  GLMMEncoder(),
  HashingEncoder(max_process=4),
  HelmertEncoder(),
  JamesSteinEncoder(),
  LeaveOneOutEncoder(),
  MEstimateEncoder(),
  SumEncoder(),
  TargetEncoder(),
  WOEEncoder(),
  'drop'],
 'columntransformer__standardscaler': [StandardScaler(),
  MinMaxScaler(),
  Normalizer(),
  RobustScaler(),
  'drop']}

In [23]:
randomCV = RandomizedSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

randomCV

In [24]:
randomCV.fit(x_train, y_train)

In [25]:
pd.DataFrame(randomCV.cv_results_).sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
9,14.9876,0.7551,0.0328,0.009,RobustScaler(),GLMMEncoder(),{'columntransformer__standardscaler': RobustSc...,0.586,0.5847,0.5898,0.6171,0.5944,0.0132,1
8,14.0927,0.4493,0.0288,0.0045,MinMaxScaler(),GLMMEncoder(),{'columntransformer__standardscaler': MinMaxSc...,0.5844,0.5843,0.5898,0.6171,0.5939,0.0135,2
0,0.3111,0.0136,0.0515,0.0039,RobustScaler(),CatBoostEncoder(),{'columntransformer__standardscaler': RobustSc...,0.5831,0.5702,0.584,0.6268,0.591,0.0214,3
6,2.3224,0.3779,0.0772,0.0181,Normalizer(),HelmertEncoder(),{'columntransformer__standardscaler': Normaliz...,0.5787,0.5735,0.5897,0.6061,0.587,0.0125,4
2,0.3155,0.0631,0.0238,0.0026,drop,JamesSteinEncoder(),"{'columntransformer__standardscaler': 'drop', ...",0.5949,0.5599,0.5749,0.6146,0.5861,0.0206,5
1,0.4611,0.0986,0.0548,0.0185,RobustScaler(),BinaryEncoder(),{'columntransformer__standardscaler': RobustSc...,0.5832,0.5714,0.5797,0.6057,0.585,0.0127,6
5,0.3965,0.0386,0.0483,0.0097,StandardScaler(),BinaryEncoder(),{'columntransformer__standardscaler': Standard...,0.582,0.5702,0.5797,0.6057,0.5844,0.0131,7
3,0.236,0.0156,0.0257,0.0012,MinMaxScaler(),MEstimateEncoder(),{'columntransformer__standardscaler': MinMaxSc...,0.5896,0.5672,0.5726,0.6052,0.5836,0.0149,8
4,0.3158,0.0316,0.0098,0.0063,RobustScaler(),OneHotEncoder(),{'columntransformer__standardscaler': RobustSc...,,0.5852,,0.6243,,,9
7,0.2771,0.0094,0.0095,0.0061,StandardScaler(),OneHotEncoder(),{'columntransformer__standardscaler': Standard...,,0.5852,,0.6243,,,10


In [26]:
best_estimator = randomCV.best_estimator_

In [27]:
y_pred = best_estimator.predict(x_test)

In [28]:
pd.crosstab(y_test.values.ravel(), y_pred, rownames = ["Verdadeiro"], colnames = ["Predito pelo modelo"], margins = True)

Predito pelo modelo,0.0,1.0,All
Verdadeiro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,3345,251,3596
1.0,783,411,1194
All,4128,662,4790


In [29]:
print(classification_report(y_test.values.ravel(), y_pred))

              precision    recall  f1-score   support

         0.0       0.81      0.93      0.87      3596
         1.0       0.62      0.34      0.44      1194

    accuracy                           0.78      4790
   macro avg       0.72      0.64      0.65      4790
weighted avg       0.76      0.78      0.76      4790



# 6. Testando um LGBM

In [30]:
from lightgbm import LGBMClassifier

In [31]:
lgbm = LGBMClassifier(n_jobs = -1, random_state = 1234)

In [32]:
final_pipeline = make_pipeline(ct, lgbm)

final_pipeline.fit(x_train, y_train.values.ravel())

In [33]:
cross_val_score(final_pipeline, x_train, y_train, cv = 5, scoring = "precision")

array([       nan, 0.58166189,        nan, 0.58345221, 0.59684362])

### 6.1. **Tunagem do Pipeline I**: Testando enconders nas variáveis quantitativas

In [34]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'lgbmclassifier', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'lgbmclassifier__boosting_type', 'lgbmclassifier__class_weight', 'lgbmclassifier__colsample_bytree', 'lgbmclassifier__importance_type', 'lgbmclassifier__learning_rate', 'lgbmclassifier__max_depth', 'lgbmclassifier__min_child_samples', 'lgbmclassifier__min_child_weight', 'l

In [35]:
params = {}
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), "drop"]

In [36]:
grid = GridSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

grid.fit(x_train, y_train)

In [37]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.1751,0.0239,0.0143,0.0109,StandardScaler(),{'columntransformer__standardscaler': Standard...,,0.5856,,0.5917,,,1
1,0.1651,0.0145,0.0124,0.0086,MinMaxScaler(),{'columntransformer__standardscaler': MinMaxSc...,,0.5856,,0.5917,,,2
2,0.1463,0.0428,0.0103,0.0068,drop,{'columntransformer__standardscaler': 'drop'},,0.5955,,0.5872,,,3


### 6.2. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas

In [38]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'lgbmclassifier', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'lgbmclassifier__boosting_type', 'lgbmclassifier__class_weight', 'lgbmclassifier__colsample_bytree', 'lgbmclassifier__importance_type', 'lgbmclassifier__learning_rate', 'lgbmclassifier__max_depth', 'lgbmclassifier__min_child_samples', 'lgbmclassifier__min_child_weight', 'l

In [39]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import category_encoders as ce


le = LabelEncoder()
oe = OrdinalEncoder()
encoder = ce.BackwardDifferenceEncoder()
encoder1 = ce.BaseNEncoder()
encoder2 = ce.BinaryEncoder()
encoder3 = ce.CatBoostEncoder()
encoder5 = ce.GLMMEncoder()
encoder6 = ce.HashingEncoder()
encoder7 = ce.HelmertEncoder()
encoder8 = ce.JamesSteinEncoder()
encoder9 = ce.LeaveOneOutEncoder()
encoder10 = ce.MEstimateEncoder()
encoder13 = ce.SumEncoder()
encoder15 = ce.TargetEncoder()
encoder16 = ce.WOEEncoder()


params = {}
params["columntransformer__onehotencoder"] = [ohe, le, oe, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]

In [40]:
grid = GridSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

grid.fit(x_train, y_train)

In [41]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
8,0.5212,0.0189,0.0689,0.0016,HelmertEncoder(),{'columntransformer__onehotencoder': HelmertEn...,0.582,0.5933,0.5965,0.5879,0.5899,0.0055,1
12,0.4533,0.0118,0.0713,0.0022,SumEncoder(),{'columntransformer__onehotencoder': SumEncode...,0.5805,0.5884,0.5935,0.5897,0.588,0.0047,2
3,0.3046,0.0342,0.0429,0.0033,BaseNEncoder(),{'columntransformer__onehotencoder': BaseNEnco...,0.5877,0.5887,0.5862,0.5811,0.5859,0.0029,3
4,0.3333,0.013,0.0431,0.0013,BinaryEncoder(),{'columntransformer__onehotencoder': BinaryEnc...,0.5877,0.5887,0.5862,0.5811,0.5859,0.0029,3
6,15.7233,1.3361,0.0283,0.002,GLMMEncoder(),{'columntransformer__onehotencoder': GLMMEncod...,0.5816,0.5928,0.5798,0.5851,0.5848,0.005,5
13,0.2861,0.0086,0.031,0.0025,TargetEncoder(),{'columntransformer__onehotencoder': TargetEnc...,0.5887,0.5848,0.5721,0.5905,0.584,0.0072,6
14,0.2946,0.0168,0.0286,0.0006,WOEEncoder(),{'columntransformer__onehotencoder': WOEEncode...,0.5742,0.5874,0.5794,0.589,0.5825,0.006,7
9,0.3009,0.0383,0.0299,0.0031,JamesSteinEncoder(),{'columntransformer__onehotencoder': JamesStei...,0.5793,0.587,0.5788,0.5799,0.5812,0.0033,8
11,0.2917,0.05,0.0299,0.0026,MEstimateEncoder(),{'columntransformer__onehotencoder': MEstimate...,0.5806,0.5874,0.5738,0.5792,0.5802,0.0049,9
5,0.5065,0.123,0.0535,0.0046,CatBoostEncoder(),{'columntransformer__onehotencoder': CatBoostE...,0.57,0.5885,0.5683,0.5905,0.5793,0.0102,10


### 6.3. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas e quantitativas

In [42]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'lgbmclassifier', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'lgbmclassifier__boosting_type', 'lgbmclassifier__class_weight', 'lgbmclassifier__colsample_bytree', 'lgbmclassifier__importance_type', 'lgbmclassifier__learning_rate', 'lgbmclassifier__max_depth', 'lgbmclassifier__min_child_samples', 'lgbmclassifier__min_child_weight', 'l

In [43]:
params = {}
params["columntransformer__onehotencoder"] = [ohe, le, oe, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), Normalizer(), RobustScaler(), "drop"]

params

{'columntransformer__onehotencoder': [OneHotEncoder(),
  LabelEncoder(),
  OrdinalEncoder(),
  BaseNEncoder(),
  BinaryEncoder(),
  CatBoostEncoder(),
  GLMMEncoder(),
  HashingEncoder(max_process=4),
  HelmertEncoder(),
  JamesSteinEncoder(),
  LeaveOneOutEncoder(),
  MEstimateEncoder(),
  SumEncoder(),
  TargetEncoder(),
  WOEEncoder(),
  'drop'],
 'columntransformer__standardscaler': [StandardScaler(),
  MinMaxScaler(),
  Normalizer(),
  RobustScaler(),
  'drop']}

In [44]:
randomCV = RandomizedSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

randomCV

In [45]:
randomCV.fit(x_train, y_train)

In [46]:
pd.DataFrame(randomCV.cv_results_).sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
1,0.3138,0.0088,0.0423,0.003,RobustScaler(),BinaryEncoder(),{'columntransformer__standardscaler': RobustSc...,0.5877,0.5887,0.5862,0.5811,0.5859,0.0029,1
5,0.3127,0.0191,0.0422,0.0019,StandardScaler(),BinaryEncoder(),{'columntransformer__standardscaler': Standard...,0.5877,0.5887,0.5862,0.5811,0.5859,0.0029,1
8,15.1137,0.7012,0.0284,0.0019,MinMaxScaler(),GLMMEncoder(),{'columntransformer__standardscaler': MinMaxSc...,0.5816,0.5928,0.5798,0.5851,0.5848,0.005,3
9,15.5238,0.977,0.0307,0.0035,RobustScaler(),GLMMEncoder(),{'columntransformer__standardscaler': RobustSc...,0.5816,0.5928,0.5798,0.5851,0.5848,0.005,3
6,0.5255,0.0249,0.0718,0.0037,Normalizer(),HelmertEncoder(),{'columntransformer__standardscaler': Normaliz...,0.5644,0.5993,0.5929,0.5766,0.5833,0.0137,5
3,0.2906,0.0436,0.0292,0.0006,MinMaxScaler(),MEstimateEncoder(),{'columntransformer__standardscaler': MinMaxSc...,0.5806,0.5874,0.5738,0.5792,0.5802,0.0049,6
0,0.395,0.0526,0.0486,0.0011,RobustScaler(),CatBoostEncoder(),{'columntransformer__standardscaler': RobustSc...,0.57,0.5885,0.5683,0.5905,0.5793,0.0102,7
2,0.2637,0.0142,0.0262,0.0026,drop,JamesSteinEncoder(),"{'columntransformer__standardscaler': 'drop', ...",0.5622,0.5849,0.5735,0.5748,0.5739,0.0081,8
4,0.1764,0.0104,0.0141,0.0107,RobustScaler(),OneHotEncoder(),{'columntransformer__standardscaler': RobustSc...,,0.5856,,0.5917,,,9
7,0.1767,0.0074,0.0131,0.0097,StandardScaler(),OneHotEncoder(),{'columntransformer__standardscaler': Standard...,,0.5856,,0.5917,,,10


In [47]:
best_estimator = randomCV.best_estimator_

In [48]:
y_pred = best_estimator.predict(x_test)

In [49]:
pd.crosstab(y_test.values.ravel(), y_pred, rownames = ["Verdadeiro"], colnames = ["Predito pelo modelo"], margins = True)

Predito pelo modelo,0.0,1.0,All
Verdadeiro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,3108,488,3596
1.0,486,708,1194
All,3594,1196,4790


In [50]:
print(classification_report(y_test.values.ravel(), y_pred))

              precision    recall  f1-score   support

         0.0       0.86      0.86      0.86      3596
         1.0       0.59      0.59      0.59      1194

    accuracy                           0.80      4790
   macro avg       0.73      0.73      0.73      4790
weighted avg       0.80      0.80      0.80      4790



### 6.4. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas e quantitativas

In [94]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'lgbmclassifier', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'lgbmclassifier__boosting_type', 'lgbmclassifier__class_weight', 'lgbmclassifier__colsample_bytree', 'lgbmclassifier__importance_type', 'lgbmclassifier__learning_rate', 'lgbmclassifier__max_depth', 'lgbmclassifier__min_child_samples', 'lgbmclassifier__min_child_weight', 'l

In [95]:
params = {}
params["columntransformer__onehotencoder"] = [ohe, le, oe, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), Normalizer(), RobustScaler(), "drop"]
params["lgbmclassifier__n_estimators"] = [50, 100, 250, 500, 1000, 5000]
params["lgbmclassifier__num_leaves"] = [10, 100, 250]
params["lgbmclassifier__max_depth"] = [1, 5, 8, 10]
params["lgbmclassifier__class_weight"] = ['balanced', 'auto', { 0:0.67, 1:0.33}, {0:0.75, 1:0.25}, {0:0.8, 1:0.2}]

params

{'columntransformer__onehotencoder': [OneHotEncoder(),
  LabelEncoder(),
  OrdinalEncoder(),
  BaseNEncoder(),
  BinaryEncoder(),
  CatBoostEncoder(),
  GLMMEncoder(),
  HashingEncoder(max_process=4),
  HelmertEncoder(),
  JamesSteinEncoder(),
  LeaveOneOutEncoder(),
  MEstimateEncoder(),
  SumEncoder(),
  TargetEncoder(),
  WOEEncoder(),
  'drop'],
 'columntransformer__standardscaler': [StandardScaler(),
  MinMaxScaler(),
  Normalizer(),
  RobustScaler(),
  'drop'],
 'lgbmclassifier__n_estimators': [50, 100, 250, 500, 1000, 5000],
 'lgbmclassifier__num_leaves': [10, 100, 250],
 'lgbmclassifier__max_depth': [1, 5, 8, 10],
 'lgbmclassifier__class_weight': ['balanced',
  'auto',
  {0: 0.67, 1: 0.33},
  {0: 0.75, 1: 0.25},
  {0: 0.8, 1: 0.2}]}

In [96]:
randomCV = RandomizedSearchCV(final_pipeline, params, cv = 4, scoring = "precision", n_jobs = -1)

randomCV

In [97]:
randomCV.fit(x_train, y_train)

In [98]:
pd.DataFrame(randomCV.cv_results_).sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lgbmclassifier__num_leaves,param_lgbmclassifier__n_estimators,param_lgbmclassifier__max_depth,param_lgbmclassifier__class_weight,param_columntransformer__standardscaler,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
3,19.8238,0.4199,0.1792,0.0111,10,1000,5,"{0: 0.67, 1: 0.33}",Normalizer(),GLMMEncoder(),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",0.5817,0.5426,0.5433,0.5765,0.561,0.0182,1
0,19.1555,0.609,0.2209,0.0134,250,500,5,"{0: 0.67, 1: 0.33}",RobustScaler(),GLMMEncoder(),"{'lgbmclassifier__num_leaves': 250, 'lgbmclass...",0.5688,0.5457,0.5337,0.5867,0.5587,0.0205,2
4,0.76,0.0732,0.1103,0.0108,100,100,8,balanced,StandardScaler(),JamesSteinEncoder(),"{'lgbmclassifier__num_leaves': 100, 'lgbmclass...",0.5419,0.5498,0.5441,0.5409,0.5442,0.0035,3
2,2.7803,0.2884,0.2953,0.0499,100,500,5,balanced,Normalizer(),HelmertEncoder(),"{'lgbmclassifier__num_leaves': 100, 'lgbmclass...",0.5272,0.5439,0.5376,0.5226,0.5328,0.0084,4
7,0.965,0.0809,0.1609,0.0193,10,250,5,balanced,RobustScaler(),CatBoostEncoder(),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",0.5315,0.5346,0.5209,0.536,0.5307,0.0059,5
1,0.8478,0.1538,0.1295,0.0097,250,100,5,balanced,Normalizer(),BaseNEncoder(),"{'lgbmclassifier__num_leaves': 250, 'lgbmclass...",0.5031,0.5165,0.5135,0.4985,0.5079,0.0074,6
5,0.42,0.0692,0.0,0.0,100,250,8,auto,drop,LeaveOneOutEncoder(),"{'lgbmclassifier__num_leaves': 100, 'lgbmclass...",,,,,,,7
6,0.3964,0.0654,0.0,0.0,10,5000,8,auto,MinMaxScaler(),MEstimateEncoder(),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",,,,,,,8
8,0.453,0.0342,0.0,0.0,10,100,8,auto,MinMaxScaler(),BinaryEncoder(),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",,,,,,,9
9,0.4039,0.0552,0.0,0.0,10,1000,10,auto,Normalizer(),BaseNEncoder(),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",,,,,,,10


In [99]:
best_estimator = randomCV.best_estimator_

In [100]:
y_pred = best_estimator.predict(x_test)

In [101]:
pd.crosstab(y_test.values.ravel(), y_pred, rownames = ["Verdadeiro"], colnames = ["Predito pelo modelo"], margins = True)

Predito pelo modelo,0.0,1.0,All
Verdadeiro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,3417,179,3596
1.0,932,262,1194
All,4349,441,4790


In [102]:
print(classification_report(y_test.values.ravel(), y_pred))

              precision    recall  f1-score   support

         0.0       0.79      0.95      0.86      3596
         1.0       0.59      0.22      0.32      1194

    accuracy                           0.77      4790
   macro avg       0.69      0.58      0.59      4790
weighted avg       0.74      0.77      0.73      4790

