# Notebook com informações sobre pipelines

A ideia deste notebook é testar vários modelos em somente um pipeline e, também, novos tipos de encoders. 

## 1. Importando as bibliotecas iniciais

In [1]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

## 2. Contextualizando e carregando os dados 

Uma empresa de ciência de dados e Big Data oferece cursos e gostaria de contratar algum(ns) cientista(s) de dado(s) que completaram os seus cursos. Para isso, criaram um questionário que traz informações de dados demográficos, sociais, educação e etc, com o objetivo de minimizar os custos de contratação e otimizando o processo de contratação, pois sabem que o candidato deve ser treinado e adequado à equipe. Resumindo, vamos se dizer que uma empresa ao final do seu curso lhe gera um questionário de feedback do curso e pergunta se você gostaria de receber vagas deles, é o mesmo caso aqui.

1. **Verdadeiro Negativo**: São os candidatos que o nosso modelo disse que não estão a procura e realmente não estão a procura de um novo trabalho
2. **Falso Positivo**: São os candidatos que o nosso modelo disse que estão a procura, mas na realidade não estão a procura de um novo trabalho - *Maior prejudicial, pois iremos comunicar esses caras e na verdade eles não estão a procura de um novo trabalho*
3. **Falso Negativo**: São os candidatos que o nosso modelo disse que **NÃO** estão a procura, mas na realidade estão a procura de um novo trabalho
4. **Verdadeiro Positivo**: São os candidatos que o nosso modelo disse que estão a procura e realmente estão a procura de um novo trabalho

Como queremos diminuir o número de Falsos Positivos iremos em busca da minimização da Precision.

In [2]:
dados_treino = pd.read_csv(filepath_or_buffer = "../data/raw/aug_train.csv")

dados_treino

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


## 3. Informações iniciais dos dados

In [3]:
dados_treino.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

---

Inicialmente, não precisamos fazer nenhuma transformação nos dados, pois todos estão no formato e tipo ideal.

In [4]:
dados_treino["target"].value_counts()

0.0    14381
1.0     4777
Name: target, dtype: int64

## 4. Separando em treino e teste

Apesar de termos dados de ter uma base de teste também, ela não possui rótulo. Logo, teremos que dividir nossos dados de treino (que estão rotulados) em treino e teste.

Como não temos nenhuma dependência temporal aliada a série, podemos fazer o split aleatório. 

In [5]:
def func_categ_encod(df):
    
    df1 = df.copy()
    
    df1[["gender", "enrolled_university", "education_level", "major_discipline", "experience", "company_size", "company_type", "last_new_job"]] = \
    df1[["gender", "enrolled_university", "education_level", "major_discipline", "experience", "company_size", "company_type", "last_new_job"]].applymap(lambda x: "Sem informação" if pd.isnull(x) else x)
    
    return df1

In [6]:
dados_treino1 = dados_treino\
.pipe(func_categ_encod)

dados_treino1.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,Sem informação,Sem informação,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,Sem informação,No relevent experience,Full time course,Graduate,STEM,5,Sem informação,Sem informação,never,83,0.0
3,33241,city_115,0.789,Sem informação,No relevent experience,Sem informação,Graduate,Business Degree,<1,Sem informação,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [7]:
from sklearn.model_selection import train_test_split

X = dados_treino1.drop("target", axis = 1)
y = dados_treino1[["target"]]

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1234, stratify = y)

In [8]:
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

(14368, 13) (14368, 1) (4790, 13) (4790, 1)


Com isso, temos 14.368 observações nos dados de treino e 4.790 nos dados de teste.

## 5. Criando Pipelines

In [9]:
#---- Funções

from sklearn.pipeline import make_pipeline # Função para o Pipeline
from sklearn.compose import make_column_transformer # Função caso queiramos criar uma função nossa e colocar dentro do Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer, OneHotEncoder, MinMaxScaler, Normalizer, PolynomialFeatures, RobustScaler # Encoders
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV # Grid Searchs
from sklearn import set_config # Pipelines bisualmente bonitos
from sklearn.linear_model import LogisticRegression # Um primeiro modelo
from sklearn.metrics import classification_report
from sklearn.impute import SimpleImputer


#---- Deixando os pipelines bonitos 

set_config(display = "diagram")

### 5.1. **Pipeline I**: Regressão Logística + OHE (qualitativas) + StandardScaler (quantitativas)

In [10]:
#---- Definindo nosso modelo

log_reg = LogisticRegression(random_state = 1234, max_iter = 400)

#---- Definindo nossos encoder

ohe = OneHotEncoder(handle_unknown = 'ignore')
scaler = StandardScaler()

In [11]:
#---- Definindo as features numéricas em uma lista para aplicarmos o Scaler

numeric_features = ["city_development_index", "training_hours"]

#---- Definindo as features categóricas em uma lista para aplicarmos o OHE

categorical_features = list(dados_treino.select_dtypes("object").columns)

In [12]:
ct = make_column_transformer(
    (ohe, categorical_features),
    (scaler, numeric_features),  
    remainder = "drop")

ct

In [13]:
final_pipeline = make_pipeline(ct, log_reg)

final_pipeline.fit(x_train, y_train.values.ravel())

In [14]:
cross_val_score(final_pipeline, x_train, y_train, cv = 5, scoring = "precision")

array([0.58612975, 0.57857143, 0.57630979, 0.61061947, 0.61267606])

**Apesar de todas essas `warnings`, ele quis dizer que não encontrou uma observação de exemplo que possuía a categoria `city_140` para generalizar corretamente para uma predição futura.**

### 5.2. **Tunagem do Pipeline I**: Testando enconders nas variáveis quantitativas

In [15]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'lo

In [16]:
params = {}
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), "drop"]

In [17]:
grid = GridSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

grid.fit(x_train, y_train)

In [18]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
1,0.3164,0.0469,0.0175,0.0025,MinMaxScaler(),{'columntransformer__standardscaler': MinMaxSc...,0.5826,0.5875,0.6004,0.6229,0.5983,0.0156,1
0,0.3294,0.0369,0.0157,0.001,StandardScaler(),{'columntransformer__standardscaler': Standard...,0.5819,0.5849,0.6011,0.6243,0.598,0.0168,2
2,0.3257,0.0089,0.0153,0.0021,drop,{'columntransformer__standardscaler': 'drop'},0.5797,0.5775,0.6011,0.6167,0.5938,0.0161,3


### 5.3. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas

In [19]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'lo

In [20]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import category_encoders as ce


le = LabelEncoder()
encoder = ce.BackwardDifferenceEncoder()
encoder1 = ce.BaseNEncoder()
encoder2 = ce.BinaryEncoder()
encoder3 = ce.CatBoostEncoder()
encoder5 = ce.GLMMEncoder()
encoder6 = ce.HashingEncoder()
encoder7 = ce.HelmertEncoder()
encoder8 = ce.JamesSteinEncoder()
encoder9 = ce.LeaveOneOutEncoder()
encoder10 = ce.MEstimateEncoder()
encoder13 = ce.SumEncoder()
encoder15 = ce.TargetEncoder()
encoder16 = ce.WOEEncoder()


params = {}
params["columntransformer__onehotencoder"] = [ohe, le, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]

In [21]:
grid = GridSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

grid.fit(x_train, y_train)

In [22]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.3766,0.0258,0.0174,0.0016,OneHotEncoder(handle_unknown='ignore'),{'columntransformer__onehotencoder': OneHotEnc...,0.5819,0.5849,0.6011,0.6243,0.598,0.0168,1
11,1.5847,0.1436,0.0814,0.0148,SumEncoder(),{'columntransformer__onehotencoder': SumEncode...,0.5819,0.5841,0.6021,0.6236,0.5979,0.0168,2
13,0.3228,0.0797,0.0403,0.0133,WOEEncoder(),{'columntransformer__onehotencoder': WOEEncode...,0.6027,0.5787,0.5793,0.6278,0.5971,0.0202,3
14,0.0382,0.0116,0.0088,0.0037,drop,{'columntransformer__onehotencoder': 'drop'},0.5958,0.5743,0.5877,0.6214,0.5948,0.0172,4
5,21.1027,0.9649,0.0447,0.012,GLMMEncoder(),{'columntransformer__onehotencoder': GLMMEncod...,0.586,0.5847,0.5898,0.6171,0.5944,0.0132,5
7,2.7162,0.1775,0.0863,0.0158,HelmertEncoder(),{'columntransformer__onehotencoder': HelmertEn...,0.582,0.5893,0.5932,0.612,0.5941,0.0111,6
4,0.56,0.06,0.0867,0.0214,CatBoostEncoder(),{'columntransformer__onehotencoder': CatBoostE...,0.5831,0.5702,0.584,0.6268,0.591,0.0214,7
9,0.3844,0.043,0.0506,0.0039,LeaveOneOutEncoder(),{'columntransformer__onehotencoder': LeaveOneO...,0.5922,0.5625,0.5857,0.6219,0.5906,0.0212,8
8,0.3816,0.0252,0.044,0.0114,JamesSteinEncoder(),{'columntransformer__onehotencoder': JamesStei...,0.5957,0.5674,0.5711,0.6195,0.5884,0.021,9
12,0.3759,0.0476,0.0332,0.01,TargetEncoder(),{'columntransformer__onehotencoder': TargetEnc...,0.5916,0.5721,0.569,0.611,0.5859,0.0169,10


### 5.4. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas e quantitativas

In [23]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'lo

In [24]:
params = {}
params["columntransformer__onehotencoder"] = [ohe, le, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), Normalizer(), RobustScaler(), "drop"]

params

{'columntransformer__onehotencoder': [OneHotEncoder(handle_unknown='ignore'),
  LabelEncoder(),
  BaseNEncoder(),
  BinaryEncoder(),
  CatBoostEncoder(),
  GLMMEncoder(),
  HashingEncoder(max_process=4),
  HelmertEncoder(),
  JamesSteinEncoder(),
  LeaveOneOutEncoder(),
  MEstimateEncoder(),
  SumEncoder(),
  TargetEncoder(),
  WOEEncoder(),
  'drop'],
 'columntransformer__standardscaler': [StandardScaler(),
  MinMaxScaler(),
  Normalizer(),
  RobustScaler(),
  'drop']}

In [25]:
randomCV = RandomizedSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

randomCV

In [26]:
randomCV.fit(x_train, y_train)

In [27]:
pd.DataFrame(randomCV.cv_results_).sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
9,0.3387,0.023,0.0167,0.0016,RobustScaler(),OneHotEncoder(handle_unknown='ignore'),{'columntransformer__standardscaler': RobustSc...,0.5819,0.5852,0.6,0.6243,0.5979,0.0167,1
1,21.9222,0.849,0.0449,0.0139,MinMaxScaler(),GLMMEncoder(),{'columntransformer__standardscaler': MinMaxSc...,0.5844,0.5843,0.5898,0.6171,0.5939,0.0135,2
4,0.4767,0.042,0.0734,0.0156,StandardScaler(),CatBoostEncoder(),{'columntransformer__standardscaler': Standard...,0.5831,0.5702,0.584,0.6268,0.591,0.0214,3
6,0.4748,0.044,0.0622,0.0101,RobustScaler(),CatBoostEncoder(),{'columntransformer__standardscaler': RobustSc...,0.5831,0.5702,0.584,0.6268,0.591,0.0214,3
3,0.3906,0.0401,0.0543,0.0074,RobustScaler(),LeaveOneOutEncoder(),{'columntransformer__standardscaler': RobustSc...,0.5922,0.5625,0.5847,0.6219,0.5903,0.0212,5
7,0.3386,0.0411,0.0384,0.0152,MinMaxScaler(),JamesSteinEncoder(),{'columntransformer__standardscaler': MinMaxSc...,0.5972,0.5651,0.5711,0.6195,0.5882,0.0217,6
8,2.9373,0.1569,0.0991,0.0098,Normalizer(),HelmertEncoder(),{'columntransformer__standardscaler': Normaliz...,0.5814,0.5727,0.5897,0.6057,0.5874,0.0122,7
5,0.3523,0.0093,0.0599,0.0084,drop,LeaveOneOutEncoder(),"{'columntransformer__standardscaler': 'drop', ...",0.5851,0.5599,0.5697,0.6218,0.5841,0.0235,8
0,0.6265,0.066,0.0641,0.02,drop,BinaryEncoder(),"{'columntransformer__standardscaler': 'drop', ...",0.5901,0.5396,0.5182,0.5645,0.5531,0.0269,9
2,0.0055,0.0014,0.0,0.0,drop,LabelEncoder(),"{'columntransformer__standardscaler': 'drop', ...",,,,,,,10


In [28]:
best_estimator = randomCV.best_estimator_

In [29]:
y_pred = best_estimator.predict(x_test)

In [30]:
pd.crosstab(y_test.values.ravel(), y_pred, rownames = ["Verdadeiro"], colnames = ["Predito pelo modelo"], margins = True)

Predito pelo modelo,0.0,1.0,All
Verdadeiro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,3322,274,3596
1.0,734,460,1194
All,4056,734,4790


In [31]:
print(classification_report(y_test.values.ravel(), y_pred))

              precision    recall  f1-score   support

         0.0       0.82      0.92      0.87      3596
         1.0       0.63      0.39      0.48      1194

    accuracy                           0.79      4790
   macro avg       0.72      0.65      0.67      4790
weighted avg       0.77      0.79      0.77      4790



# 6. Testando um LGBM

In [32]:
from lightgbm import LGBMClassifier

In [33]:
lgbm = LGBMClassifier(n_jobs = -1, random_state = 1234)

In [34]:
final_pipeline = make_pipeline(ct, lgbm)

final_pipeline.fit(x_train, y_train.values.ravel())

In [35]:
cross_val_score(final_pipeline, x_train, y_train, cv = 5, scoring = "precision")

array([0.58978102, 0.58166189, 0.5994109 , 0.58345221, 0.59684362])

### 6.1. **Tunagem do Pipeline I**: Testando enconders nas variáveis quantitativas

In [36]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'lgbmclassifier', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'lgbmclassifier__boosting_type', 'lgbmclassifier__class_weight', 'lgbmclassifier__colsample_bytree', 'lgbmclassifier__importance_type', 'lgbmclassifier__learning_rate', 'lgbmclassifier__max_depth', 'lgbmclassifier__min_child_samples', 'lgbmclassifier__min_child_weight', 'l

In [37]:
params = {}
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), "drop"]

In [38]:
grid = GridSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

grid.fit(x_train, y_train)

In [39]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
2,0.2671,0.0427,0.0205,0.0018,drop,{'columntransformer__standardscaler': 'drop'},0.5758,0.5955,0.5974,0.5872,0.589,0.0085,1
0,0.402,0.0802,0.0231,0.002,StandardScaler(),{'columntransformer__standardscaler': Standard...,0.5829,0.5856,0.5935,0.5917,0.5884,0.0043,2
1,0.3195,0.0317,0.0259,0.0028,MinMaxScaler(),{'columntransformer__standardscaler': MinMaxSc...,0.5829,0.5856,0.5935,0.5917,0.5884,0.0043,2


### 6.2. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas

In [40]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'lgbmclassifier', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'lgbmclassifier__boosting_type', 'lgbmclassifier__class_weight', 'lgbmclassifier__colsample_bytree', 'lgbmclassifier__importance_type', 'lgbmclassifier__learning_rate', 'lgbmclassifier__max_depth', 'lgbmclassifier__min_child_samples', 'lgbmclassifier__min_child_weight', 'l

In [41]:
params = {}
params["columntransformer__onehotencoder"] = [ohe, le, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]

In [42]:
grid = GridSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

grid.fit(x_train, y_train)

In [43]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
7,1.2152,0.4145,0.0786,0.0056,HelmertEncoder(),{'columntransformer__onehotencoder': HelmertEn...,0.582,0.5933,0.5965,0.5879,0.5899,0.0055,1
0,0.4067,0.0947,0.0231,0.0014,OneHotEncoder(handle_unknown='ignore'),{'columntransformer__onehotencoder': OneHotEnc...,0.5829,0.5856,0.5935,0.5917,0.5884,0.0043,2
11,0.7368,0.0538,0.0822,0.004,SumEncoder(),{'columntransformer__onehotencoder': SumEncode...,0.5805,0.5884,0.5935,0.5897,0.588,0.0047,3
2,0.4692,0.0195,0.0507,0.003,BaseNEncoder(),{'columntransformer__onehotencoder': BaseNEnco...,0.5877,0.5887,0.5862,0.5811,0.5859,0.0029,4
3,0.4657,0.0153,0.0502,0.002,BinaryEncoder(),{'columntransformer__onehotencoder': BinaryEnc...,0.5877,0.5887,0.5862,0.5811,0.5859,0.0029,4
5,21.9058,1.5185,0.0353,0.0025,GLMMEncoder(),{'columntransformer__onehotencoder': GLMMEncod...,0.5816,0.5928,0.5798,0.5851,0.5848,0.005,6
12,0.582,0.0979,0.0341,0.0007,TargetEncoder(),{'columntransformer__onehotencoder': TargetEnc...,0.5887,0.5848,0.5721,0.5905,0.584,0.0072,7
13,0.4921,0.09,0.0323,0.0029,WOEEncoder(),{'columntransformer__onehotencoder': WOEEncode...,0.5742,0.5874,0.5794,0.589,0.5825,0.006,8
8,0.5178,0.0802,0.0353,0.0012,JamesSteinEncoder(),{'columntransformer__onehotencoder': JamesStei...,0.5793,0.587,0.5788,0.5799,0.5812,0.0033,9
10,0.4256,0.0086,0.0342,0.0018,MEstimateEncoder(),{'columntransformer__onehotencoder': MEstimate...,0.5806,0.5874,0.5738,0.5792,0.5802,0.0049,10


### 6.3. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas e quantitativas

In [44]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'lgbmclassifier', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'lgbmclassifier__boosting_type', 'lgbmclassifier__class_weight', 'lgbmclassifier__colsample_bytree', 'lgbmclassifier__importance_type', 'lgbmclassifier__learning_rate', 'lgbmclassifier__max_depth', 'lgbmclassifier__min_child_samples', 'lgbmclassifier__min_child_weight', 'l

In [45]:
params = {}
params["columntransformer__onehotencoder"] = [ohe, le, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), Normalizer(), RobustScaler(), "drop"]

params

{'columntransformer__onehotencoder': [OneHotEncoder(handle_unknown='ignore'),
  LabelEncoder(),
  BaseNEncoder(),
  BinaryEncoder(),
  CatBoostEncoder(),
  GLMMEncoder(),
  HashingEncoder(max_process=4),
  HelmertEncoder(),
  JamesSteinEncoder(),
  LeaveOneOutEncoder(),
  MEstimateEncoder(),
  SumEncoder(),
  TargetEncoder(),
  WOEEncoder(),
  'drop'],
 'columntransformer__standardscaler': [StandardScaler(),
  MinMaxScaler(),
  Normalizer(),
  RobustScaler(),
  'drop']}

In [46]:
randomCV = RandomizedSearchCV(final_pipeline, params, cv = 4, scoring = "precision")

randomCV

In [47]:
randomCV.fit(x_train, y_train)

In [48]:
pd.DataFrame(randomCV.cv_results_).sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
9,0.2685,0.095,0.0218,0.0023,RobustScaler(),OneHotEncoder(handle_unknown='ignore'),{'columntransformer__standardscaler': RobustSc...,0.5829,0.5856,0.5935,0.5917,0.5884,0.0043,1
1,19.7568,3.7356,0.0309,0.0041,MinMaxScaler(),GLMMEncoder(),{'columntransformer__standardscaler': MinMaxSc...,0.5816,0.5928,0.5798,0.5851,0.5848,0.005,2
8,0.5804,0.0391,0.0719,0.0053,Normalizer(),HelmertEncoder(),{'columntransformer__standardscaler': Normaliz...,0.5644,0.5993,0.5929,0.5766,0.5833,0.0137,3
7,0.4764,0.1144,0.0298,0.0033,MinMaxScaler(),JamesSteinEncoder(),{'columntransformer__standardscaler': MinMaxSc...,0.5793,0.587,0.5788,0.5799,0.5812,0.0033,4
0,0.4747,0.0541,0.0492,0.0021,drop,BinaryEncoder(),"{'columntransformer__standardscaler': 'drop', ...",0.5681,0.577,0.594,0.5808,0.58,0.0093,5
4,0.5209,0.1397,0.0547,0.0041,StandardScaler(),CatBoostEncoder(),{'columntransformer__standardscaler': Standard...,0.57,0.5885,0.5683,0.5905,0.5793,0.0102,6
6,0.4636,0.0888,0.0512,0.0032,RobustScaler(),CatBoostEncoder(),{'columntransformer__standardscaler': RobustSc...,0.57,0.5885,0.5683,0.5905,0.5793,0.0102,6
3,0.4858,0.0883,0.0459,0.0029,RobustScaler(),LeaveOneOutEncoder(),{'columntransformer__standardscaler': RobustSc...,0.0,0.0,0.2162,0.0,0.0541,0.0936,8
5,0.3255,0.0589,0.0451,0.001,drop,LeaveOneOutEncoder(),"{'columntransformer__standardscaler': 'drop', ...",0.0,0.0,0.2162,0.0,0.0541,0.0936,8
2,0.0045,0.0005,0.0,0.0,drop,LabelEncoder(),"{'columntransformer__standardscaler': 'drop', ...",,,,,,,10


In [49]:
best_estimator = randomCV.best_estimator_

In [50]:
y_pred = best_estimator.predict(x_test)

In [51]:
pd.crosstab(y_test.values.ravel(), y_pred, rownames = ["Verdadeiro"], colnames = ["Predito pelo modelo"], margins = True)

Predito pelo modelo,0.0,1.0,All
Verdadeiro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,3120,476,3596
1.0,496,698,1194
All,3616,1174,4790


In [52]:
print(classification_report(y_test.values.ravel(), y_pred))

              precision    recall  f1-score   support

         0.0       0.86      0.87      0.87      3596
         1.0       0.59      0.58      0.59      1194

    accuracy                           0.80      4790
   macro avg       0.73      0.73      0.73      4790
weighted avg       0.80      0.80      0.80      4790



### 6.4. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas e quantitativas

In [53]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'lgbmclassifier', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'lgbmclassifier__boosting_type', 'lgbmclassifier__class_weight', 'lgbmclassifier__colsample_bytree', 'lgbmclassifier__importance_type', 'lgbmclassifier__learning_rate', 'lgbmclassifier__max_depth', 'lgbmclassifier__min_child_samples', 'lgbmclassifier__min_child_weight', 'l

In [54]:
params = {}
params["columntransformer__onehotencoder"] = [ohe, le, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), Normalizer(), RobustScaler(), "drop"]
params["lgbmclassifier__n_estimators"] = [50, 100, 250, 500, 1000, 5000]
params["lgbmclassifier__num_leaves"] = [10, 100, 250]
params["lgbmclassifier__max_depth"] = [1, 5, 8, 10]
params["lgbmclassifier__class_weight"] = ['balanced', 'auto', { 0:0.67, 1:0.33}, {0:0.75, 1:0.25}, {0:0.8, 1:0.2}]

params.keys()

dict_keys(['columntransformer__onehotencoder', 'columntransformer__standardscaler', 'lgbmclassifier__n_estimators', 'lgbmclassifier__num_leaves', 'lgbmclassifier__max_depth', 'lgbmclassifier__class_weight'])

In [55]:
randomCV = RandomizedSearchCV(final_pipeline, params, cv = 4, scoring = "precision", n_jobs = -1)

randomCV

In [56]:
randomCV.fit(x_train, y_train)

In [57]:
pd.DataFrame(randomCV.cv_results_).sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lgbmclassifier__num_leaves,param_lgbmclassifier__n_estimators,param_lgbmclassifier__max_depth,param_lgbmclassifier__class_weight,param_columntransformer__standardscaler,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
7,12.0677,0.4473,0.0582,0.0036,10,250,5,balanced,RobustScaler(),GLMMEncoder(),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",0.5367,0.5464,0.5371,0.5301,0.5376,0.0058,1
2,1.2083,0.1116,0.2282,0.0514,100,500,5,balanced,Normalizer(),JamesSteinEncoder(),"{'lgbmclassifier__num_leaves': 100, 'lgbmclass...",0.5231,0.5412,0.5303,0.5282,0.5307,0.0066,2
1,0.6714,0.0455,0.1138,0.0158,250,100,5,balanced,Normalizer(),BinaryEncoder(),"{'lgbmclassifier__num_leaves': 250, 'lgbmclass...",0.5031,0.5165,0.5135,0.4985,0.5079,0.0074,3
4,0.6202,0.0282,0.1182,0.0293,100,100,8,balanced,StandardScaler(),LeaveOneOutEncoder(),"{'lgbmclassifier__num_leaves': 100, 'lgbmclass...",0.0,0.0,0.2162,0.0,0.0541,0.0936,4
0,1.4756,0.043,0.0,0.0,250,500,5,"{0: 0.67, 1: 0.33}",RobustScaler(),HashingEncoder(max_process=4),"{'lgbmclassifier__num_leaves': 250, 'lgbmclass...",,,,,,,5
3,1.3344,0.0264,0.0,0.0,10,1000,5,"{0: 0.67, 1: 0.33}",Normalizer(),HashingEncoder(max_process=4),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",,,,,,,6
5,0.3585,0.0789,0.0,0.0,100,250,8,auto,drop,MEstimateEncoder(),"{'lgbmclassifier__num_leaves': 100, 'lgbmclass...",,,,,,,7
6,0.7927,0.1231,0.0,0.0,10,5000,8,auto,MinMaxScaler(),SumEncoder(),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",,,,,,,8
8,0.5403,0.0602,0.0,0.0,10,100,8,auto,MinMaxScaler(),CatBoostEncoder(),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",,,,,,,9
9,0.3953,0.0569,0.0,0.0,10,1000,10,auto,Normalizer(),BinaryEncoder(),"{'lgbmclassifier__num_leaves': 10, 'lgbmclassi...",,,,,,,10


In [58]:
best_estimator = randomCV.best_estimator_

In [59]:
y_pred = best_estimator.predict(x_test)

In [60]:
pd.crosstab(y_test.values.ravel(), y_pred, rownames = ["Verdadeiro"], colnames = ["Predito pelo modelo"], margins = True)

Predito pelo modelo,0.0,1.0,All
Verdadeiro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,2838,758,3596
1.0,310,884,1194
All,3148,1642,4790


In [61]:
print(classification_report(y_test.values.ravel(), y_pred))

              precision    recall  f1-score   support

         0.0       0.90      0.79      0.84      3596
         1.0       0.54      0.74      0.62      1194

    accuracy                           0.78      4790
   macro avg       0.72      0.76      0.73      4790
weighted avg       0.81      0.78      0.79      4790

