# Notebook com informações sobre pipelines

A ideia deste notebook é testar vários modelos em somente um pipeline e, também, novos tipos de encoders. 

## 1. Importando as bibliotecas iniciais

In [1]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

## 2. Contextualizando e carregando os dados 

Uma empresa de ciência de dados e Big Data oferece cursos e gostaria de contratar algum(ns) cientista(s) de dado(s) que completaram os seus cursos. Para isso, criaram um questionário que traz informações de dados demográficos, sociais, educação e etc, com o objetivo de minimizar os custos de contratação e otimizando o processo de contratação, pois sabem que o candidato deve ser treinado e adequado à equipe. Resumindo, vamos se dizer que uma empresa ao final do seu curso lhe gera um questionário de feedback do curso e pergunta se você gostaria de receber vagas deles, é o mesmo caso aqui.

1. **Verdadeiro Negativo**: São os candidatos que o nosso modelo disse que não estão a procura e realmente não estão a procura de um novo trabalho
2. **Falso Positivo**: São os candidatos que o nosso modelo disse que estão a procura, mas na realidade não estão a procura de um novo trabalho - *Maior prejudicial, pois iremos comunicar esses caras e na verdade eles não estão a procura de um novo trabalho*
3. **Falso Negativo**: São os candidatos que o nosso modelo disse que **NÃO** estão a procura, mas na realidade estão a procura de um novo trabalho
4. **Verdadeiro Positivo**: São os candidatos que o nosso modelo disse que estão a procura e realmente estão a procura de um novo trabalho

Como queremos diminuir o número de Falsos Positivos iremos em busca da minimização da Precision.

In [2]:
dados_treino = pd.read_csv(filepath_or_buffer = "../data/raw/aug_train.csv")

dados_treino

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


## 3. Informações iniciais dos dados

In [3]:
dados_treino.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

---

Inicialmente, não precisamos fazer nenhuma transformação nos dados, pois todos estão no formato e tipo ideal.

In [4]:
dados_treino["target"].value_counts()

0.0    14381
1.0     4777
Name: target, dtype: int64

## 4. Separando em treino e teste

Apesar de termos dados de ter uma base de teste também, ela não possui rótulo. Logo, teremos que dividir nossos dados de treino (que estão rotulados) em treino e teste.

Como não temos nenhuma dependência temporal aliada a série, podemos fazer o split aleatório. 

In [5]:
from sklearn.model_selection import train_test_split

X = dados_treino.drop("target", axis = 1)
y = dados_treino[["target"]]

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1234, stratify = y)

In [6]:
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

(14368, 13) (14368, 1) (4790, 13) (4790, 1)


Com isso, temos 14.368 observações nos dados de treino e 4.790 nos dados de teste.

## 5. Criando Pipelines

In [7]:
#---- Funções

from sklearn.pipeline import make_pipeline # Função para o Pipeline
from sklearn.compose import make_column_transformer # Função caso queiramos criar uma função nossa e colocar dentro do Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer, OneHotEncoder, MinMaxScaler, Normalizer, PolynomialFeatures, RobustScaler # Encoders
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV # Grid Searchs
from sklearn import set_config # Pipelines bisualmente bonitos
from sklearn.linear_model import LogisticRegression # Um primeiro modelo

#---- Deixando os pipelines bonitos 

set_config(display = "diagram")

### 5.1. **Pipeline I**: Regressão Logística + OHE (qualitativas) + StandardScaler (quantitativas)

In [8]:
#---- Definindo nosso modelo

log_reg = LogisticRegression(random_state = 1234, max_iter = 400)

#---- Definindo nossos encoder

ohe = OneHotEncoder()
scaler = StandardScaler()

In [9]:
#---- Definindo as features numéricas em uma lista para aplicarmos o Scaler

numeric_features = ["city_development_index", "training_hours"]

#---- Definindo as features categóricas em uma lista para aplicarmos o OHE

categorical_features = list(dados_treino.select_dtypes("object").columns)

In [10]:
ct = make_column_transformer(
    (ohe, categorical_features),
    (scaler, numeric_features),  
    remainder = "drop")

ct

In [11]:
final_pipeline = make_pipeline(ct, log_reg)

final_pipeline.fit(x_train, y_train.values.ravel())

In [12]:
cross_val_score(final_pipeline, x_train, y_train, cv = 5, scoring = "precision")

array([       nan, 0.57857143,        nan, 0.61061947, 0.61267606])

**Apesar de todas essas `warnings`, ele quis dizer que não encontrou uma observação de exemplo que possuía a categoria `city_140` para generalizar corretamente para uma predição futura.**

### 5.2. **Tunagem do Pipeline I**: Testando enconders nas variáveis quantitativas

In [13]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'lo

In [14]:
params = {}
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), "drop"]

In [15]:
grid = GridSearchCV(final_pipeline, params, cv = 4)

grid.fit(x_train, y_train)

In [16]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.2296,0.0269,0.0075,0.0043,StandardScaler(),{'columntransformer__standardscaler': Standard...,,0.7756,,0.7876,,,1
1,0.1893,0.0233,0.0076,0.0049,MinMaxScaler(),{'columntransformer__standardscaler': MinMaxSc...,,0.7762,,0.787,,,2
2,0.2044,0.0295,0.006,0.003,drop,{'columntransformer__standardscaler': 'drop'},,0.7728,,0.7848,,,3


### 5.3. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas

In [17]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'lo

In [18]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import category_encoders as ce


le = LabelEncoder()
oe = OrdinalEncoder()
encoder = ce.BackwardDifferenceEncoder()
encoder1 = ce.BaseNEncoder()
encoder2 = ce.BinaryEncoder()
encoder3 = ce.CatBoostEncoder()
encoder5 = ce.GLMMEncoder()
encoder6 = ce.HashingEncoder()
encoder7 = ce.HelmertEncoder()
encoder8 = ce.JamesSteinEncoder()
encoder9 = ce.LeaveOneOutEncoder()
encoder10 = ce.MEstimateEncoder()
encoder13 = ce.SumEncoder()
encoder15 = ce.TargetEncoder()
encoder16 = ce.WOEEncoder()


params = {}
params["columntransformer__onehotencoder"] = [ohe, le, oe, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]

In [19]:
grid = GridSearchCV(final_pipeline, params, cv = 4)

grid.fit(x_train, y_train)

In [20]:
results = pd.DataFrame(grid.cv_results_)

results.sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
12,0.9564,0.034,0.0598,0.0018,SumEncoder(),{'columntransformer__onehotencoder': SumEncode...,0.777,0.7753,0.7829,0.7873,0.7806,0.0048,1
8,1.9526,0.3141,0.0753,0.0096,HelmertEncoder(),{'columntransformer__onehotencoder': HelmertEn...,0.7776,0.7776,0.7812,0.7848,0.7803,0.003,2
14,0.2203,0.0182,0.0231,0.0008,WOEEncoder(),{'columntransformer__onehotencoder': WOEEncode...,0.7801,0.7712,0.7731,0.7859,0.7776,0.0058,3
6,10.9471,0.4529,0.021,0.0013,GLMMEncoder(),{'columntransformer__onehotencoder': GLMMEncod...,0.7762,0.7739,0.7762,0.7834,0.7774,0.0036,4
3,0.36,0.0493,0.0388,0.0027,BaseNEncoder(),{'columntransformer__onehotencoder': BaseNEnco...,0.7731,0.7681,0.772,0.7795,0.7732,0.0041,5
4,0.2975,0.0325,0.0346,0.0016,BinaryEncoder(),{'columntransformer__onehotencoder': BinaryEnc...,0.7731,0.7681,0.772,0.7795,0.7732,0.0041,5
13,0.3095,0.0519,0.0245,0.002,TargetEncoder(),{'columntransformer__onehotencoder': TargetEnc...,0.7751,0.7678,0.7684,0.7787,0.7725,0.0046,7
11,0.2566,0.0457,0.0238,0.0042,MEstimateEncoder(),{'columntransformer__onehotencoder': MEstimate...,0.7748,0.7673,0.7695,0.7778,0.7723,0.0042,8
15,0.024,0.0107,0.0035,0.0001,drop,{'columntransformer__onehotencoder': 'drop'},0.7764,0.7689,0.7712,0.7695,0.7715,0.003,9
10,0.3166,0.0373,0.0441,0.0011,LeaveOneOutEncoder(),{'columntransformer__onehotencoder': LeaveOneO...,0.772,0.7639,0.7706,0.7778,0.7711,0.005,10


### 5.4. **Tunagem do Pipeline II**: Testando enconders nas variáveis qualitativas e quantitativas

In [21]:
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__onehotencoder', 'columntransformer__standardscaler', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'columntransformer__standardscaler__copy', 'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'lo

In [22]:
params = {}
params["columntransformer__onehotencoder"] = [ohe, le, oe, encoder1, encoder2, encoder3, encoder5, encoder6, 
                                              encoder7, encoder8, encoder9, encoder10, encoder13, encoder15, encoder16, "drop"]
params["columntransformer__standardscaler"] = [StandardScaler(), MinMaxScaler(), Normalizer(), PolynomialFeatures(), RobustScaler(), "drop"]

params

{'columntransformer__onehotencoder': [OneHotEncoder(),
  LabelEncoder(),
  OrdinalEncoder(),
  BaseNEncoder(),
  BinaryEncoder(),
  CatBoostEncoder(),
  GLMMEncoder(),
  HashingEncoder(max_process=4),
  HelmertEncoder(),
  JamesSteinEncoder(),
  LeaveOneOutEncoder(),
  MEstimateEncoder(),
  SumEncoder(),
  TargetEncoder(),
  WOEEncoder(),
  'drop'],
 'columntransformer__standardscaler': [StandardScaler(),
  MinMaxScaler(),
  Normalizer(),
  PolynomialFeatures(),
  RobustScaler(),
  'drop']}

In [23]:
randomCV = RandomizedSearchCV(final_pipeline, params, cv = 4)

randomCV

In [24]:
randomCV.fit(x_train, y_train)

In [25]:
pd.DataFrame(randomCV.cv_results_).sort_values("rank_test_score").round(4)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__standardscaler,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
4,1.1116,0.1029,0.072,0.0106,RobustScaler(),SumEncoder(),{'columntransformer__standardscaler': RobustSc...,0.777,0.7751,0.7826,0.7873,0.7805,0.0048,1
6,1.6738,0.0535,0.0676,0.0087,RobustScaler(),HelmertEncoder(),{'columntransformer__standardscaler': RobustSc...,0.7773,0.7773,0.7803,0.7848,0.7799,0.0031,2
1,0.2699,0.0221,0.0241,0.0052,drop,TargetEncoder(),"{'columntransformer__standardscaler': 'drop', ...",0.7753,0.767,0.7689,0.7781,0.7723,0.0045,3
0,0.2073,0.0211,0.0243,0.0044,drop,MEstimateEncoder(),"{'columntransformer__standardscaler': 'drop', ...",0.7734,0.7687,0.7692,0.7753,0.7716,0.0028,4
2,0.0395,0.0132,0.0037,0.0002,MinMaxScaler(),drop,{'columntransformer__standardscaler': MinMaxSc...,0.7751,0.7689,0.7659,0.7639,0.7684,0.0042,5
3,0.5539,0.0669,0.0521,0.0153,PolynomialFeatures(),BinaryEncoder(),{'columntransformer__standardscaler': Polynomi...,0.7689,0.7611,0.7678,0.7661,0.766,0.003,6
9,0.3208,0.0456,0.046,0.003,Normalizer(),CatBoostEncoder(),{'columntransformer__standardscaler': Normaliz...,0.765,0.7603,0.7639,0.7703,0.7649,0.0036,7
5,0.0046,0.0003,0.0,0.0,PolynomialFeatures(),LabelEncoder(),{'columntransformer__standardscaler': Polynomi...,,,,,,,8
7,0.0055,0.0001,0.0,0.0,drop,OrdinalEncoder(),"{'columntransformer__standardscaler': 'drop', ...",,,,,,,9
8,0.2474,0.025,0.0076,0.0045,RobustScaler(),OneHotEncoder(),{'columntransformer__standardscaler': RobustSc...,,0.7756,,0.7876,,,10


In [26]:
best_estimator = randomCV.best_estimator_

In [27]:
y_pred = best_estimator.predict(x_test)

In [28]:
pd.crosstab(y_test.values.ravel(), y_pred, rownames = ["Verdadeiro"], colnames = ["Predito pelo modelo"], margins = True)

Predito pelo modelo,0.0,1.0,All
Verdadeiro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,3322,274,3596
1.0,734,460,1194
All,4056,734,4790
