# Modelo oficial

- **Objetivo:** Identificar se a pessoa quer ou não trocar de trabalho baseado em suas características

- **Métricas**: 
    - Recall
    - Precision
    - F1-score
    - Acurácia

- **Modelo de ML**: Light Gradient Boosting Machine

## 0. Setup

In [1]:
import pandas as pd
import numpy as np

## 1. Carregando os dados

In [2]:
dados = pd.read_csv(filepath_or_buffer = '../data/raw/aug_train.csv')

dados.head(3)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0


## 2. Criar as novas features

### 2.1. Agrupar os valores de company_size em PP, P, M e G

In [3]:
def add_feature_company_size(df):
    
    """
    # PP: Até 50 pessoas
    # P: Vai de 50 até 500
    # M: Vai de 500 até 4999
    # G: Acima de 5000
    """
    
    df1 = df.copy()
    
    df1['company_size_cat'] = np.where(dados['company_size'].isin(['<10', '10/49']), 'PP', 
                                       np.where(dados['company_size'].isin(['50-99', '100-500']), 'P',
                                                np.where(dados['company_size'].isin(['500-999', '1000-4999']), 'M',
                                                         np.where(dados['company_size'].isin(['5000-9999', '10000+']), 'GG', 
                                                                  np.nan))))
    
    return df1

### 2.2. Fazer uma feature que divide a quantidade de horas treinadas por 24 (resultados em quantos dias de treinamento ela participou)

In [4]:
def add_feature_training_hours(df):
    
    df1 = df.copy()
    
    df1['days_training_hours'] = df1['training_hours'] / 24
    
    return df1

### 2.3. Criar uma variável categórica que diz se a pessoa é nova ou não no mercado de trabalho. Ex.: Se a pessoa tem 3 ou menos anos de experiência, ela é nova, senão ela é "velha"

In [5]:
def add_feature_experience(df):
    
    df1 = df.copy()
    
    df1['experience_cat'] = np.where(dados['experience'].isin(['<1', '1', '2', '3', '4', '5', '6', '7', '8', '9']), 0, 
                                     np.where(dados['experience'].isin(['10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '>20']), 1, 
                                              np.nan))
    
    return df1

### 2.4. Agrupar os valores de company_type relacionadas a startup

In [6]:
def add_feature_company_type(df):
    
    df1 = df.copy()
    
    df1['company_type_cat'] = np.where(dados['company_type'].isin(['Funded Startup', 'Early Stage Startup']), 1, 
                                     np.where(dados['company_type'].isin(['Pvt Ltd', 'Other', 'Public Sector', 'NGO']), 0, 
                                              np.nan))
    
    return df1

### 2.5. Criando função para identificar nulos em qualquer variável (se for nulo, 1, 0)

In [7]:
def add_feature_null_column(df, col):
    
    df1 = df.copy()
    
    df1['check_null_' + col] = np.where(df1[col].isna(), 1, 0)
    
    return df1

### 2.6. Criando função para identificar nulos em variáveis qualitativas

In [8]:
def add_feature_null_qualitative(df, col):
    
    df1 = df.copy()
    
    df1[col] = np.where(df1[col].isna(), 'Outras', df1[col])
    
    return df1

### 2.7. Criando função para identificar nulos em variáveis quantitativas

In [9]:
def add_feature_null_quantitative(df, col):
    
    df1 = df.copy()
    
    df1[col] = np.where(df1[col].isna(), 99999, df1[col])
    
    return df1

## 3. Criação do modelo

### 3.0. Setup

In [10]:
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import make_pipeline 
from sklearn.compose import make_column_transformer
from sklearn import set_config 
from sklearn.metrics import classification_report, accuracy_score, recall_score, f1_score, precision_score
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler, Normalizer, QuantileTransformer, StandardScaler, OneHotEncoder
import category_encoders as ce
import lightgbm as lgb


set_config(display = "diagram")

### 3.1. Divisão da base de treino e teste

In [11]:
X = dados.drop(columns = 'target', axis = 1)

y = dados.target

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, random_state = 19, stratify = y)

In [13]:
print(f'Quantidade de linhas do X_train: {X_train.shape[0]} \n \
Quantidade de linhas do X_test: {X_test.shape[0]}\n \
Quantidade de linhas do y_train: {y_train.shape[0]}\n \
Quantidade de linhas do y_test: {y_test.shape[0]}\
')

Quantidade de linhas do X_train: 13410 
 Quantidade de linhas do X_test: 5748
 Quantidade de linhas do y_train: 13410
 Quantidade de linhas do y_test: 5748


### 3.2. Definindo os passos do Pipeline de Feature Engineering

In [14]:
encoder1 = ce.BackwardDifferenceEncoder()
encoder2 = ce.BaseNEncoder()
encoder3 = ce.BinaryEncoder()
encoder4 = ce.CatBoostEncoder()
encoder5 = ce.CountEncoder()
encoder6 = ce.GLMMEncoder()
encoder7 = ce.HashingEncoder()
encoder8 = ce.HelmertEncoder()
encoder9 = ce.JamesSteinEncoder()
encoder10 = ce.LeaveOneOutEncoder()
encoder11 = ce.MEstimateEncoder()
encoder12 = OneHotEncoder(handle_unknown = "ignore")
encoder13 = ce.OrdinalEncoder()
encoder14 = ce.SumEncoder()
encoder15 = ce.PolynomialEncoder()
encoder16 = ce.TargetEncoder()
encoder17 = ce.WOEEncoder()
encoder18 = ce.QuantileEncoder()
encoder19 = MaxAbsScaler()
encoder20 = MinMaxScaler()
encoder21 = Normalizer()
encoder22 = QuantileTransformer()
encoder23 = StandardScaler()

model = lgb.LGBMClassifier(random_state = 42)



In [15]:
features_qual = list(dados.select_dtypes(include = ['object']).columns)
features_quant = list(dados.drop(columns = ['enrollee_id', 'target'], axis = 1).select_dtypes(include = [int, float]).columns)

In [16]:
pipeline_inicial = make_column_transformer(\
                                           (encoder12, features_qual),
                                           (encoder23, features_quant),
                                           remainder = 'drop'
                       )

pipeline_inicial

In [17]:
pipeline_com_modelo = make_pipeline(pipeline_inicial, model)

pipeline_com_modelo

In [18]:
pipeline_com_modelo.fit(X_train, y_train)

In [19]:
y_pred = pipeline_com_modelo.predict(X_test)

y_pred

array([0., 0., 0., ..., 0., 0., 0.])

In [20]:
pd.crosstab(y_test, y_pred, rownames = ['Vida real'], colnames = ['Predito'], margins = True)

Predito,0.0,1.0,All
Vida real,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,3733,582,4315
1.0,581,852,1433
All,4314,1434,5748


In [21]:
print(classification_report(y_true = y_test, y_pred = y_pred))

              precision    recall  f1-score   support

         0.0       0.87      0.87      0.87      4315
         1.0       0.59      0.59      0.59      1433

    accuracy                           0.80      5748
   macro avg       0.73      0.73      0.73      5748
weighted avg       0.80      0.80      0.80      5748



In [22]:
accuracy_score(y_true = y_test, y_pred = y_pred) * 100

79.7668754349339

In [23]:
recall_score(y_true = y_test, y_pred = y_pred) * 100

59.45568736915562

In [24]:
precision_score(y_true = y_test, y_pred = y_pred) * 100

59.41422594142259

In [25]:
f1_score(y_true = y_test, y_pred = y_pred) * 100

59.43494942448553

## 4. Tuning do modelo anterior

### 4.1. Testando novos encoders/transformações para as variáveis qualitativas

In [26]:
params = {}

params['columntransformer__onehotencoder'] = [encoder1, encoder2, encoder3, encoder4, encoder5, encoder6, encoder7, 
                                              encoder8, encoder9, encoder10, encoder11, encoder12, encoder13, 
                                              encoder14, encoder15, encoder16, encoder17, encoder18]

In [27]:
grid = GridSearchCV(estimator = pipeline_com_modelo, 
                    param_grid = params,
                    scoring = 'recall',
                    n_jobs = -1,
                    cv = 4
                   )

In [28]:
grid.fit(X_train, y_train)

4 fits failed out of a total of 72.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "/home/rafael/Documentos/Github/hr_analysis/env/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/rafael/Documentos/Github/hr_analysis/env/lib/python3.10/site-packages/sklearn/pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/rafael/Documentos/Github/hr_analysis/env/lib/python3.10/site-packages/sklearn/pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/rafael/Documentos/Github/

In [29]:
pd.DataFrame(grid.cv_results_)\
    .sort_values(by = 'rank_test_score', ascending = True)\
    .head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
8,1.230887,0.115673,0.107265,0.010969,JamesSteinEncoder(),{'columntransformer__onehotencoder': JamesStei...,0.572967,0.576555,0.559809,0.563397,0.568182,0.006819,1
1,1.48472,0.233665,0.229288,0.032926,BaseNEncoder(),{'columntransformer__onehotencoder': BaseNEnco...,0.570574,0.568182,0.564593,0.553828,0.564294,0.006407,2
2,1.000458,0.304285,0.138517,0.032797,BinaryEncoder(),{'columntransformer__onehotencoder': BinaryEnc...,0.570574,0.568182,0.564593,0.553828,0.564294,0.006407,2
4,0.878773,0.196663,0.128947,0.037957,CountEncoder(combine_min_nan_groups=True),{'columntransformer__onehotencoder': CountEnco...,0.570574,0.570574,0.55622,0.558612,0.563995,0.006633,4
5,25.789216,1.140465,0.086314,0.037797,GLMMEncoder(),{'columntransformer__onehotencoder': GLMMEncod...,0.566986,0.570574,0.557416,0.559809,0.563696,0.005307,5


### 4.2. Tunando as variáveis quantitativas e qualitativas

In [30]:
params = {}

params['columntransformer__onehotencoder'] = [encoder1, encoder2, encoder3, encoder4, encoder5, encoder6, encoder7, 
                                              encoder8, encoder9, encoder10, encoder11, encoder12, encoder13, 
                                              encoder14, encoder15, encoder16, encoder17, encoder18]

params['columntransformer__standardscaler'] = [encoder19, encoder20, encoder21, encoder22, encoder23]

In [31]:
grid = GridSearchCV(estimator = pipeline_com_modelo, 
                    param_grid = params,
                    scoring = 'recall',
                    n_jobs = -1,
                    cv = 4
                   )

In [32]:
grid.fit(X_train, y_train)

20 fits failed out of a total of 360.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/home/rafael/Documentos/Github/hr_analysis/env/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/rafael/Documentos/Github/hr_analysis/env/lib/python3.10/site-packages/sklearn/pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/rafael/Documentos/Github/hr_analysis/env/lib/python3.10/site-packages/sklearn/pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/rafael/Documentos/Gith

In [33]:
pd.DataFrame(grid.cv_results_)\
    .sort_values(by = 'rank_test_score', ascending = True)\
    .head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__onehotencoder,param_columntransformer__standardscaler,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
44,0.873891,0.161397,0.098458,0.017408,JamesSteinEncoder(),StandardScaler(),{'columntransformer__onehotencoder': JamesStei...,0.572967,0.576555,0.559809,0.563397,0.568182,0.006819,1
43,0.951195,0.086415,0.11508,0.022526,JamesSteinEncoder(),QuantileTransformer(),{'columntransformer__onehotencoder': JamesStei...,0.572967,0.576555,0.559809,0.563397,0.568182,0.006819,1
41,0.94404,0.103057,0.11617,0.011693,JamesSteinEncoder(),MinMaxScaler(),{'columntransformer__onehotencoder': JamesStei...,0.572967,0.576555,0.559809,0.563397,0.568182,0.006819,1
40,1.29319,0.393327,0.130598,0.025011,JamesSteinEncoder(),MaxAbsScaler(),{'columntransformer__onehotencoder': JamesStei...,0.572967,0.576555,0.559809,0.563397,0.568182,0.006819,1
14,0.882105,0.098065,0.162199,0.029564,BinaryEncoder(),StandardScaler(),{'columntransformer__onehotencoder': BinaryEnc...,0.570574,0.568182,0.564593,0.553828,0.564294,0.006407,5


### 4.3. Tunando o lightgbm

- Baseado em árvores
- Xgboost

In [34]:
params = {}

params['columntransformer__onehotencoder'] = [encoder1, encoder2, encoder3, encoder4, encoder5, encoder6, encoder7, 
                                              encoder8, encoder9, encoder10, encoder11, encoder12, encoder13, 
                                              encoder14, encoder15, encoder16, encoder17, encoder18]

params['columntransformer__standardscaler'] = [encoder19, encoder20, encoder21, encoder22, encoder23]

params['lgbmclassifier__n_estimators'] = [100, 200, 500, 100, 5000, 10000]
params['lgbmclassifier__max_depth'] = [3, 4, 5, 6, 7, 8, 9, 10]
params['lgbmclassifier__num_leaves'] = [2, 5, 7, 9, 11, 15, 17, 20]
params['lgbmclassifier__learning_rate'] = [0.1, 0.3, 0.5, 0.7, 0.9]

In [55]:
grid = RandomizedSearchCV(estimator = pipeline_com_modelo, 
                          param_distributions = params,
                          scoring = 'recall',
                          n_jobs = -1,
                          cv = 4
                   )

In [56]:
grid.fit(X_train, y_train)



In [57]:
pd.DataFrame(grid.cv_results_)\
    .sort_values(by = 'rank_test_score', ascending = True)\
    .head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lgbmclassifier__num_leaves,param_lgbmclassifier__n_estimators,param_lgbmclassifier__max_depth,param_lgbmclassifier__learning_rate,param_columntransformer__standardscaler,param_columntransformer__onehotencoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
9,1.083784,0.159657,0.151801,0.028268,15,200,8,0.1,StandardScaler(),MEstimateEncoder(),"{'lgbmclassifier__num_leaves': 15, 'lgbmclassi...",0.55622,0.564593,0.553828,0.549043,0.555921,0.005634,1
4,1.046996,0.095441,0.161573,0.056969,11,200,3,0.1,QuantileTransformer(),JamesSteinEncoder(),"{'lgbmclassifier__num_leaves': 11, 'lgbmclassi...",0.541866,0.557416,0.543062,0.532297,0.54366,0.008971,2
1,1.060652,0.199782,0.125643,0.051941,5,100,9,0.3,Normalizer(),JamesSteinEncoder(),"{'lgbmclassifier__num_leaves': 5, 'lgbmclassif...",0.537081,0.553828,0.534689,0.545455,0.542763,0.007536,3
7,1.209253,0.21408,0.206602,0.07539,5,100,5,0.3,MinMaxScaler(),SumEncoder(),"{'lgbmclassifier__num_leaves': 5, 'lgbmclassif...",0.521531,0.549043,0.533493,0.529904,0.533493,0.009972,4
2,1.220879,0.116329,0.157596,0.016621,15,200,9,0.5,MinMaxScaler(),WOEEncoder(),"{'lgbmclassifier__num_leaves': 15, 'lgbmclassi...",0.455742,0.460526,0.47488,0.4689,0.465012,0.007392,5


In [58]:
best_estimator = grid.best_estimator_

In [59]:
y_pred = best_estimator.predict(X_test)

y_pred

array([0., 0., 0., ..., 0., 0., 0.])

### 4.4. Avaliação do modelo tunado

In [60]:
#---- Matriz de confusão

pd.crosstab(y_test, y_pred, rownames = ['Vida real'], colnames = ['Predito'], margins = True)

Predito,0.0,1.0,All
Vida real,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,3718,597,4315
1.0,569,864,1433
All,4287,1461,5748


### 4.5. Interpretações:

- Nosso modelo acertou apenas **25%** das pessoas que realmente queriam trocar de trabalho (370/1433)
- Nosso modelo acertou 93% das pessoas que não queriam trocar de trabalho (4040/4315)
- Nosso modelo errou em 7% que as pessoas não queriam trocar de trabalho quando na verdade queriam (275/4315)
- Nosso modelo errou em 75% que as pessoas que queriam trocar de trabalho quando na verdade elas não queriam (1063/1433)

In [61]:
# Métricas a serem guardadas: Acurácia, Recall, Precision e F1-score

from sklearn.metrics import classification_report, accuracy_score, recall_score, f1_score, precision_score

print(classification_report(y_true = y_test, y_pred = y_pred))

              precision    recall  f1-score   support

         0.0       0.87      0.86      0.86      4315
         1.0       0.59      0.60      0.60      1433

    accuracy                           0.80      5748
   macro avg       0.73      0.73      0.73      5748
weighted avg       0.80      0.80      0.80      5748



In [62]:
accuracy_score(y_true = y_test, y_pred = y_pred) * 100

79.71468336812805

In [63]:
recall_score(y_true = y_test, y_pred = y_pred) * 100

60.29309141660851

In [64]:
precision_score(y_true = y_test, y_pred = y_pred) * 100

59.13757700205339

In [65]:
f1_score(y_true = y_test, y_pred = y_pred) * 100

59.70974429854873

### 4.6. Definindo manualmente os melhores parâmetros

In [70]:
pipeline_final = make_column_transformer(\
                                         (encoder12, features_qual),
                                         (encoder23, features_quant),
                                         remainder = 'drop'
                       )

pipeline_final

In [76]:
modelo_final = lgb.LGBMClassifier(random_state = 42,
                                 num_leaves = 9,
                                 n_estimators = 100, 
                                 max_depth = 6, 
                                 learning_rate = 0.1
                                )

In [77]:
pipeline_final_com_modelo = make_pipeline(pipeline_final, modelo_final)

pipeline_final_com_modelo

In [79]:
pipeline_final_com_modelo.fit(X_train, y_train)

## 5. Salvando o modelo: pickle

Vamos ter um arquivo com o modelo.

In [80]:
from joblib import dump, load

dump(pipeline_final_com_modelo, '../models/hr_tuning_model_lgbm.joblib') # Salvando o modelo com o nome hr_tuning_model_lgbm

['../models/hr_tuning_model_lgbm.joblib']

In [81]:
load_best_estimator = load('../models/hr_tuning_model_lgbm.joblib')

load_best_estimator

In [83]:
load_best_estimator.predict(X_test)

array([0., 0., 0., ..., 0., 0., 0.])

## Próximos passos:

1. Avaliar as métricas comparando com o modelo sem tuning e a baseline

In [46]:
# IDEIA DO PIPELINE: Fazer a passo a passo o que o modelo vai fazer
# - Excluir uma variável
# - Aplicar a função que add_feature_null_column
# - Aplicar a função que add_feature_null_qualitative
# - Aplicar a função que add_feature_null_quantitativa
# - Utilizar os encoders nas features qualitativas

In [85]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13410 entries, 14345 to 13671
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             13410 non-null  int64  
 1   city                    13410 non-null  object 
 2   city_development_index  13410 non-null  float64
 3   gender                  10254 non-null  object 
 4   relevent_experience     13410 non-null  object 
 5   enrolled_university     13146 non-null  object 
 6   education_level         13072 non-null  object 
 7   major_discipline        11393 non-null  object 
 8   experience              13364 non-null  object 
 9   company_size            9211 non-null   object 
 10  company_type            9085 non-null   object 
 11  last_new_job            13103 non-null  object 
 12  training_hours          13410 non-null  int64  
dtypes: float64(1), int64(2), object(10)
memory usage: 1.4+ MB
