<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Modelos-de-regressão-logística" data-toc-modified-id="Modelos-de-regressão-logística-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Modelos de regressão logística</a></span><ul class="toc-item"><li><span><a href="#Resumo" data-toc-modified-id="Resumo-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Resumo</a></span></li><li><span><a href="#Importando-treino-e-teste" data-toc-modified-id="Importando-treino-e-teste-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Importando treino e teste</a></span></li><li><span><a href="#Rodando-regressão-logistica" data-toc-modified-id="Rodando-regressão-logistica-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Rodando regressão logistica</a></span></li><li><span><a href="#Rodando-logit-com-CV" data-toc-modified-id="Rodando-logit-com-CV-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Rodando logit com CV</a></span></li><li><span><a href="#Rodando-Logit-com-Stratified-K-Folds" data-toc-modified-id="Rodando-Logit-com-Stratified-K-Folds-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Rodando Logit com Stratified K-Folds</a></span></li><li><span><a href="#Testando-com-os-dados-balanceados" data-toc-modified-id="Testando-com-os-dados-balanceados-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Testando com os dados balanceados</a></span><ul class="toc-item"><li><span><a href="#Com-Stratified-K-Fold:" data-toc-modified-id="Com-Stratified-K-Fold:-1.6.1"><span class="toc-item-num">1.6.1&nbsp;&nbsp;</span>Com Stratified K-Fold:</a></span></li></ul></li></ul></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Random Forest</a></span></li></ul></div>

## Modelos de regressão logística

---

### Resumo

Em geral, os modelos não têm uma ótima performance. Testamos modelos de regressão logística variando o método de treino ("puro" e com validação cruzada: *k-fold*, *stratified k-fold*) e balanceamento das classes. 

- A acurácia em todos os casos está entorno de 72% ($\pm$ 1%). Como nosso interesse é em acertar quem está realmente evadindo (i.e. aumentar a taxa de verdadeiro positivo), essa métrica não nos traz muita innformação - devemos olhar para o *recall*. 
- Os diferentes métodos de treino com os dados desbalanceados produziram resultados piores, chegando a um recall de 34%. Depois de balanceado, o modelo puro foi de 46% para 55% de *recall*.

In [1]:
# %load first_cell.py
%reload_ext autoreload
%autoreload 2

from paths import RAW_PATH, TREAT_PATH, OUTPUT_PATH, FIGURES_PATH, MODEL_PATH

import os
from copy import deepcopy
import numpy as np
import pandas as pd
pd.options.display.max_columns = 999
import pandas_profiling

import warnings
warnings.filterwarnings('ignore')

# Plotting
import plotly
import plotly.graph_objs as go
import cufflinks as cf
plotly.offline.init_notebook_mode(connected=True)

# Metrics
from plot_metrics import plot_roc, plot_confusion

cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

colorscale = ['#025951', '#8BD9CA', '#BF7F30', '#F2C124', '#8C470B', '#DFC27D']

### Importando treino e teste

In [2]:
X_train = pd.read_csv(TREAT_PATH / 'modelo' / 'df_treino.csv', index_col='ID')
y_train = X_train['IN_EVASAO']
X_train = X_train.drop('IN_EVASAO', axis=1)

X_test = pd.read_csv(TREAT_PATH / 'modelo' / 'df_teste.csv', index_col='ID')
y_test = X_test['IN_EVASAO']
X_test = X_test.drop('IN_EVASAO', axis=1)

### Rodando regressão logistica

In [3]:
# from sklearn.linear_model import LogisticRegression

# logit = LogisticRegression(random_state=0)
# logit.fit(X_train, y_train)

# # Salvando modelo
# filename = 'logit_reg_model.sav'

# import pickle
# pickle.dump(logit, open(MODEL_PATH / filename, 'wb'))

import pickle
filename = 'logit_reg_model.sav'
logit = pickle.load(open(MODEL_PATH / filename,"rb"))

In [4]:
result = logit.score(X_test, y_test)
result

0.7254472761385719

In [5]:
y_pred = logit.predict(X_test)

In [6]:
df_result = pd.DataFrame({'Real': y_test.values.flatten(), 
                          'Predito': y_pred.flatten()})
df_result.head(2)

Unnamed: 0,Real,Predito
0,0,0
1,0,0


In [7]:
df_result['Predito'].value_counts()

0    33585
1    11186
Name: Predito, dtype: int64

In [8]:
df_result['Real'].value_counts()

0    30315
1    14456
Name: Real, dtype: int64

In [9]:
print(logit.coef_)

[[-1.14148519e-04 -8.49009686e-02  1.51490597e-01  1.22508896e+00
  -3.41094168e-01 -1.44216954e-01  7.72587711e-02  2.57394021e-03
  -1.18087385e-02  8.36078130e-02 -4.77665006e-02  4.48485696e-02
   7.46465266e-02 -4.34243966e-02 -2.23419082e-02 -2.37661278e-01]]


* Verificando a matriz de confusão

In [10]:
from sklearn.metrics import confusion_matrix

cfm = confusion_matrix(y_test, y_pred)
cfm

array([[25804,  4511],
       [ 7781,  6675]])

In [11]:
cfm.sum(axis=0) # Predito

array([33585, 11186])

In [12]:
cfm.sum(axis=1) # Real

array([30315, 14456])

In [13]:
cfm = cfm.astype(float) / cfm.sum(axis=1)[:, np.newaxis]
cfm

array([[0.85119578, 0.14880422],
       [0.53825401, 0.46174599]])

In [14]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Regressão Logística: Real x Predito'

plot_confusion(y_test, y_pred, evasao_label, title)

In [15]:
from sklearn.metrics import recall_score, precision_score, classification_report

print(classification_report(y_test, y_pred, target_names=evasao_label.values()))

              precision    recall  f1-score   support

  Não evadiu       0.77      0.85      0.81     30315
      Evadiu       0.60      0.46      0.52     14456

    accuracy                           0.73     44771
   macro avg       0.68      0.66      0.66     44771
weighted avg       0.71      0.73      0.71     44771



In [16]:
evasao_label.keys()

dict_keys([0, 1])

In [17]:
precision_score(y_test, y_pred)

0.596728052923297

In [18]:
y_score = logit.decision_function(X_test)
y_score

array([-1.75183137, -1.48781391, -1.03330739, ..., -1.40370221,
       -1.19088753, -0.86026853])

In [19]:
plot_roc(y_test, y_score, title='Curva ROC - Regressão Logística')

### Rodando logit com CV

In [20]:
# from sklearn.linear_model import LogisticRegressionCV

# logit_cv = LogisticRegressionCV(cv=5, random_state=0)
# logit_cv.fit(X_train, y_train)

# # Salvando modelo
# import pickle
# filename = 'logit_reg_cv_model.sav'

# pickle.dump(logit_cv, open(MODEL_PATH / filename, 'wb'))

import pickle
filename = 'logit_reg_cv_model.sav'
logit_cv = pickle.load(open(MODEL_PATH / filename,"rb"))

In [21]:
result_cv = logit_cv.score(X_test, y_test)
result_cv

0.7298027741171741

In [22]:
y_pred_cv = logit_cv.predict(X_test)

df_result_cv = pd.DataFrame({'Actual': y_test.values.flatten(), 'Predicted': y_pred_cv.flatten()})
df_result_cv.head()

Unnamed: 0,Actual,Predicted
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0


In [23]:
df_result_cv['Predicted'].value_counts()

0    36330
1     8441
Name: Predicted, dtype: int64

In [24]:
df_result_cv['Actual'].value_counts()

0    30315
1    14456
Name: Actual, dtype: int64

In [25]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Regressão Logística (CV k-Fold=5): Real x Predito'

plot_confusion(y_test, y_pred_cv, evasao_label, title)

In [26]:
# from sklearn.metrics import recall_score, precision_score, classification_report
print(classification_report(y_test, y_pred_cv, target_names=evasao_label.values()))

              precision    recall  f1-score   support

  Não evadiu       0.75      0.90      0.82     30315
      Evadiu       0.64      0.37      0.47     14456

    accuracy                           0.73     44771
   macro avg       0.70      0.64      0.65     44771
weighted avg       0.71      0.73      0.71     44771



In [27]:
y_score_cv = logit_cv.decision_function(X_test)
plot_roc(y_test, y_score_cv, 'Curva ROC - Regressão Logística (CV k-Fold=5)')

In [28]:
# data_final_vars=data_final.columns.values.tolist()
# y=['y']
# X=[i for i in data_final_vars if i not in y]from sklearn.feature_selection import RFE
# from sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression()rfe = RFE(logreg, 20)
# rfe = rfe.fit(os_data_X, os_data_y.values.ravel())
# print(rfe.support_)
# print(rfe.ranking_)

### Rodando Logit com Stratified K-Folds

In [29]:
# from sklearn.linear_model import LogisticRegressionCV

# logit_strat = LogisticRegressionCV(random_state=0)
# logit_strat.fit(X_train, y_train)

# # Salvando modelo
# import pickle
# filename = 'logit_reg_strat_model.sav'

# pickle.dump(logit_strat, open(MODEL_PATH / filename, 'wb'))

import pickle
filename = 'logit_reg_strat_model.sav'
logit_strat = pickle.load(open(MODEL_PATH / filename,"rb"))

In [30]:
y_pred_strat = logit_strat.predict(X_test)

In [31]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Regressão Logística (Stratified K-Fold): Real x Predito'

plot_confusion(y_test, y_pred_strat, evasao_label, title)

In [32]:
# from sklearn.metrics import recall_score, precision_score, classification_report
print(classification_report(y_test, y_pred_strat, target_names=evasao_label.values()))

              precision    recall  f1-score   support

  Não evadiu       0.74      0.91      0.82     30315
      Evadiu       0.65      0.34      0.45     14456

    accuracy                           0.73     44771
   macro avg       0.70      0.63      0.63     44771
weighted avg       0.71      0.73      0.70     44771



In [33]:
y_score_strat = logit_strat.decision_function(X_test)
plot_roc(y_test, y_score_strat, 'Curva ROC - Regressão Logística (Stratified k-Fold)')

In [34]:
# from sklearn.model_selection import StratifiedKFold
# skf = StratifiedKFold(n_splits=5, random_state=0)
# skf.get_n_splits(X_norm, y)
# for train_index, test_index in skf.split(X_norm, y):
#     print("TRAIN:", train_index, "TEST:", test_index)
#     X_train, X_test = X_norm.iloc[train_index], X_norm.iloc[test_index]
#     y_train, y_test = y.iloc[train_index], y.iloc[test_index]

### Testando com os dados balanceados

Problema dos dados desbalanceados: tendência para predição da classe dominante é maior, pois aumenta a acurácia do modelo - mas não necessariamente o *recall*, que é o nosso foco.

In [35]:
X_train_bal = pd.read_csv(TREAT_PATH / 'modelo' / 'df_treino_bal.csv', index_col='ID')
y_train_bal = X_train_bal['IN_EVASAO']
X_train_bal = X_train_bal.drop('IN_EVASAO', axis=1)

In [36]:
# from sklearn.linear_model import LogisticRegression

# bal_logit = LogisticRegression(random_state=0)
# bal_logit.fit(X_train_bal, y_train_bal)

# # Salvando modelo
# filename = 'bal_logit_reg_model.sav'
# pickle.dump(bal_logit, open(MODEL_PATH / filename, 'wb'))

import pickle
filename = 'bal_logit_reg_model.sav'
bal_logit = pickle.load(open(MODEL_PATH / filename,"rb"))

In [37]:
y_pred_bal = bal_logit.predict(X_test)

In [38]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Regressão Logística (classes balanceadas): Real x Predito'

plot_confusion(y_test, y_pred_bal, evasao_label, title)

In [39]:
# from sklearn.metrics import recall_score, precision_score, classification_report
print(classification_report(y_test, y_pred_bal, target_names=evasao_label.values()))

              precision    recall  f1-score   support

  Não evadiu       0.79      0.80      0.79     30315
      Evadiu       0.57      0.55      0.56     14456

    accuracy                           0.72     44771
   macro avg       0.68      0.67      0.68     44771
weighted avg       0.72      0.72      0.72     44771



In [40]:
y_score_bal = bal_logit.decision_function(X_test)
plot_roc(y_test, y_score_bal, 'Curva ROC - Regressão Logística (classes balanceadas)')

#### Com Stratified K-Fold:

In [41]:
# from sklearn.linear_model import LogisticRegressionCV

# bal_logit_strat = LogisticRegressionCV(random_state=0)
# bal_logit_strat.fit(X_train_bal, y_train_bal)

# # Salvando modelo
# import pickle
# filename = 'bal_logit_reg_strat_model.sav'
# pickle.dump(bal_logit_strat, open(MODEL_PATH / filename, 'wb'))

import pickle
filename = 'bal_logit_reg_strat_model.sav'
bal_logit_strat = pickle.load(open(MODEL_PATH / filename,"rb"))

In [42]:
y_pred_bal_strat = bal_logit_strat.predict(X_test)

In [43]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Regressão Logística (classes balanceadas - Stratified K-Fold): Real x Predito'

plot_confusion(y_test, y_pred_bal_strat, evasao_label, title)

In [44]:
# from sklearn.metrics import recall_score, precision_score, classification_report
print(classification_report(y_test, y_pred_bal_strat, target_names=evasao_label.values()))

              precision    recall  f1-score   support

  Não evadiu       0.79      0.79      0.79     30315
      Evadiu       0.56      0.56      0.56     14456

    accuracy                           0.71     44771
   macro avg       0.67      0.67      0.67     44771
weighted avg       0.72      0.71      0.71     44771



In [45]:
y_score_bal_strat = bal_logit_strat.decision_function(X_test)
plot_roc(y_test, y_score_bal_strat, 'Curva ROC - Regressão Logística (classes balanceadas - Stratified K-Fold)')

## Random Forest

---

In [46]:
# from sklearn.ensemble import RandomForestClassifier

# bal_rf = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=0)
# bal_rf.fit(X_train_bal, y_train_bal)

# # Salvando modelo
# import pickle
# filename = 'bal_rf_model.sav'
# pickle.dump(bal_rf, open(MODEL_PATH / filename, 'wb'))

import pickle
filename = 'bal_rf_model.sav'
bal_rf = pickle.load(open(MODEL_PATH / filename,"rb"))

In [47]:
pd.Series(bal_rf.feature_importances_, index=X_train_bal.columns).sort_values(ascending=False)

NU_IDADE                      0.230758
CO_ENTIDADE                   0.221149
IN_DISTORCAO                  0.177605
CO_MUNICIPIO                  0.086010
N_TURMA                       0.063404
TP_SEXO                       0.050978
IN_TRANSPORTE_PUBLICO         0.027574
NIVEL                         0.022754
IN_LOCAL_ESCOLA               0.020597
IN_N_COMP_15                  0.018805
IN_BANHEIRO_FORA_PREDIO       0.017590
IN_LABORATORIO_CIENCIAS       0.016715
IN_LABORATORIO_INFORMATICA    0.015138
IN_AREA_VERDE                 0.013037
IN_QUADRA_ESPORTES            0.009133
IN_BIBLIOTECA                 0.008752
dtype: float64

In [48]:
y_pred_bal_rf = bal_rf.predict(X_test)

In [49]:
# from sklearn.metrics import recall_score, precision_score, classification_report
print(classification_report(y_test, y_pred_bal_rf, target_names=evasao_label.values()))

              precision    recall  f1-score   support

  Não evadiu       0.79      0.71      0.74     30315
      Evadiu       0.49      0.60      0.54     14456

    accuracy                           0.67     44771
   macro avg       0.64      0.65      0.64     44771
weighted avg       0.69      0.67      0.68     44771



In [50]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Random Forest (classes balanceadas): Real x Predito'

plot_confusion(y_test, y_pred_bal_rf, evasao_label, title)

In [51]:
from sklearn.preprocessing import OneHotEncoder

rf_enc = OneHotEncoder(categories='auto')
rf_enc.fit(bal_rf.apply(X_train_bal))

y_score_bal_rf = bal_rf.predict_proba(rf_enc.transform(bal_rf.apply(X_test)))[:, 1]

ValueError: Number of features of the model must match the input. Model n_features is 16 and input n_features is 899677 

In [None]:
# y_score_bal_rf = bal_rf.decision_function(X_test)
# plot_roc(y_test, y_pred_bal_rf, 'Curva ROC - Regressão Logística (classes balanceadas - rfified K-Fold)')