# Primeiros testes com modelos

Neste notebook são testados os modelos de regressão logística e random forest com as variáveis filtradas de alunos e escolas com base no Censo, e o nível socioeconômico.

## Resultados

- Como esperado, o Random Forest tem uma melhor performance na classificação de evadidos: *recall* de 59% em contraponto a 55% da regressão logística (com validação cruzada - CV), mas a diferença ainda é pouca.
- Ambos os modelos têm como variável de grande relevância a idade dos alunos, sendo na regressão logística a distorção idade-série o fator de maior coeficiente (0.85 $\pm$ 0.1) e a idade "categórica" a segunda maior (0.3 $\pm$ 0.1).
- No random forest, observamos também uma importância considerável para as escolas (quando comparado às demais variáveis): a escola, idade e distorção tem cerca de 20.5% $\pm$ 2.

Dada a relevância da idade, decidimos verificar o quão melhor nosso modelo performava em comparação a "probabilidade" de evasão somente a partir do ranking das idades (quanto mais velho, maior a probabilidade de evasão), com resultados não muito satisfatórios: 
- Para os 10% com maior risco (seja pela probabilidade do modelo ou por fator idade), a precisão e o recall da regressão logística são sempre menores que da idade, diferindo em até 0.03.
- Para a mesma categoria, a precisão e o recall do random forest fica acima da idade, mas por no máximo 0.05.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Resultados" data-toc-modified-id="Resultados-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Resultados</a></span></li><li><span><a href="#Regressão-logística" data-toc-modified-id="Regressão-logística-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Regressão logística</a></span><ul class="toc-item"><li><span><a href="#Resumo" data-toc-modified-id="Resumo-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Resumo</a></span></li><li><span><a href="#Importando-treino-e-teste" data-toc-modified-id="Importando-treino-e-teste-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Importando treino e teste</a></span></li><li><span><a href="#Rodando-regressão-logistica" data-toc-modified-id="Rodando-regressão-logistica-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Rodando regressão logistica</a></span></li><li><span><a href="#Rodando-logit-com-CV" data-toc-modified-id="Rodando-logit-com-CV-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Rodando logit com CV</a></span></li><li><span><a href="#Rodando-Logit-com-Stratified-K-Folds" data-toc-modified-id="Rodando-Logit-com-Stratified-K-Folds-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Rodando Logit com Stratified K-Folds</a></span></li><li><span><a href="#Testando-com-os-dados-balanceados" data-toc-modified-id="Testando-com-os-dados-balanceados-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Testando com os dados balanceados</a></span></li><li><span><a href="#Métrica:-curva-Precision-Recall-pelo-número-de-alunos-cobertos-(prob.-ordenada)" data-toc-modified-id="Métrica:-curva-Precision-Recall-pelo-número-de-alunos-cobertos-(prob.-ordenada)-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Métrica: curva Precision-Recall pelo número de alunos cobertos (prob. ordenada)</a></span><ul class="toc-item"><li><span><a href="#Ordenando-por-idade" data-toc-modified-id="Ordenando-por-idade-2.7.1"><span class="toc-item-num">2.7.1&nbsp;&nbsp;</span>Ordenando por idade</a></span></li></ul></li></ul></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Random Forest</a></span></li></ul></div>

## Regressão logística

---

### Resumo

Em geral, os modelos não têm uma ótima performance. Testamos modelos de regressão logística variando o método de treino ("puro" e com validação cruzada: *k-fold*, *stratified k-fold*). 

- A acurácia em todos os casos está entorno de 71% ($\pm$ 1%). Como nosso interesse é em acertar quem está realmente evadindo (i.e. aumentar a taxa de verdadeiro positivo), essa métrica sozinha não nos traz muita informação.
- A sensibilidade (*recall*) dos modelos melhorou apenas 1% com a validação cruzada - passou de 55 para 56%. Ainda é um modelo com uma performance baixa se compararmos com um aleatório (50%)

In [1]:
# %load first_cell.py
%reload_ext autoreload
%autoreload 2

from paths import RAW_PATH, TREAT_PATH, OUTPUT_PATH, FIGURES_PATH, MODEL_PATH

import os
from copy import deepcopy
import numpy as np
import pandas as pd
pd.options.display.max_columns = 999
import pandas_profiling

import warnings
warnings.filterwarnings('ignore')

# Plotting
import plotly
import plotly.graph_objs as go
import cufflinks as cf
plotly.offline.init_notebook_mode(connected=True)

# Metrics
from plot_metrics import plot_roc, plot_confusion, plot_cover

cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

colorscale = ['#025951', '#8BD9CA', '#BF7F30', '#F2C124', '#8C470B', '#DFC27D']

### Importando treino e teste

In [2]:
X_train = pd.read_csv(TREAT_PATH / 'modelo' / 'df_treino_bal.csv', index_col='ID')
y_train = X_train['IN_EVASAO']
X_train = X_train.drop('IN_EVASAO', axis=1)

X_test = pd.read_csv(TREAT_PATH / 'modelo' / 'df_teste.csv', index_col='ID')
y_test = X_test['IN_EVASAO']
X_test = X_test.drop('IN_EVASAO', axis=1)

In [3]:
X_train.head()

Unnamed: 0_level_0,CO_ENTIDADE,TP_SEXO,NU_IDADE,IN_DISTORCAO,IN_TRANSPORTE_PUBLICO,IN_LOCAL_ESCOLA,N_TURMA,CO_MUNICIPIO,IN_LABORATORIO_INFORMATICA,IN_LABORATORIO_CIENCIAS,IN_QUADRA_ESPORTES,IN_BIBLIOTECA,IN_BANHEIRO_FORA_PREDIO,IN_AREA_VERDE,IN_N_COMP_15,NIVEL
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
113112940902,463.0,1.0,6.0,1.0,1.0,1.0,1.0,36.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0
114122977290,644.0,1.0,8.0,1.0,1.0,1.0,1.0,69.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0
112735067407,228.0,1.0,6.0,1.0,0.0,2.0,1.0,56.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,2.0
116497790542,550.0,0.0,6.0,1.0,0.0,1.0,1.0,47.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0
116816788370,690.0,0.0,3.0,0.0,0.0,1.0,2.0,71.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0


In [4]:
y_train.value_counts()

1    33158
0    33158
Name: IN_EVASAO, dtype: int64

In [5]:
X_test.head()

Unnamed: 0_level_0,CO_ENTIDADE,TP_SEXO,NU_IDADE,IN_DISTORCAO,IN_TRANSPORTE_PUBLICO,IN_LOCAL_ESCOLA,N_TURMA,CO_MUNICIPIO,IN_LABORATORIO_INFORMATICA,IN_LABORATORIO_CIENCIAS,IN_QUADRA_ESPORTES,IN_BIBLIOTECA,IN_BANHEIRO_FORA_PREDIO,IN_AREA_VERDE,IN_N_COMP_15,NIVEL
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
122688991280,388.0,1.0,4.0,0.0,1.0,1.0,1.0,22.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0
112004177984,79.0,0.0,4.0,0.0,0.0,2.0,0.0,14.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0
111551586796,841.0,0.0,4.0,0.0,0.0,2.0,0.0,76.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,2.0
113447628919,558.0,0.0,4.0,0.0,1.0,1.0,0.0,47.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0
118341995300,786.0,1.0,4.0,0.0,1.0,1.0,0.0,69.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0


In [6]:
y_test.value_counts()

0    30207
1    14177
Name: IN_EVASAO, dtype: int64

### Rodando regressão logistica

In [7]:
# from sklearn.linear_model import LogisticRegression

# logit = LogisticRegression(random_state=0)
# logit.fit(X_train, y_train)

# # Salvando modelo
# filename = 'logit_reg_model.sav'

# import pickle
# pickle.dump(logit, open(MODEL_PATH / filename, 'wb'))

import pickle
filename = 'logit_reg_model.sav'
logit = pickle.load(open(MODEL_PATH / filename,"rb"))

* Verificando a matriz de confusão

In [9]:
y_pred = logit.predict(X_test)

In [73]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Regressão Logística: Real x Predito'

cfm = plot_confusion(y_test, y_pred, evasao_label, title)

              precision    recall  f1-score   support

  Não evadiu       0.79      0.79      0.79     30207
      Evadiu       0.55      0.55      0.55     14177

    accuracy                           0.71     44384
   macro avg       0.67      0.67      0.67     44384
weighted avg       0.71      0.71      0.71     44384



In [74]:
cfm

array([[23849,  6358],
       [ 6317,  7860]])

In [75]:
cfm.sum(axis=1) - cfm.sum(axis=0) # Real - Predito

array([ 41, -41])

* Coeficiente das variáveis

In [95]:
coef = pd.DataFrame(index=X_train.columns, data=logit.coef_[0], columns=['Logit'])
coef.sort_values('Logit', ascending=False)

Unnamed: 0,Logit
IN_DISTORCAO,0.931175
NU_IDADE,0.309979
IN_BANHEIRO_FORA_PREDIO,0.080234
IN_BIBLIOTECA,0.073092
IN_LABORATORIO_CIENCIAS,0.072949
N_TURMA,0.060943
CO_MUNICIPIO,0.002696
CO_ENTIDADE,-0.000149
IN_LABORATORIO_INFORMATICA,-0.000271
IN_N_COMP_15,-0.014047


In [61]:
# precision_score(y_test, y_pred)
y_score = logit.decision_function(X_test)
y_score

array([-0.66904013, -0.54832071, -0.50911167, ..., -0.97371452,
        1.07012925, -0.36961899])

In [62]:
plot_roc(y_test, y_score, title='Curva ROC - Regressão Logística')

### Rodando logit com CV

In [63]:
# from sklearn.linear_model import LogisticRegressionCV

# logit_cv = LogisticRegressionCV(cv=5, random_state=0)
# logit_cv.fit(X_train, y_train)

# # Salvando modelo
# import pickle
# filename = 'logit_reg_cv_model.sav'

# pickle.dump(logit_cv, open(MODEL_PATH / filename, 'wb'))

import pickle
filename = 'logit_reg_cv_model.sav'
logit_cv = pickle.load(open(MODEL_PATH / filename,"rb"))

In [64]:
y_pred_cv = logit_cv.predict(X_test)

In [66]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Regressão Logística (CV k-Fold=5): Real x Predito'

cfm = plot_confusion(y_test, y_pred_cv, evasao_label, title)

              precision    recall  f1-score   support

  Não evadiu       0.79      0.78      0.79     30207
      Evadiu       0.55      0.56      0.55     14177

    accuracy                           0.71     44384
   macro avg       0.67      0.67      0.67     44384
weighted avg       0.71      0.71      0.71     44384



In [67]:
cfm

array([[23661,  6546],
       [ 6233,  7944]])

In [70]:
cfm.sum(axis=1) - cfm.sum(axis=0) # Real - Predito

array([ 313, -313])

* Coeficiente das variáveis

In [96]:
coef['LogitCV'] = pd.Series(index=X_train.columns, data=logit_cv.coef_[0]).sort_values()
coef.sort_values('LogitCV', ascending=False)

Unnamed: 0,Logit,LogitCV
IN_DISTORCAO,0.931175,0.783001
NU_IDADE,0.309979,0.363813
IN_BANHEIRO_FORA_PREDIO,0.080234,0.076713
IN_LABORATORIO_CIENCIAS,0.072949,0.067583
IN_BIBLIOTECA,0.073092,0.065928
N_TURMA,0.060943,0.06128
IN_LABORATORIO_INFORMATICA,-0.000271,0.021284
CO_MUNICIPIO,0.002696,0.002925
CO_ENTIDADE,-0.000149,-1.9e-05
IN_N_COMP_15,-0.014047,-0.021444


In [87]:
y_score_cv = logit_cv.decision_function(X_test)
plot_roc(y_test, y_score_cv, 'Curva ROC - Regressão Logística (CV k-Fold=5)')

### Rodando Logit com Stratified K-Folds

In [88]:
# from sklearn.linear_model import LogisticRegressionCV

# logit_strat = LogisticRegressionCV(random_state=0)
# logit_strat.fit(X_train, y_train)

# # Salvando modelo
# import pickle
# filename = 'logit_reg_strat_model.sav'

# pickle.dump(logit_strat, open(MODEL_PATH / filename, 'wb'))

import pickle
filename = 'logit_reg_strat_model.sav'
logit_strat = pickle.load(open(MODEL_PATH / filename,"rb"))

In [89]:
y_pred_strat = logit_strat.predict(X_test)

In [91]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Regressão Logística (Stratified K-Fold): Real x Predito'

cfm = plot_confusion(y_test, y_pred_strat, evasao_label, title)

              precision    recall  f1-score   support

  Não evadiu       0.79      0.78      0.79     30207
      Evadiu       0.55      0.56      0.55     14177

    accuracy                           0.71     44384
   macro avg       0.67      0.67      0.67     44384
weighted avg       0.71      0.71      0.71     44384



In [92]:
cfm

array([[23626,  6581],
       [ 6205,  7972]])

In [93]:
cfm.sum(axis=1) - cfm.sum(axis=0) # Real - Predito

array([ 376, -376])

* Coeficiente das variáveis

In [97]:
coef['LogitStratified'] = pd.Series(index=X_train.columns, data=logit_cv.coef_[0]).sort_values()
coef.sort_values('LogitStratified', ascending=False)

Unnamed: 0,Logit,LogitCV,LogitStratified
IN_DISTORCAO,0.931175,0.783001,0.783001
NU_IDADE,0.309979,0.363813,0.363813
IN_BANHEIRO_FORA_PREDIO,0.080234,0.076713,0.076713
IN_LABORATORIO_CIENCIAS,0.072949,0.067583,0.067583
IN_BIBLIOTECA,0.073092,0.065928,0.065928
N_TURMA,0.060943,0.06128,0.06128
IN_LABORATORIO_INFORMATICA,-0.000271,0.021284,0.021284
CO_MUNICIPIO,0.002696,0.002925,0.002925
CO_ENTIDADE,-0.000149,-1.9e-05,-1.9e-05
IN_N_COMP_15,-0.014047,-0.021444,-0.021444


In [98]:
y_score_strat = logit_strat.decision_function(X_test)
plot_roc(y_test, y_score_strat, 'Curva ROC - Regressão Logística (Stratified k-Fold)')

In [34]:
# from sklearn.model_selection import StratifiedKFold
# skf = StratifiedKFold(n_splits=5, random_state=0)
# skf.get_n_splits(X_norm, y)
# for train_index, test_index in skf.split(X_norm, y):
#     print("TRAIN:", train_index, "TEST:", test_index)
#     X_train, X_test = X_norm.iloc[train_index], X_norm.iloc[test_index]
#     y_train, y_test = y.iloc[train_index], y.iloc[test_index]

### Testando com os dados balanceados

Problema dos dados desbalanceados: tendência para predição da classe dominante é maior, pois aumenta a acurácia do modelo - mas não necessariamente o *recall*, que é o nosso foco.

### Métrica: curva Precision-Recall pelo número de alunos cobertos (prob. ordenada)

Usando aqui no modelo de regressão logística estartificado (classes balanceadas).

In [99]:
logit_strat.classes_

array([0, 1])

In [100]:
y_prob_strat = logit_strat.predict_proba(X_test)
y_prob_strat

array([[0.65838591, 0.34161409],
       [0.63833988, 0.36166012],
       [0.60735879, 0.39264121],
       ...,
       [0.73661684, 0.26338316],
       [0.2759881 , 0.7240119 ],
       [0.59603941, 0.40396059]])

In [101]:
# Calcula as probabilidades de cada aluno
y_prob_strat_pos = y_prob_strat[:, 1]

df_pred = pd.DataFrame(y_test).rename({'IN_EVASAO': 'in_evasao'}, axis=1)
df_pred['prob'] = y_prob_strat[:,1]

df_pred = df_pred.sort_values('prob', ascending=False)
df_pred

Unnamed: 0_level_0,in_evasao,prob
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
117444679435,1,0.926373
112218351659,0,0.926315
114879594945,0,0.925572
110250458585,1,0.922439
110929741055,1,0.922099
144782900839,1,0.921913
113199723305,1,0.921382
118274943529,0,0.920496
110876059115,1,0.919917
112816568287,1,0.919652


In [102]:
# Baseline: probabilidade aleatoria
r = np.random.RandomState(0)

df_pred_rand = pd.DataFrame(y_test).rename({'IN_EVASAO': 'in_evasao'}, axis=1)
df_pred_rand['prob'] = r.rand(len(df_pred_rand))

df_pred_rand = df_pred_rand.sort_values('prob', ascending=False)
df_pred_rand

Unnamed: 0_level_0,in_evasao,prob
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
144795439612,1,0.999978
116705540825,0,0.999964
144785330324,1,0.999962
111382638930,0,0.999957
110180671323,1,0.999951
117586397928,1,0.999949
111306746069,0,0.999931
111483891215,0,0.999919
144785582722,1,0.999856
112294070142,1,0.999855


In [103]:
from sklearn.metrics import precision_score, recall_score
from tqdm import tqdm_notebook as tqdm

def calculate_prec_recall(df, feature=None, n_limits=100):
    """
    Calcula precision e recall escolhendo iterando o threshold pela lista ordenada da prob. dos alunos.
    
    Parameters
    ----------
    df : pandas.DataFrame
        Dataframe da prob e indicadora de evasao por aluno
        
    feature: str
        Feature para ordenação dos alunos (default=None, i.e., ordena pela probabilidade)
        
    n_limits : int
        Número de pontos percentuais (default=100)
        
    Returns
    -------
    precision, recall : lists
    
    """
    n_alunos = list(range(1, len(df)))
    
    if feature:
        df = df.sort_values(feature, ascending=False)
     
    else:
        # Ordenando pelas prob (decrescente)
        df = df.sort_values('prob', ascending=False)

    # Indice da linha ordenada
    df['idx'] = range(len(df)) 
    
    # Lista de limites 
    limits = np.linspace(1, len(df), n_limits)

    # Calculo das metricas
    precision=[]
    recall = []
    
    for i in tqdm(limits):
        # Classificando como evadidos probs acima do limite
        pred = df['idx'].map(lambda x: 1 if x < i else 0)
        precision.append(precision_score(df['in_evasao'], pred))
        recall.append(recall_score(df['in_evasao'], pred))
    
    t = pd.DataFrame(list(zip(limits/len(df), precision, recall)), 
                     columns=['perc_cover', 'precision', 'recall'])
    return t

In [104]:
df_cover_rand = calculate_prec_recall(df_pred_rand)

HBox(children=(IntProgress(value=0), HTML(value='')))




In [105]:
df_cover = calculate_prec_recall(df_pred)

HBox(children=(IntProgress(value=0), HTML(value='')))




In [106]:
df_cover.head()

Unnamed: 0,perc_cover,precision,recall
0,2.3e-05,1.0,7.1e-05
1,0.010123,0.755556,0.023983
2,0.020224,0.73608,0.046625
3,0.030325,0.71471,0.067856
4,0.040426,0.703064,0.089017


In [107]:
df_cover_rand.head()

Unnamed: 0,perc_cover,precision,recall
0,2.3e-05,1.0,7.1e-05
1,0.010123,0.348889,0.011074
2,0.020224,0.306236,0.019398
3,0.030325,0.314264,0.029837
4,0.040426,0.321448,0.0407


In [118]:
plot_cover(df_cover, df_cover_rand, 'random')

#### Ordenando por idade

In [109]:
df_pred['idade_norm'] = X_test['NU_IDADE']
df_pred.head()

Unnamed: 0_level_0,in_evasao,prob,idade_norm
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
117444679435,1,0.926373,9.0
112218351659,0,0.926315,9.0
114879594945,0,0.925572,9.0
110250458585,1,0.922439,9.0
110929741055,1,0.922099,9.0


In [110]:
df_cover_idade = calculate_prec_recall(df_pred, feature='idade_norm')

HBox(children=(IntProgress(value=0), HTML(value='')))




In [111]:
df_pred_rand['idade_norm'] = X_test['NU_IDADE']
df_pred_rand.head()

Unnamed: 0_level_0,in_evasao,prob,idade_norm
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
144795439612,1,0.999978,6.0
116705540825,0,0.999964,3.0
144785330324,1,0.999962,8.0
111382638930,0,0.999957,3.0
110180671323,1,0.999951,4.0


In [112]:
df_cover_idade_rand = calculate_prec_recall(df_pred_rand, feature='idade_norm')

HBox(children=(IntProgress(value=0), HTML(value='')))




In [113]:
df_cover_idade_rand.head()

Unnamed: 0,perc_cover,precision,recall
0,2.3e-05,1.0,7.1e-05
1,0.010123,0.702222,0.02229
2,0.020224,0.701559,0.044438
3,0.030325,0.675334,0.064118
4,0.040426,0.681894,0.086337


In [117]:
plot_cover(df_cover_idade, df_cover_idade_rand, 'idade')

## Random Forest

---

In [120]:
from sklearn.ensemble import RandomForestClassifier

bal_rf = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=0)
bal_rf.fit(X_train, y_train)

# Salvando modelo
import pickle
filename = 'bal_rf_model.sav'
pickle.dump(bal_rf, open(MODEL_PATH / filename, 'wb'))

# import pickle
# filename = 'bal_rf_model.sav'
# bal_rf = pickle.load(open(MODEL_PATH / filename,"rb"))

In [121]:
pd.Series(bal_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

CO_ENTIDADE                   0.228901
NU_IDADE                      0.212264
IN_DISTORCAO                  0.182355
CO_MUNICIPIO                  0.087048
N_TURMA                       0.065016
TP_SEXO                       0.050300
IN_TRANSPORTE_PUBLICO         0.027733
NIVEL                         0.023587
IN_LOCAL_ESCOLA               0.020825
IN_N_COMP_15                  0.020069
IN_BANHEIRO_FORA_PREDIO       0.017517
IN_LABORATORIO_CIENCIAS       0.016975
IN_LABORATORIO_INFORMATICA    0.015841
IN_AREA_VERDE                 0.012810
IN_BIBLIOTECA                 0.009483
IN_QUADRA_ESPORTES            0.009276
dtype: float64

In [141]:
n_nodes = [ind_tree.tree_.node_count for ind_tree in bal_rf.estimators_]
max_depths = [ind_tree.tree_.max_depth for ind_tree in bal_rf.estimators_]
    
print(f'Average number of nodes {int(np.mean(n_nodes))}')
print(f'Average maximum depth {int(np.mean(max_depths))}')

Average number of nodes 17703
Average maximum depth 20


In [152]:
y_pred_rf = bal_rf.predict(X_test)

In [147]:
# y_pred_rf = bal_rf.predict(X_test)
# y_prob_rf = bal_rf.predict_proba(X_test)[:, 1]

# y_train_pred_rf = bal_rf.predict(X_train)
# y_train_prob_rf = bal_rf.predict_proba(X_train)[:, 1]

# evaluate_model(y_pred_rf, y_prob_rf, y_train_pred_rf, y_train_prob_rf)

In [153]:
evasao_label = {0: 'Não evadiu', 1: 'Evadiu'}
title = 'Random Forest (classes balanceadas): Real x Predito'

cfm = plot_confusion(y_test, y_pred_rf, evasao_label, title)

              precision    recall  f1-score   support

  Não evadiu       0.79      0.70      0.74     30207
      Evadiu       0.48      0.60      0.53     14177

    accuracy                           0.67     44384
   macro avg       0.64      0.65      0.64     44384
weighted avg       0.69      0.67      0.67     44384



In [124]:
cfm

array([[21128,  9079],
       [ 5703,  8474]])

In [155]:
y_prob_rf = bal_rf.predict_proba(X_test)
y_prob_rf

array([[0.75119507, 0.24880493],
       [0.51217296, 0.48782704],
       [0.71980315, 0.28019685],
       ...,
       [0.87550271, 0.12449729],
       [0.72118269, 0.27881731],
       [0.55207233, 0.44792767]])

In [156]:
plot_roc(y_test, y_prob_rf[:,1], 'Curva ROC - RandomForest')

In [157]:
# Calcula as probabilidades de cada aluno
y_prob_rf_pos = y_prob_rf[:, 1]

df_pred = pd.DataFrame(y_test).rename({'IN_EVASAO': 'in_evasao'}, axis=1)
df_pred['prob'] = y_prob_strat[:,1]

df_pred = df_pred.sort_values('prob', ascending=False)
df_pred

Unnamed: 0_level_0,in_evasao,prob
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
117444679435,1,0.926373
112218351659,0,0.926315
114879594945,0,0.925572
110250458585,1,0.922439
110929741055,1,0.922099
144782900839,1,0.921913
113199723305,1,0.921382
118274943529,0,0.920496
110876059115,1,0.919917
112816568287,1,0.919652


In [158]:
df_cover = calculate_prec_recall(df_pred)

HBox(children=(IntProgress(value=0), HTML(value='')))




In [159]:
plot_cover(df_cover, df_cover_rand, 'random', 'Percentual de cobertura - RandomForest')

In [160]:
plot_cover(df_cover, df_cover_idade, 'idade', 'Percentual de cobertura - RandomForest')