# Introdução

Esse caderno tem por objetivo a criação de um modelo básico de treinamento, utilizando o classificador [RandomForest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), que faça a predição de homologação de arquivamentos de procedimentos enviados à 1A.CAM do MPF.

Esse modelo usará apenas os metadados dos procedimentos, sem fazer nenhum processamento textual.

O objetivo desse modelo é servir como um *baseline* de comparações para implementações futuras.


**Nota**: os dados desse modelo foram recuperados de procedimentos que tiveram suas deliberações realizadas após o dia 02/07/2018, data em que a nova composição tomou posse na 1A.CAM.

# Carga de dados e pré-processamento

Vamos fazer a carga dos dados e fazer um pré-processamento tradicional (remoção de atributos que não interessam, criação de variáveis categóricas etc.)

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [2]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd

from sklearn import metrics

In [3]:
PATH = "../data/"
df_original = pd.read_json(f'{PATH}/1A.CAM.homologacao-arquivamento.json')

In [4]:
len(df_original)

5462

In [5]:
df_original.columns

Index(['areaAtuacao', 'classe', 'dataAutuacao', 'dataEntrada', 'homologado',
       'id', 'itemCnmp', 'municipio', 'prioritario', 'procedimento',
       'providenciasExecutadas', 'quantidadeConversoes',
       'quantidadeProvidencias', 'urgente'],
      dtype='object')

In [6]:
df_original.head()

Unnamed: 0,areaAtuacao,classe,dataAutuacao,dataEntrada,homologado,id,itemCnmp,municipio,prioritario,procedimento,providenciasExecutadas,quantidadeConversoes,quantidadeProvidencias,urgente
0,2,3,"May 16, 2016 12:00:00 AM","Aug 3, 2018 5:38:09 PM",1,71564833,1103,60.0,0,1.10.001.000068/2016-52,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,6,0
1,2,3,"Jul 7, 2016 12:00:00 AM","Jul 25, 2018 7:10:53 PM",1,72574520,1542,1541.0,0,1.11.000.000785/2016-57,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4,12,0
2,2,3,"Apr 25, 2017 12:00:00 AM","Jul 24, 2018 5:30:13 PM",1,77742213,1543,3113.0,0,1.30.001.001754/2017-39,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,6,0
3,2,3,"Feb 14, 2017 12:00:00 AM","Jul 24, 2018 3:48:22 PM",1,76399468,1726,2650.0,0,1.22.005.000023/2017-16,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,5,0
4,2,3,"Jul 9, 2013 12:00:00 AM","Jul 27, 2018 3:12:08 PM",1,47526845,1503,4249.0,0,1.33.005.000326/2013-13,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",6,7,0


In [7]:
df_work = df_original.copy()

In [8]:
# Convertendo strings para data
from datetime import datetime

for index in range(len(df_original)):
    df_work.loc[index, 'dataAutuacao'] =  datetime.strptime(df_work.loc[index, 'dataAutuacao'], '%b %d, %Y %I:%M:%S %p')
    df_work.loc[index, 'dataEntrada'] =  datetime.strptime(df_work.loc[index, 'dataEntrada'], '%b %d, %Y %I:%M:%S %p')

In [9]:
for index in range(len(df_work)):
    df_work.loc[index, 'providenciasExecutadas'] = [[i] for i in df_work.loc[index, 'providenciasExecutadas']]

In [10]:
removed_columns = ['id', 'procedimento']
df_work = df_work.drop(columns=removed_columns)
df_work.sample(10)

Unnamed: 0,areaAtuacao,classe,dataAutuacao,dataEntrada,homologado,itemCnmp,municipio,prioritario,providenciasExecutadas,quantidadeConversoes,quantidadeProvidencias,urgente
555,2,3,2018-01-19 00:00:00,2018-06-12 16:59:44,1,3047,170.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,4,0
4828,5,2,2015-05-12 00:00:00,2016-03-22 00:00:00,1,1523,810.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,5,0
3132,2,3,2009-09-02 00:00:00,2017-03-09 13:22:13,1,1580,1466.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",5,30,0
5324,3,3,2012-11-21 00:00:00,2015-08-27 00:00:00,1,1580,2919.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,5,0
4737,2,3,2014-06-24 00:00:00,2016-04-18 00:00:00,1,3150,4962.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,10,0
5113,2,3,2013-07-08 00:00:00,2016-01-26 00:00:00,1,1503,105.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,6,0
3423,2,3,2015-10-07 00:00:00,2017-01-30 15:21:03,1,3147,810.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,4,0
1551,2,3,2016-12-02 00:00:00,2017-11-28 17:55:37,1,1515,2211.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,17,0
1912,2,3,2015-11-03 00:00:00,2017-08-31 15:08:22,1,2359,4878.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,6,0
2282,2,2,2017-02-14 00:00:00,2017-06-20 14:51:58,1,1811,1619.0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,2,0


In [16]:
df_work.describe()

Unnamed: 0,areaAtuacao,classe,homologado,itemCnmp,municipio,prioritario,quantidadeConversoes,quantidadeProvidencias,urgente
count,5462.0,5462.0,5462.0,5462.0,5462.0,5462.0,5462.0,5462.0,5462.0
mean,2.266752,2.537532,0.983706,19800.86,2868.144086,0.006957,2.040278,7.393812,0.006957
std,0.83399,0.539216,0.126617,189043.8,1681.901095,0.083127,1.379018,8.934992,0.083127
min,1.0,1.0,0.0,2.0,-1.0,0.0,0.0,0.0,0.0
25%,2.0,2.0,1.0,1521.0,1301.0,0.0,1.0,3.0,0.0
50%,2.0,3.0,1.0,1582.5,3066.0,0.0,2.0,5.0,0.0
75%,2.0,3.0,1.0,1874.0,4313.0,0.0,3.0,9.0,0.0
max,6.0,5.0,1.0,2007548.0,5767.0,1.0,11.0,161.0,1.0


In [11]:
# tratando os nulos
df_work.fillna(-1, inplace=True)

In [12]:
len(df_work[df_work['homologado'] == 1]), len(df_work[df_work['homologado'] == 0])

(5373, 89)

### Classes desbalanceadas!!!

Conforme podemos ver acima, as classes desse problema são altamente desbalanceadas - apenas 1.63% do conjunto de dados representam procedimentos que não foram homologados.

Isso, muito provavelmente, causará problemas no treino do modelo. Mas, inicialmente, vamos ignorar isso e seguir com o nosso treino.

In [17]:
# aproximadamente 10% dos dados serão separados para o teste
test_size = int(len(df_work) * 0.10)
train_size = len(df_work) - test_size

print((train_size, test_size))

(4916, 546)


In [18]:
df_train  = df_work[0:train_size]
df_test  = df_work[-test_size:]
print((df_train.shape, df_test.shape))

((4916, 12), (546, 12))


In [19]:
# Porcentagem de não homologados em cada set
(len(df_train[df_train.homologado == 0])/len(df_train))*100, (len(df_test[df_test.homologado == 0])/len(df_test))*100

(1.647681041497152, 1.465201465201465)

In [20]:
from sklearn.ensemble import RandomForestClassifier

removed_cols = ['homologado', 'providenciasExecutadas', 'dataEntrada', 'dataAutuacao']
features = [c for c in df_train.columns if c not in removed_cols]

model = RandomForestClassifier()
model.fit(df_train[features], df_train['homologado'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [23]:
model.score(df_train[features], df_train['homologado'])

0.9977624084621644

In [24]:
from sklearn.model_selection import train_test_split

train, valid = train_test_split(df_train, random_state=42)

In [26]:
model.fit(train[features], train['homologado'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [27]:
train_preds = model.predict(train[features])
valid_preds = model.predict(valid[features])

In [29]:
from sklearn.metrics import accuracy_score

(accuracy_score(train['homologado'], train_preds), accuracy_score(valid['homologado'], valid_preds))

(0.9981014374830486, 0.983726606997559)

In [30]:
test_preds = model.predict(df_test[features])

In [32]:
accuracy_score(df_test['homologado'], test_preds)

0.9853479853479854

In [33]:
test_preds

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [35]:
np.array(df_test['homologado'])

array([1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

### Primeira avaliação

Embora não tenha ocorrido overfitting no treinamento e o *score* final tenha sido alto, 98.5%, está claro das duas últimas saídas que o modelo está prevendo tudo como homologado. Ou seja, ele está dando um alto valor para a classe homologado - acredito que seja devido ao desbalanceamento.