# Classificador de Inadimplência

## Proposta

Criação de uma Rede Neural, afim de classificar a situação e projeção de inadimplência atuais de um empreendimento. Para isso, a Rede Neural busca correlações entre os dados afim de definir uma classificação de inadimplência.



## Bibliotecas Utilizadas

Utilizamos o TensorFlow para criação do modelo preditivo, e algumas outras bibliotecas que auxiliam da separação da massa de dados utilizada no processo de treinamento e validação do modelo.



In [1]:
!pip install --quiet sklearn

In [2]:
%tensorflow_version 2.x  # this line is not required unless you are in a notebook

In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from IPython.display import clear_output

import tensorflow.compat.v2.feature_column as fc

import tensorflow as tf

## Fonte de Dados

Carregamento do dataset, e separação de *80%* dos dados, para treino e *20%* para validação do modelo. Importante remover o atributo que será previsto pelo modelo, nesse caso **p_dias_atraso_category**, afim de remover qualquer tipo de enviesamento do modelo.



In [4]:
df_train, df_test = train_test_split(pd.read_csv('https://raw.githubusercontent.com/afonsir/cyrela-etl/main/data/parsed-data.csv'))

y_train = df_train.pop('p_dias_atraso_category')
y_test = df_test.pop('p_dias_atraso_category')

## Pré-processamento dos Dados

Para o modelo proposto, os dados utilizados necessitam ser apenas numéricos (inerente ao treinamento do modelo), então, se faz necessário, o pré-processamento dos dados categóricos para numéricos, sugerimos o uso de uma aplicação **Apache Spark**, pela performance e escalabilidade.


In [5]:
df_train.head()

Unnamed: 0,empresa,p_empresa,marca,p_marca,obra,p_obra,bloco,p_bloco,unidade,p_unidade,dt_venda,p_dt_venda_day,p_dt_venda_month,p_dt_venda_year,dt_chaves,p_dt_chaves_day,p_dt_chaves_month,p_dt_chaves_year,carteira_sd_gerencial,p_carteira_sd_gerencial,saldo_devedor,p_saldo_devedor,p_data_base_day,p_data_base_month,p_data_base_year,dias_atraso,p_dias_atraso,valor_pago_atualizado,p_valor_pago_atualizado,valor_pago,p_valor_pago,vgv,p_vgv
2805,168,0.053097,LIVING,2,4718,0.475126,1,1.0,203,0.088108,2020-09-29,0.966667,0.75,1.01,2023-06-01,0.033333,0.5,1.0115,313637,0.024126,313637.49,0.024126,1.0,0.333333,1.0105,-823,0.710708,406358.11,0.036531,379533.5,0.034503,693170.99,0.033813
2046,661,0.208913,CYRELA,1,2339,0.235549,1,1.0,1305,0.566406,2020-02-29,0.966667,0.166667,1.01,2022-12-01,0.033333,1.0,1.011,190869,0.014682,190868.57,0.014682,1.0,0.333333,1.0105,-30,0.025907,117418.63,0.010556,108164.92,0.009833,299033.49,0.014587
2123,737,0.232933,CYRELA,1,7761,0.781571,1,1.0,58,0.025174,2019-03-30,1.0,0.25,1.0095,2022-06-01,0.033333,0.5,1.011,851966,0.065536,851965.56,0.065536,1.0,0.333333,1.0105,-29,0.025043,236023.62,0.021218,211029.91,0.019185,1062995.47,0.051853
1853,2067,0.653287,CYRELA,1,8312,0.837059,1,1.0,506,0.219618,2020-09-30,1.0,0.75,1.01,2023-07-01,0.033333,0.583333,1.0115,164842,0.01268,164841.78,0.01268,1.0,0.333333,1.0105,-20,0.017271,211962.24,0.019055,196900.4,0.0179,361742.18,0.017646
1217,1859,0.587547,CYRELA,1,3259,0.328197,1,1.0,92,0.039931,2020-11-24,0.8,0.916667,1.01,2023-10-01,0.033333,0.833333,1.0115,562785,0.043291,562785.12,0.043291,1.0,0.333333,1.0105,-945,0.816062,329573.1,0.029628,309008.0,0.028092,871793.12,0.042526


In [6]:
df_train.describe()

Unnamed: 0,empresa,p_empresa,p_marca,obra,p_obra,bloco,p_bloco,unidade,p_unidade,p_dt_venda_day,p_dt_venda_month,p_dt_venda_year,p_dt_chaves_day,p_dt_chaves_month,p_dt_chaves_year,carteira_sd_gerencial,p_carteira_sd_gerencial,saldo_devedor,p_saldo_devedor,p_data_base_day,p_data_base_month,p_data_base_year,dias_atraso,p_dias_atraso,valor_pago_atualizado,p_valor_pago_atualizado,valor_pago,p_valor_pago,vgv,p_vgv
count,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0,2248.0
mean,1792.344751,0.566481,1.686833,7170.653915,0.72212,1.0,1.0,428.220196,0.185859,0.786877,0.59931,1.009744,0.03333333,0.578218,1.011114,638284.6,0.049099,641116.0,0.049317,1.0,0.3333333,1.0105,-78.161922,0.067497,314054.7,0.028233,281995.6,0.025636,923111.6,0.04503
std,845.660775,0.267276,0.752841,2431.498928,0.244864,0.0,0.0,491.714023,0.213418,0.2526,0.296702,0.000519,7.495673e-16,0.272753,0.000418,783695.7,0.060284,777462.2,0.059805,0.0,6.829391e-15,2.176521e-14,153.090383,0.132202,526606.9,0.047341,453717.1,0.041247,1041685.0,0.050814
min,164.0,0.051833,1.0,2339.0,0.235549,1.0,1.0,1.0,0.000434,0.1,0.083333,1.0045,0.03333333,0.083333,1.006,216.0,1.7e-05,216.08,1.7e-05,1.0,0.3333333,1.0105,-1158.0,0.0,0.0,0.0,0.0,0.0,50000.0,0.002439
25%,1166.0,0.368521,1.0,7470.0,0.752266,1.0,1.0,102.0,0.044271,0.6,0.333333,1.0095,0.03333333,0.416667,1.011,223087.8,0.017161,224386.7,0.017261,1.0,0.3333333,1.0105,-52.0,0.017271,62085.92,0.005581,57970.75,0.00527,368242.9,0.017963
50%,2043.5,0.64586,2.0,7848.0,0.790332,1.0,1.0,191.0,0.082899,0.9,0.666667,1.01,0.03333333,0.583333,1.011,408213.0,0.031401,410062.4,0.031543,1.0,0.3333333,1.0105,-32.0,0.027634,160037.7,0.014387,147499.7,0.013409,596170.5,0.029081
75%,2154.0,0.680784,2.0,8722.0,0.878348,1.0,1.0,618.25,0.268338,1.0,0.916667,1.01,0.03333333,0.833333,1.0115,771016.2,0.059309,776434.8,0.059726,1.0,0.3333333,1.0105,-20.0,0.044905,380057.7,0.034166,342712.8,0.031156,1206142.0,0.058836
max,3164.0,1.0,3.0,9930.0,1.0,1.0,1.0,2221.0,0.963976,1.033333,1.0,1.0105,0.03333333,1.0,1.012,13000000.0,1.0,13000000.0,1.0,1.0,0.3333333,1.0105,0.0,1.0,11000000.0,0.988875,11000000.0,1.0,20500000.0,1.0


## Categorias de Atraso

Para a massa de dados disponível, optamos pela seguinte divisão de categorias:

1.   **Atraso até 30 dias** (*983* registros para treino).
2.   **Atraso até 90 dias** (*884* registros para treino).
3.   **Atraso superior 90 dias** (*381* registros para treino).



In [7]:
y_train.value_counts()

0    983
1    884
2    381
Name: p_dias_atraso_category, dtype: int64

In [8]:
def input_fn(features, labels, training=True, batch_size=256):
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle and repeat if you are in training mode.
    if training:
        dataset = dataset.shuffle(1000).repeat()
    
    return dataset.batch(batch_size)

## Atributos Utilizados no Treino

*   *p_empresa*
*   *p_marca*
*   *p_obra*
*   *p_unidade*
*   *p_saldo_devedor*
*   *p_valor_pago*
*   *p_vgv*
*   *p_dt_venda_year*
*   *p_dt_chaves_year*

Totalizando **9 atributos** utilizados no treinamento do modelo.


In [9]:
NUMERIC_COLUMNS = ['p_empresa', 'p_marca', 'p_obra', 'p_unidade',
                   'p_saldo_devedor', 'p_valor_pago', 'p_vgv',
                   'p_dt_venda_year', 'p_dt_chaves_year']

feature_columns = []

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

print(feature_columns)

## Definição do Classificador

Após alguns testes, os melhores parâmetros do modelo, foram os seguintes:

*   Duas camadas ocultas (9 unidades cada).
*   Otimizador Adam.



In [10]:
classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    optimizer='Adam',
    hidden_units=[9, 9],
    n_classes=3)

## Treino

O melhor número de iterações nos treinos (tempo X perda), foi **10.000** iterações. Importante não treinar demais o modelo (depende do volume de dados disponível), afim de evitar o *overfitting*.

In [11]:
classifier.train(
    input_fn=lambda: input_fn(df_train, y_train, training=True),
    steps=10000)

## Validação do Modelo

Utiliza-se os *20%* dos dados, não utilizados no treino, e, portanto, não vistos pelo modelo, para a validação de acurácia.

In [12]:
eval_result = classifier.evaluate(
    input_fn=lambda: input_fn(df_test, y_test, training=False))

print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2021-06-22T02:07:35
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp8jb98ssg/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.28780s
INFO:tensorflow:Finished evaluation at 2021-06-22-02:07:35
INFO:tensorflow:Saving dict for global step 10000: accuracy = 0.48, average_loss = 0.94754946, global_step = 10000, loss = 0.94734925
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 10000: /tmp/tmp8jb98ssg/model.ckpt-10000

Test set accuracy: 0.480



## Conclusão

Após alguns testes, a acurácia geral do modelo, oscilou entre **40 e 50%**, o que consideramos muito baixo. Sugerimos o aumento no volume de dados (algo em torno de 10.000 registros), e a ampliação do escopo, ou seja, que haja maior cruzamento de dados (demais fontes), afim do aumento da acurácia do modelo, com isso, haverá maior facilidade em detecção de correlações.

---

