<img src = "https://images2.imgbox.com/a5/72/7ZbDUHlf_o.jpg" width="200">

# Módulo - Machine Learning I
---

## Projeto - Análise do Dataset "Crédito Imóveis"

### Instruções 

Usando o [dataset do projeto](https://drive.google.com/file/d/17fyteuN2MdGdbP5_Xq_sySN_yH91vTup/view?usp=sharing), crie modelos usando Árvore de Decisão e KNN para identificar se uma pessoa será adimplente ou inadimplente, realizando os pré-processamentos necessários para cada um. Utilize a metodologia de avaliação de sua preferência, mas seu modelo será avaliado em um conjunto apartado. Que conclusões você consegue tirar a partir do modelo?

- Escolham apenas 5 variáveis dentro das 100+ disponíveis

---

### Equipe

- Mariana de Cassia Soares Nunes Cunha 
- Deborah Soares Cardoso
- Luiz Henrique Simioni Machado
- Eden de Oliveira Santana
- Luiz Gabriel de Souza

---

## Importando Bibliotecas

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from IPython.core.display import HTML

## Funções

In [7]:
## Função para aumentar o notebook

def jupyter_settings():
   
%matplotlib inline
    %pylab inline
    plt.style.use('bmh')
    plt.rcParams['figure.figsize'] = [8,5]
    plt.rcParams['font.size'] = 24
    display(HTML('<style>.container {width:80% !important;}</style>'))
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option('display.expand_frame_repr', False)
    sns.set()

jupyter_settings()

Populating the interactive namespace from numpy and matplotlib


In [None]:
## Função para remover outliers

def outlier_detect(df):
    
    for i in df.describe().columns:
        Q1  = df.describe().at['25%',i]
        Q3  = df.describe().at['75%',i]
        
        IQR = Q3 - Q1
        LTV = Q1 - 1.5 * IQR
        UTV = Q3 + 1.5 * IQR
        
        x = np.array(df[i])
        p = []
        for j in x:
            if j < LTV or j > UTV:
                p.append(df[i].median())
            else:
                p.append(j)
        df[i] = p
    return df

---
## 1. Dataset Crédito Imóveis

### 1.1 Importação dos dados

In [2]:
credito_imoveis_df = pd.read_csv('dados/application_train.csv')

In [3]:
# O dataset possui 246008 observações e 122 variáveis
credito_imoveis_df.shape

(246008, 122)

### 1.2 Informações sobre os dados

In [4]:
# 5 primeiras observações do dataset
credito_imoveis_df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,456162,0,Cash loans,F,N,N,0,112500.0,700830.0,22738.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,134978,0,Cash loans,F,N,N,0,90000.0,375322.5,14422.5,...,0,0,0,0,0.0,0.0,0.0,1.0,0.0,3.0
2,318952,0,Cash loans,M,Y,N,0,180000.0,544491.0,16047.0,...,0,0,0,0,0.0,0.0,0.0,1.0,1.0,3.0
3,361264,0,Cash loans,F,N,Y,0,270000.0,814041.0,28971.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
4,260639,0,Cash loans,F,N,Y,0,144000.0,675000.0,21906.0,...,0,0,0,0,0.0,0.0,0.0,10.0,0.0,0.0


In [5]:
# 5 últimas observações do dataset
credito_imoveis_df.tail()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
246003,242114,0,Cash loans,F,N,Y,1,270000.0,1172470.5,34411.5,...,0,0,0,0,0.0,0.0,0.0,1.0,0.0,8.0
246004,452374,0,Cash loans,F,N,Y,0,180000.0,654498.0,27859.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
246005,276545,1,Revolving loans,M,N,N,1,112500.0,270000.0,13500.0,...,0,0,0,0,,,,,,
246006,236776,1,Cash loans,M,Y,N,3,202500.0,204858.0,17653.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
246007,454197,0,Cash loans,F,N,Y,2,81000.0,547344.0,23139.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,5.0


In [9]:
# percentual de dados nulos
credito_imoveis_df.isnull().mean().round(4)*100

SK_ID_CURR                       0.00
TARGET                           0.00
NAME_CONTRACT_TYPE               0.00
CODE_GENDER                      0.00
FLAG_OWN_CAR                     0.00
FLAG_OWN_REALTY                  0.00
CNT_CHILDREN                     0.00
AMT_INCOME_TOTAL                 0.00
AMT_CREDIT                       0.00
AMT_ANNUITY                      0.00
AMT_GOODS_PRICE                  0.09
NAME_TYPE_SUITE                  0.43
NAME_INCOME_TYPE                 0.00
NAME_EDUCATION_TYPE              0.00
NAME_FAMILY_STATUS               0.00
NAME_HOUSING_TYPE                0.00
REGION_POPULATION_RELATIVE       0.00
DAYS_BIRTH                       0.00
DAYS_EMPLOYED                    0.00
DAYS_REGISTRATION                0.00
DAYS_ID_PUBLISH                  0.00
OWN_CAR_AGE                     66.00
FLAG_MOBIL                       0.00
FLAG_EMP_PHONE                   0.00
FLAG_WORK_PHONE                  0.00
FLAG_CONT_MOBILE                 0.00
FLAG_PHONE  

In [13]:
# Medidas resumo das variáveis quantitativas
credito_imoveis_df.describe(percentiles = [.25, .5, .75, .95, .99]).round(2).T.style.background_gradient(cmap='OrRd')

Unnamed: 0,count,mean,std,min,25%,50%,75%,95%,99%,max
SK_ID_CURR,246008.0,278280.07,102790.91,100002.0,189165.5,278392.5,367272.25,438466.65,452695.86,456255.0
TARGET,246008.0,0.08,0.27,0.0,0.0,0.0,0.0,1.0,1.0,1.0
CNT_CHILDREN,246008.0,0.42,0.72,0.0,0.0,0.0,1.0,2.0,3.0,19.0
AMT_INCOME_TOTAL,246008.0,168912.16,260381.83,25650.0,112500.0,148500.0,202500.0,337500.0,472500.0,117000000.0
AMT_CREDIT,246008.0,599628.31,403067.18,45000.0,270000.0,514777.5,808650.0,1350000.0,1870677.0,4050000.0
AMT_ANNUITY,245998.0,27129.16,14504.97,1615.5,16561.12,24930.0,34599.38,53329.5,69962.17,258025.5
AMT_GOODS_PRICE,245782.0,538928.93,369973.84,40500.0,238500.0,450000.0,679500.0,1305000.0,1800000.0,4050000.0
REGION_POPULATION_RELATIVE,246008.0,0.02,0.01,0.0,0.01,0.02,0.03,0.05,0.07,0.07
DAYS_BIRTH,246008.0,-16042.79,4365.97,-25229.0,-19691.0,-15763.0,-12418.0,-9413.0,-8264.0,-7489.0
DAYS_EMPLOYED,246008.0,63963.76,141400.32,-17912.0,-2758.0,-1215.0,-289.0,365243.0,365243.0,365243.0


---
# 2. Tratamento e Limpeza dos Dados

### 2.1

In [None]:
# Criando uma cópia do dataset
credito_imoveis_df_copy = credito_imoveis_df.copy()

---
# 3.  Análise Exploratória dos Dados

### 3.1 Distribuição de frequência das variáveis quantitativas

### 3.2 Análise bivariada

---
# 4. Aplicação dos Modelos de Machine Learning

### 4.1 Árvore de decisão

### 4.2 KNN

---
# 5. Conclusões

In [8]:
# Correlação forte entre as variáveis ( retirar depois )

thresh = 0.7

# matriz de correlação
df_corr = credito_imoveis_df.corr().abs().unstack()

# filtro
df_corr_filt = df_corr[(df_corr>thresh) | (df_corr<-thresh)].reset_index()

df_corr_filt[df_corr_filt.level_0 != df_corr_filt.level_1]

Unnamed: 0,level_0,level_1,0
3,CNT_CHILDREN,CNT_FAM_MEMBERS,0.87865
6,AMT_CREDIT,AMT_ANNUITY,0.769821
7,AMT_CREDIT,AMT_GOODS_PRICE,0.987024
8,AMT_ANNUITY,AMT_CREDIT,0.769821
10,AMT_ANNUITY,AMT_GOODS_PRICE,0.7749
11,AMT_GOODS_PRICE,AMT_CREDIT,0.987024
12,AMT_GOODS_PRICE,AMT_ANNUITY,0.7749
17,DAYS_EMPLOYED,FLAG_EMP_PHONE,0.99975
22,FLAG_EMP_PHONE,DAYS_EMPLOYED,0.99975
28,CNT_FAM_MEMBERS,CNT_CHILDREN,0.87865


---