## Contexto

Uma empresa de cartão de crébido trouxe um dataset que inclui dados demográficos e financeiros recentes (últimos 6 meses), de uma amostra de 30.000 titulares de contas.

Para cada linha há está no nível de conta de crédito e são rotuladas de acordo com se no mês seguinte ao período de dados históricos de seis meses um proprietário de conta ficou inadimplente.

## Objetivo

Desenvolver um modelo que preveja se uma conta ficará inadimplemente no próximo mês, de acordo com dados demográficos e históricos.

https://archive.ics.uci.edu

https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients

In [1]:
import pandas as pd

In [6]:
df = pd.read_excel('default_of_credit_card_clients__courseware.xls')

In [7]:
df.shape

(30000, 25)

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:

- LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
- SEX: Gender (1 = male; 2 = female).
- EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).
- AGE: Age (year).

In [18]:
df.iloc[0:3, 0:6]

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE
0,798fc410-45c1,20000,2,2,1,24
1,8a8c8f3b-8eb4,120000,2,2,2,26
2,85698822-43f5,90000,2,2,2,34


PAY_1 - PAY_6: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
- PAY_1 = the repayment status in September, 2005; 
- PAY_2 = the repayment status in August, 2005;
- . . .;
- PAY_6 = the repayment status in April, 2005.

  The measurement scale for the repayment status is: 
  - -1 = pay duly; 
  - 1 = payment delay for one month; 
  - 2 = payment delay for two months; 
  - . . .; 
  - 8 = payment delay for eight months; 
  - 9 = payment delay for nine months and above.

In [19]:
df.iloc[0:3, 6:12]

Unnamed: 0,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6
0,2,2,-1,-1,-2,-2
1,-1,2,0,0,0,2
2,0,0,0,0,0,0


BILL_AMT1-BILL_AMT6: Amount of bill statement (NT dollar). 
  - BILL_AMT1 = amount of bill statement in September, 2005; 
  - BILL_AMT2 = amount of bill statement in August, 2005; 
  - . . .; 
  - BILL_AMT6 = amount of bill statement in April, 2005.

In [17]:
df.iloc[0:3, 12:18]

Unnamed: 0,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6
0,3913,3102,689,0,0,0
1,2682,1725,2682,3272,3455,3261
2,29239,14027,13559,14331,14948,15549


PAY_AMT1-PAY_AMT6: Amount of previous payment (NT dollar). 
- PAY_AMT1 = amount paid in September, 2005; 
- PAY_AMT2 = amount paid in August, 2005; . . .;
- PAY_AMT6 = amount paid in April, 2005.

In [21]:
df.iloc[0:3, 18:24]

Unnamed: 0,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,0,689,0,0,0,0
1,0,1000,1000,1000,0,2000
2,1518,1500,1000,1000,1000,5000


- default payment next month: default payment (Yes = 1, No = 0), as the response variable.

In [22]:
df.iloc[0:3, 24:25]

Unnamed: 0,default payment next month
0,1
1,1
2,0
