### The Business Problem

Our client is a credit card company. They have brought us a dataset that includes some demographics and recent financial data (the past six months) for a sample of 30,000 of their account holders. This data is at the credit account level; in other words, there is one row for each account (you should always clarify what the definition of a row is, in a dataset). Rows are labeled by whether in the next month after the six month historical data period, an account owner has defaulted, or in other words, failed to make the minimum payment.

### Goal
Your goal is to develop a predictive model for whether an account will default next month, given demographics and historical data


### Data Exploration Steps 

1. How many columns are there in the data?These may be features, response, or metadata.
2. How many rows (samples)?
3. What kind of features are there? Which are categorical and which are numerical?
Categorical features have values in discrete classes such as "Yes," "No," or "maybe." Numerical features are typically on a continuous numerical scale, such as dollar amounts.
4. What does the data look like in these features?To see this, you can examine the range of values in numeric features, or the frequency of different classes in categorical features, for example
5. Is there any missing data?

In [3]:
# import libraries
import pandas as pd

In [4]:
df = pd.read_excel('../data/default_of_credit_card_clients.xls')

In [5]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,798fc410-45c1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,8a8c8f3b-8eb4,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,85698822-43f5,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,0737c11b-be42,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,3b7f77cc-dbc0,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


### Verifying Basic Data Integrity

In [8]:
# examining column names
df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

### Meta  Data
1. The account ID column is referenced as ID
2. LIMIT_BAL - Amount of the credit provided (in New Taiwanese (NT) dollar) including individual consumer credit and the family (supplementary) credit
3. SEX: Gender (1 = male; 2 = female)
4. EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
5. MARRIAGE: Marital status (1 = married; 2 = single; 3 = others). 
6. AGE: Age (year). 
7. PAY_1–Pay_6: A record of past payments. Past monthly payments, recorded from April to September, are stored in these columns. 
8. PAY_1 represents the repayment status in September; PAY_2 = repayment status in August; and so on up to PAY_6, which represents the repayment status in April. The measurement scale for the repayment status is as follows: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; and so on up to 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
9. BILL_AMT1–BILL_AMT6: Bill statement amount (in NT dollar). 
10. BILL_AMT1 represents the bill statement amount in September; BILL_AMT2represents the bill statement amount in August; and so on up to BILL_AMT7, which represents the bill statement amount in April. 
11. PAY_AMT1–PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 represents the amount paid in September; PAY_AMT2 represents the amount paid in August; and so on up to PAY_AMT6, which represents the amount paid in A

In [13]:
#  target column (ID) and count unique values

df['ID'].nunique()

29687

In [14]:
# How many rows and columns?
df.shape

# number of unique IDs is less than the number of rows. This implies that the ID is not a unique identifier

(30000, 25)

In [15]:
# Store the value counts in a variable defined as id_counts
id_counts = df['ID'].value_counts()

In [17]:
# display stored values
id_counts.head

d6697da8-74fc    2
bbffebbc-e3c4    2
cb18af1f-3b53    2
f9bcd13e-96bc    2
27e04c06-487f    2
                ..
19606a1f-f7df    1
81541f18-d2aa    1
97087c62-7f9f    1
dbbe09bc-8baa    1
ccf7d4ae-6e83    1
Name: ID, Length: 29687, dtype: int64