# Home Credit Default Risk
In this notebook I explore the datasets provided for the home credit default risk kaggle challenge.  I will cover the following learning objectives here: 
- Working with structured data
- Encoding of categorical variables 
- Handling missing values 
Some of this notebook follows the helpful kaggle kernel created by Will Koehrsen hosted [here](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction).  

Our task is to create a model that predicts an applicants risk of default based on the provided datasets.

In [1]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

%matplotlib inline

In [2]:
import glob 
path_to_data = '/home/dmriser/data/kaggle/home-credit/'

for datafile in glob.glob(path_to_data + '*.csv'):
    print(datafile)

/home/dmriser/data/kaggle/home-credit/installments_payments.csv
/home/dmriser/data/kaggle/home-credit/application_test.csv
/home/dmriser/data/kaggle/home-credit/sample_submission.csv
/home/dmriser/data/kaggle/home-credit/HomeCredit_columns_description.csv
/home/dmriser/data/kaggle/home-credit/credit_card_balance.csv
/home/dmriser/data/kaggle/home-credit/application_train.csv
/home/dmriser/data/kaggle/home-credit/bureau.csv
/home/dmriser/data/kaggle/home-credit/POS_CASH_balance.csv
/home/dmriser/data/kaggle/home-credit/bureau_balance.csv
/home/dmriser/data/kaggle/home-credit/previous_application.csv


### Load Data 
The first thing I want to do is take a look at the structure of each dataset provided.  I will check the size of each dataset, the number of missing values, and the number of categorical features to be encoded or otherwise dealt with.  

In [3]:
def check_structure(data):
    
    # Check shape
    print('Dataset has shape (samples, features): ', data.shape)
    
    # Check missing values (percentage of total)
    missing_values = data.isnull().sum() / len(data) * 100.0
    missing_values.sort_values(ascending=False, inplace=True)
    
    print('\nDataset missing values: ')
    print(missing_values.head(12))
    
    # Check type of variables 
    print('\nDataset value types: ')
    print(data.dtypes.value_counts())
    
    # List categorical features and their number of unique values
    print('\nDataset categorical summary: ')
    print(data.select_dtypes('object').apply(pd.Series.nunique, axis=0))

In [18]:
app_train = pd.read_csv(path_to_data + 'application_train.csv')
installment_df = pd.read_csv(path_to_data + 'installments_payments.csv')
credit_card_df = pd.read_csv(path_to_data + 'credit_card_balance.csv')
bureau_df = pd.read_csv(path_to_data + 'bureau.csv')
cash_df = pd.read_csv(path_to_data + 'POS_CASH_balance.csv')
bureau_balance_df = pd.read_csv(path_to_data + 'bureau_balance.csv')
prev_app_df = pd.read_csv(path_to_data + 'previous_application.csv')

### Application Data 
This is the main dataset provided with the application, that contains 122 features/fields (one of those is the target).  It contains plenty of entries with missing data, and 16 features that are categorical in nature.  

In [19]:
check_structure(app_train)

('Dataset has shape (samples, features): ', (307511, 122))

Dataset missing values: 
COMMONAREA_MEDI             69.872297
COMMONAREA_AVG              69.872297
COMMONAREA_MODE             69.872297
NONLIVINGAPARTMENTS_MODE    69.432963
NONLIVINGAPARTMENTS_MEDI    69.432963
NONLIVINGAPARTMENTS_AVG     69.432963
FONDKAPREMONT_MODE          68.386172
LIVINGAPARTMENTS_MEDI       68.354953
LIVINGAPARTMENTS_MODE       68.354953
LIVINGAPARTMENTS_AVG        68.354953
FLOORSMIN_MEDI              67.848630
FLOORSMIN_MODE              67.848630
dtype: float64

Dataset value types: 
float64    65
int64      41
object     16
dtype: int64

Dataset categorical summary: 
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKD

In [20]:
app_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


### Installment Data 
This dataset provides information on previous home credit payments.  Each payment made or missed is a unique row.  The structure of this dataset is simple, containing very few missing values and no categorical variables.  It contains 8 features.

In [21]:
check_structure(installment_df)

('Dataset has shape (samples, features): ', (13605401, 8))

Dataset missing values: 
AMT_PAYMENT               0.021352
DAYS_ENTRY_PAYMENT        0.021352
AMT_INSTALMENT            0.000000
DAYS_INSTALMENT           0.000000
NUM_INSTALMENT_NUMBER     0.000000
NUM_INSTALMENT_VERSION    0.000000
SK_ID_CURR                0.000000
SK_ID_PREV                0.000000
dtype: float64

Dataset value types: 
float64    5
int64      3
dtype: int64

Dataset categorical summary: 
Series([], dtype: float64)


### Credit Card Data
Information on credit cards previously held with home credit.  Contains 23 features, some missing data, and one categorical variable.

In [22]:
check_structure(credit_card_df)

('Dataset has shape (samples, features): ', (3840312, 23))

Dataset missing values: 
AMT_PAYMENT_CURRENT           19.998063
AMT_DRAWINGS_OTHER_CURRENT    19.524872
CNT_DRAWINGS_POS_CURRENT      19.524872
CNT_DRAWINGS_OTHER_CURRENT    19.524872
CNT_DRAWINGS_ATM_CURRENT      19.524872
AMT_DRAWINGS_ATM_CURRENT      19.524872
AMT_DRAWINGS_POS_CURRENT      19.524872
CNT_INSTALMENT_MATURE_CUM      7.948208
AMT_INST_MIN_REGULARITY        7.948208
SK_DPD_DEF                     0.000000
SK_ID_CURR                     0.000000
MONTHS_BALANCE                 0.000000
dtype: float64

Dataset value types: 
float64    15
int64       7
object      1
dtype: int64

Dataset categorical summary: 
NAME_CONTRACT_STATUS    7
dtype: int64


### Bureau Data 
Information about the clients credit standings with other financial institutions.  This dataset contains 17 features, a fair amount of missing data in 4-5 of the fields, and contains three categorical variables.

In [23]:
check_structure(bureau_df)

('Dataset has shape (samples, features): ', (1716428, 17))

Dataset missing values: 
AMT_ANNUITY               71.473490
AMT_CREDIT_MAX_OVERDUE    65.513264
DAYS_ENDDATE_FACT         36.916958
AMT_CREDIT_SUM_LIMIT      34.477415
AMT_CREDIT_SUM_DEBT       15.011932
DAYS_CREDIT_ENDDATE        6.149573
AMT_CREDIT_SUM             0.000757
CREDIT_TYPE                0.000000
AMT_CREDIT_SUM_OVERDUE     0.000000
CNT_CREDIT_PROLONG         0.000000
DAYS_CREDIT_UPDATE         0.000000
CREDIT_DAY_OVERDUE         0.000000
dtype: float64

Dataset value types: 
float64    8
int64      6
object     3
dtype: int64

Dataset categorical summary: 
CREDIT_ACTIVE       4
CREDIT_CURRENCY     4
CREDIT_TYPE        15
dtype: int64


### Cash Dataset 
Point of sale cash and loans information.  This dataset contains eight features, one of which is categorical, and almost no missing entries. 

In [24]:
check_structure(cash_df)

('Dataset has shape (samples, features): ', (10001358, 8))

Dataset missing values: 
CNT_INSTALMENT_FUTURE    0.260835
CNT_INSTALMENT           0.260675
SK_DPD_DEF               0.000000
SK_DPD                   0.000000
NAME_CONTRACT_STATUS     0.000000
MONTHS_BALANCE           0.000000
SK_ID_CURR               0.000000
SK_ID_PREV               0.000000
dtype: float64

Dataset value types: 
int64      5
float64    2
object     1
dtype: int64

Dataset categorical summary: 
NAME_CONTRACT_STATUS    9
dtype: int64


### Bureau Balance Data 
This is some more data, small dataset.

In [25]:
check_structure(bureau_balance_df)

('Dataset has shape (samples, features): ', (27299925, 3))

Dataset missing values: 
STATUS            0.0
MONTHS_BALANCE    0.0
SK_ID_BUREAU      0.0
dtype: float64

Dataset value types: 
int64     2
object    1
dtype: int64

Dataset categorical summary: 
STATUS    8
dtype: int64


In [26]:
check_structure(prev_app_df)

('Dataset has shape (samples, features): ', (1670214, 37))

Dataset missing values: 
RATE_INTEREST_PRIVILEGED     99.643698
RATE_INTEREST_PRIMARY        99.643698
RATE_DOWN_PAYMENT            53.636480
AMT_DOWN_PAYMENT             53.636480
NAME_TYPE_SUITE              49.119754
DAYS_TERMINATION             40.298129
NFLAG_INSURED_ON_APPROVAL    40.298129
DAYS_FIRST_DRAWING           40.298129
DAYS_FIRST_DUE               40.298129
DAYS_LAST_DUE_1ST_VERSION    40.298129
DAYS_LAST_DUE                40.298129
AMT_GOODS_PRICE              23.081773
dtype: float64

Dataset value types: 
object     16
float64    15
int64       6
dtype: int64

Dataset categorical summary: 
NAME_CONTRACT_TYPE              4
WEEKDAY_APPR_PROCESS_START      7
FLAG_LAST_APPL_PER_CONTRACT     2
NAME_CASH_LOAN_PURPOSE         25
NAME_CONTRACT_STATUS            4
NAME_PAYMENT_TYPE               4
CODE_REJECT_REASON              9
NAME_TYPE_SUITE                 7
NAME_CLIENT_TYPE                4
NAME_GOODS_CATEGO