# Home Credit Default Risk
In this notebook I explore the datasets provided for the home credit default risk kaggle challenge.  I will cover the following learning objectives here: 
- Working with structured data
- Encoding of categorical variables 
- Handling missing values 
Some of this notebook follows the helpful kaggle kernel created by Will Koehrsen hosted [here](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction).  

Our task is to create a model that predicts an applicants risk of default based on the provided datasets.

In [1]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

%matplotlib inline

In [2]:
import glob 
path_to_data = '/home/dmriser/data/kaggle/home-credit/'

for datafile in glob.glob(path_to_data + '*.csv'):
    print(datafile)

/home/dmriser/data/kaggle/home-credit/installments_payments.csv
/home/dmriser/data/kaggle/home-credit/application_test.csv
/home/dmriser/data/kaggle/home-credit/sample_submission.csv
/home/dmriser/data/kaggle/home-credit/HomeCredit_columns_description.csv
/home/dmriser/data/kaggle/home-credit/credit_card_balance.csv
/home/dmriser/data/kaggle/home-credit/application_train.csv
/home/dmriser/data/kaggle/home-credit/bureau.csv
/home/dmriser/data/kaggle/home-credit/POS_CASH_balance.csv
/home/dmriser/data/kaggle/home-credit/bureau_balance.csv
/home/dmriser/data/kaggle/home-credit/previous_application.csv


In [3]:
app_df = pd.read_csv(path_to_data + 'application_train.csv')
app_df.head(12)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
5,100008,0,Cash loans,M,N,Y,0,99000.0,490495.5,27517.5,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0
6,100009,0,Cash loans,F,Y,Y,1,171000.0,1560726.0,41301.0,...,0,0,0,0,0.0,0.0,0.0,1.0,1.0,2.0
7,100010,0,Cash loans,M,Y,Y,0,360000.0,1530000.0,42075.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
8,100011,0,Cash loans,F,N,Y,0,112500.0,1019610.0,33826.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
9,100012,0,Revolving loans,M,N,Y,0,135000.0,405000.0,20250.0,...,0,0,0,0,,,,,,


Having loaded the data, I can ask some very basic questions. 
- What is the quantity of data provided in the application train file? 
- How many missing values are present in the file? 
- In the provided training data, what is the probability of default? 

In [4]:
print('Loaded application training data with shape: ', app_df.shape)

('Loaded application training data with shape: ', (307511, 122))


In [7]:
app_missing_values = app_df.isnull().sum() / len(app_df)
app_missing_values.sort_values(ascending=False, inplace=True)
app_missing_values.head(12)

COMMONAREA_MEDI             0.698723
COMMONAREA_AVG              0.698723
COMMONAREA_MODE             0.698723
NONLIVINGAPARTMENTS_MODE    0.694330
NONLIVINGAPARTMENTS_MEDI    0.694330
NONLIVINGAPARTMENTS_AVG     0.694330
FONDKAPREMONT_MODE          0.683862
LIVINGAPARTMENTS_MEDI       0.683550
LIVINGAPARTMENTS_MODE       0.683550
LIVINGAPARTMENTS_AVG        0.683550
FLOORSMIN_MEDI              0.678486
FLOORSMIN_MODE              0.678486
dtype: float64

In [8]:
app_df.TARGET.value_counts()

0    282686
1     24825
Name: TARGET, dtype: int64

In [11]:
app_df.dtypes.value_counts()

float64    65
int64      41
object     16
dtype: int64

In [13]:
categorical_features = [column for column in app_df.columns if app_df[column].dtype == 'object']
print(categorical_features)

['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']


In [17]:
app_df.select_dtypes('object').apply(pd.Series.nunique, axis=0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64