# Home Credit Default Risk
In this notebook I explore the datasets provided for the home credit default risk kaggle challenge.  I will cover the following learning objectives here: 
- Working with structured data
- Encoding of categorical variables 
- Handling missing values 

Some of this notebook follows the helpful kaggle kernel created by Will Koehrsen hosted [here](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction).  Our task is to create a model that predicts an applicants risk of default based on the provided datasets.

In [1]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

%matplotlib inline

In [2]:
import glob 
path_to_data = '../data/raw/'

for datafile in glob.glob(path_to_data + '*.csv'):
    print(datafile)

../data/raw/application_test.csv
../data/raw/HomeCredit_columns_description.csv
../data/raw/POS_CASH_balance.csv
../data/raw/credit_card_balance.csv
../data/raw/installments_payments.csv
../data/raw/application_train.csv
../data/raw/bureau.csv
../data/raw/previous_application.csv
../data/raw/bureau_balance.csv
../data/raw/sample_submission.csv


### Load Data 
The first thing I want to do is take a look at the structure of each dataset provided.  I will check the size of each dataset, the number of missing values, and the number of categorical features to be encoded or otherwise dealt with.  

In [None]:
def check_structure(data):
    
    # Check shape
    print('Dataset has shape (samples, features): ', data.shape)
    
    # Check missing values (percentage of total)
    missing_values = data.isnull().sum() / len(data) * 100.0
    missing_values.sort_values(ascending=False, inplace=True)
    
    print('\nDataset missing values: ')
    print(missing_values.head(12))
    
    # Check type of variables 
    print('\nDataset value types: ')
    print(data.dtypes.value_counts())
    
    # List categorical features and their number of unique values
    print('\nDataset categorical summary: ')
    print(data.select_dtypes('object').apply(pd.Series.nunique, axis=0))

In [None]:
app_train = pd.read_csv(path_to_data + 'application_train.csv')
installment_df = pd.read_csv(path_to_data + 'installments_payments.csv')
credit_card_df = pd.read_csv(path_to_data + 'credit_card_balance.csv')
bureau_df = pd.read_csv(path_to_data + 'bureau.csv')
cash_df = pd.read_csv(path_to_data + 'POS_CASH_balance.csv')
bureau_balance_df = pd.read_csv(path_to_data + 'bureau_balance.csv')
prev_app_df = pd.read_csv(path_to_data + 'previous_application.csv')

In [None]:
description = pd.read_csv(path_to_data + 'HomeCredit_columns_description.csv')

### Application Data 
This is the main dataset provided with the application, that contains 122 features/fields (one of those is the target).  It contains plenty of entries with missing data, and 16 features that are categorical in nature.  

In [None]:
check_structure(app_train)

In [None]:
app_train.head()

### Installment Data 
This dataset provides information on previous home credit payments.  Each payment made or missed is a unique row.  The structure of this dataset is simple, containing very few missing values and no categorical variables.  It contains 8 features.

In [None]:
check_structure(installment_df)

### Credit Card Data
Information on credit cards previously held with home credit.  Contains 23 features, some missing data, and one categorical variable.

In [None]:
check_structure(credit_card_df)

### Bureau Data 
Information about the clients credit standings with other financial institutions.  This dataset contains 17 features, a fair amount of missing data in 4-5 of the fields, and contains three categorical variables.

In [None]:
check_structure(bureau_df)

### Cash Dataset 
Point of sale cash and loans information.  This dataset contains eight features, one of which is categorical, and almost no missing entries. 

In [None]:
check_structure(cash_df)

### Bureau Balance Data 
This dataset details the payment information from the bureau.  

In [None]:
check_structure(bureau_balance_df)

### Previous Application Data 
This dataset contains the applicants previous applications for home credit.  It seems to be the second largest, after the applcation, and has 16 categorical variables.

In [None]:
check_structure(prev_app_df)

### Aggregate Data 
In the following section, the data are aggregated by the applicant identification number `SK_ID_CURR`.

In [None]:
installment_df.columns

In [None]:
for c in installment_df.columns:
    print(c, description[description.Row == c].Description.values)

In [None]:
# Add features to installment dataset here.
installment_df['PAYMENT_DIFF'] = installment_df.AMT_PAYMENT - installment_df.AMT_INSTALMENT
installment_df['PAYMENT_PERC'] = installment_df.AMT_PAYMENT / installment_df.AMT_INSTALMENT
installment_df['DPD'] = installment_df.DAYS_INSTALMENT - installment_df.DAYS_ENTRY_PAYMENT

# Define aggregation methods for float/int cols
aggregations = {
    'NUM_INSTALMENT_VERSION': ['mean', 'min', 'max'],
    'NUM_INSTALMENT_NUMBER': ['mean', 'min', 'max'],
    'DAYS_INSTALMENT': ['mean', 'min', 'max'],
    'DAYS_ENTRY_PAYMENT': ['mean', 'min', 'max'],
    'AMT_INSTALMENT': ['mean', 'min', 'max'],
    'AMT_PAYMENT': ['mean', 'min', 'max'],
    'PAYMENT_DIFF': ['min', 'max', 'mean'],
    'PAYMENT_PERC': ['min', 'max', 'mean'],
    'DPD': ['min', 'max', 'mean']
               }

install_agg = installment_df.groupby('SK_ID_CURR').aggregate(aggregations)

colname = lambda x, y: 'INSTALL_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(install_agg.columns)]
install_agg.columns = new_cols

In [None]:
install_agg.head()

In [None]:
for c in credit_card_df.columns:
    print(c, description[description.Row == c].Description.values)

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
cat_cols = [credit_card_df.columns[i] for i, c in enumerate(credit_card_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

credit_card_df['BALANCE_PERC'] =  credit_card_df.AMT_BALANCE / credit_card_df.AMT_CREDIT_LIMIT_ACTUAL

# Map the unique strings to unique integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    credit_card_df[cat] = encoder.fit_transform(credit_card_df[cat])
    
# Add features that make sense here.    
#aggregations = {
#    'AMT_BALANCE': ['min', 'mean', 'max'],
#    'BALANCE_PERC': ['min', 'max', 'mean']
#}

# Do more custom modification.
#credit_card_agg = credit_card_df.groupby('SK_ID_CURR').aggregate(aggregations)

# Do general aggregations.
credit_card_agg = credit_card_df.groupby('SK_ID_CURR').aggregate(['min', 'max', 'sum', 'var', 'mean'])

colname = lambda x, y: 'CC_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(credit_card_agg.columns)]
credit_card_agg.columns = new_cols

In [None]:
credit_card_agg.head()

In [None]:
for c in bureau_df.columns:
    print(c, description[description.Row == c].Description.values)

In [None]:
cat_cols = [bureau_df.columns[i] for i, c in enumerate(bureau_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

# Change the strings to integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    bureau_df[cat] = encoder.fit_transform(bureau_df[cat])
    
# Add features that make sense here.    
#aggregations = {
#    'DAYS_CREDIT': ['min', 'mean', 'max']
#}

bureau_agg = bureau_df.groupby('SK_ID_CURR').aggregate(['min', 'max', 'sum', 'mean', 'var'])
colname = lambda x, y: 'CC_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(bureau_agg.columns)]
bureau_agg.columns = new_cols

print('Bureau shape: ', bureau_df.shape)
print('Bureau (agg) shape: ', bureau_agg.shape)

In [None]:
for c in cash_df.columns:
    print(c, description[description.Row == c].Description.values)

In [None]:
cat_cols = [cash_df.columns[i] for i, c in enumerate(cash_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

# Change the strings to integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    cash_df[cat] = encoder.fit_transform(cash_df[cat])
    
# Add features that make sense here.    
#aggregations = {
#    'SK_DPD': ['min', 'mean', 'max']
#}

cash_agg = cash_df.groupby('SK_ID_CURR').aggregate(['min', 'max', 'sum', 'var', 'mean'])
colname = lambda x, y: 'CC_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(cash_agg.columns)]
cash_agg.columns = new_cols

print('Cash shape: ', cash_df.shape)
print('Cash (agg) shape: ', cash_agg.shape)

In [None]:
for c in bureau_balance_df.columns:
    print(c, description[description.Row == c].Description.values)

In [None]:
cat_cols = [bureau_balance_df.columns[i] for i, c in enumerate(bureau_balance_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

# Change the strings to integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    bureau_balance_df[cat] = encoder.fit_transform(bureau_balance_df[cat])
    
# Add features that make sense here.    
aggregations = {
    'SK_DPD': ['min', 'mean', 'max']
}

bureau_balance_agg = cash_df.groupby('SK_ID_CURR').aggregate(aggregations)
colname = lambda x, y: 'BB_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(bureau_agg.columns)]
bureau_balance_agg.columns = new_cols

print('Bureau balance shape: ', bureau_balance_df.shape)
print('Bureau balance (agg) shape: ', bureau_balance_agg.shape)

In [None]:
for c in prev_app_df.columns:
    print(c, description[description.Row == c].Description.values)

In [None]:
cat_cols = [prev_app_df.columns[i] for i, c in enumerate(prev_app_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

# Change the strings to integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    prev_app_df[cat] = encoder.fit_transform(prev_app_df[cat])
    
# Add features that make sense here.    
aggregations = {
    'CNT_PAYMENT': ['min', 'mean', 'max']
}

prev_app_agg = prev_app_df.groupby('SK_ID_CURR').aggregate(aggregations)
colname = lambda x, y: 'BB_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(prev_app_agg.columns)]
prev_app_agg.columns = new_cols

print('Previous application shape: ', prev_app_df.shape)
print('Previous application (agg) shape: ', prev_app_agg.shape)