![image](https://www.cncf.io/wp-content/uploads/2018/10/dotscience.svg)

# Home Credit Default Risk

## Data

We are using data provided by [Home Credit](http://www.homecredit.net/about-us.aspx) a service dedicated to provided lines of credit (loans) to the unbanked population. 

There are 7 sources of data. We track them all as `ds.input`s:

* application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row. The training application data comes with the `TARGET` indicating 0: the loan was repaid or 1: the loan was not repaid. 
* bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
* bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length. 
* previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature `SK_ID_PREV`. 
* POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
* credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
* installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment. 

This diagram shows how all of the data is related:

![image](https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png)


## Imports

In [None]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# get dotscience
import dotscience as ds



In [None]:
%matplotlib inline

# Start logging with Dotscience

In [None]:
ds.interactive()
ds.start()

## Read in Data 


In [None]:
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(os.getcwd()) if (isfile(join(os.getcwd(), f)) and f.endswith('csv'))]

In [None]:
# Start tracking files with Dotscience
for file in onlyfiles:
    ds.input(file)

In [None]:
# Training data
app_train = pd.read_csv('application_train.csv')
app_train.rename(columns={'SK_ID_CURR':'Loan_ID'}, inplace=True) 
print('Training data shape: ', app_train.shape)
app_train.head()

# Or, uncomment this to see all the features
# with pd.option_context("display.max_columns", 122):
#    print(app_train.head())

The training data has 307511 observations (each one a separate loan) and 122 features (variables) including the `TARGET` (the label we want to predict).

In [None]:
# Testing data features
app_test = pd.read_csv('application_test.csv')
app_test.rename(columns={'SK_ID_CURR':'Loan_ID'}, inplace=True)
print('Testing data shape: ', app_test.shape)
app_test.head()



# Exploratory Data Analysis

## Examine the Distribution of the Target

0 indicates that the loan was repaid on time. 1 indicates that the client had payment difficulties.

In [None]:
app_train['TARGET'].value_counts()

In [None]:
app_train['TARGET'].astype(int).plot.hist();

## Encode Categorical Variables

For any categorical variable (`dtype == object`) with 2 unique categories, we will use label encoding, and for any categorical variable with more than 2 unique categories, we will use one-hot encoding. 

For label encoding, we use the Scikit-Learn `LabelEncoder` and for one-hot encoding, the pandas `get_dummies(df)` function.

In [None]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(app_train[col])
            # Transform both training and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

In [None]:
# one-hot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

### Aligning Training and Testing Data

We want to retain the same features (columns) in both the training and testing data. So, let's encode the test data in the same way as the training data.

In [None]:
train_labels = app_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

# Add the target back in
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

The training and testing datasets now have the same features which is required for machine learning. The number of features has grown significantly due to one-hot encoding. At some point we probably will want to try [dimensionality reduction (removing features that are not relevant)](https://en.wikipedia.org/wiki/Dimensionality_reduction) to reduce the size of the datasets.

## Correlations

### Correlation with `TARGET` label

In [None]:
# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

### Correlation between features
Let's dig into some initially promising sounding features. We'll see how they are related, and how each one is distributed.

In [None]:
reduced_df = app_train[['OCCUPATION_TYPE_Laborers', 'DAYS_EMPLOYED', 'DAYS_BIRTH', 'DAYS_LAST_PHONE_CHANGE', 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'NAME_EDUCATION_TYPE_Higher education', 'NAME_INCOME_TYPE_Pensioner', 'TARGET',]]

corrs = reduced_df.corr()
# Heatmap of correlations
sns.heatmap(corrs, cmap = plt.cm.PuRd, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap');

In [None]:
fig = plt.figure(figsize = (20,20))
ax = fig.gca()
app_train[['OCCUPATION_TYPE_Laborers', 'DAYS_EMPLOYED', 
           'DAYS_BIRTH', 'DAYS_LAST_PHONE_CHANGE', 
           'NAME_EDUCATION_TYPE_Secondary / secondary special', 
           'NAME_EDUCATION_TYPE_Higher education', 
           'NAME_INCOME_TYPE_Pensioner']].hist(ax = ax, edgecolor = 'k')



### Effect of Age on Repayment

We can dig into one feature in particular...

In [None]:
# Find the correlation of the positive days since birth and target
app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])
app_train['DAYS_BIRTH'].corr(app_train['TARGET'])

As the client gets older, there is a negative linear relationship with the target meaning that as clients get older, they tend to repay their loans on time more often. 


In [None]:
# plt.style.use('fivethirtyeight')

# Plot the distribution of ages in years
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');

The distribution of age shows that there are no outliers.

In [None]:
plt.figure(figsize = (10, 8))

# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')

# KDE plot of loans which were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')

# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');

The `TARGET == 1` curve skews towards the younger end of the range. 

In [None]:
# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)

# Let's write this out to csv
age_data.to_csv('repayment_by_age_group.csv', index = False)
ds.add_output('repayment_by_age_group.csv')

In [None]:
# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups

In [None]:
plt.figure(figsize = (8, 8))

# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');

There is a clear trend: younger applicants are more likely to default on the loan. The rate of failure to repay is above 10% for the youngest three age groups and beolow 5% for the oldest age group.

This is information that could be directly used by the bank.

# Make the model

We implement a [random forest model with Scikit-Learn](https://scikit-learn.org/stable/modules/ensemble.html#forest). We want to predict probabilities (floats in the range [0, 1]) that unlabeled clients will default on credit repayments.

### Preprocessing
We fill in missing values with the median of their respective columns (imputation) and normalize the feature ranges (scaling).

In [None]:
from sklearn.preprocessing import MinMaxScaler, Imputer

# Drop the target from the training data
if 'TARGET' in app_train:
    train = app_train.drop(columns = ['TARGET'])
else:
    train = app_train.copy()
    
# Feature names
features = list(train.columns)

# Copy the testing data
test = app_test.copy()

# Median imputation of missing values
imputer = Imputer(strategy = 'median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))

# Fit on the training data
imputer.fit(train)

# Transform both training and testing data
train = imputer.transform(train)
test = imputer.transform(app_test)

# Repeat with the scaler
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)

## Train the model and get summary metrics
We use a random forest model.

We can parameterise with the number of trees to build, `max_features` to use, and with [warm_start](https://scikit-learn.org/stable/glossary.html#term-warm-start). We set `n_jobs` to -1 to use all available cores.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Make the random forest classifier
# NB setting n_jobs to -1 will take advantage of all processors to run jobs in parallel
random_forest = RandomForestClassifier(n_estimators = ds.parameter("n_estimators", 100), random_state = 50, 
                                       verbose = 1, n_jobs = -1, oob_score=True, warm_start = (ds.parameter("warm_start", True)))
#random_forest.fit(train, train_labels)


Use `ds.summary()` to grab the model's accuracy metric, returned by its `.score()` method. For a random forest model, this is the mean accuracy score. If we had more time to run the model, we could use cross validation to get a metric value instead instead.

In [None]:
# Train on the training data
%time random_forest.fit(train, train_labels)


In [None]:
# Get out of bag score
print(ds.summary("out_of_bag_score", random_forest.oob_score_))

## Make predictions
We want to predict the probability of defaulting, so we use the model's `predict.proba` method. This returns an `m` x 2 array where `m` is the number of test records. The first column is the probability of the target being 0 (indicating no default) and the second column is the probability of the target being 1 (indicating a default), so for a single row, the two columns must sum to 1. 

We will output the probability the loan is _not_ repaid, so we select the second column.

In [None]:
# Make predictions on the test data
%time rand_forest_predictions = random_forest.predict_proba(test)[:, 1]

### Output predictions to csv

In [None]:
predictions_df = app_test[['Loan_ID']]
predictions_df['TARGET'] = rand_forest_predictions

# Save the dataframe as a csv
predictions_df.to_csv('random_forest_predictions.csv', index = False)
ds.add_output('random_forest_predictions.csv')

In [None]:
predictions_df.head()

The predictions show the probability that the loan will not be repaid. The user of the model can chose a threshold beyond which the risk of default is too high, and use this data to make lending decisions.

In [None]:
ds.publish()