# Home Credit Default Risk
Home Credit is an international non-bank, consumer finance group. The company operates in 14 countries and focuses on lending primarily to people with little or no credit history.

*Credit* is the trust which allows one party to provide money or resources to another party where that second party does not reimburse the first party immediately (*thereby generating a debt*), but instead promises either to repay or return those resources at a later date. A *credit bureau* is a collection agency that gathers account information from various creditors.

This notebook aims to be able to **predict how capable an applicant is of repaying a loan**, using datasets, provided by Home Credit.

[Link](https://www.kaggle.com/c/home-credit-default-risk/data) to the datasets.

## Attributes

A **cash loan** is the lending of money by one or more individuals, organizations, or other entities to other individuals, organizations etc. The recipient (i.e. the borrower) incurs a debt, and is usually liable to pay interest on that debt until it is repaid, and also to repay the principal amount borrowed.  
**Revolving credit/loan** is a type of credit that does not have a fixed number of payments, in contrast to installment credit. Credit cards are an example of revolving credit used by consumers.


In [13]:
# Manipulating data
import numpy as np
import pandas as pd

# Managing files
import os

## The different datasets

* **application_train.csv** and **application_test.csv**
  * The main table in two files. The train set contains __labels/targets__ (0: the loan was *repaid* or 1: the loan was *not* repaid), and the test set does not.
  * Static data for all applications. One row represents one loan.  
  
  
* **bureau.csv**
  * All client's previous credits provided by other financial institutions reported by the Credit Bureau (loans).
  * For every loan, there are as many rows as number of credits the client had in Credit Bureau before the application date. In other words, each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  
  
* **bureau_balance.csv**
  * Monthly balances of previous credits in Credit Bureau (**bureau.csv**).
  * This table has one row for each month of history of every previous credit reported to Credit Bureau - the table has (*# loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits*) rows. In other words, each row is one month of a previous credit, and a single previous credit can have multiple rows, *one for each month* of the credit length.
  
  
* **POS_CASH_balance.csv**
  * Monthly balance snapshots/data of previous **POS (point of sales) and cash loans** that the applicant had with Home Credit.
  * This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (*# loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits*) rows.
  
  
* **credit_card_balance.csv**
  * Monthly balance snapshots of previous **credit cards** that the applicant has with Home Credit.
  * This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample - the table has (*# loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card*) rows. In other words, each row is one month of a credit card balance, and a single credit card can have many rows.
  
  
* **previous_application.csv**
  * All previous applications for Home Credit loans of *clients who have loans* in our sample.
  * There is one row for each previous application related to loans in our data sample. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV. 
  
  
* **installments_payments.csv** (*A sum of money paid in small parts in a fixed period of time or a single payment within a staged payment plan of a loan*)
  * Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
  * There is a) one row for **every payment that was made** plus b) one row each for **missed payment**.
  * One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.
  
  
* **HomeCredit_columns_description.csv**
  * This file contains descriptions for the columns in the various data files.

<img src="misc/datasets.png" title="The datasets." />

In [14]:
print(os.listdir("datasets/"))

['credit_card_balance.csv', 'HomeCredit_columns_description.csv', 'installments_payments.csv', 'sample_submission.csv', 'bureau.csv', 'previous_application.csv', 'POS_CASH_balance.csv', 'application_train.csv', 'application_test.csv', 'bureau_balance.csv']


In [15]:
train_set = pd.read_csv('datasets/application_train.csv')
test_set = pd.read_csv('datasets/application_test.csv')

## Exploring the application train and test sets

Exploring the data to learn more about the data we have available. We dive into the application_train and application_test set to get a solid grasp of the main data and not branch out too widely into the other sets before we have solid grounds to work from.

In [16]:
print("The training set contains the following number of rows and columns", train_set.shape)

The training set contains the following number of rows and columns (307511, 122)


In [17]:
train_set.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
list(train_set)

['SK_ID_CURR',
 'TARGET',
 'NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_REGISTRATION',
 'DAYS_ID_PUBLISH',
 'OWN_CAR_AGE',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'OCCUPATION_TYPE',
 'CNT_FAM_MEMBERS',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'WEEKDAY_APPR_PROCESS_START',
 'HOUR_APPR_PROCESS_START',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'ORGANIZATION_TYPE',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_A

In [19]:
train_set.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
print("The test set contains the following number of rows and columns", test_set.shape, "- missing the TARGET column")

The test set contains the following number of rows and columns (48744, 121) - missing the TARGET column


### Repaid vs not repaid

Repaid loans (0) compared to not repaid loans (1) shows us that, *there are far more repaid loans than not repaid loans*. Most machine learning algorithms work best when the number of classes are roughly equal. This is called the *class imbalance problem*. We want to emphasize to the machine learning algorithm that it's important to detect the few cases where a loan is likely not to be repaid, rather than focusing only on high accuracy.

In [21]:
train_set["TARGET"].value_counts()

0    282686
1     24825
Name: TARGET, dtype: int64

In [55]:
num_of_not_repaid = train_set["TARGET"].sum()
not_repaid_percentage = num_of_not_repaid / train_set["TARGET"].count() * 100
print("%.2f" % not_repaid_percentage, "% did not repay their loans")

8.07 % did not repay their loans


#### (Put on halt for now) Training a temporary Binary Classifier on income and credit alone

We create a temporary set of the total amount of income and credit and explore what kind of predictions we can get from it.

In [22]:
from sklearn.linear_model import SGDClassifier

income_and_credit = train_set[["AMT_INCOME_TOTAL","AMT_CREDIT"]]
income_and_credit_labeled = train_set["TARGET"]
income_and_credit.head()

Unnamed: 0,AMT_INCOME_TOTAL,AMT_CREDIT
0,202500.0,406597.5
1,270000.0,1293502.5
2,67500.0,135000.0
3,135000.0,312682.5
4,121500.0,513000.0


In [23]:
# using random state makes the results reproducible
sgd_clf = SGDClassifier(random_state=30)
sgd_clf.fit(income_and_credit, income_and_credit_labeled)



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=30, shuffle=True,
       tol=None, verbose=0, warm_start=False)

Predicting the first and second row, which we know are 1 and 0 respectively

### Crosstab

Inspecting the relation between target and gender (exlcuding "XNA").

In [61]:
pd.crosstab(train_set["TARGET"], train_set["CODE_GENDER"].notnull(), margins=True)

CODE_GENDER,True,All
TARGET,Unnamed: 1_level_1,Unnamed: 2_level_1
0,282686,282686
1,24825,24825
All,307511,307511


In [46]:
train_set.CODE_GENDER.unique()

array(['M', 'F', 'XNA'], dtype=object)