# Udacity Machine Learning Capstone Project 
#### Prepared for: Udacity 
#### Prepared by: Damilola Omifare June 2018.

## Home Credit Default Risk
##### Can you predict how capable each applicant is of repaying a loan?
#### Overview 
This project was inspired by that fact that many people who deserves loan do not get it and ends up in the hands of untrustworthy lenders.
This project is a competition from Kaggle. Below is the link: [Kaggle | Home Credit Default Risk Competition](https://www.kaggle.com/c/home-credit-default-risk)


Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

![about homecredit](https://storage.googleapis.com/kaggle-media/competitions/home-credit/about-us-home-credit.jpg) [Source : Kaggle](https://storage.googleapis.com/kaggle-media/competitions/home-credit/about-us-home-credit.jpg)

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a
 
variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.


## Problem Statement.
### Can you predict how capable each applicant is of repaying a loan ?
- My analysis will be predicting how capable each applicant is at repaying a loan.

### Datasets and Inputs.
The dataset for this project has been provided by Kaggle. <br>
Data description is below :
* application_{train|test}.csv
* bureau.csv
* bureau_balance.csv
* POS_CASH_balance.csv 
* credit_card_balance.csv
* previous_application.csv
* installments_payments.csv
* HomeCredit_columns_description.csv

for more information on what each data represents, please read the [PROPOSAL]('/Users/bhetey/version_control/machine-learning/projects/capstone/proposal.pdf'), or [Kaggle](https://www.kaggle.com/c/home-credit-default-risk) <br>
- Below is a diagram of how the data are connected. 
![Data structure](https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png)

In [6]:
from __future__ import division
import pandas as pd # this is to import the pandas module
import numpy as np # importing the numpy module 
import os # file system management 
import zipfile # module to read ZIP archive files.

data = (os.listdir("/Users/bhetey/.kaggle/competitions/home-credit-default-risk/"))
data

['application_test.csv',
 '.DS_Store',
 'HomeCredit_columns_description.csv',
 'POS_CASH_balance.csv',
 'credit_card_balance.csv',
 'installments_payments.csv',
 'application_train.csv',
 'bureau.csv',
 'previous_application.csv',
 'bureau_balance.csv',
 'sample_submission.csv']

In [7]:
data.remove("sample_submission.csv")
data.remove(".DS_Store")
data

['application_test.csv',
 'HomeCredit_columns_description.csv',
 'POS_CASH_balance.csv',
 'credit_card_balance.csv',
 'installments_payments.csv',
 'application_train.csv',
 'bureau.csv',
 'previous_application.csv',
 'bureau_balance.csv']

In [8]:
# reading the data with pandas 
application_train = pd.read_csv('/Users/bhetey/.kaggle/competitions/home-credit-default-risk/application_train.csv')
print 'There are '+ str(application_train.shape[0])+' rows and '+str(application_train.shape[1])+' columns'
application_train.head()

There are 307511 rows and 122 columns


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
def readFile_shape(filePath):

    dataName = pd.read_csv(filePath)
    #print 'There are '+ str(dataName.shape[0])+' rows and '+str(dataName.shape[1])+' columns'
    return dataName

In [10]:
application_test = readFile_shape('/Users/bhetey/.kaggle/competitions/home-credit-default-risk/application_test.csv')

In [11]:
application_test.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


In [12]:
bureau = readFile_shape('/Users/bhetey/.kaggle/competitions/home-credit-default-risk/bureau.csv')
bureau_balance = readFile_shape('/Users/bhetey/.kaggle/competitions/home-credit-default-risk/bureau_balance.csv')
credit_card_balance = readFile_shape('/Users/bhetey/.kaggle/competitions/home-credit-default-risk/credit_card_balance.csv')
HomeCredit_columns_description = readFile_shape('/Users/bhetey/.kaggle/competitions/home-credit-default-risk/HomeCredit_columns_description.csv')
installments_payments = readFile_shape('/Users/bhetey/.kaggle/competitions/home-credit-default-risk/installments_payments.csv')
POS_CASH_balance = readFile_shape('/Users/bhetey/.kaggle/competitions/home-credit-default-risk/POS_CASH_balance.csv')
previous_application  = readFile_shape('/Users/bhetey/.kaggle/competitions/home-credit-default-risk/previous_application.csv')

## DATA PREPARATION, CLEANSING AND WRANGLING 
### Joining the dataset together. 
+ I combined the each rows and columns for each data set together, making them into one dataset. 
+ a dictionary was made to handle the data together. 
+ I will remove the home credit descritons, since it is just a description about the dataset 

In [31]:
joinedDataSet = {
    "rows": [
        # each row is each data is matched with others. using .shape[0]
        application_test.shape[0],
        HomeCredit_columns_description.shape[0],
        POS_CASH_balance.shape[0],
        credit_card_balance.shape[0],
        installments_payments.shape[0],
        application_train.shape[0],
        bureau.shape[0],
        previous_application.shape[0],
        bureau_balance.shape[0]],
                 
    'columns' : [
        # each colums is each data is matched with others. using .shape[1]
        application_test.shape[1],
        HomeCredit_columns_description.shape[1],
        POS_CASH_balance.shape[1],
        credit_card_balance.shape[1],
        installments_payments.shape[1],
        application_train.shape[1],
        bureau.shape[1],
        previous_application.shape[1],
        bureau_balance.shape[1]]
}

In [30]:
wholeData = pd.DataFrame(joinedDataSet, index = data)
wholeData

Unnamed: 0,columns,rows
application_test.csv,121,48744
HomeCredit_columns_description.csv,5,219
POS_CASH_balance.csv,8,10001358
credit_card_balance.csv,23,3840312
installments_payments.csv,8,13605401
application_train.csv,122,307511
bureau.csv,17,1716428
previous_application.csv,37,1670214
bureau_balance.csv,3,27299925


### Understanding how the ID's in the dataset are connected. 
**NOTE:** We have 3 ID's classification 
* SK_ID_CURR 
* SK_ID_PREV  
* SK_ID_BUREAU 

* **SK_ID_CURR:** : is the linkage happening around 
    + `application_train/test.csv`, is linked to : 
        + `bureau.csv`, 
        + `previous_application`, 
        + `credit_card_balance.csv`, 
        + `installments_payments`
        + `POS_CASH_balance.csv`


* **SK_ID_PREV:** is the linkage happening around,
    + `previous_application.csv` is linked to: 
        + `installments_payments.csv`
        + `credits_card_balance.csv`
        + `POS_CASH_balance.csv`

    
* **SK_ID_BUREAU:** is the linkage happening around, 
    + `bureau_balance.csv`

#### Samples of the data after joining 

In [15]:
application_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
application_test.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


In [17]:
HomeCredit_columns_description.head()

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,


In [18]:
POS_CASH_balance.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,1803195,182943,-31,48.0,45.0,Active,0,0
1,1715348,367990,-33,36.0,35.0,Active,0,0
2,1784872,397406,-32,12.0,9.0,Active,0,0
3,1903291,269225,-35,48.0,42.0,Active,0,0
4,2341044,334279,-35,36.0,35.0,Active,0,0


In [19]:
credit_card_balance.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,2562384,378907,-6,56.97,135000,0.0,877.5,0.0,877.5,1700.325,...,0.0,0.0,0.0,1,0.0,1.0,35.0,Active,0,0
1,2582071,363914,-1,63975.555,45000,2250.0,2250.0,0.0,0.0,2250.0,...,64875.555,64875.555,1.0,1,0.0,0.0,69.0,Active,0,0
2,1740877,371185,-7,31815.225,450000,0.0,0.0,0.0,0.0,2250.0,...,31460.085,31460.085,0.0,0,0.0,0.0,30.0,Active,0,0
3,1389973,337855,-4,236572.11,225000,2250.0,2250.0,0.0,0.0,11795.76,...,233048.97,233048.97,1.0,1,0.0,0.0,10.0,Active,0,0
4,1891521,126868,-1,453919.455,450000,0.0,11547.0,0.0,11547.0,22924.89,...,453919.455,453919.455,0.0,1,0.0,1.0,101.0,Active,0,0


In [20]:
installments_payments.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,1054186,161674,1.0,6,-1180.0,-1187.0,6948.36,6948.36
1,1330831,151639,0.0,34,-2156.0,-2156.0,1716.525,1716.525
2,2085231,193053,2.0,1,-63.0,-63.0,25425.0,25425.0
3,2452527,199697,1.0,3,-2418.0,-2426.0,24350.13,24350.13
4,2714724,167756,1.0,2,-1383.0,-1366.0,2165.04,2160.585


In [21]:
bureau.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


In [22]:
previous_application.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,...,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,...,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,...,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,...,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,...,XNA,24.0,high,Cash Street: high,,,,,,


In [23]:
bureau_balance.head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


##### Checking for redundancy between train and test application 

In [24]:
# checking for redundancy on the ID 
checkingRedundantID_forApp = np.concatenate((application_train["SK_ID_CURR"].values, application_test["SK_ID_CURR"].values)) # concatenated SKI_ID
print (len(np.unique(np.array(checkingRedundantID_forApp)))) == len(checkingRedundantID_forApp) # shows there is no redundancy 

True


##### showing how the data are related in terms of ID's 

In [25]:
total_id_app = len(checkingRedundantID_forApp)

def intercept_id(sk_ids, app_total_ids,length_id_app):
    # Return the sorted, unique values that are in both of the input arrays 
    # divided by the total number of id * 100
    return len(np.intersect1d(sk_ids, app_total_ids))/ length_id_app *100

### FIRST LINK
* **SK_ID_CURR:** : is the linkage happening around 
    + `application_train/test.csv`, is linked to : 
        + `bureau.csv`, 
        + `previous_application`, 
        + `credit_card_balance.csv`, 
        + `installments_payments`
        + `POS_CASH_balance.csv`

In [26]:
# bureau_interception 
bureau_interception = intercept_id(bureau['SK_ID_CURR'].values, 
                                   checkingRedundantID_forApp, total_id_app)
print bureau_interception, 'bureau_interception'

# previous application interception 
previous_application_interception = intercept_id(previous_application['SK_ID_CURR'].values,
                                                 checkingRedundantID_forApp, total_id_app)
print previous_application_interception, 'previous_application_interception'

# credit_card_balance_interception 
credit_card_balance_interception = intercept_id(credit_card_balance['SK_ID_CURR'].values,
                                               checkingRedundantID_forApp, total_id_app)
print credit_card_balance_interception, 'credit_card_balance_interception'

# installment_payments_interception 
installment_payments_interception = intercept_id(installments_payments['SK_ID_CURR'].values,
                                               checkingRedundantID_forApp, total_id_app)
print installment_payments_interception, 'installment_payments_interception'

# POS_CASH_balance_interception 
pos_cash_balance_interception = intercept_id(POS_CASH_balance['SK_ID_CURR'].values, 
                                            checkingRedundantID_forApp, total_id_app)
print pos_cash_balance_interception, 'pos_cash_balance_interception'

85.8404794319 bureau_interception
95.1164194187 previous_application_interception
29.0685043017 credit_card_balance_interception
95.3213288235 installment_payments_interception
94.665899426 pos_cash_balance_interception


### SECOND LINK 
* **SK_ID_PREV:** is the linkage happening around,
    + `previous_application.csv` is linked to: 
        + `installments_payments.csv`
        + `credits_card_balance.csv`
        + `POS_CASH_balance.csv`

In [27]:
second_link = previous_application['SK_ID_PREV'].values
length_of_second_link = len(second_link)
# installment_payments_interception 
prev_installment_payments_interception = intercept_id(installments_payments['SK_ID_PREV'].values,
                                               second_link, length_of_second_link)
print prev_installment_payments_interception, 'prev_installment_payments_interception'

# credit_card_balance_interception
prev_credit_card_balance_interception = intercept_id(credit_card_balance['SK_ID_PREV'].values,
                                               second_link, length_of_second_link)
print prev_credit_card_balance_interception, 'prev_credit_card_balance_interception'

# POS_CASH_balance_interception 
prev_pos_cash_balance_interception = intercept_id(POS_CASH_balance['SK_ID_PREV'].values, 
                                            second_link, length_of_second_link)
print prev_pos_cash_balance_interception, 'prev_pos_cash_balance_interception'

57.4121040777 prev_installment_payments_interception
5.56425703533 prev_credit_card_balance_interception
53.8196302989 prev_pos_cash_balance_interception


### THIRD LINK 
* **SK_ID_BUREAU:** is the linkage happening around, 
    + `bureau_balance.csv`

In [28]:
bureau_link = np.unique(bureau["SK_ID_BUREAU"].values)
length_of_bureau_link = len(bureau_link)

# bureau_balance_interception 
bureau_balance_interception = intercept_id(bureau_balance["SK_ID_BUREAU"].values, 
                                           bureau_link, 
                                           length_of_bureau_link)
bureau_balance_interception

45.114272197843434

In [32]:
wholeData.describe()

Unnamed: 0,columns,rows
count,9.0,9.0
mean,38.222222,6498901.0
std,48.380724,9157505.0
min,3.0,219.0
25%,8.0,307511.0
50%,17.0,1716428.0
75%,37.0,10001360.0
max,122.0,27299920.0
