# Home Credit Default Risk

## Team Members

* Anuj Mahajan - anujmaha@iu.edu        
* Shubham Jambhale - sjambhal@iu.edu 
* Siddhant Patil - sidpatil@iu.edu 
* Shashwati Diware - sdiware@iu.edu

![image.png](attachment:image.png)

## 1.0 FP_GroupN_ 11 HCDR

### 1.1 Phase Leader Plan

![image.png](attachment:image.png)

### 1.2 Credit Assignment Plan

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)

### 1.3 Abstract

Based on historical credit histories and repayment trends utilizing machine learning modeling, Home Credit offers unsecured lending. A user-generated credit score is calculated using criteria like the balance that the user has maintained. As part of this project, we are predicting the customer repayment status such as if the user is a defaulter or not using machine learning pipelines and models using the datasets provided by Kaggle. The data collection includes seven separate tables that aid in determining the user status, including bureau balance, credit card balance, home credit column detection, Installments payments, POS CASH balance, and previous applications. In phase 3, we provide feature engineering, hyperparameter tuning, and modeling pipelines. We experimented with selected features for Logistic regression, Decision Making Tree, Random Forest, Lasso, and Ridge Regressions. The Decision Tree has the highest test accuracy with 92.12, followed by Logistic regression and Random Forest with a test accuracy of 91.98. We received 0.5 ROC AUC from a Kaggle submission.

### 1.4 Data and Task Description

* Data source: 
    * We are planning to use the existing datasets provided by Kaggle. 
        Source: https://www.kaggle.com/c/home-credit-default-risk/data 
   <br>  
* POS_CASH_balance.csv:
    * This dataset gives information about previous credit information such as contract status, the number of installments left to pay, DPD(days past due), etc. of the current application.
![image.png](attachment:image.png)
<br>

* bureau.csv 
    * This dataset gives information about the type of credit, debt, limit, overdue, maximum overdue, annuity, remaining days for previous credit, etc.
![image-3.png](attachment:image-3.png)
<br>

* bureau_balance.csv:
    * This dataset gives information about the Status of the Credit Bureau loan during the month, the Month of balance relative to the application date, Recoded ID of the Credit Bureau credit. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
![image-3.png](attachment:image-3.png)
<br>

 * credit_card_balance.csv:
    * This dataset gives information about financial transactions aggregated values such as amount received, drawings, number of transactions of previous credit, installments, etc.  Each row is one month of a credit card balance, and a single credit card can have many rows.
![image-4.png](attachment:image-4.png)
<br>

* installments_payments.csv:
    * This dataset gives information about payments, installments supposed to be paid, and their details. There is one row for every made payment and one row for every missed payment. 
![image-5.png](attachment:image-5.png)
<br>

 * previous_application.csv 
    * This dataset contains information about previous application details of an application. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
![image-6.png](attachment:image-6.png)
<br>

![image-7.png](attachment:image-7.png)

### 1.5 Gantt Chart

![image-3.png](attachment:image-3.png)

### 1.6 Machine Learning Algorithms and Metrics

The outcome of this project is to predict, whether the customer will repay the loan or not. That’s why this is a classification task where the outcome is 0 or 1.  To classify this problem we will be building the following machine-learning models:
<br>
1.	Logistics Regression:<br>
    •	In our case, the number of features is relatively small i.e. <1000, and no. of examples is large. Hence logistic regression can be a good fit here for the classification.
<br>

2.	Decision Tree:<br>
    •	Decision trees are better for categorical data and our target data is also categorical in nature that’s why decision trees are a good fit.
<br>

3.	Random Forest:<br>
    •	Random Forest works well with a mixture of numerical and categorical features.
    •	As we have a good amount of mixture of both types of features random forest can be a good fit.
<br>

4.	Lasso Regression:<br>
    •    The bias-variance trade-off is the basis for Lasso's superiority over least squares. The lasso solution can result in a decrease in variance at the cost of a slight increase in bias when the variance of the least squares estimates is very large. Consequently, this can produce predictions that are more accurate.
<br>

5.	Ridge Regression:<br>
    •	Any data that exhibits multicollinearity can be analyzed using the model-tuning technique known as ridge regression. This technique carries out L2 regularization. Predicted values differ much from real values when the problem of multicollinearity arises, least-squares are unbiased, and variances are significant.
<br>
    

#### 1.6.1 Loss Function

* Log loss
    * How closely the forecast probability matches the associated real or true value is indicated by log-loss (0 or 1 in case of binary classification). The higher the log-loss number, the more the predicted probability deviates from the actual value.


#### 1.6.2 Metrics

In [1]:
!pip install latexify-py==0.2.0
import math
import latexify



1.	Confusion Metrics:<br>
    * A confusion matrix, also called an error matrix, is used in the field of machine learning and more specifically in the challenge of classification. Confusion matrices show counts between expected and observed values. The result "TN" stands for True Negative and displays the number of negatively classed cases that were correctly identified. Similar to this, "TP" stands for True Positive and denotes the quantity of correctly identified positive cases. The term "FP" denotes the number of real negative cases that were mistakenly categorized as positive, while "FN" denotes the number of real positive examples that were mistakenly classed as negative. Accuracy is one of the most often used metrics in classification. 
    <br>
    
![image.png](attachment:image.png)

2.	AUC:<br>
    * AUC stands for "Area under the ROC Curve." It measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). It is a widely used accuracy method for binary classification problems. 
    <br>
3.	Accuracy:<br>
    * The accuracy score is used to gauge the model's effectiveness by calculating the ratio of total true positives to total true negatives across all made predictions. Accuracy is generally used to calculate binary classification models.

In [2]:
@latexify.function(use_math_symbols=True)
def Accuracy():
    return(True_Positives + True_Negatives) / (True_Positives +
True_Negatives + False_Positives + False_Negatives)

Accuracy

<latexify.frontend.LatexifiedFunction at 0x7fc0103eea90>

In [3]:
@latexify.function(use_math_symbols=True)
def logloss():
    return (-1/m*(sum(y*np.log(p)+(1- y)*np.log(1-p))))
logloss

<latexify.frontend.LatexifiedFunction at 0x7fc0103eef10>

### 1.7 Machine Learning Pipeline Steps

![image.png](attachment:image.png)

* Data Preprocessing:<br>
•	Convert the raw data set into a clean data set for processing.<br> 
•	First, Obtain Kaggle's raw data.<br>
•	On this Raw Data. Analyze exploratory data.<br>

* Feature Engineering:<br>
•	Create a suitable input dataset by performing feature engineering and other processing techniques.<br>
•	Pipeline must not only select the features it wants to create from an unlimited pool of possibilities, but it must also process vast amounts of data to do so. This makes the data appropriate for the model. <br>

* Model Selection:<br>
•	Here, we try on different models for various option purposes.<br>
•	Develop and test several candidate models, such as Random Forest, Decision Making Trees, and Logistic Regression.<br>
•	Using the evaluation function, pick the top model with a good evaluation score.<br>
•	For this selection purposes, employ many measures for evaluation criteria, including "Accuracy," "F1 Score,".<br>

* Prediction Generation:<br>
•	The top performer is then chosen as the winning model when the models are tested on a new set of data that wasn't used during training.<br>
•	Once the best model has been chosen, use it to forecast outcomes based on the fresh data.<br>
•	It is then used to make predictions across all your objects.<br>

### 1.8 Block Diagram

![image.png](attachment:image.png)

In [127]:
#!pip install opendatasets

In [128]:
#pip install pandas

In [129]:
#pwd

In [130]:
#ls -l .kaggle\kaggle.json

In [131]:
#!mkdir .kaggle

In [132]:
#!mkdir ~\.kaggle

In [133]:
#mkdir \.kaggle

In [134]:
#ls -l .kaggle

In [135]:
#pwd


In [136]:
#!chmod 600 C:\\Users\\jambh\\.kaggle\\kaggle.json

In [137]:
#!kaggle competitions 

In [138]:
#DATA_DIR = './././Data/home-credit-default-risk'

In [139]:
#!mkdir DATA_DIR

In [140]:
#!kaggle competitions download home-credit-default-risk -p .\\Data\\home-credit-default-risk
#! kaggle competitions download home-credit-default-risk -p $DATA_DIR

In [141]:
'''import zipfile
unzippingReq = True #True
if unzippingReq: #please modify this code 
    zip_ref = zipfile.ZipFile('./DATA_DIR/home-credit-default-risk.zip', 'r')
    zip_ref.extractall('./DATA_DIR') 
    zip_ref.close()'''
    

"import zipfile\nunzippingReq = True #True\nif unzippingReq: #please modify this code \n    zip_ref = zipfile.ZipFile('./DATA_DIR/home-credit-default-risk.zip', 'r')\n    zip_ref.extractall('./DATA_DIR') \n    zip_ref.close()"

In [96]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
#import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import seaborn as sns
#from sklearn.linear_model import Lasso,Ridge,LogisticRegression
#from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

#import warnings
warnings.simplefilter('ignore')
import seaborn as sea
import matplotlib.pyplot as plt
%matplotlib inline

#from sklearn.model_selection import train_test_split
import re
from time import time
from scipy import stats
import json
#from sklearn.model_selection import ShuffleSplit
#from sklearn.linear_model import LogisticRegression

#from sklearn.ensemble import RandomForestClassifier
#from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import roc_auc_score, log_loss, accuracy_score
from sklearn.metrics import confusion_matrix

from IPython.display import display, Math, Latex

#Referred this fron the base notebook provided by professor
# https://github.iu.edu/jshanah/I526_AML_Student/blob/master/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/HCDR_baseLine_submission_with_numerical_and_cat_features_to_kaggle.ipynb
def load_data(in_path, name):
    df = pd.read_csv(in_path)
    print(f"{name}: shape is {df.shape}")
    print(df.info())
    display(df.head(5))
    return df

'''datasets={}
ds_name = 'application_train'
DATA_DIR='./DATA_DIR'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)

datasets['application_train'].shape'''

complete_data={}
ds_name = 'application_train'
DATA_DIR='./DATA_DIR'
complete_data[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)

complete_data['application_train'].shape

application_train: shape is (307511, 122)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


(307511, 122)

In [6]:
ds_name = 'application_test'
complete_data[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)

application_test: shape is (48744, 121)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None


Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


In [201]:
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
            "previous_application","POS_CASH_balance")

for ds_name in ds_names:
    complete_data[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)

application_train: shape is (307511, 122)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


application_test: shape is (48744, 121)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None


Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


bureau: shape is (1716428, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY             float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None


Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


bureau_balance: shape is (27299925, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int64 
 1   MONTHS_BALANCE  int64 
 2   STATUS          object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None


Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


credit_card_balance: shape is (3840312, 23)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int64  
 1   SK_ID_CURR                  int64  
 2   MONTHS_BALANCE              int64  
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int64  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float64
 16  CNT_DRAWINGS_CURRENT        int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  float64
 18  CNT_DRAWINGS_POS_C

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,2562384,378907,-6,56.97,135000,0.0,877.5,0.0,877.5,1700.325,...,0.0,0.0,0.0,1,0.0,1.0,35.0,Active,0,0
1,2582071,363914,-1,63975.555,45000,2250.0,2250.0,0.0,0.0,2250.0,...,64875.555,64875.555,1.0,1,0.0,0.0,69.0,Active,0,0
2,1740877,371185,-7,31815.225,450000,0.0,0.0,0.0,0.0,2250.0,...,31460.085,31460.085,0.0,0,0.0,0.0,30.0,Active,0,0
3,1389973,337855,-4,236572.11,225000,2250.0,2250.0,0.0,0.0,11795.76,...,233048.97,233048.97,1.0,1,0.0,0.0,10.0,Active,0,0
4,1891521,126868,-1,453919.455,450000,0.0,11547.0,0.0,11547.0,22924.89,...,453919.455,453919.455,0.0,1,0.0,1.0,101.0,Active,0,0


installments_payments: shape is (13605401, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_PREV              int64  
 1   SK_ID_CURR              int64  
 2   NUM_INSTALMENT_VERSION  float64
 3   NUM_INSTALMENT_NUMBER   int64  
 4   DAYS_INSTALMENT         float64
 5   DAYS_ENTRY_PAYMENT      float64
 6   AMT_INSTALMENT          float64
 7   AMT_PAYMENT             float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB
None


Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,1054186,161674,1.0,6,-1180.0,-1187.0,6948.36,6948.36
1,1330831,151639,0.0,34,-2156.0,-2156.0,1716.525,1716.525
2,2085231,193053,2.0,1,-63.0,-63.0,25425.0,25425.0
3,2452527,199697,1.0,3,-2418.0,-2426.0,24350.13,24350.13
4,2714724,167756,1.0,2,-1383.0,-1366.0,2165.04,2160.585


previous_application: shape is (1670214, 37)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 1

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,...,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,...,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,...,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,...,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,...,XNA,24.0,high,Cash Street: high,,,,,,


POS_CASH_balance: shape is (10001358, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SK_ID_PREV             int64  
 1   SK_ID_CURR             int64  
 2   MONTHS_BALANCE         int64  
 3   CNT_INSTALMENT         float64
 4   CNT_INSTALMENT_FUTURE  float64
 5   NAME_CONTRACT_STATUS   object 
 6   SK_DPD                 int64  
 7   SK_DPD_DEF             int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None


Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,1803195,182943,-31,48.0,45.0,Active,0,0
1,1715348,367990,-33,36.0,35.0,Active,0,0
2,1784872,397406,-32,12.0,9.0,Active,0,0
3,1903291,269225,-35,48.0,42.0,Active,0,0
4,2341044,334279,-35,36.0,35.0,Active,0,0


CPU times: user 24.9 s, sys: 6.22 s, total: 31.1 s
Wall time: 34.9 s


In [202]:
for ds_name in complete_data.keys():
    print(f'dataset {ds_name:24}: [ {complete_data[ds_name].shape[0]:10,}, {complete_data[ds_name].shape[1]}]')

dataset application_train       : [    307,511, 122]
dataset application_test        : [     48,744, 121]
dataset bureau                  : [  1,716,428, 17]
dataset bureau_balance          : [ 27,299,925, 3]
dataset credit_card_balance     : [  3,840,312, 23]
dataset installments_payments   : [ 13,605,401, 8]
dataset previous_application    : [  1,670,214, 37]
dataset POS_CASH_balance        : [ 10,001,358, 8]


In [203]:
data = complete_data['application_train'].copy()
y = data['TARGET']
X = data.drop(['SK_ID_CURR','TARGET'], axis = 1)

# EXPLORATORY DATA ANALYSIS

In [204]:
application_test = complete_data['application_test'].copy()
application_train = complete_data['application_train'].copy()

In [15]:
'''def Exploratory_Data_Analysis(dataframe,dataframe_name):
    print("Test description; data type: {}".format(dataframe_name))
    print(dataframe.dtypes)
    print("\n--------------------------------------------------------------------------\n")
    print(" Dataset size (rows columns): {}".format(dataframe_name))
    print(dataframe.shape)
    print("\n--------------------------------------------------------------------------\n")
    print("Summary statistics: {}".format(dataframe_name))
    print(dataframe.describe())
    print("\n--------------------------------------------------------------------------\n")
    print("Correlation analysis: {}".format(dataframe_name))
    print(dataframe.corr())
    print("\n--------------------------------------------------------------------------\n")
    print("Other Analysis: {}".format(dataframe_name))
    print("1. Checking for Null values: {}".format(dataframe_name))
    print(dataframe.isna().sum())
    print("\n2. Info")
    print(dataframe.info())'''

'def Exploratory_Data_Analysis(dataframe,dataframe_name):\n    print("Test description; data type: {}".format(dataframe_name))\n    print(dataframe.dtypes)\n    print("\n--------------------------------------------------------------------------\n")\n    print(" Dataset size (rows columns): {}".format(dataframe_name))\n    print(dataframe.shape)\n    print("\n--------------------------------------------------------------------------\n")\n    print("Summary statistics: {}".format(dataframe_name))\n    print(dataframe.describe())\n    print("\n--------------------------------------------------------------------------\n")\n    print("Correlation analysis: {}".format(dataframe_name))\n    print(dataframe.corr())\n    print("\n--------------------------------------------------------------------------\n")\n    print("Other Analysis: {}".format(dataframe_name))\n    print("1. Checking for Null values: {}".format(dataframe_name))\n    print(dataframe.isna().sum())\n    print("\n2. Info")\n    p

In [16]:
#Exploratory_Data_Analysis(application_train,'APPLICTION_TRAIN_DATA')

In [17]:
#bureau = datasets['bureau'].copy()
#Exploratory_Data_Analysis(bureau,'Bureau_Data')

In [18]:
#bureau_balance = datasets['bureau_balance'].copy()
#Exploratory_Data_Analysis(bureau_balance,'Bureau_balance_Data')

In [19]:
#credit_card_balance = datasets['credit_card_balance'].copy()
#Exploratory_Data_Analysis(credit_card_balance,'credit_card_balance')

In [20]:
#installments_payments = datasets['installments_payments'].copy()
#Exploratory_Data_Analysis(installments_payments,'installments_payments')

In [21]:
#previous_application   = datasets['previous_application'].copy()
#Exploratory_Data_Analysis(previous_application ,'previous_application')

In [22]:
#POS_CASH_balance = datasets['POS_CASH_balance'].copy()
#Exploratory_Data_Analysis(POS_CASH_balance ,'POS_CASH_balance')

In [23]:
#males = application_train[application_train['CODE_GENDER']=='M']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'user_count','index':'TARGET'})
#males['count_percent'] = males['user_count']/males['user_count'].sum()*100
#males['CODE_GENDER'] = 'M'
#females = application_train[application_train['CODE_GENDER']=='F']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'user_count','index':'TARGET'})
#females['count_percent'] = females['user_count']/females['user_count'].sum()*100
#females['CODE_GENDER'] = 'F'
#gender_data = males.append(females, ignore_index=True,sort=False)
#gender_data

In [24]:
#sea.catplot(data=gender_data, kind="bar", x="TARGET", y="user_count", hue="CODE_GENDER")
#sea.catplot(data=gender_data, kind="bar", x="CODE_GENDER", y="count_percent", hue="TARGET")
#plt.xlabel("GENDER")
#plt.ylabel('User count in Percentage')

# GENDER Vs INCOME based on Target

In [25]:
#figure,ax = plt.subplots(figsize = (12,12))
#sea.boxplot(x='CODE_GENDER',hue = 'TARGET',y='AMT_INCOME_TOTAL', data=application_train)
#plt.ylim(0, 500000)
#plt.xlabel("GENDER")
#plt.ylabel('ANNUAL INCOME')

# OWN HOUSE COUNT based on Target

In [159]:
#own_house = application_train[application_train['FLAG_OWN_REALTY']=='Y']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'user_count','index':'TARGET'})
#own_house['OWN_HOUSE'] = 'Y'
#own_house['count_percent'] = own_house['user_count']/own_house['user_count'].sum()*100
#not_own_house = application_train[application_train['FLAG_OWN_REALTY']=='N']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'user_count','index':'TARGET'})
#not_own_house['OWN_HOUSE'] = 'N'
#not_own_house['count_percent'] = not_own_house['user_count']/not_own_house['user_count'].sum()*100
#own_house = own_house.append(not_own_house,ignore_index=True,sort=False)
#own_house

Unnamed: 0,TARGET,user_count,OWN_HOUSE,count_percent
0,0,196329,Y,92.038423
1,1,16983,Y,7.961577
2,0,86357,N,91.675071
3,1,7842,N,8.324929


In [26]:
#sea.barplot(x='OWN_HOUSE',y='count_percent',hue = 'TARGET',data=own_house[own_house['TARGET']==1])
#sea.catplot(data=own_house, kind="bar", x="TARGET", y="user_count", hue="OWN_HOUSE")
#sea.catplot(data=own_house, kind="bar", x="OWN_HOUSE", y="count_percent", hue="TARGET")
#plt.xlabel("OWN HOUSE")
#plt.ylabel('USER COUNT IN PERCENTAGE')

# OWN CAR COUNT based on Target

In [28]:
#own_car = application_train[application_train['FLAG_OWN_CAR']=='Y']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'user_count','index':'TARGET'})
#own_car['FLAG_OWN_CAR'] = 'Y'
#own_car['count_percent'] = own_car['user_count']/own_car['user_count'].sum()*100
#not_own_car = application_train[application_train['FLAG_OWN_CAR']=='N']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'user_count','index':'TARGET'})
#not_own_car['FLAG_OWN_CAR'] = 'N'
#not_own_car['count_percent'] = not_own_car['user_count']/not_own_car['user_count'].sum()*100
#own_car = own_car.append(not_own_car,ignore_index=True,sort=False)
#own_car

In [29]:
#sea.barplot(x='FLAG_OWN_CAR',y='count_percent',hue = 'TARGET',data=own_car[own_car['TARGET']==1])
#sea.catplot(data=own_car, kind="bar", x="TARGET", y="user_count", hue="FLAG_OWN_CAR")
#sea.catplot(data=own_car, kind="bar", x="FLAG_OWN_CAR", y="count_percent", hue="TARGET")
#plt.xlabel("OWN CAR",fontsize = 18)
#plt.ylabel('USER COUNT IN PERCENTAGE',fontsize = 18)

### BORROWER OWNING A CAR are more likely to Pay

# -------------------------------------------------------------------------------------------------------

# OCCUPATION TYPE COUNT based on Target

In [30]:
#fig, ax = plt.subplots(figsize=(15,9))
#sea.countplot(x='OCCUPATION_TYPE', hue = 'TARGET',data=application_train)
#plt.xlabel("OCCUPATION TYPE",fontsize = 18)
#plt.ylabel('BORROWERS',fontsize = 18)
#plt.xticks(fontsize=14, rotation=65)

# OCCUPATION TYPE vs INCOME based on Target

In [31]:
#fig, ax = plt.subplots(figsize=(15,7))
#sea.barplot(x='OCCUPATION_TYPE',y='AMT_INCOME_TOTAL',hue = 'TARGET',data=application_train)
#plt.xticks(rotation=65,fontsize = 14)
#plt.xlabel("OCCUPATION TYPE",fontsize = 18)
#plt.ylabel("AVERAGE ANNUAL FAMILY INCOME",fontsize = 18)

In [32]:
#income_credit_ratio_data = application_train[['AMT_INCOME_TOTAL','AMT_CREDIT','TARGET']]
#income_credit_ratio_data['IC_ratio'] = income_credit_ratio_data['AMT_INCOME_TOTAL']/income_credit_ratio_data['AMT_CREDIT']
#income_credit_ratio_data['quantile'] = pd.qcut(income_credit_ratio_data['IC_ratio'],q = 10, labels = False)
#income_credit_ratio_data

In [33]:
#income_credit_ratio_data = income_credit_ratio_data.groupby(['quantile','TARGET'])['AMT_INCOME_TOTAL'].count().reset_index().rename(columns={'AMT_INCOME_TOTAL':'user_count'})
#income_credit_ratio_data['count_percent'] = income_credit_ratio_data.apply(lambda x: x['user_count']/income_credit_ratio_data[income_credit_ratio_data['quantile']==x['quantile']]['user_count'].sum()*100,axis=1)
#income_credit_ratio_data

In [34]:
#fig, ax = plt.subplots(figsize=(15,7))
#sea.barplot(x='quantile',y='count_percent',hue = 'TARGET',data=income_credit_ratio_data)
#plt.xticks(rotation=0,fontsize = 14)
#plt.xlabel("QUANTILE BASED ON INCOME TO CREDIT RATIO",fontsize = 18)
#plt.ylabel("DEFAULTER/NON-DEFAULTER PERCENTAGE",fontsize = 18)

In [35]:
#sns.relplot(
#    data=income_credit_ratio_data, x="quantile", y="count_percent",
#    col="TARGET", hue="TARGET", style="TARGET",
#    kind="scatter"
#)

## Defaulter Percentage is less than IC_ratiois either low or High

# -----------------------------------------------------------------------------------------------------------

# REPAYERS TO APPLICATION RATIO

In [36]:
#occ_data = pd.DataFrame(data=application_train.groupby(['OCCUPATION_TYPE','TARGET']).count()['SK_ID_CURR'])
#occ_data = occ_data.reset_index() 
#value_counts = occ_data['SK_ID_CURR'].values
#def repayers_to_applicants_ratio(values):
#    flag = 1
#    ratios = []
#    for count in range(len(values)):
#        if flag == 1:
#            current_number = values[count]
#            next_number = values[count+1]
#            ratios.append(current_number/(current_number+next_number))
#            ratios.append(current_number/(current_number+next_number))
#        flag=flag*-1
#    return ratios
#occ_data['Ratio R/A'] = repayers_to_applicants_ratio(value_counts)
#occ_ratio = occ_data.groupby(['OCCUPATION_TYPE','Ratio R/A']).count().drop(['TARGET', 'SK_ID_CURR'],axis=1)
#occ_ratio = occ_ratio.reset_index() 
#occ_ratio = occ_ratio.sort_values(['Ratio R/A'],ascending=False)
#occ_ratio

In [205]:
application_train.corrwith(application_train["TARGET"])

SK_ID_CURR                   -0.002108
TARGET                        1.000000
CNT_CHILDREN                  0.019187
AMT_INCOME_TOTAL             -0.003982
AMT_CREDIT                   -0.030369
                                ...   
AMT_REQ_CREDIT_BUREAU_DAY     0.002704
AMT_REQ_CREDIT_BUREAU_WEEK    0.000788
AMT_REQ_CREDIT_BUREAU_MON    -0.012462
AMT_REQ_CREDIT_BUREAU_QRT    -0.002022
AMT_REQ_CREDIT_BUREAU_YEAR    0.019930
Length: 106, dtype: float64

# CORRELATION OF POSITIVE DAYS SINCE BIRTH AND TARGET

In [206]:

#application_train['DAYS_BIRTH'] = abs(application_train['DAYS_BIRTH'])
#-1*(application_train['DAYS_BIRTH'].corr(application_train['TARGET']))

# CORRELATION OF POSITIVE DAYS SINCE EMPLOYMENT AND TARGET

In [207]:
#application_train['DAYS_EMPLOYED'] = abs(application_train['DAYS_EMPLOYED'])
#-1*(application_train['DAYS_EMPLOYED'].corr(application_train['TARGET']))

# FETCHING IMPORTANT RELAVENT FEATURES

In [208]:
#Fetched some of the important features from dtaset on the basis of their correlation
imp_ftr = ['ELEVATORS_MEDI', 'NAME_TYPE_SUITE', 'AMT_GOODS_PRICE', 'FLAG_DOCUMENT_3', 'YEARS_BEGINEXPLUATATION_MEDI', 'AMT_ANNUITY', 'BASEMENTAREA_MODE', 'FLAG_DOCUMENT_6', 'BASEMENTAREA_MEDI', 'FLAG_DOCUMENT_4', 'FLAG_OWN_CAR', 'FONDKAPREMONT_MODE', 'FLAG_OWN_REALTY', 'ENTRANCES_MEDI', 'NONLIVINGAPARTMENTS_AVG', 'EMERGENCYSTATE_MODE', 'APARTMENTS_MODE', 'DAYS_REGISTRATION', 'FLOORSMAX_MEDI', 'YEARS_BUILD_AVG']


In [209]:
imp_ftr += ['FLAG_OWN_CAR','AMT_CREDIT','AMT_ANNUITY','DAYS_EMPLOYED','OWN_CAR_AGE','CODE_GENDER', 'OCCUPATION_TYPE','CNT_FAM_MEMBERS','FLAG_OWN_REALTY', 'EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3'] + ['TARGET']


In [44]:
#experimentLog = pd.DataFrame(columns=["ExpID", "Train accuracy", "Test Accuracy", "Validation Accuracy", "AUC", "Accuracy", "Loss", "Train Time(s)", "Test Time(s)", "Validation Time(s)","Experiment description"])

In [45]:
#def rounding(x):
#    return round(100*x,1)

#class DataFrameSelector(BaseEstimator, TransformerMixin):
#    def __init__(self, attribute_names):
#        self.attribute_names = attribute_names
#    def fit(self, X, y=None):
#        return self
#    def transform(self, X):
#        return X[self.attribute_names].values
#
#def LossBinaryClassifier(actual, predicted): 
#    return(-1/ len(actual)*(sum(actual * np.log(predicted) + (1 - actual) * np.log(1 - predicted))))

In [210]:
X

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,...,0,0,0,0,,,,,,
4,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,225000.0,Unaccompanied,...,0,0,0,0,,,,,,
307507,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,225000.0,Unaccompanied,...,0,0,0,0,,,,,,
307508,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,585000.0,Unaccompanied,...,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
307509,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,319500.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [211]:
X.isna().sum()

NAME_CONTRACT_TYPE                0
CODE_GENDER                       0
FLAG_OWN_CAR                      0
FLAG_OWN_REALTY                   0
CNT_CHILDREN                      0
                              ...  
AMT_REQ_CREDIT_BUREAU_DAY     41519
AMT_REQ_CREDIT_BUREAU_WEEK    41519
AMT_REQ_CREDIT_BUREAU_MON     41519
AMT_REQ_CREDIT_BUREAU_QRT     41519
AMT_REQ_CREDIT_BUREAU_YEAR    41519
Length: 120, dtype: int64

In [212]:
'''null_value = X.isna().sum().reset_index().rename(columns={'index':'column_name',0:'null_value_count'})
numberOfNullValues = (max(null_value['null_value_count']) * 30) / 100
print('Threshold set for number of null values : ',numberOfNullValues)
#print(type(null_value))
nullFreeData = null_value.copy()
nullFreeData[nullFreeData.columns[:]] = ''
print(nullFreeData)
for index, row in null_value.iterrows():
    if row['null_value_count'] >= numberOfNullValues:
        nullFreeData = nullFreeData.append(row)
#nullFreeData[(190-70):]
nullFreeData = nullFreeData.iloc[-70:]
#null_value = null_value[null_value['null_value_count'] <= numberOfNullValues]'''
null_data = X.isna().sum().reset_index().rename(columns={'index':'column_name',0:'null_value_count'})
numberOfNullValues = (max(null_value['null_value_count']) * 30) / 100
#null_data['count_%'] = (null_data['null_count'] / len(X)) *100
null_data = null_data[null_data['null_value_count'] <= numberOfNullValues]
null_data

Unnamed: 0,column_name,null_value_count
0,NAME_CONTRACT_TYPE,0
1,CODE_GENDER,0
2,FLAG_OWN_CAR,0
3,FLAG_OWN_REALTY,0
4,CNT_CHILDREN,0
...,...,...
115,AMT_REQ_CREDIT_BUREAU_DAY,41519
116,AMT_REQ_CREDIT_BUREAU_WEEK,41519
117,AMT_REQ_CREDIT_BUREAU_MON,41519
118,AMT_REQ_CREDIT_BUREAU_QRT,41519


In [213]:
selectiveFtr = null_data['column_name'].tolist()
selectiveFtr += ['TARGET']
print(selectiveFtr)

['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DO

In [118]:
#null_data['column_type'] = null_data['column_name'].apply(lambda x: X[x].dtype)
#nullFreeData['null_value_count'] > 0
#null_value[null_value['count%'] > 0]

In [214]:
finalFtrs = data[selectiveFtr]
finalFtrs['NAME_TYPE_SUITE'].fillna('dummy', inplace=True) 
print('Unique values of NAME_TYPE_SUITE')
finalFtrs.NAME_TYPE_SUITE.unique()

Unique values of NAME_TYPE_SUITE


array(['Unaccompanied', 'Family', 'Spouse, partner', 'Children',
       'Other_A', 'dummy', 'Other_B', 'Group of people'], dtype=object)

In [215]:
finalFtrs.dropna( subset=['DAYS_LAST_PHONE_CHANGE'], inplace=True)

In [216]:
NullValColumns = null_data[null_data['null_value_count'] != 0].reset_index(drop=True)['column_name'].tolist()
NullValColumns

['AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'NAME_TYPE_SUITE',
 'CNT_FAM_MEMBERS',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'DAYS_LAST_PHONE_CHANGE',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR']

In [217]:
def fillNullValues(NullValColumns, finalFtrs):
    for column in NullValColumns:
        if column == 'AMT_REQ_CREDIT_BUREAU_HOUR':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'AMT_REQ_CREDIT_BUREAU_WEEK':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'AMT_REQ_CREDIT_BUREAU_DAY':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'AMT_REQ_CREDIT_BUREAU_MON':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'AMT_REQ_CREDIT_BUREAU_YEAR':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'AMT_REQ_CREDIT_BUREAU_QRT':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'OBS_30_CNT_SOCIAL_CIRCLE':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'DEF_30_CNT_SOCIAL_CIRCLE':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'OBS_60_CNT_SOCIAL_CIRCLE':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'OBS_60_CNT_SOCIAL_CIRCLE':
            finalFtrs[column].fillna(0,inplace=True)
        elif column == 'CNT_FAM_MEMBERS':
            finalFtrs[column].fillna(0,inplace=True)
    return finalFtrs

In [218]:
finalFtrs = fillNullValues(NullValColumns, finalFtrs)
finalFtrs

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,TARGET
0,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,1
1,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,225000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
307507,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,225000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
307508,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,585000.0,Unaccompanied,...,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0,0
307509,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,319500.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [219]:
AMTGoodPrice = finalFtrs[['AMT_GOODS_PRICE','NAME_FAMILY_STATUS']]
AMTGoodPrice = AMTGoodPrice.groupby('NAME_FAMILY_STATUS')
AMTGoodPrice.head()

Unnamed: 0,AMT_GOODS_PRICE,NAME_FAMILY_STATUS
0,351000.0,Single / not married
1,1129500.0,Married
2,135000.0,Single / not married
3,297000.0,Civil marriage
4,513000.0,Single / not married
5,454500.0,Married
6,1395000.0,Married
7,1530000.0,Married
8,913500.0,Married
9,405000.0,Single / not married


In [220]:
AMTGoodPrice = finalFtrs[['AMT_GOODS_PRICE','NAME_FAMILY_STATUS']]
AMTGoodPrice = AMTGoodPrice.groupby('NAME_FAMILY_STATUS')['AMT_GOODS_PRICE'].median().reset_index()
AMTGoodPrice['AMT_GOODS_PRICE'] = AMTGoodPrice['AMT_GOODS_PRICE'].fillna(AMTGoodPrice['AMT_GOODS_PRICE'].median())
AMTGoodPrice.head()

Unnamed: 0,NAME_FAMILY_STATUS,AMT_GOODS_PRICE
0,Civil marriage,450000.0
1,Married,459000.0
2,Separated,450000.0
3,Single / not married,373500.0
4,Unknown,450000.0


In [221]:
#sns.set(rc={'figure.figsize':(11,8)})
#ax = sns.barplot(x="NAME_FAMILY_STATUS", y="AMT_GOODS_PRICE", data=temp_vis)

In [222]:
#sns.set(rc={'figure.figsize':(11,8)})
#ax = sns.boxplot(x="NAME_FAMILY_STATUS", y="AMT_GOODS_PRICE", data=X_feature)

In [223]:
finalFtrs.dropna(subset=['AMT_ANNUITY'], inplace=True)

In [224]:
def correct_cat_val(c):
    if c['AMT_GOODS_PRICE'] == -np.inf:
        return AMTGoodPrice[AMTGoodPrice['NAME_FAMILY_STATUS'] == c['NAME_FAMILY_STATUS']]['AMT_GOODS_PRICE'].values[0]
    else:
        return c['AMT_GOODS_PRICE']

In [225]:
for col in NullValColumns:
    finalFtrs['AMT_GOODS_PRICE'] = finalFtrs['AMT_GOODS_PRICE'].fillna(-np.inf)
    if 'AMT_GOODS_PRICE' in col:
        #print("columns to be filled with category median is: {}".format(col))
        finalFtrs['AMT_GOODS_PRICE'] = finalFtrs.apply(lambda c: correct_cat_val(c),axis=1)

In [226]:
finalFtrs = finalFtrs.reset_index(drop=True)
print(finalFtrs.columns.tolist())
finalFtrs.shape

['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DO

(307498, 71)

In [227]:
correlation_df = pd.DataFrame()
allColumns = finalFtrs.columns.tolist()

In [230]:
def correlationWithEXTSource3(allColumns):
    correlation_df = pd.DataFrame()
    for col in allColumns:
        if finalFtrs[col].dtype == 'int':
            l = [col, finalFtrs['EXT_SOURCE_3'].corr(finalFtrs[col])]
        else:
            l = [col, X_feature['EXT_SOURCE_3'].corr(pd.DataFrame(LabelEncoder().fit_transform(finalFtrs[[col]]))[0])]
        correlation_df = correlation_df.append(pd.Series(l),ignore_index=True)
    correlation_df = correlation_df.rename(columns={0:'Column Name',1:'Correlation with EXT 3'})
    correlation_df['Correlation with EXT 3'] = correlation_df['Correlation with EXT 3'] * -1
    return correlation_df

In [231]:
correlation_df = correlationWithEXTSource3(allColumns)
correlation_df.sort_values(by='Correlation with EXT 3',ascending = False).head()

Unnamed: 0,Column Name,Correlation with EXT 3
15,DAYS_BIRTH,0.205474
70,TARGET,0.178929
18,DAYS_ID_PUBLISH,0.131598
20,FLAG_EMP_PHONE,0.115284
34,REG_CITY_NOT_WORK_CITY,0.079706


In [None]:
ext3Corrcolumns = correlation_df['Column Name'].tolist()
ext_source_3 = finalFtrs[ext3Corrcolumns + ['EXT_SOURCE_3']]

for col in ext_source_3.columns.tolist():
    if col != 'EXT_SOURCE_3':
        ext_source_3[col] = LabelEncoder().fit_transform(X_feature[[col]])

ext_source_3_train = ext_source_3[ext_source_3['EXT_SOURCE_3'].notnull()]
ext_source_3_test = ext_source_3[ext_source_3['EXT_SOURCE_3'].isnull()]
ext_source_3_train.shape, ext_source_3_test.shape

In [232]:
correlation_df = pd.DataFrame()
allColumns = finalFtrs.columns.tolist()

In [233]:
def correlationWithEXTSource2(allColumns):
    correlation_df = pd.DataFrame()
    for col in allColumns:
        if finalFtrs[col].dtype == 'int':
            l = [col, finalFtrs['EXT_SOURCE_2'].corr(finalFtrs[col])]
        else:
            l = [col, X_feature['EXT_SOURCE_2'].corr(pd.DataFrame(LabelEncoder().fit_transform(X_feature[[col]]))[0])]
        correlation_df = correlation_df.append(pd.Series(l),ignore_index=True)
    correlation_df = correlation_df.rename(columns={0:'Column Name',1:'Correlation with EXT 2'})
    correlation_df['Correlation with EXT 2'] = correlation_df['Correlation with EXT 2'] * -1
    return correlation_df

In [234]:
correlation_df = correlationWithEXTSource2(allColumns)
correlation_df.sort_values(by='Correlation with EXT 2',ascending = False).head()

Unnamed: 0,Column Name,Correlation with EXT 2
26,REGION_RATING_CLIENT,0.292903
27,REGION_RATING_CLIENT_W_CITY,0.288306
70,TARGET,0.160471
15,DAYS_BIRTH,0.092
34,REG_CITY_NOT_WORK_CITY,0.07598


In [237]:
region_rating_client = finalFtrs.groupby('REGION_RATING_CLIENT')['EXT_SOURCE_2'].median().reset_index()
finalFtrs['EXT_SOURCE_2'] = finalFtrs['EXT_SOURCE_2'].fillna(-np.inf)

def correct_ext_source_2(e):
    if e['EXT_SOURCE_2'] != -np.inf:
        return e['EXT_SOURCE_2']
    else:
        return region_rating_client[region_rating_client['REGION_RATING_CLIENT'] == e['REGION_RATING_CLIENT']]['EXT_SOURCE_2'].values[0]

finalFtrs['EXT_SOURCE_2'] = finalFtrs.apply(lambda e: correct_ext_source_2(e),axis=1)


Unnamed: 0,column_name,correlation_with_EXT_3
15,DAYS_BIRTH,0.205475
70,TARGET,0.178929
18,DAYS_ID_PUBLISH,0.13161
20,FLAG_EMP_PHONE,0.115284
37,EXT_SOURCE_2,0.109728
17,DAYS_REGISTRATION,0.10757
5,AMT_INCOME_TOTAL,0.088906
36,ORGANIZATION_TYPE,0.087994
34,REG_CITY_NOT_WORK_CITY,0.079706


In [192]:
ext_source_3 = X_feature[correlation_df['column_name'].tolist()+['EXT_SOURCE_3']]

for col in ext_source_3.columns.tolist():
    if col != 'EXT_SOURCE_3':
        ext_source_3[col] = LabelEncoder().fit_transform(X_feature[[col]])

ext_source_3_train = ext_source_3[ext_source_3['EXT_SOURCE_3'].notnull()]
ext_source_3_test = ext_source_3[ext_source_3['EXT_SOURCE_3'].isnull()]
ext_source_3_train.shape, ext_source_3_test.shape

((246535, 10), (60963, 10))

In [193]:
ext_source_3_y_train = ext_source_3_train[['EXT_SOURCE_3']]
ext_source_3_X_train = ext_source_3_train.drop(columns=['EXT_SOURCE_3'])
ext_source_3_X_test = ext_source_3_test.drop(columns=['EXT_SOURCE_3'])

In [194]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(ext_source_3_X_train, ext_source_3_y_train) 
ext_source_3_y_pred = model.predict(ext_source_3_X_test)

ext_source_3_output = ext_source_3_X_test
ext_source_3_output['exs3_y'] = ext_source_3_y_pred
ext_source_3_output

Unnamed: 0,DAYS_BIRTH,TARGET,DAYS_ID_PUBLISH,FLAG_EMP_PHONE,EXT_SOURCE_2,DAYS_REGISTRATION,AMT_INCOME_TOTAL,ORGANIZATION_TYPE,REG_CITY_NOT_WORK_CITY,exs3_y
1,8382,0,5876,1,85079,14501,2064,39,0,0.482395
3,6142,0,3730,1,90559,5854,1170,5,0,0.562438
4,5215,0,2709,1,36021,11376,1019,37,1,0.542161
9,10676,0,2175,1,110724,1373,1170,9,0,0.563857
14,10562,0,4111,1,89028,15072,1659,53,0,0.500342
...,...,...,...,...,...,...,...,...,...,...
307471,12298,0,6132,1,109216,13156,2398,5,0,0.448259
307488,12184,0,2387,1,76369,14289,506,14,0,0.528944
307491,8442,0,5908,1,68374,5889,1371,42,0,0.510365
307493,15818,0,4185,1,96856,7231,1407,43,0,0.483491


In [195]:
ext_source_3_output = ext_source_3_output.reset_index().rename(columns={'index':'index_to_be_updated'})
for i in ext_source_3_output['index_to_be_updated'].tolist():
    X_feature['EXT_SOURCE_3'].iloc[i] = ext_source_3_output[ext_source_3_output['index_to_be_updated']==i]['exs3_y'].values[0]

In [196]:
X_feature.isna().sum()

NAME_CONTRACT_TYPE            0
CODE_GENDER                   0
FLAG_OWN_CAR                  0
FLAG_OWN_REALTY               0
CNT_CHILDREN                  0
                             ..
AMT_REQ_CREDIT_BUREAU_WEEK    0
AMT_REQ_CREDIT_BUREAU_MON     0
AMT_REQ_CREDIT_BUREAU_QRT     0
AMT_REQ_CREDIT_BUREAU_YEAR    0
TARGET                        0
Length: 71, dtype: int64

In [197]:
X_feature['AMT_CREDIT_TO_ANNUITY_RATIO'] = X_feature['AMT_CREDIT'] / X_feature['AMT_ANNUITY']
X_feature['Tot_EXTERNAL_SOURCE'] = X_feature['EXT_SOURCE_2'] + X_feature['EXT_SOURCE_3']
X_feature['Salary_to_credit'] = X_feature['AMT_INCOME_TOTAL']/X_feature['AMT_CREDIT']
X_feature['Annuity_to_salary_ratio'] = X_feature['AMT_ANNUITY']/X_feature['AMT_INCOME_TOTAL']

In [198]:
train = X_feature
train.shape

(307498, 75)

In [199]:
features = train.columns.tolist()
features.remove('TARGET')
len(features)

74

In [200]:
le_dict = {}

for col in features:
    if train[col].dtype == 'object':
        le = LabelEncoder()
        train[col] = le.fit_transform(train[col])
        le_dict['le_{}'.format(col)] = le

In [201]:
X = train[features]
y = train['TARGET']

In [202]:
experimentLog = pd.DataFrame(columns=["ExpID", "Cross fold train accuracy", "Test Accuracy", "Valid Accuracy", "AUC", "Loss", "Accuracy", "Train Time(s)", "Test Time(s)", "Experiment description", "Best Hyper Parameters"])

In [203]:
X_feature.to_csv('MLP_input_data.csv', index=False)

In [204]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import re
from time import time
from scipy import stats
import json

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import make_scorer, roc_auc_score, log_loss, accuracy_score
from sklearn.preprocessing import LabelEncoder,MinMaxScaler

from sklearn.metrics import confusion_matrix
from IPython.display import display, Math, Latex

from numpy import vstack
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torch import Tensor
from torch.nn import Linear
from torch.nn import ReLU
from torch.nn import Sigmoid
from torch.nn import Module
from torch.optim import SGD
from torch.nn import BCELoss
from torch.nn.init import kaiming_uniform_
from torch.nn.init import xavier_uniform_
from sklearn.metrics import make_scorer, roc_auc_score

from torch.utils.tensorboard import SummaryWriter

In [205]:
scaler = MinMaxScaler()

In [206]:
writer = SummaryWriter()

# dataset definition
class CSVDataset(Dataset):
    # load the dataset
    def __init__(self, path):
        # load the csv file as a dataframe
        df = read_csv(path).head(10000)

        num_attribs = []
        cat_attribs = []

        for col in df.columns.tolist():
            if df[col].dtype in (['int','float']):
                num_attribs.append(col)
            else:
                cat_attribs.append(col)
                
        le_dict = {}
        for col in df.columns.tolist():
            if df[col].dtype == 'object':
                le = LabelEncoder()
                df[col] = df[col].fillna("NULL")
                df[col] = le.fit_transform(df[col])
                le_dict['le_{}'.format(col)] = le

        # store the inputs and outputs
        self.X = df.drop(columns=['TARGET']).values[:, :]
        self.X = scaler.fit_transform(self.X)
        self.y = df['TARGET'].values[:]

        self.X = self.X.astype('float32')
        # label encode target and ensure the values are floats
        self.y = LabelEncoder().fit_transform(self.y)
        self.y = self.y.astype('float32')
        self.y = self.y.reshape((len(self.y), 1))

    # number of rows in the dataset
    def __len__(self):
        return len(self.X)

    # # get a row at an index
    def __getitem__(self, idx):
        return [self.X[idx], self.y[idx]]

    # get indexes for train and test rows
    def get_splits(self, n_test=0.25):
        # determine sizes
        test_size = round(n_test * len(self.X))
        train_size = len(self.X) - test_size
        # calculate the split
        return random_split(self, [train_size, test_size])

# model definition
class MLP(Module):
    # define model elements
    def __init__(self, n_inputs):
        super(MLP, self).__init__()
        # input to first hidden layer
        self.hidden1 = Linear(n_inputs, 35)
        kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
        self.act1 = ReLU()
        # second hidden layer
        self.hidden2 = Linear(35, 15)
        kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
        self.act2 = ReLU()
        # third hidden layer
        self.hidden3 = Linear(15, 5)
        kaiming_uniform_(self.hidden3.weight, nonlinearity='relu')
        self.act3 = ReLU()
        # third hidden layer and output
        self.hidden4 = Linear(5, 1)
        xavier_uniform_(self.hidden4.weight)
        self.act4 = Sigmoid()

    # forward propagate input
    def forward(self, X):
        # input to first hidden layer
        X = self.hidden1(X)
        X = self.act1(X)
         # second hidden layer
        X = self.hidden2(X)
        X = self.act2(X)
        # third hidden layer and output
        X = self.hidden3(X)
        X = self.act3(X)
        # third hidden layer and output
        X = self.hidden4(X)
        X = self.act4(X)
        return X

# prepare the dataset
def prepare_data(path):
    # load the dataset
    dataset = CSVDataset(path)
    # calculate split
    train, test = dataset.get_splits()
    # prepare data loaders
    train_dl = DataLoader(train, batch_size=32, shuffle=True)
    test_dl = DataLoader(test, batch_size=1024, shuffle=False)
    return train_dl, test_dl

# train the model
def train_model(train_dl, model):
    # define the optimization
    criterion = BCELoss()
    optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
    # enumerate epochs
    start = time()
    for epoch in range(50):
        # enumerate mini batches
        for i, (inputs, targets) in enumerate(train_dl):
            # clear the gradients
            optimizer.zero_grad()
            # compute the model output
            yhat = model(inputs)
            # calculate loss
            loss = criterion(yhat, targets)
            # plotting on tensorboard
            writer.add_scalar("Loss/train", loss, epoch)
            # credit assignment
            loss.backward()
            # update model weights
            optimizer.step()
    train_time = np.round(time() - start,4)
    return train_time
# evaluate the model
def evaluate_model(test_dl, model):
    predictions, actuals = list(), list()
    for i, (inputs, targets) in enumerate(test_dl):
        # evaluate the model on the test set
        yhat = model(inputs)
        # retrieve numpy array
        yhat = yhat.detach().numpy()
        actual = targets.numpy()
        actual = actual.reshape((len(actual), 1))
        # round to class values
        yhat = yhat.round()
        # store
        predictions.append(yhat)
        actuals.append(actual)
    predictions, actuals = vstack(predictions), vstack(actuals)
    # calculate accuracy
    acc = accuracy_score(actuals, predictions)
    return acc

# make a class prediction for one row of data
def predict_model(test_dl, model):
    temp_df = pd.DataFrame()
    predictions, actuals = list(), list()
    for i, (inputs, targets) in enumerate(test_dl):
        # evaluate the model on the test set
        yhat = model(inputs)
        # retrieve numpy array
        yhat = yhat.detach().numpy()
        actual = targets.numpy()
        actual = actual.reshape((len(actual), 1))
        # round to class values
        yhat = yhat.round()
        # store
        predictions.append(yhat)
        actuals.append(actual)
    predictions = predictions[0].reshape(len(predictions[0])).tolist()
    actuals = actuals[0].reshape(len(actuals[0])).tolist()
    temp_df['pred'] = predictions
    temp_df['actual'] = actuals
    return temp_df

# prepare the data
path = 'MLP_input_data.csv'
train_dl, test_dl = prepare_data(path)
print(len(train_dl.dataset), len(test_dl.dataset))
# define the network
model = MLP(74)
# train the model
train_time = train_model(train_dl, model)
# getting test results
output_df = predict_model(test_dl, model)
# evaluate the model
acc = evaluate_model(test_dl, model)
print('Accuracy: %.3f' % acc)
AUC = roc_auc_score(output_df['actual'],output_df['pred'])
print (AUC)

expLog = pd.DataFrame(columns=["ExpID", "Accuracy", "Time", "AUC","Comments (not mandatory)"])
temp_df = pd.DataFrame()
temp_df = temp_df.append(pd.Series(["MLP", acc, train_time, AUC, '']),ignore_index=True)
temp_df.columns = expLog.columns
expLog = expLog.append(temp_df,ignore_index=True)


#writer.flush()
#writer.close()

7500 2500
Accuracy: 0.923
0.49947089947089945


In [207]:
expLog

Unnamed: 0,ExpID,Accuracy,Time,AUC,Comments (not mandatory)
0,MLP,0.9228,28.6339,0.499471,


In [208]:
%load_ext tensorboard
%tensorboard --logdir runs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [209]:
bureau_data = datasets['bureau']

In [210]:
bureau_data

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.00,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.00,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.50,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.00,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.00,,,0.0,Consumer credit,-21,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1716423,259355,5057750,Active,currency 1,-44,0,-30.0,,0.0,0,11250.00,11250.0,0.0,0.0,Microloan,-19,
1716424,100044,5057754,Closed,currency 1,-2648,0,-2433.0,-2493.0,5476.5,0,38130.84,0.0,0.0,0.0,Consumer credit,-2493,
1716425,100044,5057762,Closed,currency 1,-1809,0,-1628.0,-970.0,,0,15570.00,,,0.0,Consumer credit,-967,
1716426,246829,5057770,Closed,currency 1,-1878,0,-1513.0,-1513.0,,0,36000.00,0.0,0.0,0.0,Consumer credit,-1508,


In [211]:
bureau_data['DAYS_CREDIT'].value_counts()

-364    1330
-336    1248
-273    1238
-357    1218
-343    1203
        ... 
-4       113
-3        74
-2        42
 0        25
-1        17
Name: DAYS_CREDIT, Length: 2923, dtype: int64

In [212]:
corr = bureau_data.corr()['DAYS_CREDIT'].sort_values(ascending = False)
print('Most Positive Correlations:\n', corr.head(10))
print('\nMost Negative Correlations:\n', corr.tail(10))

Most Positive Correlations:
 DAYS_CREDIT             1.000000
DAYS_ENDDATE_FACT       0.875359
DAYS_CREDIT_UPDATE      0.688771
DAYS_CREDIT_ENDDATE     0.225682
AMT_CREDIT_SUM_DEBT     0.135397
AMT_CREDIT_SUM          0.050883
AMT_CREDIT_SUM_LIMIT    0.025140
SK_ID_BUREAU            0.013015
AMT_ANNUITY             0.005676
SK_ID_CURR              0.000266
Name: DAYS_CREDIT, dtype: float64

Most Negative Correlations:
 AMT_CREDIT_SUM_DEBT       0.135397
AMT_CREDIT_SUM            0.050883
AMT_CREDIT_SUM_LIMIT      0.025140
SK_ID_BUREAU              0.013015
AMT_ANNUITY               0.005676
SK_ID_CURR                0.000266
AMT_CREDIT_SUM_OVERDUE   -0.000383
AMT_CREDIT_MAX_OVERDUE   -0.014724
CREDIT_DAY_OVERDUE       -0.027266
CNT_CREDIT_PROLONG       -0.030460
Name: DAYS_CREDIT, dtype: float64


In [213]:
bureau_corr = pd.DataFrame(corr, columns = ['DAYS_CREDIT'])
bureau_corr

Unnamed: 0,DAYS_CREDIT
DAYS_CREDIT,1.0
DAYS_ENDDATE_FACT,0.875359
DAYS_CREDIT_UPDATE,0.688771
DAYS_CREDIT_ENDDATE,0.225682
AMT_CREDIT_SUM_DEBT,0.135397
AMT_CREDIT_SUM,0.050883
AMT_CREDIT_SUM_LIMIT,0.02514
SK_ID_BUREAU,0.013015
AMT_ANNUITY,0.005676
SK_ID_CURR,0.000266


In [214]:
bureau_data['CREDIT_ACTIVE_CLASSIFY'] = bureau_data['CREDIT_ACTIVE']

def classify(x):
    if x == 'Closed':
        y = 0
    else:
        y = 1    
    return y

bureau_data['CREDIT_ACTIVE_CLASSIFY'] = bureau_data.apply(lambda x: classify(x.CREDIT_ACTIVE), axis = 1)

In [215]:
active_loan = bureau_data.groupby(by = ['SK_ID_CURR'])['CREDIT_ACTIVE_CLASSIFY'].mean().reset_index().rename(index=str, columns={'CREDIT_ACTIVE_CLASSIFY': 'ACTIVE_LOANS_PERCENTAGE'})
active_loan

Unnamed: 0,SK_ID_CURR,ACTIVE_LOANS_PERCENTAGE
0,100001,0.428571
1,100002,0.250000
2,100003,0.250000
3,100004,0.000000
4,100005,0.666667
...,...,...
305806,456249,0.153846
305807,456250,0.666667
305808,456253,0.500000
305809,456254,0.000000


In [216]:
bureau_data = bureau_data.merge(active_loan, on = ['SK_ID_CURR'], how = 'left')
bureau_data

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY,CREDIT_ACTIVE_CLASSIFY,ACTIVE_LOANS_PERCENTAGE
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.00,0.0,,0.0,Consumer credit,-131,,0,0.545455
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.00,171342.0,,0.0,Credit card,-20,,1,0.545455
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.50,,,0.0,Consumer credit,-16,,1,0.545455
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.00,,,0.0,Credit card,-16,,1,0.545455
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.00,,,0.0,Consumer credit,-21,,1,0.545455
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1716423,259355,5057750,Active,currency 1,-44,0,-30.0,,0.0,0,11250.00,11250.0,0.0,0.0,Microloan,-19,,1,1.000000
1716424,100044,5057754,Closed,currency 1,-2648,0,-2433.0,-2493.0,5476.5,0,38130.84,0.0,0.0,0.0,Consumer credit,-2493,,0,0.545455
1716425,100044,5057762,Closed,currency 1,-1809,0,-1628.0,-970.0,,0,15570.00,,,0.0,Consumer credit,-967,,0,0.545455
1716426,246829,5057770,Closed,currency 1,-1878,0,-1513.0,-1513.0,,0,36000.00,0.0,0.0,0.0,Consumer credit,-1508,,0,0.258065


In [217]:
bureau_data.drop(['AMT_ANNUITY','DAYS_ENDDATE_FACT','CREDIT_CURRENCY','CREDIT_ACTIVE_CLASSIFY'], axis = 1, inplace = True)

In [218]:
bureau_data

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,ACTIVE_LOANS_PERCENTAGE
0,215354,5714462,Closed,-497,0,-153.0,,0,91323.00,0.0,,0.0,Consumer credit,-131,0.545455
1,215354,5714463,Active,-208,0,1075.0,,0,225000.00,171342.0,,0.0,Credit card,-20,0.545455
2,215354,5714464,Active,-203,0,528.0,,0,464323.50,,,0.0,Consumer credit,-16,0.545455
3,215354,5714465,Active,-203,0,,,0,90000.00,,,0.0,Credit card,-16,0.545455
4,215354,5714466,Active,-629,0,1197.0,77674.5,0,2700000.00,,,0.0,Consumer credit,-21,0.545455
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1716423,259355,5057750,Active,-44,0,-30.0,0.0,0,11250.00,11250.0,0.0,0.0,Microloan,-19,1.000000
1716424,100044,5057754,Closed,-2648,0,-2433.0,5476.5,0,38130.84,0.0,0.0,0.0,Consumer credit,-2493,0.545455
1716425,100044,5057762,Closed,-1809,0,-1628.0,,0,15570.00,,,0.0,Consumer credit,-967,0.545455
1716426,246829,5057770,Closed,-1878,0,-1513.0,,0,36000.00,0.0,0.0,0.0,Consumer credit,-1508,0.258065


In [219]:
bureau_data['AMT_CREDIT_MAX_OVERDUE'].fillna(0, inplace = True)

In [220]:
bureau_data['AMT_CREDIT_SUM_LIMIT'].isnull().value_counts()

False    1124648
True      591780
Name: AMT_CREDIT_SUM_LIMIT, dtype: int64

In [221]:
bureau_data['AMT_CREDIT_SUM'].fillna(0, inplace = True)
bureau_data['AMT_CREDIT_SUM_LIMIT'].fillna(0, inplace = True)

In [222]:
bureau_data['AMT_CREDIT_SUM_DEBT'].fillna(bureau_data['AMT_CREDIT_SUM'], inplace = True)

In [223]:
difference = bureau_data['AMT_CREDIT_SUM'] - bureau_data['AMT_CREDIT_SUM_DEBT']
bureau_data['AMT_CREDIT_SUM_LIMIT'].fillna(difference, inplace = True)

In [224]:
bureau_active_loans = pd.DataFrame(bureau_data[['SK_ID_CURR','ACTIVE_LOANS_PERCENTAGE']], columns = ['SK_ID_CURR','ACTIVE_LOANS_PERCENTAGE'])

In [225]:
bureau_active_loans

Unnamed: 0,SK_ID_CURR,ACTIVE_LOANS_PERCENTAGE
0,215354,0.545455
1,215354,0.545455
2,215354,0.545455
3,215354,0.545455
4,215354,0.545455
...,...,...
1716423,259355,1.000000
1716424,100044,0.545455
1716425,100044,0.545455
1716426,246829,0.258065


In [226]:
train = datasets['application_train']
train

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,,,,,,
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,,,,,,
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [227]:
train.drop(['DAYS_LAST_PHONE_CHANGE','OBS_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE'], axis = 1, inplace = True)

In [228]:
train = train.merge(bureau_active_loans, on = 'SK_ID_CURR', how = 'left')
train

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,ACTIVE_LOANS_PERCENTAGE
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.250000
1,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.250000
2,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.250000
3,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.250000
4,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1509340,456255,0,Cash loans,F,N,N,0,157500.0,675000.0,49117.5,...,0,0,0,0.0,0.0,0.0,2.0,0.0,1.0,0.454545
1509341,456255,0,Cash loans,F,N,N,0,157500.0,675000.0,49117.5,...,0,0,0,0.0,0.0,0.0,2.0,0.0,1.0,0.454545
1509342,456255,0,Cash loans,F,N,N,0,157500.0,675000.0,49117.5,...,0,0,0,0.0,0.0,0.0,2.0,0.0,1.0,0.454545
1509343,456255,0,Cash loans,F,N,N,0,157500.0,675000.0,49117.5,...,0,0,0,0.0,0.0,0.0,2.0,0.0,1.0,0.454545


In [229]:
expLog = pd.DataFrame(columns=["ExpID", "Accuracy", "Valid Acc", "AUC", "Valid AUC","Comment (Not Mandatory)"])

application_train = datasets['application_train']

#Subset dataframe

#subset application train dataset to improve performance
#Calculate the percentage of applicants given a loan or not
app1 = application_train[application_train['TARGET']==1] 
app0 = application_train[application_train['TARGET']==0]

app1_len = app1.shape[0]
app0_len = app0.shape[0]
n = app1_len + app0_len

app1_len_proportion = app1_len/n
app0__len_proportion = app0_len/n

#subset rows from data
app1_sample = app1.sample(n=int(10000*app1_len_proportion))
app0_sample = app0.sample(n=int(10000*app0__len_proportion))

application_train = pd.concat([app1_sample,app0_sample]) # End Subset 

application_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
205786,338531,1,Cash loans,M,N,Y,0,103500.0,124438.5,9184.5,...,0,0,0,0,,,,,,
294789,441520,1,Cash loans,M,Y,N,1,540000.0,576072.0,20821.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
106147,223160,1,Cash loans,F,Y,Y,0,81000.0,396000.0,28813.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
240789,378824,1,Cash loans,F,N,Y,0,112500.0,295668.0,14508.0,...,0,0,0,0,,,,,,
17239,120109,1,Cash loans,F,N,Y,0,198000.0,467257.5,17743.5,...,0,0,0,0,0.0,0.0,0.0,0.0,2.0,2.0


In [230]:
X = application_train.loc[:, application_train.columns != 'TARGET']
y = application_train['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Create full set of categorical features
cat_features = X.select_dtypes(include=['object']).columns.tolist()

#Numeric Features
num_features = X.select_dtypes(include=np.number).columns.tolist()

# Check if all columns selected
if X.shape[1] == len(num_features) + len(cat_features): print("All columns have been selected")
else: print("All columns have not been selected, re-evaluate selection criteria")

All columns have been selected


In [231]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

In [232]:
from sklearn.compose import ColumnTransformer
cat_pipe = Pipeline([
        ('selector', DataFrameSelector(cat_features)),
        ('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))]) # ignore values from validation/test data that do NOT occur in training set

num_pipe = Pipeline([
        ('selector', DataFrameSelector(num_features)),
        ('imputer', SimpleImputer(strategy='mean')),
        ('std_scaler', StandardScaler())])

data_prep_pipe = ColumnTransformer(transformers= [
        ("num_pipeline", num_pipe, num_features),
        ("cat_pipeline", cat_pipe, cat_features)],
         remainder='drop',
         n_jobs=-1)

In [233]:
from sklearn.neural_network import MLPClassifier
mlp_pipe = Pipeline([
        ("data_prep", data_prep_pipe),
        ('MLP', MLPClassifier(random_state=42))])

params = {'MLP__hidden_layer_sizes':[1],
          'MLP__alpha':[0.1,0.5],
          'MLP__activation':['identity','tanh','relu']}
slp_gridsearch = GridSearchCV(mlp_pipe, param_grid = params, cv = 3, scoring='accuracy', n_jobs = -1)

In [234]:
t0 = time()
slp_gridsearch.fit(X_train, y_train)
print("Best parameters:")
print(slp_gridsearch.best_params_)
print("Grid scores on development set:")
means = slp_gridsearch.cv_results_['mean_test_score']
stds = slp_gridsearch.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, slp_gridsearch.cv_results_['params']): print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
scoring='accuracy'
print("Best %s score: %0.3f" %(scoring, slp_gridsearch.best_score_))
print("Best parameters:")
best_parameters = slp_gridsearch.best_estimator_.get_params()
for param_name in sorted(params.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name]))
         
sortedGridSearchResults = sorted(zip(slp_gridsearch.cv_results_["params"], slp_gridsearch.cv_results_["mean_test_score"]), key=lambda x: x[1], reverse=True)
y_pred_train = slp_gridsearch.predict(X_train)
print(f"Training accuracy by best pipeline is: {accuracy_score(y_pred_train, y_train):.3f}")
y_pred_test = slp_gridsearch.predict(X_test)
print(f"Testing accuracy by best pipeline is: {accuracy_score(y_pred_test, y_test):.3f}")

Best parameters:
{'MLP__activation': 'identity', 'MLP__alpha': 0.1, 'MLP__hidden_layer_sizes': 1}
Grid scores on development set:
0.922 (+/-0.003) for {'MLP__activation': 'identity', 'MLP__alpha': 0.1, 'MLP__hidden_layer_sizes': 1}
0.922 (+/-0.003) for {'MLP__activation': 'identity', 'MLP__alpha': 0.5, 'MLP__hidden_layer_sizes': 1}
0.919 (+/-0.007) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.1, 'MLP__hidden_layer_sizes': 1}
0.921 (+/-0.004) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.5, 'MLP__hidden_layer_sizes': 1}
0.920 (+/-0.003) for {'MLP__activation': 'relu', 'MLP__alpha': 0.1, 'MLP__hidden_layer_sizes': 1}
0.922 (+/-0.003) for {'MLP__activation': 'relu', 'MLP__alpha': 0.5, 'MLP__hidden_layer_sizes': 1}
Best accuracy score: 0.922
Best parameters:
	MLP__activation: 'identity'
	MLP__alpha: 0.1
	MLP__hidden_layer_sizes: 1
Training accuracy by best pipeline is: 0.924
Testing accuracy by best pipeline is: 0.908


In [235]:
y_test_pred = slp_gridsearch.best_estimator_.predict(X_test)
print("Confusion matrix (valid data)")
print(confusion_matrix(y_test_pred, y_test))
print("------------------")
print(f"Overall accuracy (valid data): {np.round(accuracy_score(y_test, y_test_pred), 3)*100}%")
print("------------------")
print(f"AUROC (valid data): {np.round(roc_auc_score(y_test, y_test_pred), 3)*100}%")

Confusion matrix (valid data)
[[2718  265]
 [  11    6]]
------------------
Overall accuracy (valid data): 90.8%
------------------
AUROC (valid data): 50.9%


In [236]:
expLog.loc[len(expLog)] = ["SLP"] + list(np.round(
               [accuracy_score(y_train, slp_gridsearch.best_estimator_.predict(X_train)),
                accuracy_score(y_test, y_test_pred), 
                roc_auc_score(y_train,  slp_gridsearch.best_estimator_.predict_proba(X_train)[:, 1]),
                roc_auc_score(y_test, y_test_pred)], 3)*100) + [
                "SLP"]
expLog

Unnamed: 0,ExpID,Accuracy,Valid Acc,AUC,Valid AUC,Comment (Not Mandatory)
0,SLP,92.4,90.8,77.7,50.9,SLP


In [237]:
from sklearn.neural_network import MLPClassifier
mlp_pipe = Pipeline([
        ("data_prep", data_prep_pipe),
        ('MLP', MLPClassifier(random_state=42))])

params = {'MLP__hidden_layer_sizes':[5,10],
          'MLP__alpha':[0.1,0.5],
          'MLP__activation':['identity','tanh','relu']}
mlp_gridsearch = GridSearchCV(mlp_pipe, param_grid = params, cv = 3, scoring='accuracy', n_jobs = -1)

In [238]:
t0 = time()
mlp_gridsearch.fit(X_train, y_train)
print("Best parameters:")
print(mlp_gridsearch.best_params_)
print("Grid scores on development set:")
means = mlp_gridsearch.cv_results_['mean_test_score']
stds = mlp_gridsearch.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, mlp_gridsearch.cv_results_['params']): print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
scoring='accuracy'
print("Best %s score: %0.3f" %(scoring, mlp_gridsearch.best_score_))
print("Best parameters:")
best_parameters = mlp_gridsearch.best_estimator_.get_params()
for param_name in sorted(params.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name]))
         
sortedGridSearchResults = sorted(zip(mlp_gridsearch.cv_results_["params"], mlp_gridsearch.cv_results_["mean_test_score"]), key=lambda x: x[1], reverse=True)
y_pred_train = mlp_gridsearch.predict(X_train)
print(f"Training accuracy by best pipeline is: {accuracy_score(y_pred_train, y_train):.3f}")
y_pred_test = mlp_gridsearch.predict(X_test)
print(f"Testing accuracy by best pipeline is: {accuracy_score(y_pred_test, y_test):.3f}")

Best parameters:
{'MLP__activation': 'identity', 'MLP__alpha': 0.1, 'MLP__hidden_layer_sizes': 5}
Grid scores on development set:
0.922 (+/-0.005) for {'MLP__activation': 'identity', 'MLP__alpha': 0.1, 'MLP__hidden_layer_sizes': 5}
0.922 (+/-0.004) for {'MLP__activation': 'identity', 'MLP__alpha': 0.1, 'MLP__hidden_layer_sizes': 10}
0.922 (+/-0.003) for {'MLP__activation': 'identity', 'MLP__alpha': 0.5, 'MLP__hidden_layer_sizes': 5}
0.922 (+/-0.004) for {'MLP__activation': 'identity', 'MLP__alpha': 0.5, 'MLP__hidden_layer_sizes': 10}
0.914 (+/-0.002) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.1, 'MLP__hidden_layer_sizes': 5}
0.909 (+/-0.013) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.1, 'MLP__hidden_layer_sizes': 10}
0.919 (+/-0.004) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.5, 'MLP__hidden_layer_sizes': 5}
0.920 (+/-0.004) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.5, 'MLP__hidden_layer_sizes': 10}
0.915 (+/-0.005) for {'MLP__activation': 'relu', 'MLP__alpha': 0.1

In [239]:
y_test_pred = mlp_gridsearch.best_estimator_.predict(X_test)
print("Confusion matrix (valid data)")
print(confusion_matrix(y_test_pred, y_test))
print("------------------")
print(f"Overall accuracy (valid data): {np.round(accuracy_score(y_test, y_test_pred), 3)*100}%")
print("------------------")
print(f"AUROC (valid data): {np.round(roc_auc_score(y_test, y_test_pred), 3)*100}%")


Confusion matrix (valid data)
[[2718  264]
 [  11    7]]
------------------
Overall accuracy (valid data): 90.8%
------------------
AUROC (valid data): 51.1%


In [240]:
expLog.loc[len(expLog)] = ["MLP"] + list(np.round(
               [accuracy_score(y_train, mlp_gridsearch.best_estimator_.predict(X_train)),
                accuracy_score(y_test, y_test_pred), 
                roc_auc_score(y_train,  mlp_gridsearch.best_estimator_.predict_proba(X_train)[:, 1]),
                roc_auc_score(y_test, y_test_pred)], 3)*100) + [
                "MLP"]
expLog

Unnamed: 0,ExpID,Accuracy,Valid Acc,AUC,Valid AUC,Comment (Not Mandatory)
0,SLP,92.4,90.8,77.7,50.9,SLP
1,MLP,92.5,90.8,78.4,51.1,MLP


# Result and Discussion

From the experiment log table above, describes the accuracy, AUC, and loss of hyper tuned machine learning model logistic regression, Decision Tree, random forest, lasso regression and ridge regression. For the hyper tuned model decision tree model, we can see that the train (92.19) and test (92.13) accuracy has increased significantly as compared to its baseline model, which means it is performing well on the provided dataset. The log loss for decision tree is on the lower side which is 0.24 and has significantly dropped as compared to its baseline model as well as its AUC is also 0.53. So, the algorithm is performing well for given set of input features. The overall accuracy of decision tree has increased by comparatively large margin and went upto 92 %. 
Both Random Forest and logistic regression have approximately the same train and test accuracy and log loss as compared to baseline. There is no significant improvement on their hyper tuned parameter model. But hyper tuned Decision Tree remains the best-fit algorithm as it beats others by a very small margin in all the criteria. We observed an increase of 6 percent in test accuracy, and 6 percent in overall accuracy. The log loss for decision tree (0.24) has significantly decreased as compared to its baseline model and is on the lower side and hence it beats the other models.
For Lasso and Ridge Regression we observed that AUC has increased to .75. So, both the models seem to predict the target quite correctly as compared to all other models, but, on the other hand accuracy has decreased dramatically. So, even if the models have high AUC, Lasso and Ridge are not the models to look for. They fail to perform appropriately on HCDR dataset.



![image-2.png](attachment:image-2.png)

# Conclusion

The HCDR project's goal is to forecast the population's capacity for payback among those who are underserved financially. Because both the lender and the borrower want reliable estimates, this project is crucial. Real-time Home credit's ML pipelines, which acquire data from the data sources via APIs, run EDA, and fit it to the model to generate scores, which allows them to present loan offers to their consumers with the greatest amount and APR.
Hence if NPA expected to be less than 5% in order to maintain a profitable firm, risk analysis becomes extremely important. Credit history is an indicator of a user's trustworthiness that is created using parameters such as the average, minimum, and maximum balances that the user maintains, Bureau scores that are reported, salary, etc. Repayment patterns can be analysed using the timely defaults and repayments that the user has made in the past. Other criteria such as location information, social media data, calling/SMS data, etc. are included in alternative data. As part of this project, we would create machine learning pipelines, do exploratory data analysis on the datasets provided by Kaggle, and evaluate the models using a variety of evaluation measures before deploying one.
Phase 3 involved the estimation of several models. Data imputation and feature selection were done. We started by selecting features and imputed values. The values of certain features that were missing were filled in. Then, based on our past understanding, we chose to include pertinent features. We trained and assessed several models, including Random Forest, Decision Tree Model, Logistic Regression, Lasso Regression, and Ridge Regression to discover the best one. We hyper tuned them on the best parameters using GridSearch.
We have concluded from phase 3 that the Lasso, Ridge and Logistic Regression models  is unable to defeat the other hyper parameter tuned models. The decision tree model performs the best out of all the models. In phase 4 we plan to implement MLP using PyTorch


# Bibliography 


* Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition 
    -	by Aurélien Géron
* Lab-End_to_end_Machine_Learning_Project 
    -	by James Shahnan
