<a href="https://colab.research.google.com/github/emailtosanj/ADSP_LAIEDM/blob/development/adsp_laiedm/capstone/loan_default_prediction/Capstone_Project_Reference_Notebook_Loan_Default_Prediction_Full_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Loan Default Prediction**

## **Problem Definition**

### **The Context:**

 - Why is this problem important to solve?

    The Bank needs to be compliant with regulation like Equal Credit Opportunity Act by enabling automation in decision making process.

### **The objective:**

 - What is the intended goal?
  
   *Enabling efficiency by eliminating manual errors in identifying the customer credit-worthiness.*

   *Enabling efficiency and improved profitability by eliminating manual errors pertaining to processing of loan application in identifying the customer credit-worthiness.*

* Reduce manual overhead thus enabling efficiency and improved profitability by applying automation in loan processing.


### **The key questions:**

- What are the key questions that need to be answered?

  * Is the dataset having complete observations pertaining to loan processing.
  * Does the Load dataset all the faceted applications / observation.

### **The problem formulation**:

- What is it that we are trying to solve using data science?

  Identifying the customers credit-worthiness having machine learning model learn patterns / obervations from a given credit / loan dataset.


## **Data Description:**
The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.


* **BAD:** 1 = Client defaulted on loan, 0 = loan repaid

* **LOAN:** Amount of loan approved.

* **MORTDUE:** Amount due on the existing mortgage.

* **VALUE:** Current value of the property.

* **REASON:** Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)

* **JOB:** The type of job that loan applicant has such as manager, self, etc.

* **YOJ:** Years at present job.

* **DEROG:** Number of major derogatory reports (which indicates a serious delinquency or late payments).

* **DELINQ:** Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).

* **CLAGE:** Age of the oldest credit line in months.

* **NINQ:** Number of recent credit inquiries.

* **CLNO:** Number of existing credit lines.

* **DEBTINC:** Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.

In [1]:
#get the path to the dataset
from google.colab import drive
import os
drive.mount('/content/drive')

file_dir = '/content/drive/MyDrive/adsp_laiedm/capstone_project/capstone/loan_default_prediction'
#list the files in the dir
os.listdir(file_dir)



Mounted at /content/drive


['hmeq.csv',
 'Capstone_Project - Reference_Notebook_Loan_Default_Prediction_Low_Code.ipynb',
 'FAQs_Loan_Default_Prediction.docx',
 'Loan Default Prediction Problem Statement.pdf',
 'milestone_submission.docx',
 'live_presentation_loan_default_prediction.docx',
 'final_submission.docx',
 'Capstone_Project_Reference_Notebook_Loan_Default_Prediction_Full_Code.ipynb']

## **Import the necessary libraries and Data**

In [2]:
#libraries related to the data checks and analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [3]:
#display all columns.
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [4]:
#load the dataset in pandas dataframe
df = pd.read_csv(file_dir + '/hmeq.csv')
df

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,1,1500,,,,,,,,,,,
4,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5955,0,88900,57264.0,90185.0,DebtCon,Other,16.0,0.0,0.0,221.808718,0.0,16.0,36.112347
5956,0,89000,54576.0,92937.0,DebtCon,Other,16.0,0.0,0.0,208.692070,0.0,15.0,35.859971
5957,0,89200,54045.0,92924.0,DebtCon,Other,15.0,0.0,0.0,212.279697,0.0,15.0,35.556590
5958,0,89800,50370.0,91861.0,DebtCon,Other,14.0,0.0,0.0,213.892709,0.0,16.0,34.340882


## **Data Overview**

- Reading the dataset
- Understanding the shape of the dataset
- Checking the data types
- Checking for missing values
- Checking for duplicated values

TODO write the observations on


*   Record counts
*   null / empty columns in the dataset
*   Highlight important feature variables and its stats
*   determine class imbalance - proportion of the classes





In [6]:
#info on the data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB


In [38]:
#sanity data checks -  on data points for missing data.
display(df.isnull().sum())

Unnamed: 0,0
BAD,0
LOAN,0
MORTDUE,518
VALUE,112
REASON,252
JOB,279
YOJ,515
DEROG,708
DELINQ,580
CLAGE,308


In [46]:
#lambda function which checks if the dataframe has duplicate rows
has_duplicates = lambda df: df.duplicated().any()

if has_duplicates(df):
  display('Loan dataset has duplicate rows')
else:
  display('Loan dataset has no duplicate rows')

'Loan dataset has no duplicate rows'

## Summary Statistics

In [32]:
#descriptive stats of categorical column

display(df.describe(exclude='number').T)

Unnamed: 0,count,unique,top,freq
REASON,5708,2,DebtCon,3928
JOB,5681,6,Other,2388


- Observations from Summary Statistics

In [31]:
#descrip stats of numerical column
display(df.describe(include='number').T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BAD,5960.0,0.199497,0.399656,0.0,0.0,0.0,0.0,1.0
LOAN,5960.0,18607.969799,11207.480417,1100.0,11100.0,16300.0,23300.0,89900.0
MORTDUE,5442.0,73760.8172,44457.609458,2063.0,46276.0,65019.0,91488.0,399550.0
VALUE,5848.0,101776.048741,57385.775334,8000.0,66075.5,89235.5,119824.25,855909.0
YOJ,5445.0,8.922268,7.573982,0.0,3.0,7.0,13.0,41.0
DEROG,5252.0,0.25457,0.846047,0.0,0.0,0.0,0.0,10.0
DELINQ,5380.0,0.449442,1.127266,0.0,0.0,0.0,0.0,15.0
CLAGE,5652.0,179.766275,85.810092,0.0,115.116702,173.466667,231.562278,1168.233561
NINQ,5450.0,1.186055,1.728675,0.0,0.0,1.0,2.0,17.0
CLNO,5738.0,21.296096,10.138933,0.0,15.0,20.0,26.0,71.0


In [34]:
#value counts of categorical data points in the dataset

for col in df.select_dtypes(include='object').columns: #df.columns gives Index which is iterable
  # display('***'*20)
  display(df[col].value_counts())

Unnamed: 0_level_0,count
REASON,Unnamed: 1_level_1
DebtCon,3928
HomeImp,1780


Unnamed: 0_level_0,count
JOB,Unnamed: 1_level_1
Other,2388
ProfExe,1276
Office,948
Mgr,767
Self,193
Sales,109


In [35]:
#value counts of class (BAD) 'client defaulted' - 1 and 'loan repaid' - 0
display(df['BAD'].value_counts())

Unnamed: 0_level_0,count
BAD,Unnamed: 1_level_1
0,4771
1,1189


#### TODO write the observations of summary statistics

## **Exploratory Data Analysis (EDA) and Visualization**

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Leading Questions**:
1. What is the range of values for the loan amount variable "LOAN"?
2. How does the distribution of years at present job "YOJ" vary across the dataset?
3. How many unique categories are there in the REASON variable?
4. What is the most common category in the JOB variable?
5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
7. Is there a correlation between the value of the property and the loan default rate?
8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

### **Univariate Analysis**


### **Bivariate Analysis**

### **Multivariate Analysis**

## Treating Outliers

## Treating Missing Values

## **Important Insights from EDA**

What are the the most important observations and insights from the data based on the EDA performed?

## **Model Building - Approach**
- Data preparation
- Partition the data into train and test set
- Build the model
- Fit on the train data
- Tune the model
- Test the model on test set

### Logistic Regression

### Decision Tree

### **Decision Tree - Hyperparameter Tuning**

* Hyperparameter tuning is tricky in the sense that **there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model**, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
* **Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.**
* **It is an exhaustive search** that is performed on the specific parameter values of a model.
* The parameters of the estimator/model used to apply these methods are **optimized by cross-validated grid-search** over a parameter grid.

**Criterion {“gini”, “entropy”}**

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

**max_depth**

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

**min_samples_leaf**

The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

You can learn about more Hyperpapameters on this link and try to tune them.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


### **Building a Random Forest Classifier**

**Random Forest is a bagging algorithm where the base models are Decision Trees.** Samples are taken from the training data and on each sample a decision tree makes a prediction.

**The results from all the decision trees are combined together and the final prediction is made using voting or averaging.**

### **Random Forest Classifier Hyperparameter Tuning**

**1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):**
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

**2. Refined insights:**
- What are the most meaningful insights relevant to the problem?

**3. Proposal for the final solution design:**
- What model do you propose to be adopted? Why is this the best solution to adopt?