---
**Outline**

1. Review - Initial Characteristic Analysis
2. Binning Numerical Predictors
3. WoE and IV
4. Check Logical Trend
5. Test of Independence

In [1]:
# Load data manipulation package
import numpy as np
import pandas as pd

# Load data visualization package
import matplotlib.pyplot as plt
import seaborn as sns

# <font color='blue'>Review - Initial Characteristic Analysis

## **1. Characteristic Binning**
---

Bin numerical characteristic into 20 or so equal groups — 5 percent of total accounts in each bin. (Naeem Siddiqi)
- The greater the number of bins, the greater the complexity and potential for overfitting.
- Too few number of bins can lose valuable information.

## **2. WoE and IV**

**2.1 Weight of Evidence**
- Measure the strength or predictive power of each attribute in a characteristic.
- Assess the relative risk of different attributes for a characteristic.

$$
\begin{align*}
W_i &= \% \text{Good}/\% \text{Bad} \\
W_i &= \ln \left ( \left ( \frac{N_i}{\sum N } \right )/ \left ( \frac{P_i}{\sum P} \right ) \right ) \\
\end{align*}
$$

where:
- $N$ = non-occurrence/negative/goods
- $P$ = occurrence/positive/bads
- $i$ =  index of the attribute being evaluated

Interpretation:
- $W_i = 0$ if $\% \text{Good} = \% \text{Bad}$
- $W_i < 0$ if $\% \text{Good} < \% \text{Bad}$
- $W_i > 0$ if $\% \text{Good} > \% \text{Bad}$

**2.2 Information Value**
- Measure the total strength of a characteristic.

$$
\begin{align*}
IV &= \sum_{i=1}^{n} ( \% \text{Good} - \% \text{Bad} ) \times W_i \\
IV &= \sum_{i=1}^{n}\left [ \left ( \frac{N_i}{\sum N} - \frac{P_i}{\sum P} \right ) \times W_i \right ] \\
\end{align*}
$$

where:
- $N$ = non-occurrence/negative/goods
- $P$ = occurrence/positive/bads
- $i$ = index of the attribute being evaluated
- $n$ = the total number of attributes
- $W$ = WOE of the attribute

The rule of thumb:
- Less than 0.02 : generally unpredictive
- 0.02 to 0.1 : weak
- 0.1 to 0.3 : medium
- 0.3+ : strong


## **3. Check Logical Trend**
---
- Logical trend
- Operational sense
- Business sense

Why we need the business-based approach?
- To ensure that the final weightings and scores after regression make sense.
- To ensure buy-in from internal end users (risk manager, adjudicators, etc).
- To confirm business experience, thus going one step further than a purely statistical evaluation.

## **4. Test of Independence**
---
Will be discussed in the next notebook.

# **1. Data Preparation**

## **1.1 Load Data**

The sample consist of some demographic, bureau, and financial information.

Note that we are not defining the default or bad status from our dataset here. Instead, we already have the binary response variable:

- `loan_status`
  - `loan_status = 0` for non default loan.
  - `loan_status = 1` for default loan.

The potential predictors for predicting the response variable are:

1. `person age` : age of the debtor.
2. `person_income` : annual income of the debtor.
3. `person_home_ownership`
  - `RENT`
  - `MORTGAGE`
  - `OWN`
  - `OTHER`
4. `person_emp_length` : employment length of debtor (in years).
5. `loan_intent` : purpose of the loan.
  - `EDUCATION`
  - `MEDICAL`
  - `VENTURE`
  - `PERSONAL`
  - `DEBTCONSOLIDATION`
6. `loan_grade`
7.  `loan_amnt`	: amount of the loan.
8. `loan_int_rate` : interest rate of the loan.
10. `loan_percent_income`	: percent loan of the debtor's income.
11. `cb_person_default_on_file`	: historical default.
  - `0` : the debtor does not have any history of defaults.
  - `1` : the debtor has a history of defaults on their credit file.
12. `cb_preson_cred_hist_length` : length of the credit history.

In [7]:
# Import dataset from csv file
data = pd.read_csv('credit_risk_dataset.csv')

# Table check
data.head().T

Unnamed: 0,0,1,2,3,4
person_age,22,21,25,23,24
person_income,59000,9600,9600,65500,54400
person_home_ownership,RENT,OWN,MORTGAGE,RENT,RENT
person_emp_length,123.0,5.0,1.0,4.0,8.0
loan_intent,PERSONAL,EDUCATION,MEDICAL,MEDICAL,MEDICAL
loan_grade,D,B,C,C,C
loan_amnt,35000,1000,5500,35000,35000
loan_int_rate,16.02,11.14,12.87,15.23,14.27
loan_status,1,0,1,1,1
loan_percent_income,0.59,0.1,0.57,0.53,0.55


In [3]:
# Check the data shape
data.shape

(32581, 12)

Our sample contains 12 variables from 32,581 credit records.
- 1 response variable, `loan_status`,
- and 11 potential predictors/characteristics.

Before modeling, make sure you split the data first for model validation.

In the classification case, check the proportion of response variable first to decide the splitting strategy.

In [8]:
# Define response variable
response_variable = 'loan_status'

# Check the proportion of response variable
data[response_variable].value_counts(normalize = True)

0    0.781836
1    0.218164
Name: loan_status, dtype: float64

The proportion of the response variable, `loan status`, is not quite balanced (in a ratio of 78:22).

To get the same ratio in training and testing set, define a stratified splitting based on the response variable, `loan_status`.

## **1.2 Sample Splitting**

First, define the predictors (X) and the response (y).

In [9]:
# Split response and predictors
y = data[response_variable]
y

0        1
1        0
2        1
3        1
4        1
        ..
32576    0
32577    0
32578    1
32579    0
32580    0
Name: loan_status, Length: 32581, dtype: int64

In [10]:
X = data.drop(columns = [response_variable],
              axis = 1)
X

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0.10,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,0.55,Y,4
...,...,...,...,...,...,...,...,...,...,...,...
32576,57,53000,MORTGAGE,1.0,PERSONAL,C,5800,13.16,0.11,N,30
32577,54,120000,MORTGAGE,4.0,PERSONAL,A,17625,7.49,0.15,N,19
32578,65,76000,RENT,3.0,HOMEIMPROVEMENT,B,35000,10.99,0.46,N,28
32579,56,150000,MORTGAGE,5.0,PERSONAL,B,15000,11.48,0.10,N,26


In [11]:
# Validate the splitting
print('y shape :', y.shape)
print('X shape :', X.shape)

y shape : (32581,)
X shape : (32581, 11)


In [12]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify = y,
                                                    test_size = 0.3,
                                                    random_state = 42)

# Validate splitting
print('X train shape :', X_train.shape)
print('y train shape :', y_train.shape)
print('X test shape  :', X_test.shape)
print('y test shape  :', y_test.shape)

X train shape : (22806, 11)
y train shape : (22806,)
X test shape  : (9775, 11)
y test shape  : (9775,)


Check the proportion of response y in each training and testing set.

In [13]:
y_train.value_counts(normalize = True)

0    0.781856
1    0.218144
Name: loan_status, dtype: float64

In [14]:
y_test.value_counts(normalize = True)

0    0.78179
1    0.21821
Name: loan_status, dtype: float64

# **2. Data Exploration**

## **2.1 Exploratory Data Analysis (EDA)**

- To make a model that predicts well on unseen data, we must prevent leakage of test set information.
- Thus, we only explore on **training set**.

In [15]:
# Concatenate X_train and y_train as data_train
data_train = pd.concat((X_train, y_train),
                       axis = 1)

# Validate data_train
print('Train data shape:', data_train.shape)
data_train.head()

Train data shape: (22806, 12)


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status
11491,26,62000,RENT,1.0,DEBTCONSOLIDATION,B,10000,11.26,0.16,N,2,0
3890,23,39000,MORTGAGE,3.0,EDUCATION,C,5000,12.98,0.13,N,4,0
17344,24,35000,RENT,1.0,DEBTCONSOLIDATION,A,12000,6.54,0.34,N,2,1
13023,24,86000,RENT,1.0,HOMEIMPROVEMENT,B,12000,10.65,0.14,N,3,0
29565,42,38400,RENT,4.0,MEDICAL,B,13000,,0.34,N,11,1


What do we do in EDA?
- Check data integrity.
- Check for any insight in the data: distribution, proportion, outliers, missing values, etc.
- Make a plan for data pre-processing.

### Check for Missing Values

In [16]:
data_train.isna().sum()

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              639
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 2200
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
loan_status                      0
dtype: int64

In [17]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22806 entries, 11491 to 10456
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  22806 non-null  int64  
 1   person_income               22806 non-null  int64  
 2   person_home_ownership       22806 non-null  object 
 3   person_emp_length           22167 non-null  float64
 4   loan_intent                 22806 non-null  object 
 5   loan_grade                  22806 non-null  object 
 6   loan_amnt                   22806 non-null  int64  
 7   loan_int_rate               20606 non-null  float64
 8   loan_percent_income         22806 non-null  float64
 9   cb_person_default_on_file   22806 non-null  object 
 10  cb_person_cred_hist_length  22806 non-null  int64  
 11  loan_status                 22806 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 2.3+ MB


There are 639 Missing Value on person_emp_length variable and 2200 Missing Value on loan_int_rate Variable

Summary
- There are missing values in person_emp_length, a numerical/float variable and loan_int_rate, a categorical variable.
 - We need to find how to handle the missing values by exploring the variable