<div style="text-align: center; background-color: #0A6EBD; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
    FIT-HCMUS, VNU-HCM 
    <br>
    LẬP TRÌNH KHOA HỌC DỮ LIỆU 
    <br>
    Final project 📌
</div>

<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
 Question and modelling📌
</div>


# 0. Question

- Loans are really important because they help people buy homes, start businesses, or even medical emergency .Despite their importance, the loan approval process is often marred by time-consuming manual assessments, leading to occasional errors and delays. This issue becomes particularly critical when borrowers require funds promptly for time-sensitive ventures. 

**Question:** How can we predict one person's loan approval status with high accuracy and low time budget?

- Luckily, with the remarkable computation power of computers, we can solve the problem within seconds if only we have a good model that predicts precise outcomes.

# 1. Import relevant libraries and modules

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 2. Preprocessing

In [2]:
raw_data = pd.read_csv("../data/processed_data.csv")
raw_data.head()

Unnamed: 0.1,Unnamed: 0,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,0,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,1,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,2,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,3,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,4,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


In [3]:
raw_data = raw_data.drop(['Unnamed: 0'],axis = 1)

In [4]:
raw_data.columns.values

array(['no_of_dependents', 'education', 'self_employed', 'income_annum',
       'loan_amount', 'loan_term', 'cibil_score',
       'residential_assets_value', 'commercial_assets_value',
       'luxury_assets_value', 'bank_asset_value', 'loan_status'],
      dtype=object)

In [7]:
raw_data.head()

Unnamed: 0,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


## Encoding for categorical columns' binary values

In [6]:
encoded_data = raw_data.copy()
cat_col = raw_data.select_dtypes(include = 'object').columns.values

map_dict = {'education':{' Graduate':1,' Not Graduate':0},
            'self_employed':{' Yes': 1, ' No': 0},
             'loan_status':{' Approved':1,' Rejected':0}}

for col in cat_col:
    encoded_data[col] = encoded_data[col].map(map_dict[col])
encoded_data.head()

Unnamed: 0,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,2,1,0,9600000,29900000,12,778,2400000,17600000,22700000,8000000,1
1,0,0,1,4100000,12200000,8,417,2700000,2200000,8800000,3300000,0
2,3,1,0,9100000,29700000,20,506,7100000,4500000,33300000,12800000,0
3,3,1,0,8200000,30700000,8,467,18200000,3300000,23300000,7900000,0
4,5,0,1,9800000,24200000,20,382,12400000,8200000,29400000,5000000,0


# 3. Checking logistic regression assumption

## Declare inputs and targets

In [8]:
x = encoded_data.drop(['loan_status'], axis = 1)
y = encoded_data['loan_status']

### Assumption 1 : Appropriate Outcome Type

- Logistic regression generally works as a classifier, so the type of logistic regression utilized (binary, multinomial, or ordinal) must match the outcome (dependent) variable in the dataset.
- By default, logistic regression assumes that the outcome variable is binary, where the number of outcomes is two (e.g., Yes/No).
- If the dependent variable has three or more outcomes, then multinomial or ordinal logistic regression should be used.

**How to Check?**

- We can check this assumption by getting the number of different outcomes in the dependent variable. If we want to use binary logistic regression, then there should only be two unique outcomes in the outcome variable.
- However, when exploring data, loan_approval column is only seen with two values, so there is no need to further investigate

### Assumption 2 — Linearity of independent variables and log-odds

- One of the critical assumptions of logistic regression is that the relationship between the logit (aka log-odds) of the outcome and each continuous independent variable is linear.

- The logit is the logarithm of the odds ratio, where p = probability of a positive outcome (e.g., survived Titanic sinking)

logit(p) = log(p/1-p)

**How to Check?**

**Visual check**

- We can check logit linearity is by visually inspecting the scatter plot between each predictor and the logit values.


### Assumption 3 : No strongly influential outliers