# Credit Modelling
##### We are predicting whether to give out loans or not

## Banking
#### Asset = (Profitable Loan Product)
1. Housing Loan
2. Personal Loan
3. Vehicle Loan
4. Group Loan
5. Education Loan
6. Credit Card


#### Liability = (Product that are non-profit)
###### (CASA)
1. Current Account
2. Savings Account

##### (Term Deposit)
3. Fixed Deposit
4. RD

## NPA
#### (Non-Performing Asset: Loan that is Defaulted (No paid loan))

1. Disbursed Amount = Loan Amount given to a customer
2. OSP = Out-Standing Principle (remaining amount to be paid)
3. DPD = Days past due (total days after due date)
4. PAR = Portfolio at Risk (OSP when DPD > 0 )
5. NPA = Loan account when DPD > 90 days

## Credit Risk Type in Bankings
1. DPD (Zero): NDA (Non-Deliquint Account = Good account)
2. DPD (0 to 30): SMA1 (Standard Monitoring Account 1)
3. DPD (30 to 60): SMA2 (Standard Monitoring Account 2)
4. DPD (60 to 90): SMA3 (Standard Monitoring Account 3)
5. DPD (90 to 180): NPA
6. DPD (>180): Written-off (Loan which is not present)
###### Why bank write-off loan? = NPA will improve = Loan Portfolio quality of the bank will be better = Market sentiment will be good = Stock Price will improve


### Types of NPA
1. GNPA = Gross NPA (3-5 %) = OSP Default (Couldn't able to retrieve 3-5% of the loan back)
2. NNPA = Net NPA (0.01 to 0.06 %) = Provisioning Amount Subtracted (Surplus amount kept to fill the GNPA to keep the Stock prices and Reputation.)
### That's why you should see the GNPA to select a bank before making an account.

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from scipy.stats import chi2_contingency
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, precision_recall_fscore_support
import warnings
warnings.filterwarnings('ignore')
import os

In [3]:
df1 = pd.read_excel(r'case_study1.xlsx')
df2 = pd.read_excel(r'case_study2.xlsx')

In [6]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51336 entries, 0 to 51335
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   PROSPECTID            51336 non-null  int64  
 1   Total_TL              51336 non-null  int64  
 2   Tot_Closed_TL         51336 non-null  int64  
 3   Tot_Active_TL         51336 non-null  int64  
 4   Total_TL_opened_L6M   51336 non-null  int64  
 5   Tot_TL_closed_L6M     51336 non-null  int64  
 6   pct_tl_open_L6M       51336 non-null  float64
 7   pct_tl_closed_L6M     51336 non-null  float64
 8   pct_active_tl         51336 non-null  float64
 9   pct_closed_tl         51336 non-null  float64
 10  Total_TL_opened_L12M  51336 non-null  int64  
 11  Tot_TL_closed_L12M    51336 non-null  int64  
 12  pct_tl_open_L12M      51336 non-null  float64
 13  pct_tl_closed_L12M    51336 non-null  float64
 14  Tot_Missed_Pmnt       51336 non-null  int64  
 15  Auto_TL            

In [8]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51336 entries, 0 to 51335
Data columns (total 62 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   PROSPECTID                    51336 non-null  int64  
 1   time_since_recent_payment     51336 non-null  int64  
 2   time_since_first_deliquency   51336 non-null  int64  
 3   time_since_recent_deliquency  51336 non-null  int64  
 4   num_times_delinquent          51336 non-null  int64  
 5   max_delinquency_level         51336 non-null  int64  
 6   max_recent_level_of_deliq     51336 non-null  int64  
 7   num_deliq_6mts                51336 non-null  int64  
 8   num_deliq_12mts               51336 non-null  int64  
 9   num_deliq_6_12mts             51336 non-null  int64  
 10  max_deliq_6mts                51336 non-null  int64  
 11  max_deliq_12mts               51336 non-null  int64  
 12  num_times_30p_dpd             51336 non-null  int64  
 13  n

In [10]:
print(df1.shape)
print(df2.shape)

(51336, 26)
(51336, 62)


In [12]:
# Remove Nulls
df1 = df1.loc[df1['Age_Oldest_TL']!= -99999]

In [14]:
columns_to_be_removed = []

for i in df2.columns:
    if df2.loc[df2[i] == -99999].shape[0] > 10000:
        columns_to_be_removed.append(i)

In [16]:
columns_to_be_removed

['time_since_first_deliquency',
 'time_since_recent_deliquency',
 'max_delinquency_level',
 'max_deliq_6mts',
 'max_deliq_12mts',
 'CC_utilization',
 'PL_utilization',
 'max_unsec_exposure_inPct']

In [18]:
df2.drop(columns_to_be_removed, axis = 1, inplace=True)

In [20]:
for i in df2.columns:
    df2 = df2.loc[df2[i] != -99999]

In [22]:
df1.isna().sum()

PROSPECTID              0
Total_TL                0
Tot_Closed_TL           0
Tot_Active_TL           0
Total_TL_opened_L6M     0
Tot_TL_closed_L6M       0
pct_tl_open_L6M         0
pct_tl_closed_L6M       0
pct_active_tl           0
pct_closed_tl           0
Total_TL_opened_L12M    0
Tot_TL_closed_L12M      0
pct_tl_open_L12M        0
pct_tl_closed_L12M      0
Tot_Missed_Pmnt         0
Auto_TL                 0
CC_TL                   0
Consumer_TL             0
Gold_TL                 0
Home_TL                 0
PL_TL                   0
Secured_TL              0
Unsecured_TL            0
Other_TL                0
Age_Oldest_TL           0
Age_Newest_TL           0
dtype: int64

In [24]:
df2.isna().sum()

PROSPECTID                    0
time_since_recent_payment     0
num_times_delinquent          0
max_recent_level_of_deliq     0
num_deliq_6mts                0
num_deliq_12mts               0
num_deliq_6_12mts             0
num_times_30p_dpd             0
num_times_60p_dpd             0
num_std                       0
num_std_6mts                  0
num_std_12mts                 0
num_sub                       0
num_sub_6mts                  0
num_sub_12mts                 0
num_dbt                       0
num_dbt_6mts                  0
num_dbt_12mts                 0
num_lss                       0
num_lss_6mts                  0
num_lss_12mts                 0
recent_level_of_deliq         0
tot_enq                       0
CC_enq                        0
CC_enq_L6m                    0
CC_enq_L12m                   0
PL_enq                        0
PL_enq_L6m                    0
PL_enq_L12m                   0
time_since_recent_enq         0
enq_L12m                      0
enq_L6m 

In [26]:
print(df1.shape)
print(df2.shape)


(51296, 26)
(42066, 54)


In [28]:
# Checking common column names
for i in list(df1.columns):
    if i in list(df2.columns):
        print(i)


PROSPECTID


In [30]:
# Merge the two dataframes, inner join so that no null are present
df = pd.merge(df1, df2, how='inner', left_on=['PROSPECTID'], right_on=['PROSPECTID'])

In [32]:
df.head()

Unnamed: 0,PROSPECTID,Total_TL,Tot_Closed_TL,Tot_Active_TL,Total_TL_opened_L6M,Tot_TL_closed_L6M,pct_tl_open_L6M,pct_tl_closed_L6M,pct_active_tl,pct_closed_tl,...,pct_PL_enq_L6m_of_L12m,pct_CC_enq_L6m_of_L12m,pct_PL_enq_L6m_of_ever,pct_CC_enq_L6m_of_ever,HL_Flag,GL_Flag,last_prod_enq2,first_prod_enq2,Credit_Score,Approved_Flag
0,1,5,4,1,0,0,0.0,0.0,0.2,0.8,...,0.0,0.0,0.0,0.0,1,0,PL,PL,696,P2
1,2,1,0,1,0,0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0,0,ConsumerLoan,ConsumerLoan,685,P2
2,3,8,0,8,1,0,0.125,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1,0,ConsumerLoan,others,693,P2
3,5,3,2,1,0,0,0.0,0.0,0.333,0.667,...,0.0,0.0,0.0,0.0,0,0,AL,AL,753,P1
4,6,6,5,1,0,0,0.0,0.0,0.167,0.833,...,1.0,0.0,0.429,0.0,1,0,ConsumerLoan,PL,668,P3


In [34]:
# We will divide the feature into- Categorical and Numerical

# check how many columns are categorical
cat_col = []

for i in df.columns:
    if df[i].dtype == 'object':
        cat_col.append(i)

cat_col

['MARITALSTATUS',
 'EDUCATION',
 'GENDER',
 'last_prod_enq2',
 'first_prod_enq2',
 'Approved_Flag']

In [36]:
# Hypothesis Testing/ Inferential Statistics

# Are these two associated -- MartialStatus vs Approved_Flag? 

"""
Alpha (Always Assumed)
Significant level 
Strictness level
Margin of error
5% = 0.05 (Less Risky Projects)
0.0001    (High Risky Projects) 

Chisquare = Cat vs Cat
T-test = Cat vs Num (2 categories)
Anova = Cat vs Num (>=3 categories)
"""
# p_value of all categories from chi_square test
for i in cat_col[:-1]:
    chi2, p_value, dof, _  = chi2_contingency(pd.crosstab(df[i], df['Approved_Flag']))
    print(i, '---', p_value)

# Since all the p_value is below 0.05 we will accept all cat columns

MARITALSTATUS --- 3.578180861038862e-233
EDUCATION --- 2.6942265249737532e-30
GENDER --- 1.907936100186563e-05
last_prod_enq2 --- 0.0
first_prod_enq2 --- 7.84997610555419e-287


In [38]:
# Multicollineartiy vs Cor-relation
# Multicollinearity = Predictibility of each features by other features

# Important
# Correaltion is specific to linear realtionships between columns
# In Convex functions, correlation gives misleading values

In [40]:
# Total Numeric Columns
numeric_columns = []
for i in df.columns:
    if df[i].dtype != 'object' and i not in ['PROSPECTID', 'Approved_Flag']:
        numeric_columns.append(i)
len(numeric_columns)

72

In [42]:
# VIF for numerical columns
vif_data = df[numeric_columns]
total_columns = vif_data.shape[1]
columns_to_be_kept = []
column_index = 0

for i in range (0, total_columns):

    vif_value = variance_inflation_factor(vif_data, column_index)
    print(column_index, '---', vif_value)

    if vif_value <= 6:
        columns_to_be_kept.append(numeric_columns[i])
        column_index = column_index+1
    else:
        vif_data = vif_data.drop([numeric_columns[i]], axis=1)

0 --- inf
0 --- inf
0 --- 11.320180023967996
0 --- 8.363698035000327
0 --- 6.520647877790928
0 --- 5.149501618212625
1 --- 2.611111040579735
2 --- inf
2 --- 1788.7926256209232
2 --- 8.601028256477228
2 --- 3.8328007921530785
3 --- 6.0996533816467355
3 --- 5.581352009642762
4 --- 1.985584353098778
5 --- inf
5 --- 4.809538302819343
6 --- 23.270628983464636
6 --- 30.595522588100053
6 --- 4.3843464059655854
7 --- 3.0646584155234238
8 --- 2.898639771299253
9 --- 4.377876915347322
10 --- 2.2078535836958433
11 --- 4.916914200506864
12 --- 5.214702030064725
13 --- 3.3861625024231476
14 --- 7.840583309478997
14 --- 5.255034641721438
15 --- inf
15 --- 7.380634506427232
15 --- 1.4210050015175733
16 --- 8.083255010190316
16 --- 1.624122752404011
17 --- 7.257811920140003
17 --- 15.59624383268298
17 --- 1.825857047132431
18 --- 1.5080839450032664
19 --- 2.172088834824577
20 --- 2.6233975535272283
21 --- 2.2959970812106176
22 --- 7.360578319196439
22 --- 2.1602387773102554
23 --- 2.8686288267891467
2

In [43]:
vif_data.shape

(42064, 39)

In [44]:
# Check Anova for columns to be kept

from scipy.stats import f_oneway

columns_to_be_kept_numerical = []

for i in columns_to_be_kept:
    a = list(df[i])
    b = list(df['Approved_Flag'])

    group_P1 = [value for value, group in zip(a,b) if group == 'P1']
    group_P2 = [value for value, group in zip(a,b) if group == 'P2']
    group_P3 = [value for value, group in zip(a,b) if group == 'P3']
    group_P4 = [value for value, group in zip(a,b) if group == 'P4']

    f_stat, p_value = f_oneway(group_P1, group_P2, group_P3, group_P4)

    if p_value <= 0.05:
        columns_to_be_kept_numerical.append(i)
    

In [45]:
len(columns_to_be_kept_numerical)

37

In [46]:
cat_col

['MARITALSTATUS',
 'EDUCATION',
 'GENDER',
 'last_prod_enq2',
 'first_prod_enq2',
 'Approved_Flag']

In [47]:
# feature selection is done for cat and num features

# listing all the final features
features = columns_to_be_kept_numerical + cat_col
df = df[features]

In [48]:
print(df['MARITALSTATUS'].unique())    
print(df['EDUCATION'].unique())
print(df['GENDER'].unique())
print(df['last_prod_enq2'].unique())
print(df['first_prod_enq2'].unique())

['Married' 'Single']
['12TH' 'GRADUATE' 'SSC' 'POST-GRADUATE' 'UNDER GRADUATE' 'OTHERS'
 'PROFESSIONAL']
['M' 'F']
['PL' 'ConsumerLoan' 'AL' 'CC' 'others' 'HL']
['PL' 'ConsumerLoan' 'others' 'AL' 'HL' 'CC']


In [None]:
"""
# Ordinal feature -- EDUCATION
# SSC            : 1
# 12TH           : 2
# GRADUATE       : 3
# UNDER GRADUATE : 3
# POST-GRADUATE  : 4
# OTHERS         : 1
# PROFESSIONAL   : 3

OTHERS has to be verified by the Business end-user or Your Senior
"""

In [50]:
df.loc[df['EDUCATION'] == 'SSC',['EDUCATION']]              = 1
df.loc[df['EDUCATION'] == '12TH',['EDUCATION']]             = 2
df.loc[df['EDUCATION'] == 'GRADUATE',['EDUCATION']]         = 3
df.loc[df['EDUCATION'] == 'UNDER GRADUATE',['EDUCATION']]   = 3
df.loc[df['EDUCATION'] == 'POST-GRADUATE',['EDUCATION']]    = 4
df.loc[df['EDUCATION'] == 'OTHERS',['EDUCATION']]           = 1
df.loc[df['EDUCATION'] == 'PROFESSIONAL',['EDUCATION']]     = 3

In [57]:
df['EDUCATION'].value_counts()

EDUCATION
3    18931
2    11703
1     9532
4     1898
Name: count, dtype: int64

In [62]:
df['EDUCATION'] = df['EDUCATION'].astype('int')

In [64]:
df_encoded = pd.get_dummies(df, columns=['MARITALSTATUS', 'GENDER', 'last_prod_enq2', 'first_prod_enq2'])

In [66]:
df_encoded.describe()

Unnamed: 0,pct_tl_open_L6M,pct_tl_closed_L6M,Tot_TL_closed_L12M,pct_tl_closed_L12M,Tot_Missed_Pmnt,CC_TL,Home_TL,PL_TL,Secured_TL,Unsecured_TL,...,enq_L3m,NETMONTHLYINCOME,Time_With_Curr_Empr,CC_Flag,PL_Flag,pct_PL_enq_L6m_of_ever,pct_CC_enq_L6m_of_ever,HL_Flag,GL_Flag,EDUCATION
count,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0,...,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0,42064.0
mean,0.179032,0.097783,0.825504,0.160365,0.525746,0.145921,0.076241,0.328,2.921334,2.341646,...,1.230458,26929.9,110.345783,0.102962,0.193063,0.195497,0.064186,0.252235,0.05658,2.313689
std,0.278043,0.210957,1.537208,0.258831,1.106442,0.549314,0.358582,0.916368,6.379764,3.405397,...,2.069461,20843.0,75.629967,0.303913,0.394707,0.367414,0.225989,0.4343,0.231042,0.87107
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,18000.0,61.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,24000.0,92.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
75%,0.333,0.1,1.0,0.25,1.0,0.0,0.0,0.0,3.0,3.0,...,2.0,31000.0,131.0,0.0,0.0,0.0,0.0,1.0,0.0,3.0
max,1.0,1.0,33.0,1.0,34.0,27.0,10.0,29.0,235.0,55.0,...,42.0,2500000.0,1020.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0


# Machine Learning Model Fitting

In [69]:
# Data Splitting in Test and Train
y = df_encoded['Approved_Flag']
X = df_encoded.drop(columns=['Approved_Flag'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

In [71]:
# 1. Random Forest
rf_classifier = RandomForestClassifier(n_estimators=200, random_state=42)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy:{accuracy}')

Accuracy:0.7636990372043266


In [72]:
# All metrics for every category in Target Column
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)

for i, v in enumerate(['p1','p2','p3','p4',]):
    print(f'Class {v}: ')
    print(f"Precision: {precision[i]}")
    print(f"Recall: {recall[i]}")
    print(f"F1 Score: {f1_score[i]}")

Class p1: 
Precision: 0.8370457209847597
Recall: 0.7041420118343196
F1 Score: 0.7648634172469202
Class p2: 
Precision: 0.7957519116397621
Recall: 0.9282457879088206
F1 Score: 0.856907593778591
Class p3: 
Precision: 0.4423380726698262
Recall: 0.21132075471698114
F1 Score: 0.28600612870275793
Class p4: 
Precision: 0.7178502879078695
Recall: 0.7269193391642371
F1 Score: 0.7223563495895703


In [73]:
# 2. XGboost
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

xgb_classifier = xgb.XGBClassifier(objective='multi:softmax', num_class=4)

y = df_encoded['Approved_Flag']
X = df_encoded. drop ( ['Approved_Flag'], axis = 1 )


label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

xgb_classifier.fit(X_train, y_train)
y_pred = xgb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy:{accuracy}')

Accuracy:0.7783192677998336


In [74]:
# All metrics for every category in Target Column
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)

for i, v in enumerate(['p1','p2','p3','p4',]):
    print(f'Class {v}: ')
    print(f"Precision: {precision[i]}")
    print(f"Recall: {recall[i]}")
    print(f"F1 Score: {f1_score[i]}")

Class p1: 
Precision: 0.823906083244397
Recall: 0.7613412228796844
F1 Score: 0.7913890312660175
Class p2: 
Precision: 0.8255418233924413
Recall: 0.913577799801784
F1 Score: 0.8673315769665035
Class p3: 
Precision: 0.4756380510440835
Recall: 0.30943396226415093
F1 Score: 0.37494284407864653
Class p4: 
Precision: 0.7342386032977691
Recall: 0.7356656948493683
F1 Score: 0.7349514563106796


In [75]:
# 3. Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(max_depth=20, min_samples_split=10)
dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy:{accuracy}')

Accuracy:0.7096160703672887


In [76]:
# All metrics for every category in Target Column
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)

for i, v in enumerate(['p1','p2','p3','p4',]):
    print(f'Class {v}: ')
    print(f"Precision: {precision[i]}")
    print(f"Recall: {recall[i]}")
    print(f"F1 Score: {f1_score[i]}")

Class p1: 
Precision: 0.7198824681684622
Recall: 0.7248520710059172
F1 Score: 0.7223587223587223
Class p2: 
Precision: 0.8101635514018691
Recall: 0.824777006937562
F1 Score: 0.8174049700422356
Class p3: 
Precision: 0.34098101265822783
Recall: 0.32528301886792454
F1 Score: 0.33294708381614524
Class p4: 
Precision: 0.6481854838709677
Recall: 0.6248785228377065
F1 Score: 0.636318654131618


In [84]:
# We must also look at precision and recall because of imbalanced data. Accuracy might lead to wrong interpretation. 
# Take the example of terrorists and non-terrorists. Recall can tell the metrics of the individual classes, and precision can tell the metrics of
# the predicted classes. F1_score is generally used when Recall can be manipulated to give out 100% results That is why F1_score is used to ask why 
# precision or recall is low.

## Apply Standard Scaler


In [91]:
from sklearn.preprocessing import StandardScaler

columns_to_be_scaled = ['Age_Oldest_TL','Age_Newest_TL','time_since_recent_payment',
'max_recent_level_of_deliq','recent_level_of_deliq',
'time_since_recent_enq','NETMONTHLYINCOME','Time_With_Curr_Empr']

for i in columns_to_be_scaled:
    column_data = df_encoded[i].values.reshape(-1,1)
    scaler = StandardScaler()
    scaled_column = scaler.fit_transform(column_data)
    df_encoded[i] = scaled_column


In [93]:
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

xgb_classifier = xgb.XGBClassifier(objective='multi:softmax',  num_class=4)

y = df_encoded['Approved_Flag']
x = df_encoded. drop ( ['Approved_Flag'], axis = 1 )

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

x_train, x_test, y_train, y_test = train_test_split(x, y_encoded, test_size=0.2, random_state=42)

xgb_classifier.fit(x_train, y_train)
y_pred = xgb_classifier.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)

for i, v in enumerate(['p1', 'p2', 'p3', 'p4']):
    print(f"Class {v}:")
    print(f"Precision: {precision[i]}")
    print(f"Recall: {recall[i]}")
    print(f"F1 Score: {f1_score[i]}")
    print()

Accuracy: 0.78
Class p1:
Precision: 0.823906083244397
Recall: 0.7613412228796844
F1 Score: 0.7913890312660175

Class p2:
Precision: 0.8255418233924413
Recall: 0.913577799801784
F1 Score: 0.8673315769665035

Class p3:
Precision: 0.4756380510440835
Recall: 0.30943396226415093
F1 Score: 0.37494284407864653

Class p4:
Precision: 0.7342386032977691
Recall: 0.7356656948493683
F1 Score: 0.7349514563106796



##### No improvements by applying scaling

## Hyperparameter Tuning

###### We are selecting XGboost for its immaculate performance against other algorithms

In [96]:
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Define the XGBClassifer with the initial set of hyperparameters
xgb_model = xgb.XGBClassifier(objective='multi:softmax', num_classes=4)

# Define the parameter grid for hyperparameter tuning

param_grid = {
    'n_estimator': [50, 100, 200],
    'max_depth': [3,5,7],
    'learning_rate': [0.01, 0.1, 0.2],
}

grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(x_train, y_train)

# Print the best hyperparameters
print('Best Hyperparameters:', grid_search.best_params_)

# Evaluate the model with the best hyperparameters on the test set
best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print('Test Accuracy:', accuracy)

# Best Hyperparameters: {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 200}

# Based on risk appetite of the bank, you will suggest P1,P2,P3,P4 to the business end user

Best Hyperparameters: {'learning_rate': 0.2, 'max_depth': 5, 'n_estimator': 50}
Test Accuracy: 0.19874004516819208


In [100]:
# Hyperparameter tuning for xgboost 

# Define the hyperparameter grid
param_grid = {
   'colsample_bytree': [0.1, 0.3, 0.5, 0.7, 0.9],
   'learning_rate'   : [0.001, 0.01, 0.1, 1],
   'max_depth'       : [3, 5, 8, 10],
   'alpha'           : [1, 10, 100],
   'n_estimators'    : [10,50,100]
}

index = 0

answers_grid = {
 'combination'       :[],
 'train_Accuracy'    :[],
 'test_Accuracy'     :[],
 'colsample_bytree'  :[],
 'learning_rate'     :[],
 'max_depth'         :[],
 'alpha'             :[],
 'n_estimators'      :[]
 }

# Loop through each combination of hyperparameters
for colsample_bytree in param_grid['colsample_bytree']:
    for learning_rate in param_grid['learning_rate']:
        for max_depth in param_grid['max_depth']:
            for alpha in param_grid['alpha']:
                for n_estimators in param_grid['n_estimators']:

                    index = index + 1

                    # Define and train the XGBoost model
                    model = xgb.XGBClassifier(objective='multi:softmax',
                                              num_class=4,
                                              colsample_bytree=colsample_bytree,
                                              learning_rate=learning_rate,
                                              max_depth=max_depth,
                                              alpha=alpha, 
                                              n_estimators=n_estimators)

                    
                    y = df_encoded['Approved_Flag']
                    x = df_encoded. drop ( ['Approved_Flag'], axis = 1 )

                    label_encoder = LabelEncoder()
                    y_encoded = label_encoder.fit_transform(y)


                    x_train, x_test, y_train, y_test = train_test_split(x, y_encoded, test_size=0.2, random_state=42)


                    model.fit(x_train, y_train)

                    # Predict on training and testing sets
                    y_pred_train = model.predict(x_train)
                    y_pred_test = model.predict(x_test)

                    # Calculate train and test results
                  
                    train_accuracy =  accuracy_score (y_train, y_pred_train)
                    test_accuracy  =  accuracy_score (y_test , y_pred_test)
                  
                  
           
                    # Include into the lists
                    answers_grid ['combination']   .append(index)
                    answers_grid ['train_Accuracy']    .append(train_accuracy)
                    answers_grid ['test_Accuracy']     .append(test_accuracy)
                    answers_grid ['colsample_bytree']   .append(colsample_bytree)
                    answers_grid ['learning_rate']      .append(learning_rate)
                    answers_grid ['max_depth']          .append(max_depth)
                    answers_grid ['alpha']              .append(alpha)
                    answers_grid ['n_estimators']       .append(n_estimators)
               
               
                    # Print results for this combination
                    print(f"Combination {index}")
                    print(f"colsample_bytree: {colsample_bytree}, learning_rate: {learning_rate}, max_depth: {max_depth}, alpha: {alpha}, n_estimators: {n_estimators}")
                    print(f"Train Accuracy: {train_accuracy:.2f}")
                    print(f"Test Accuracy : {test_accuracy :.2f}")
                    print("-" * 30)


Combination 1
colsample_bytree: 0.1, learning_rate: 0.001, max_depth: 3, alpha: 1, n_estimators: 10
Train Accuracy: 0.61
Test Accuracy : 0.60
------------------------------
Combination 2
colsample_bytree: 0.1, learning_rate: 0.001, max_depth: 3, alpha: 1, n_estimators: 50
Train Accuracy: 0.61
Test Accuracy : 0.60
------------------------------
Combination 3
colsample_bytree: 0.1, learning_rate: 0.001, max_depth: 3, alpha: 1, n_estimators: 100
Train Accuracy: 0.61
Test Accuracy : 0.60
------------------------------
Combination 4
colsample_bytree: 0.1, learning_rate: 0.001, max_depth: 3, alpha: 10, n_estimators: 10
Train Accuracy: 0.61
Test Accuracy : 0.60
------------------------------
Combination 5
colsample_bytree: 0.1, learning_rate: 0.001, max_depth: 3, alpha: 10, n_estimators: 50
Train Accuracy: 0.61
Test Accuracy : 0.60
------------------------------
Combination 6
colsample_bytree: 0.1, learning_rate: 0.001, max_depth: 3, alpha: 10, n_estimators: 100
Train Accuracy: 0.61
Test Accu

In [None]:
answers_grid = {
 'combination'       :[],
 'train_Accuracy'    :[],
 'test_Accuracy'     :[],
 'colsample_bytree'  :[],
 'learning_rate'     :[],
 'max_depth'         :[],
 'alpha'             :[],
 'n_estimators'      :[]
 }


In [148]:
param_df = pd.DataFrame(answers_grid).sort_values(by="test_Accuracy", ascending=False).reset_index()
param_df

Unnamed: 0,index,combination,train_Accuracy,test_Accuracy,colsample_bytree,learning_rate,max_depth,alpha,n_estimators
0,536,537,0.837746,0.780221,0.7,0.100,10,10,100
1,400,401,0.797123,0.779983,0.5,1.000,3,10,50
2,671,672,0.824998,0.779508,0.9,0.100,8,10,100
3,541,542,0.807554,0.779389,0.7,1.000,3,1,50
4,527,528,0.820719,0.778438,0.7,0.100,8,10,100
...,...,...,...,...,...,...,...,...,...
715,34,35,0.606431,0.599667,0.1,0.001,10,100,50
716,33,34,0.606431,0.599667,0.1,0.001,10,100,10
717,62,63,0.606431,0.599667,0.1,0.010,8,100,100
718,31,32,0.606579,0.599667,0.1,0.001,10,10,50


# Best Parameters for XGBoost

In [150]:
param_df.iloc[0,:]

index               536.000000
combination         537.000000
train_Accuracy        0.837746
test_Accuracy         0.780221
colsample_bytree      0.700000
learning_rate         0.100000
max_depth            10.000000
alpha                10.000000
n_estimators        100.000000
Name: 0, dtype: float64

In [None]:
# So for project he just used different data and made a project with it flask. It's pretty simple project.
# He also made a .exe file where by running some command which I'll list below, will create a exe file which when running it will give prediction like "1,0,2" again each row in xlsx data.
# The same test data from which we trained the model 

# commands: 
# cd C:\xxxx\Desktop\New folder
# python -m PyInstaller --onefile exe.py
