# A Comparative Analysis: Machine Learning vs. Deep Learning Algorithms for Predicting Default Risk - Inference

## i. Introduction

By: George Kenji Putra

From: HCK-005

Dataset: [Home Credit Default Risk](https://www.kaggle.com/competitions/home-credit-default-risk/data)

Objective: To run the model inference from h8dsft_FP_George_Kenji_Putra.ipynb file.

## ii. Importing Libraries

In [15]:
# importing libraries to be used

import pandas as pd
import pickle
import warnings
warnings.filterwarnings(action='ignore')

from sklearn.metrics import classification_report

## iii. Data Loading

We are going to load the sample data and also all the models needed to run our model inference.

In [16]:
# loading and assigning the dataset as inference

inference = pd.read_csv('Application_inference.csv')

inference

Unnamed: 0,target,name_income_type,organization_type,occupation_type,ext_source_3,ext_source_2,days_birth,credit_income_percent,annuity_income_percent,credit_term,days_employed_percent
0,0,Commercial associate,Business Entity Type 3,Sales staff,0.000527,0.604894,-13297,2.248587,0.254304,0.113095,0.057306
1,0,Commercial associate,Business Entity Type 3,Managers,0.712155,0.425351,-14778,5.178636,0.197764,0.038188,0.077209
2,0,Working,Self-employed,Sales staff,0.207964,0.531760,-17907,2.444444,0.193133,0.079009,0.035684
3,0,State servant,Business Entity Type 3,High skill tech staff,0.614414,0.693521,-19626,1.029091,0.100764,0.097915,0.355753
4,0,Working,Business Entity Type 3,Laborers,0.636376,0.560690,-20327,2.738960,0.141000,0.051479,0.054361
...,...,...,...,...,...,...,...,...,...,...,...
195,0,Commercial associate,Government,Managers,0.324891,0.700425,-13318,2.352941,0.158088,0.067187,0.205286
196,0,Working,Business Entity Type 3,Others,0.312365,0.702947,-14061,7.222222,0.279833,0.038746,0.020411
197,0,Working,Other,Drivers,0.429424,0.634161,-17344,2.422400,0.113650,0.046916,0.135378
198,0,Working,Kindergarten,Core staff,0.610991,0.533388,-14032,9.224600,0.269700,0.029237,0.146308


In [24]:
# load the test dataset

app_test = pd.read_csv('Application_test.csv')
app_test.columns = app_test.columns.str.lower()

# adding several columns from the domain knowledge 

app_test['credit_income_percent'] = app_test['amt_credit'] / app_test['amt_income_total']
app_test['annuity_income_percent'] = app_test['amt_annuity'] / app_test['amt_income_total']
app_test['credit_term'] = app_test['amt_annuity'] / app_test['amt_credit']
app_test['days_employed_percent'] = app_test['days_employed'] / app_test['days_birth']

# handling the missing value by imputing

app_test['occupation_type'].fillna('Others', inplace=True)

app_test

Unnamed: 0,sk_id_curr,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,amt_goods_price,...,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year,credit_income_percent,annuity_income_percent,credit_term,days_employed_percent
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.213333,0.152300,0.036147,0.121044
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0.0,0.0,0.0,0.0,0.0,3.0,2.250182,0.175455,0.077973,0.247398
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0.0,0.0,0.0,0.0,1.0,4.0,3.275378,0.344578,0.105202,0.222477
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0.0,0.0,0.0,0.0,0.0,3.0,5.000000,0.155614,0.031123,0.133515
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,,,,,,,3.475000,0.178150,0.051266,0.168021
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,...,0.0,0.0,0.0,0.0,0.0,1.0,3.395556,0.143815,0.042354,0.258838
48740,456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,...,,,,,,,3.951829,0.202600,0.051267,0.102718
48741,456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,...,0.0,0.0,0.0,0.0,3.0,1.0,1.555556,0.163978,0.105414,0.190742
48742,456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,...,0.0,0.0,0.0,0.0,0.0,2.0,2.000000,0.111680,0.055840,0.195518


In [18]:
# load the models

model = pickle.load(open('model.pkl', 'rb'))

## xi. Model Inference

We will run the actual test in this session using the saved data from h8dsft_P1M2_George_Kenji_Putra.ipynb file.

### x.i Inference Data Processing

In [19]:
# defining y_test for comparison purpose

y_test_inf = inference['target']

### x.ii Inference Prediction

In [20]:
# predicting the inference data

result_inf = model.predict(inference)

result_inf

array([1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       0, 1], dtype=int64)

In [21]:
# converting the result to a dataframe

result_inf = pd.DataFrame(result_inf, columns=['LGBM Prediction'])

# making comparison between the actual customer status from the sample data with the predicted values

predictions_inf = pd.concat([y_test_inf, result_inf], axis=1)
predictions_inf.columns = ['Actual', 'LGBM Prediction']

predictions_inf

Unnamed: 0,Actual,LGBM Prediction
0,0,1
1,0,0
2,0,1
3,0,0
4,0,0
...,...,...
195,0,0
196,0,0
197,0,1
198,0,0


In [22]:
print ('LGBM Improved Report:')
print('')
print(classification_report(y_test_inf, result_inf))

LGBM Improved Report:

              precision    recall  f1-score   support

           0       0.94      0.67      0.78       177
           1       0.20      0.65      0.31        23

    accuracy                           0.67       200
   macro avg       0.57      0.66      0.54       200
weighted avg       0.85      0.67      0.72       200



### x.iii Test Prediction

In [25]:
# predicting the test data

result_test = model.predict(app_test)

result_test

array([0, 1, 0, ..., 0, 0, 1], dtype=int64)

In [26]:
# converting the result to a dataframe

result_test = pd.DataFrame(result_test, columns=['target'])

# merging the sk_id_curr column with its predictions

predictions_test = pd.concat([app_test['sk_id_curr'], result_test], axis=1)

predictions_test

Unnamed: 0,sk_id_curr,target
0,100001,0
1,100005,1
2,100013,0
3,100028,0
4,100038,1
...,...,...
48739,456221,0
48740,456222,0
48741,456223,0
48742,456224,0
