<h2><b>CREDIT CARD APPROVAL PREDICTION PROJECT</b></h2>
<h3><u>Author: Chirag Ingle</u></h3>
<p>It is an automatic credit card approval predictor that uses machine learning techniques.</p>

<h3>NEED OF CREDIT CARD APPROVAL PREDICTION:</h3>

<p>Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming.So, this task can be automated with the power of machine learning and most of the banks use the same technique.</p>

<img src='https://cardinsider.com/wp-content/uploads/2023/08/Improve-Your-Chances-For-Credit-Card-Application-Blog-Post.jpg' alt="Image Not Found">

<p>We will use the <i>Credit Card Approval dataset from the UCI Machine Learning Repository.</i> The structure of this notebook is as follows:

<ul><li>We will start off by loading and viewing the dataset.</li>
<li>We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.</li>
<li>We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.</li>
<li>After our data is in good shape, we will do some exploratory data analysis to build our intuitions.</li></ul>
Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.
Before the first step, we are mounting the dataset from the google drive to the google collab.</p>

<h3>Importing Libraries</h3>

In [104]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
import math
import tensorflow as tf
import warnings
import ydata_profiling
from ydata_profiling import ProfileReport
from IPython.display import HTML
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import xgboost

In [77]:
from sklearn.ensemble import  RandomForestClassifier,GradientBoostingClassifier,HistGradientBoostingClassifier
from sklearn.preprocessing import MinMaxScaler,LabelEncoder
from sklearn.model_selection import train_test_split

In [61]:
#Version Check
tf.__version__

'2.13.0'

In [62]:
sklearn.__version__

'1.2.2'

In [63]:
#ignoring all the warnings
warnings.filterwarnings('ignore')

<h3>Importing Dataset</h3>

In [64]:
data=pd.read_csv("/kaggle/input/credit-card-approval/cc_approvals.data",header=None)

In [65]:
data.sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
169,b,37.5,1.125,y,p,d,v,1.5,f,f,0,t,g,431,0,+
599,b,20.5,2.415,u,g,c,v,2.0,t,t,11,t,g,200,3000,+
411,b,25.17,3.0,u,g,c,v,1.25,f,t,1,f,g,0,22,-
388,b,26.67,14.585,u,g,i,bb,0.0,f,f,0,t,g,178,0,-
163,b,32.0,1.75,y,p,e,h,0.04,t,f,0,t,g,393,0,+


<h4><b>Inspecting The Applications</b></h4>
<p>The features of this dataset have been anonymized to protect the privacy. The probable features in a typical credit card application are Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.

The dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.</p>

In [66]:
data.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [67]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


<h3>3. Handling the missing values (part i)</h3>
<p>We've uncovered some issues that will affect the performance of our machine learning model(s) if they go unchanged:

Our dataset contains both numeric and non-numeric data (specifically data that are of float64, int64 and object types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values. The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like mean, max, and min) about the features that have numerical values. Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output.

Now, let's temporarily replace these missing value question marks with NaN.</p>

In [68]:
print(data.tail(17))

data.replace(to_replace='?',value=np.nan,inplace=False)

print(data.tail(17))

    0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  ?  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674  a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675  a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676  a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677  b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678  a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679  a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680  b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681  b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682  b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683  b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684  b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
685  b  21.08  10.085  y  p   e   h  1

<h3> 4. Handling the missing values (part ii) </h3>

<p>We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.

An important question that gets raised here is why are we giving so much importance to missing values? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as LDA.

So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.</p>

In [69]:
# Fill NaN values with mean for numeric columns
numeric_columns = data.select_dtypes(include=np.number).columns
data[numeric_columns] = data[numeric_columns].apply(lambda x: x.fillna(x.mean()))

<h3>5. Handling the missing values (part iii)</h3>
<p>We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this why the mean imputation strategy would not work here. This needs a different treatment.

We are going to impute these missing values with the most frequent values as present in the respective columns. This is good practice when it comes to imputing missing values for categorical data in general.</p>

In [70]:
# Iterate over each column
for col in data.columns:
    # Check if the column is of object type
    if data[col].dtype == 'object':
        # Impute with the most frequent value
        data[col] = data[col].fillna(data[col].value_counts().index[0])

# If there are still any remaining NaN values, you can fill them with a constant or additional strategies
data = data.fillna(0)  # Filling remaining NaN values with 0, adjust as needed

# Count the number of NaNs in the dataset and print the counts to verify
print(data.isna().sum())


0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64


In [71]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,00202,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,00043,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,00280,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,00100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,00260,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,00200,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,00200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,00280,750,-


In [72]:
numeric_data

Unnamed: 0,2,7,10,14
0,0.000,1.25,1,0
1,4.460,3.04,6,560
2,0.500,1.50,0,824
3,1.540,3.75,5,3
4,5.625,1.71,0,0
...,...,...,...,...
685,10.085,1.25,0,0
686,0.750,2.00,2,394
687,13.500,2.00,1,1
688,0.205,0.04,0,750


<h3>Exploratory Data Analysis</h3>

In [73]:
profile=ProfileReport(data,explorative=True)
profile.to_file('output.html',)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [74]:
HTML('output.html')

0,1
Number of variables,16
Number of observations,690
Missing cells,0
Missing cells (%),0.0%
Duplicate rows,0
Duplicate rows (%),0.0%
Total size in memory,496.2 KiB
Average record size in memory,736.4 B

0,1
Categorical,7
Text,2
Numeric,4
Boolean,3

0,1
3 is highly overall correlated with 4 and 2 other fields,High correlation
4 is highly overall correlated with 3 and 2 other fields,High correlation
5 is highly overall correlated with 6 and 1 other fields,High correlation
6 is highly overall correlated with 3 and 3 other fields,High correlation
8 is highly overall correlated with 15,High correlation
12 is highly overall correlated with 3 and 3 other fields,High correlation
15 is highly overall correlated with 8,High correlation
3 is highly imbalanced (55.8%),Imbalance
4 is highly imbalanced (55.8%),Imbalance
12 is highly imbalanced (68.4%),Imbalance

0,1
Analysis started,2023-11-10 21:32:05.520212
Analysis finished,2023-11-10 21:32:09.507853
Duration,3.99 seconds
Software version,ydata-profiling vv4.5.1
Download configuration,config.json

0,1
Distinct,3
Distinct (%),0.4%
Missing,0
Missing (%),0.0%
Memory size,39.2 KiB

0,1
b,468
a,210
?,12

0,1
Max length,1
Median length,1
Mean length,1
Min length,1

0,1
Total characters,690
Distinct characters,3
Distinct categories,2 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,b
2nd row,a
3rd row,a
4th row,b
5th row,b

Value,Count,Frequency (%)
b,468,67.8%
a,210,30.4%
?,12,1.7%

Value,Count,Frequency (%)
b,468,67.8%
a,210,30.4%
,12,1.7%

Value,Count,Frequency (%)
b,468,67.8%
a,210,30.4%
?,12,1.7%

Value,Count,Frequency (%)
Lowercase Letter,678,98.3%
Other Punctuation,12,1.7%

Value,Count,Frequency (%)
b,468,69.0%
a,210,31.0%

Value,Count,Frequency (%)
?,12,100.0%

Value,Count,Frequency (%)
Latin,678,98.3%
Common,12,1.7%

Value,Count,Frequency (%)
b,468,69.0%
a,210,31.0%

Value,Count,Frequency (%)
?,12,100.0%

Value,Count,Frequency (%)
ASCII,690,100.0%

Value,Count,Frequency (%)
b,468,67.8%
a,210,30.4%
?,12,1.7%

0,1
Distinct,350
Distinct (%),50.7%
Missing,0
Missing (%),0.0%
Memory size,41.9 KiB

0,1
Max length,5.0
Median length,5.0
Mean length,4.9304348
Min length,1.0

0,1
Total characters,3402
Distinct characters,12
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,170 ?
Unique (%),24.6%

0,1
1st row,30.83
2nd row,58.67
3rd row,24.5
4th row,27.83
5th row,20.17

Value,Count,Frequency (%)
,12,1.7%
22.67,9,1.3%
20.42,7,1.0%
18.83,6,0.9%
24.50,6,0.9%
25.00,6,0.9%
19.17,6,0.9%
22.50,6,0.9%
20.67,6,0.9%
23.58,6,0.9%

Value,Count,Frequency (%)
.,678,19.9%
2,518,15.2%
3,381,11.2%
5,333,9.8%
0,287,8.4%
7,271,8.0%
8,250,7.3%
1,227,6.7%
4,197,5.8%
6,136,4.0%

Value,Count,Frequency (%)
Decimal Number,2712,79.7%
Other Punctuation,690,20.3%

Value,Count,Frequency (%)
2,518,19.1%
3,381,14.0%
5,333,12.3%
0,287,10.6%
7,271,10.0%
8,250,9.2%
1,227,8.4%
4,197,7.3%
6,136,5.0%
9,112,4.1%

Value,Count,Frequency (%)
.,678,98.3%
?,12,1.7%

Value,Count,Frequency (%)
Common,3402,100.0%

Value,Count,Frequency (%)
.,678,19.9%
2,518,15.2%
3,381,11.2%
5,333,9.8%
0,287,8.4%
7,271,8.0%
8,250,7.3%
1,227,6.7%
4,197,5.8%
6,136,4.0%

Value,Count,Frequency (%)
ASCII,3402,100.0%

Value,Count,Frequency (%)
.,678,19.9%
2,518,15.2%
3,381,11.2%
5,333,9.8%
0,287,8.4%
7,271,8.0%
8,250,7.3%
1,227,6.7%
4,197,5.8%
6,136,4.0%

0,1
Distinct,215
Distinct (%),31.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,4.7587246

0,1
Minimum,0
Maximum,28
Zeros,19
Zeros (%),2.8%
Negative,0
Negative (%),0.0%
Memory size,5.5 KiB

0,1
Minimum,0.0
5-th percentile,0.165
Q1,1.0
median,2.75
Q3,7.2075
95-th percentile,14.0
Maximum,28.0
Range,28.0
Interquartile range (IQR),6.2075

0,1
Standard deviation,4.9781632
Coefficient of variation (CV),1.0461129
Kurtosis,2.2740219
Mean,4.7587246
Median Absolute Deviation (MAD),2.21
Skewness,1.4888131
Sum,3283.52
Variance,24.782109
Monotonicity,Not monotonic

Value,Count,Frequency (%)
1.5,21,3.0%
0,19,2.8%
3,19,2.8%
2.5,19,2.8%
0.75,16,2.3%
1.25,16,2.3%
0.5,15,2.2%
5,14,2.0%
6.5,12,1.7%
1.75,12,1.7%

Value,Count,Frequency (%)
0.0,19,2.8%
0.04,5,0.7%
0.08,1,0.1%
0.085,1,0.1%
0.125,5,0.7%
0.165,8,1.2%
0.17,1,0.1%
0.205,3,0.4%
0.21,3,0.4%
0.25,6,0.9%

Value,Count,Frequency (%)
28.0,1,0.1%
26.335,1,0.1%
25.21,1,0.1%
25.125,1,0.1%
25.085,1,0.1%
22.29,1,0.1%
22.0,1,0.1%
21.5,1,0.1%
21.0,1,0.1%
20.0,1,0.1%

0,1
Distinct,4
Distinct (%),0.6%
Missing,0
Missing (%),0.0%
Memory size,39.2 KiB

0,1
u,519
y,163
?,6
l,2

0,1
Max length,1
Median length,1
Mean length,1
Min length,1

0,1
Total characters,690
Distinct characters,4
Distinct categories,2 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,u
2nd row,u
3rd row,u
4th row,u
5th row,u

Value,Count,Frequency (%)
u,519,75.2%
y,163,23.6%
?,6,0.9%
l,2,0.3%

Value,Count,Frequency (%)
u,519,75.2%
y,163,23.6%
,6,0.9%
l,2,0.3%

Value,Count,Frequency (%)
u,519,75.2%
y,163,23.6%
?,6,0.9%
l,2,0.3%

Value,Count,Frequency (%)
Lowercase Letter,684,99.1%
Other Punctuation,6,0.9%

Value,Count,Frequency (%)
u,519,75.9%
y,163,23.8%
l,2,0.3%

Value,Count,Frequency (%)
?,6,100.0%

Value,Count,Frequency (%)
Latin,684,99.1%
Common,6,0.9%

Value,Count,Frequency (%)
u,519,75.9%
y,163,23.8%
l,2,0.3%

Value,Count,Frequency (%)
?,6,100.0%

Value,Count,Frequency (%)
ASCII,690,100.0%

Value,Count,Frequency (%)
u,519,75.2%
y,163,23.6%
?,6,0.9%
l,2,0.3%

0,1
Distinct,4
Distinct (%),0.6%
Missing,0
Missing (%),0.0%
Memory size,39.2 KiB

0,1
g,519
p,163
?,6
gg,2

0,1
Max length,2.0
Median length,1.0
Mean length,1.0028986
Min length,1.0

0,1
Total characters,692
Distinct characters,3
Distinct categories,2 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,g
2nd row,g
3rd row,g
4th row,g
5th row,g

Value,Count,Frequency (%)
g,519,75.2%
p,163,23.6%
?,6,0.9%
gg,2,0.3%

Value,Count,Frequency (%)
g,519,75.2%
p,163,23.6%
,6,0.9%
gg,2,0.3%

Value,Count,Frequency (%)
g,523,75.6%
p,163,23.6%
?,6,0.9%

Value,Count,Frequency (%)
Lowercase Letter,686,99.1%
Other Punctuation,6,0.9%

Value,Count,Frequency (%)
g,523,76.2%
p,163,23.8%

Value,Count,Frequency (%)
?,6,100.0%

Value,Count,Frequency (%)
Latin,686,99.1%
Common,6,0.9%

Value,Count,Frequency (%)
g,523,76.2%
p,163,23.8%

Value,Count,Frequency (%)
?,6,100.0%

Value,Count,Frequency (%)
ASCII,692,100.0%

Value,Count,Frequency (%)
g,523,75.6%
p,163,23.6%
?,6,0.9%

0,1
Distinct,15
Distinct (%),2.2%
Missing,0
Missing (%),0.0%
Memory size,39.4 KiB

0,1
c,137
q,78
w,64
i,59
aa,54
Other values (10),298

0,1
Max length,2.0
Median length,1.0
Mean length,1.2144928
Min length,1.0

0,1
Total characters,838
Distinct characters,14
Distinct categories,2 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,w
2nd row,q
3rd row,q
4th row,w
5th row,w

Value,Count,Frequency (%)
c,137,19.9%
q,78,11.3%
w,64,9.3%
i,59,8.6%
aa,54,7.8%
ff,53,7.7%
k,51,7.4%
cc,41,5.9%
m,38,5.5%
x,38,5.5%

Value,Count,Frequency (%)
c,137,19.9%
q,78,11.3%
w,64,9.3%
i,59,8.6%
aa,54,7.8%
ff,53,7.7%
k,51,7.4%
cc,41,5.9%
m,38,5.5%
x,38,5.5%

Value,Count,Frequency (%)
c,219,26.1%
a,108,12.9%
f,106,12.6%
q,78,9.3%
w,64,7.6%
i,59,7.0%
k,51,6.1%
m,38,4.5%
x,38,4.5%
d,30,3.6%

Value,Count,Frequency (%)
Lowercase Letter,829,98.9%
Other Punctuation,9,1.1%

Value,Count,Frequency (%)
c,219,26.4%
a,108,13.0%
f,106,12.8%
q,78,9.4%
w,64,7.7%
i,59,7.1%
k,51,6.2%
m,38,4.6%
x,38,4.6%
d,30,3.6%

Value,Count,Frequency (%)
?,9,100.0%

Value,Count,Frequency (%)
Latin,829,98.9%
Common,9,1.1%

Value,Count,Frequency (%)
c,219,26.4%
a,108,13.0%
f,106,12.8%
q,78,9.4%
w,64,7.7%
i,59,7.1%
k,51,6.2%
m,38,4.6%
x,38,4.6%
d,30,3.6%

Value,Count,Frequency (%)
?,9,100.0%

Value,Count,Frequency (%)
ASCII,838,100.0%

Value,Count,Frequency (%)
c,219,26.1%
a,108,12.9%
f,106,12.6%
q,78,9.3%
w,64,7.6%
i,59,7.0%
k,51,6.1%
m,38,4.5%
x,38,4.5%
d,30,3.6%

0,1
Distinct,10
Distinct (%),1.4%
Missing,0
Missing (%),0.0%
Memory size,39.3 KiB

0,1
v,399
h,138
bb,59
ff,57
?,9
Other values (5),28

0,1
Max length,2.0
Median length,1.0
Mean length,1.1768116
Min length,1.0

0,1
Total characters,812
Distinct characters,10
Distinct categories,2 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,v
2nd row,h
3rd row,h
4th row,v
5th row,v

Value,Count,Frequency (%)
v,399,57.8%
h,138,20.0%
bb,59,8.6%
ff,57,8.3%
?,9,1.3%
j,8,1.2%
z,8,1.2%
dd,6,0.9%
n,4,0.6%
o,2,0.3%

Value,Count,Frequency (%)
v,399,57.8%
h,138,20.0%
bb,59,8.6%
ff,57,8.3%
,9,1.3%
j,8,1.2%
z,8,1.2%
dd,6,0.9%
n,4,0.6%
o,2,0.3%

Value,Count,Frequency (%)
v,399,49.1%
h,138,17.0%
b,118,14.5%
f,114,14.0%
d,12,1.5%
?,9,1.1%
j,8,1.0%
z,8,1.0%
n,4,0.5%
o,2,0.2%

Value,Count,Frequency (%)
Lowercase Letter,803,98.9%
Other Punctuation,9,1.1%

Value,Count,Frequency (%)
v,399,49.7%
h,138,17.2%
b,118,14.7%
f,114,14.2%
d,12,1.5%
j,8,1.0%
z,8,1.0%
n,4,0.5%
o,2,0.2%

Value,Count,Frequency (%)
?,9,100.0%

Value,Count,Frequency (%)
Latin,803,98.9%
Common,9,1.1%

Value,Count,Frequency (%)
v,399,49.7%
h,138,17.2%
b,118,14.7%
f,114,14.2%
d,12,1.5%
j,8,1.0%
z,8,1.0%
n,4,0.5%
o,2,0.2%

Value,Count,Frequency (%)
?,9,100.0%

Value,Count,Frequency (%)
ASCII,812,100.0%

Value,Count,Frequency (%)
v,399,49.1%
h,138,17.0%
b,118,14.5%
f,114,14.0%
d,12,1.5%
?,9,1.1%
j,8,1.0%
z,8,1.0%
n,4,0.5%
o,2,0.2%

0,1
Distinct,132
Distinct (%),19.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,2.2234058

0,1
Minimum,0
Maximum,28.5
Zeros,70
Zeros (%),10.1%
Negative,0
Negative (%),0.0%
Memory size,5.5 KiB

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.165
median,1.0
Q3,2.625
95-th percentile,8.56875
Maximum,28.5
Range,28.5
Interquartile range (IQR),2.46

0,1
Standard deviation,3.3465134
Coefficient of variation (CV),1.5051294
Kurtosis,11.200192
Mean,2.2234058
Median Absolute Deviation (MAD),0.915
Skewness,2.8913304
Sum,1534.15
Variance,11.199152
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,70,10.1%
0.25,35,5.1%
0.04,33,4.8%
1,31,4.5%
0.125,30,4.3%
0.5,28,4.1%
0.085,26,3.8%
1.5,25,3.6%
0.165,22,3.2%
2.5,17,2.5%

Value,Count,Frequency (%)
0.0,70,10.1%
0.04,33,4.8%
0.085,26,3.8%
0.125,30,4.3%
0.165,22,3.2%
0.21,6,0.9%
0.25,35,5.1%
0.29,12,1.7%
0.335,5,0.7%
0.375,7,1.0%

Value,Count,Frequency (%)
28.5,1,0.1%
20.0,2,0.3%
18.0,1,0.1%
17.5,1,0.1%
16.0,1,0.1%
15.5,1,0.1%
15.0,3,0.4%
14.415,1,0.1%
14.0,3,0.4%
13.875,2,0.3%

0,1
Distinct,2
Distinct (%),0.3%
Missing,0
Missing (%),0.0%
Memory size,818.0 B

0,1
True,361
False,329

Value,Count,Frequency (%)
True,361,52.3%
False,329,47.7%

0,1
Distinct,2
Distinct (%),0.3%
Missing,0
Missing (%),0.0%
Memory size,818.0 B

0,1
False,395
True,295

Value,Count,Frequency (%)
False,395,57.2%
True,295,42.8%

0,1
Distinct,23
Distinct (%),3.3%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,2.4

0,1
Minimum,0
Maximum,67
Zeros,395
Zeros (%),57.2%
Negative,0
Negative (%),0.0%
Memory size,5.5 KiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,0
Q3,3
95-th percentile,11
Maximum,67
Range,67
Interquartile range (IQR),3

0,1
Standard deviation,4.86294
Coefficient of variation (CV),2.026225
Kurtosis,50.829431
Mean,2.4
Median Absolute Deviation (MAD),0
Skewness,5.1525199
Sum,1656
Variance,23.648186
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,395,57.2%
1,71,10.3%
2,45,6.5%
3,28,4.1%
6,23,3.3%
11,19,2.8%
5,18,2.6%
7,16,2.3%
4,15,2.2%
9,10,1.4%

Value,Count,Frequency (%)
0,395,57.2%
1,71,10.3%
2,45,6.5%
3,28,4.1%
4,15,2.2%
5,18,2.6%
6,23,3.3%
7,16,2.3%
8,10,1.4%
9,10,1.4%

Value,Count,Frequency (%)
67,1,0.1%
40,1,0.1%
23,1,0.1%
20,2,0.3%
19,1,0.1%
17,2,0.3%
16,3,0.4%
15,4,0.6%
14,8,1.2%
13,1,0.1%

0,1
Distinct,2
Distinct (%),0.3%
Missing,0
Missing (%),0.0%
Memory size,818.0 B

0,1
False,374
True,316

Value,Count,Frequency (%)
False,374,54.2%
True,316,45.8%

0,1
Distinct,3
Distinct (%),0.4%
Missing,0
Missing (%),0.0%
Memory size,39.2 KiB

0,1
g,625
s,57
p,8

0,1
Max length,1
Median length,1
Mean length,1
Min length,1

0,1
Total characters,690
Distinct characters,3
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,g
2nd row,g
3rd row,g
4th row,g
5th row,s

Value,Count,Frequency (%)
g,625,90.6%
s,57,8.3%
p,8,1.2%

Value,Count,Frequency (%)
g,625,90.6%
s,57,8.3%
p,8,1.2%

Value,Count,Frequency (%)
g,625,90.6%
s,57,8.3%
p,8,1.2%

Value,Count,Frequency (%)
Lowercase Letter,690,100.0%

Value,Count,Frequency (%)
g,625,90.6%
s,57,8.3%
p,8,1.2%

Value,Count,Frequency (%)
Latin,690,100.0%

Value,Count,Frequency (%)
g,625,90.6%
s,57,8.3%
p,8,1.2%

Value,Count,Frequency (%)
ASCII,690,100.0%

Value,Count,Frequency (%)
g,625,90.6%
s,57,8.3%
p,8,1.2%

0,1
Distinct,171
Distinct (%),24.8%
Missing,0
Missing (%),0.0%
Memory size,41.9 KiB

0,1
Max length,5.0
Median length,5.0
Mean length,4.9246377
Min length,1.0

0,1
Total characters,3398
Distinct characters,11
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,104 ?
Unique (%),15.1%

0,1
1st row,202
2nd row,43
3rd row,280
4th row,100
5th row,120

Value,Count,Frequency (%)
00000,132,19.1%
00200,35,5.1%
00120,35,5.1%
00160,34,4.9%
00100,30,4.3%
00080,30,4.3%
00280,22,3.2%
00180,18,2.6%
00140,16,2.3%
00240,14,2.0%

Value,Count,Frequency (%)
0,2320,68.3%
2,257,7.6%
1,233,6.9%
8,117,3.4%
4,117,3.4%
3,114,3.4%
6,106,3.1%
5,52,1.5%
9,35,1.0%
7,34,1.0%

Value,Count,Frequency (%)
Decimal Number,3385,99.6%
Other Punctuation,13,0.4%

Value,Count,Frequency (%)
0,2320,68.5%
2,257,7.6%
1,233,6.9%
8,117,3.5%
4,117,3.5%
3,114,3.4%
6,106,3.1%
5,52,1.5%
9,35,1.0%
7,34,1.0%

Value,Count,Frequency (%)
?,13,100.0%

Value,Count,Frequency (%)
Common,3398,100.0%

Value,Count,Frequency (%)
0,2320,68.3%
2,257,7.6%
1,233,6.9%
8,117,3.4%
4,117,3.4%
3,114,3.4%
6,106,3.1%
5,52,1.5%
9,35,1.0%
7,34,1.0%

Value,Count,Frequency (%)
ASCII,3398,100.0%

Value,Count,Frequency (%)
0,2320,68.3%
2,257,7.6%
1,233,6.9%
8,117,3.4%
4,117,3.4%
3,114,3.4%
6,106,3.1%
5,52,1.5%
9,35,1.0%
7,34,1.0%

0,1
Distinct,240
Distinct (%),34.8%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,1017.3855

0,1
Minimum,0
Maximum,100000
Zeros,295
Zeros (%),42.8%
Negative,0
Negative (%),0.0%
Memory size,5.5 KiB

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
median,5.0
Q3,395.5
95-th percentile,4119.4
Maximum,100000.0
Range,100000.0
Interquartile range (IQR),395.5

0,1
Standard deviation,5210.1026
Coefficient of variation (CV),5.1210702
Kurtosis,214.66997
Mean,1017.3855
Median Absolute Deviation (MAD),5
Skewness,13.140655
Sum,701996
Variance,27145169
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,295,42.8%
1,29,4.2%
500,10,1.4%
1000,10,1.4%
2,9,1.3%
6,8,1.2%
5,8,1.2%
300,8,1.2%
200,6,0.9%
100,6,0.9%

Value,Count,Frequency (%)
0,295,42.8%
1,29,4.2%
2,9,1.3%
3,6,0.9%
4,5,0.7%
5,8,1.2%
6,8,1.2%
7,4,0.6%
8,2,0.3%
9,1,0.1%

Value,Count,Frequency (%)
100000,1,0.1%
51100,1,0.1%
50000,1,0.1%
31285,1,0.1%
26726,1,0.1%
18027,1,0.1%
15108,1,0.1%
15000,1,0.1%
13212,1,0.1%
11202,1,0.1%

0,1
Distinct,2
Distinct (%),0.3%
Missing,0
Missing (%),0.0%
Memory size,39.2 KiB

0,1
-,383
+,307

0,1
Max length,1
Median length,1
Mean length,1
Min length,1

0,1
Total characters,690
Distinct characters,2
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,+
2nd row,+
3rd row,+
4th row,+
5th row,+

Value,Count,Frequency (%)
-,383,55.5%
+,307,44.5%

Value,Count,Frequency (%)
,690,100.0%

Value,Count,Frequency (%)
-,383,55.5%
+,307,44.5%

Value,Count,Frequency (%)
Dash Punctuation,383,55.5%
Math Symbol,307,44.5%

Value,Count,Frequency (%)
-,383,100.0%

Value,Count,Frequency (%)
+,307,100.0%

Value,Count,Frequency (%)
Common,690,100.0%

Value,Count,Frequency (%)
-,383,55.5%
+,307,44.5%

Value,Count,Frequency (%)
ASCII,690,100.0%

Value,Count,Frequency (%)
-,383,55.5%
+,307,44.5%

Unnamed: 0,2,7,10,14,0,3,4,5,6,8,9,11,12,15
2,1.0,0.266,0.204,0.106,0.0,0.177,0.177,0.082,0.163,0.237,0.168,0.1,0.1,0.222
7,0.266,1.0,0.315,0.087,0.0,0.092,0.092,0.064,0.192,0.294,0.171,0.119,0.0,0.279
10,0.204,0.315,1.0,0.427,0.0,0.04,0.04,0.0,0.0,0.338,0.455,0.085,0.0,0.37
14,0.106,0.087,0.427,1.0,0.0,0.399,0.399,0.229,0.279,0.087,0.013,0.01,0.233,0.116
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.249,0.168,0.08,0.07,0.003,0.038,0.02
3,0.177,0.092,0.04,0.399,0.0,1.0,1.0,0.468,0.547,0.173,0.188,0.08,0.637,0.188
4,0.177,0.092,0.04,0.399,0.0,1.0,1.0,0.468,0.547,0.173,0.188,0.08,0.637,0.188
5,0.082,0.064,0.0,0.229,0.249,0.468,0.468,1.0,0.637,0.307,0.262,0.141,0.516,0.35
6,0.163,0.192,0.0,0.279,0.168,0.547,0.547,0.637,1.0,0.261,0.113,0.131,0.523,0.229
8,0.237,0.294,0.338,0.087,0.08,0.173,0.173,0.307,0.261,1.0,0.428,0.08,0.139,0.717

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
5,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+
6,b,33.17,1.04,u,g,r,h,6.5,t,f,0,t,g,164,31285,+
7,a,22.92,11.585,u,g,cc,v,0.04,t,f,0,f,g,80,1349,+
8,b,54.42,0.5,y,p,k,h,3.96,t,f,0,f,g,180,314,+
9,b,42.5,4.915,y,p,w,v,3.165,t,f,0,t,g,52,1442,+

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-
683,b,36.42,0.75,y,p,d,v,0.585,f,f,0,f,g,240,3,-
684,b,40.58,3.29,u,g,m,v,3.5,f,f,0,t,s,400,0,-
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260,0,-
686,a,22.67,0.75,u,g,c,v,2.0,f,t,2,t,g,200,394,-
687,a,25.25,13.5,y,p,ff,ff,2.0,f,t,1,t,g,200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280,750,-
689,b,35.0,3.375,u,g,c,h,8.29,f,f,0,t,g,0,0,-


<h3>6. Preprocessing the data (part i)</h3>
<p>The missing values are now successfully handled.

There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into three main tasks:

Convert the non-numeric data into numeric. Split the data into train and test sets. Scale the feature values to a uniform range. First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called label encoding.</p>

In [75]:
le=LabelEncoder()
# Iterate over all the values of each column and extract their dtypes
for col in data.columns.values:
    # Compare if the dtype is object
    if data[col].dtypes =='object':
    # Use LabelEncoder to do the numeric transformation
        data[col]=le.fit_transform(data[col])

<h3>7. Splitting the dataset into train and test sets</h3>
We have successfully converted all the non-numeric values to numeric ones.

Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then apply the scaling.

Also, features like DriversLicense and ZipCode are not as important as the other features in the dataset for predicting credit card approvals. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as feature selection.

In [82]:
# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
data = data.iloc[:,:11].join([data.iloc[:,12],data.iloc[:,14:]])
data = data.values

# Segregate features and labels into separate variables
X,y = data[:,0:12] , data[:,13]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42)     

<h3>8. Preprocessing the data (part ii)</h3>
The data is now split into two separate sets - train and test sets respectively. We are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.

Now, let's try to understand what these scaled values mean in the real world. Let's use CreditScore as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since we're rescaling all the values to the range of 0-1.

In [84]:
# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

<h3>9.Fitting a logistic regression model to the train set</h3>
Essentially, predicting if a credit card application will be approved or not is a classification task. According to UCI, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.

This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.

Which model should we pick? A question to ask is: are the features that affect the credit card approval decision process correlated with each other? Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).

In [111]:
# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train,y_train)

rfc=RandomForestClassifier()
rfc.fit(rescaledX_train,y_train)

gbc=GradientBoostingClassifier()
gbc.fit(rescaledX_train,y_train)

hgbc=HistGradientBoostingClassifier()
hgbc.fit(rescaledX_train,y_train)

xgb=xgboost.XGBClassifier()
xgb.fit(rescaledX_train,y_train)

<h3>10. Making predictions and evaluating performance</h3>

But how well does our model perform?

We will now evaluate our model on the test set with respect to classification accuracy. But we will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.

In [113]:
# Use logreg to predict instances from the test set and store it
y_pred_log = logreg.predict(rescaledX_test)
y_pred_rfc = rfc.predict(rescaledX_test)
y_pred_gbc = gbc.predict(rescaledX_test)
y_pred_hgbc = hgbc.predict(rescaledX_test)
y_pred_xgb = xgb.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of Random Forest Classifier: ", sklearn.metrics.accuracy_score(y_test,y_pred_rfc))
print("Accuracy of HistGradient Boosted Classifier: ", sklearn.metrics.accuracy_score(y_test,y_pred_hgbc))
print("Accuracy of Logistic Regression Classifier: ", sklearn.metrics.accuracy_score(y_test,y_pred_log))
print("Accuracy of Gradient Boosted Classifier: ", sklearn.metrics.accuracy_score(y_test,y_pred_gbc))
print("Accuracy of XG Boost Classifier: ", sklearn.metrics.accuracy_score(y_test,y_pred_xgb))

Accuracy of Random Forest Classifier:  0.8640350877192983
Accuracy of HistGradient Boosted Classifier:  0.8552631578947368
Accuracy of Logistic Regression Classifier:  0.8333333333333334
Accuracy of Gradient Boosted Classifier:  0.7017543859649122
Accuracy of XG Boost Classifier:  0.7324561403508771
[[92 11]
 [27 98]]
[[ 91  12]
 [ 19 106]]


In [114]:
#best output accuracy is given by random forest classifier
print(confusion_matrix(y_test,y_pred_rfc))

[[ 91  12]
 [ 19 106]]


Our model was pretty good! It was able to yield an accuracy score of almost 84%.

For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

While building this credit card predictor, we tackled some of the most widely-known preprocessing steps such as scaling, label encoding, and missing value imputation. We finished with some machine learning to predict if a person's application for a credit card would get approved or not given some information about that person.