# Credit Card Approval Model

# Motivation

Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays.

#### Importing Libraries and Dependencies

In [1]:
import numpy as np
import pandas as pd

#### Data Processing

Data Collected from: https://www.kaggle.com/datasets/devzohaib/predicting-credit-card-approvals

Let's import the data and clean it first

In [2]:
df = pd.read_csv("dat/cc_approvals.data")
print(f"Shape: {df.shape}")
df.head(5)

Shape: (689, 16)


Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+


** *This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data by the author of the data*

###### Changing Column Names
Since the column names have been encrypted, there are giving us no meaningul value. Hence, let's make it more human-readable, and systematic by setting arbitrary column names.

I will append **N** in front of the Numeric Variables, and **C** in front of the Categorical Variables. This is how the df looks now

In [3]:
df.rename({'b':"C1", 'u':"C2", 'g':"C3",'w':"C4",'v':"C5",'t':"C6",'t.1':"C7",'01':"C8",'f':"C9",'g.1':"C10",
          '30.83':"N1",'0':"N2",'1.25':"N3",'0.1':"N5",'00202':"N4",
          '+':"Approval"}, axis=1, inplace=True)
# Reordering columns
df = df[["C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", "C9", "C10", "N1", "N2", "N3", "N4", "N5", "Approval"]]

# Correcting the Data Types of predictor variables
df["C1"] = df["C1"].astype("category")
df["C2"] = df["C2"].astype("category")
df["C3"] = df["C3"].astype("category")
df["C4"] = df["C4"].astype("category")
df["C5"] = df["C5"].astype("category")
df["C6"] = df["C6"].astype("category")
df["C7"] = df["C7"].astype("category")
df["C8"] = df["C8"].astype("category")
df["C9"] = df["C9"].astype("category")
df["C10"] = df["C10"].astype("category")
#df["N1"] = pd.to_numeric(df["N1"])

# Updating the predictions column to contain 1 when the application was approved, 0 otherwise
df.loc[(df['Approval']=='-'), 'Approval'] = 0
df.loc[(df['Approval']=='+'), 'Approval'] = 1
df = df.astype({'Approval': 'int32'})
df.head(5)

Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,N1,N2,N3,N4,N5,Approval
0,a,u,g,q,h,t,t,6,f,g,58.67,4.46,3.04,43,560,1
1,a,u,g,q,h,t,f,0,f,g,24.5,0.5,1.5,280,824,1
2,b,u,g,w,v,t,t,5,t,g,27.83,1.54,3.75,100,3,1
3,b,u,g,w,v,t,f,0,f,s,20.17,5.625,1.71,120,0,1
4,b,u,g,m,v,t,f,0,t,g,32.08,4.0,2.5,360,0,1


#### EDA
Let's see what the data tells us at a first glance.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689 entries, 0 to 688
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   C1        689 non-null    category
 1   C2        689 non-null    category
 2   C3        689 non-null    category
 3   C4        689 non-null    category
 4   C5        689 non-null    category
 5   C6        689 non-null    category
 6   C7        689 non-null    category
 7   C8        689 non-null    category
 8   C9        689 non-null    category
 9   C10       689 non-null    category
 10  N1        689 non-null    object  
 11  N2        689 non-null    float64 
 12  N3        689 non-null    float64 
 13  N4        689 non-null    object  
 14  N5        689 non-null    int64   
 15  Approval  689 non-null    int32   
dtypes: category(10), float64(2), int32(1), int64(1), object(2)
memory usage: 39.2+ KB


###### Some more Data Processing

Let's convert the categorical variables into numeric respresentations

# Dropping Missing Values
This data set contains a lot of missing values in a lot of different columns. For the purpose of simplicity, and since we have a moderately big sample size, let's drop all rows with missing values

In [5]:
df.head()
df.replace(to_replace=["?"], value=np.nan, inplace=True)

Let's create a new data frame, so that we can use the old `df` object in case we want to explore the missing values in more detail

In [6]:
# df_d is the DatFrame after dropping Missing Values
df_d = df.dropna(axis='index', how='any') 
df.shape

(689, 16)

We ended up dropping 37 rows of data