# Predicting Credit Card Application Approvals

<br />
Commercial banks receive myriads of credit card applications. Manually analyzing these applications is mundane, error-prone, and time-consuming, as many factors are considered: high loan balances, low income levels, too many inquiries on a credit card, etc. In this project, I will build an automatic credit card approval predictor using machine learning. 

We will use the Credit Card Approval dataset from the UCI ML Repository, which contains confidential data, and anonymized feature names. First, let's load and look at the dataset. 

In [2]:
import pandas as pd

# load dataset
cc_apps = pd.read_csv('datasets/cc_approvals.data', header=None)

# inspect credit card apps
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


<br /><br />
At first, this data may seem a little confusing, since the features of this dataset have been anonymized for privacy purposes. But we can use a decent approximation on probable features, such as Gender, Age, Debt, Married, Education Level, Ethnicity, Years Employed, Employed, Credit Score, Drivers License, Citizen, Zip Code, Income and finally the Approval Status. Mapping these features to their respective columns will give us a pretty good starting point. 

As we see from our first inspection, our dataset has a combination of both numerical and non-numerical features. We can fix this with some preprocessing, but before, we're going to take a closer look at the dataset to see if there exist any other issues. 

In [3]:
# extract summary statistics, print them
cc_apps_description = cc_apps.describe()
print(cc_apps_description)
print('\n')

# print dataframe info
cc_apps_info = cc_apps.info()
print(cc_apps_info)
print('\n')

# Inspect missing values in the dataset
cc_apps.tail(17)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


<br /><br />Now, it's time to split our data into a train set and test set to prepare our data for two different phases of machine learning modelling: training and testing. Ideally, no information from the test data should be used to preprocess training data or to direct the training process of a machine learning model. Hence, we split the data <strong>first</strong> and then preprocess it after.

Also, since features such as Drivers Licenses and Zip Codes are not as important as the other features in the dataset for predicting credit card approvals, we will apply feature selection and drop them when designing our machine learning model.

In [4]:
# import train_test_split
from sklearn.model_selection import train_test_split

# drop features 11 and 13
cc_apps = cc_apps.drop([11, 13], axis=1)

# split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

<br /><br />In our dataset inspection, we ran into some issues. Since we've split our data, we can handle these problems in our model. 

First, our dataset contains a mix of numeric and non-numeric data. Specifically, features 2, 7, 10 and 14 contain numeric values and all other features contain the opposite. 

In addition, the dataset encompasses values from several ranges. Some features have a value range of 0-28, others have a range of 2-67, and some even 1017-100000. Apart from these, we can get useful statistical information about features that have numerical values.

Finally, the dataset has missing values, labelled '?'. Let's temporarily replace those missing value question marks with 'NaN'.

In [6]:
import numpy as np

# replace '?' with 'NaN' in train/test sets
cc_apps_train = cc_apps_train.replace('?', np.NaN)
cc_apps_test = cc_apps_test.replace('?', np.NaN)

<br /><br />We've replaced the missing values with a placeholder, "NaN". But why are these missing values so importawnt? Ignoring missing values can heavily affect the performance of our machine learning model. Many models, such as Linear Discriminant Analysis, cannot handle missing values implicitly.

To avoid this problem, we will impute the missing values with mean imputation.

In [8]:
# impute mean as missing values
cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.fillna(cc_apps_train.mean(), inplace=True)

# count number of NaN in datasets
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

0     8
1     5
2     0
3     6
4     6
5     7
6     7
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     4
1     7
2     0
3     0
4     0
5     2
6     2
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


  cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
  cc_apps_test.fillna(cc_apps_train.mean(), inplace=True)


<br /><br />We have successfully taken care of the missing values for our numeric columns. There are still missing values to be imputed for our non-numerical columns: 0, 1, 3, 4, 5, 6, 13. Mean imputation would not work here, so we require a different treatment. 

Instead, we will impute these missing values with the most frequent values in the respective columns, namely the mode. 

In [10]:
# iterate over each column of ctraining set
for col in cc_apps_train.columns:
    # check if col is object type
    if cc_apps_train[col].dtypes == 'object':
        # impute with the most frequent value for both training and test sets
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test = cc_apps_test.fillna(cc_apps_train[col].value_counts().index[0])

# count number of NaNs in dataset
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


<br /><br />Now that the missing data has been handled, we can now conduct our minor, but essential data preprocessing. We can divide this into two tasks, the first being: to convert non-numeric data into numeric. We do this because not only does it result in faster computation; also, as stated before, many machine learning models require data to be in a strictly numeric format. 

We will do this by using the get_dummies() method from pandas.

In [11]:
# convert categorical features in train and test sets
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

# reindex the test set columns aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)

<br /><br />Our second task is to understand what these scaled values mean in the real world. For example, credit score determines one's creditworthiness based on credit history. The higher the number, the more financially trustworthy a person is considered to be. Therefore, a credit score of 1 is the highest since we're rescaling all the values to the range of 0-1. 

In [30]:
from sklearn.preprocessing import MinMaxScaler

# segregate features and labels into separate variables for both test/training set
X_train, y_train = cc_apps_train.iloc[:, :-1].values, cc_apps_train.iloc[:, [-1]].values
X_test, y_test = cc_apps_test.iloc[:, :-1].values, cc_apps_test.iloc[:, [-1]].values

# instantiate MinMaxScaler and use it to rescale X_train and X_test with a scale of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

<br /><br />Predicting a credit card application's approval is a classification task. According to UCI, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.

A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics. Are the features that affect credit card approvals correlated with each other? Based on intuition, they are indeed correlated with each other; thus, a generalized linear model would perform well in this case. Let's start our machine learning modelling with a Logistic Regression model.

In [27]:
# import LogisticRegression
from sklearn.linear_model import LogisticRegression

# instantiate a LogisticRegression classifier
logreg = LogisticRegression()

# fit logreg with training set
logreg.fit(rescaledX_train, y_train.ravel())

LogisticRegression()

<br /><br />How well does our model perform? We will now evaluate our model on the test set with respect to the classification accruacy. But also, we are concerned with the model's confusion matrix. Since we are predicting credit card applications, it is crucial for us to deduce whether or not our machine learning model is equally capable to predict both approval AND denied status, inline with the frequency of these labels in our original dataset. 

Our model must perform well in this aspect, because we do not want our model to approve an application that should have been denied. Thus, the confusion matrix helps us view our model's performance from these aspects.

In [19]:
from sklearn.metrics import confusion_matrix

# use logistic regression to predict instances from the test set
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Logistic regression classifier accuracy: ", (logreg.score(rescaledX_test, y_test) * 100), "%")

# print confusion matrix of the logistic regression model
confusion_matrix(y_test, y_pred)

Logistic regression classifier accuracy:  100.0 %


array([[103,   0],
       [  0, 125]], dtype=int64)

<br /><br />Our model was good, yielding a classifier accuracy score of 100%. For the confusion matrix, the first element of the first row denotes the number of denied applications, while the second element of the last row denotes the number of approved applications. 

To assure our model accuracy, we can perform a grid search of the model parameters, namely 'tol' and 'max_iter', to improve the machine learning model's ability to predict credit card approvals. 

In [20]:
from sklearn.model_selection import GridSearchCV

# define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# create a dictionary where tol and max_iter are keys and the lists of their values are values, respectively
param_grid = dict(tol=tol, max_iter=max_iter)

<br /><br />We have defined the grid of hyperparameter values and converted them into a single dictionary format which GridSearchCV() expects as one of its parameters. We will instantiate GridSearchCV() with our earlier logistic regression model with all our data. We will also instruct GridSearchCV() to perform a cross-validation of five folds. Now, we can begin the grid search to see which values perform best. The output will include the best-achieved score and its 'best parameters'. 



In [29]:
# instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# fill out grid model with data
grid_model_result = grid_model.fit(rescaledX_train, y_train.ravel())

# summarize best score, best parameters
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# extract best model and output its results
best_model = grid_model_result.best_estimator_
print("Accuracy of logistic regression classifier: ", best_model.score(rescaledX_test, y_test))

Best: 1.000000 using {'max_iter': 100, 'tol': 0.01}
Accuracy of logistic regression classifier:  1.0
