# Credit Card Approval using Machine Learning

The credit card approval can be automated with Machine Learning (ML), pretty much all the banks in the world does so nowdays. In this notebook an approximation this this task is made, using the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository. 

In [50]:
import pandas as pd
# Reading the data file 
cc_apps = pd.read_csv('crx.data', header=None)
# Inspecting the content
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


The first thing we can see it's that the autor has made some of the features of the data set confidential. So, we dont really know what are the features we're dealing with. Nevertheless we do know that the last column is the target varibale, wheather or not the client is approved for a credit card. Then the rest of the data can be treated as predictors.

## Inspecting the file
As we can see there is a mixture of numerical and non-numerical features, this will be fixed with some preprocessing libraries since we only need and can work with numerical features

In [51]:
#summary statistics
cc_apps_stats=cc_apps.describe()
print(cc_apps_stats)
print('\n')

#DataFrame Information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

#Inspecting missing files at the end of the file
cc_apps.tail(20)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null object
1     690 non-null object
2     690 non-null float64
3     690 non-null object
4     690 non-null object
5     690 non-null object
6     690 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    690 non-null object
14    690 non-null int64

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


## Cleaning the file
There is some missing values filled with '?', we're dealing with this characters replacing them with NaN values using the <code>numpy</code> library. Then those NaN values are replaced with the mean values of the respective column.

In [52]:
import numpy as np
#replacing the '?' with NaN through np.nan
cc_apps = cc_apps.replace('?', np.nan)

#cheking again the tail of the DataFrama to check the replaced values
cc_apps.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


## Filing the NaN values with their respective mean values

In [53]:
#replacing missing values with the correspondent mean value
cc_apps.fillna(cc_apps.mean(), inplace=True)

#counting the NaN missing values to verify the replacement
cc_apps.isnull().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

There is still some missing values at columns 0, 1, 3, 4,..., which correspond to non-numeric data. So the mean numeric imputation does not work. 
Those values are going to be replaced with the most frequent value of respective column.

In [54]:
#iterating over each column in the DF cc_apps
for col in cc_apps:
    #checking if the column type is object 
    if cc_apps[col].dtypes == 'object':
        #input the most frequent value
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])
        
#checking again for missing values
cc_apps.isnull().sum()        

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

This conclude the cleaning stage, as we can see there is no more missing data in the data frame.

## Preprocessing the data $i$
Transform all the column dtypes to numeric not only for faster computation but also because most of the ML algorithms required numeric data for the calculations. We're doing this through label encoder method.

In [55]:
#Import LabelEncoder
from sklearn.preprocessing import LabelEncoder
#Initiate the Encoder
le = LabelEncoder()
#iterating over each column in the DataFrame
for col in cc_apps:
    #checking the type of each column
    if cc_apps[col].dtypes == 'object':
        #making the transformation
        cc_apps[col] = le.fit_transform(cc_apps[col])


## Spliting into test and training set (for Machine Learning application)
Once all the data is in numeric format is time to split the data into training set (for training the algorithm) and test set (for test the algorithm). This is done with the <code>train_test_split</code> library.

In [56]:
#importing train_test_split library
from sklearn.model_selection import train_test_split

#droping 11 and 13 features and converting the DF to NP array
cc_apps= cc_apps.drop([11, 13], axis=1)
cc_apps = cc_apps.values

#spliting features and labels (predictors and target variable)
X,y = cc_apps[:,0:13], cc_apps[:,13]

#making the test and training sets (30% of the data to test set with a seed of 42)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=42)

## Preprocessing the data $ii$
Scaling the data between 0 and 1

In [57]:
#importing MinMaxScaler for Scaling the data
from sklearn.preprocessing import MinMaxScaler

#Initiate the Scaler object
scaler = MinMaxScaler(feature_range=(0,1))
#Scaling the predictors test and train set
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

## Using Logistic Regression into the train set
Predicting if a credic card is approved or not is basically a classification task, so a Logistic Regression algorithm is used.

In [63]:
#Importing the Logistic Regression model
from sklearn.linear_model import LogisticRegression

#Initiate the logistic Regression object
logreg = LogisticRegression(solver='liblinear')

#Fitting the training data
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

## Making predictions and evaluating performance
Once the algorithm has been fitted to the training data is time to make some precitions and calculate the accuracy of the model.

In [64]:
#importing confusion_matrix
from sklearn.metrics import confusion_matrix

#making predictions from the test set
y_pred =logreg.predict(rescaledX_test)

#printing the score of the model
print('Logistic Regression accuracy: ', logreg.score(rescaledX_test, y_test))

#printing the confusion matrix
print(confusion_matrix(y_test, y_pred))

Logistic Regression accuracy:  0.8405797101449275
[[89  8]
 [25 85]]


84% of accuracy is obtained from this first implementation. Not bad!!

For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

## Optimizing the model through parameter Grid Search
There is space for more improvemente in the model. This is done though parameter optimization via Grid Search. This is done with the <code>GridSearchCV</code> library.

In [65]:
#importing GridSearchCV
from sklearn.model_selection import GridSearchCV
#defining the grid values
tol = [0.01,0.001,0.0001]
max_iter = [100, 150, 200]

#Creating the grid
param_grid = dict(tol=tol, max_iter=max_iter)

## Finding the best model

In [66]:
#Initiate the grid search the defined parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

#Rescaling X
rescaledX = scaler.fit_transform(X)

#Fit the data to the grid_model
grid_model_result= grid_model.fit(rescaledX, y)

#Printing the results 
best_score, best_params= grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))


Best: 0.853623 using {'max_iter': 100, 'tol': 0.01}


Finally we can obtain a little bit more accuracy optimizing the parameters of the algorithm. 85.36% of accuracy with 100 max iterations and tolerance of 0.01.