# Automated Credit Card Applications Evaluator

## 1. Introduction

Manually evaluating credit card applications can be time-consuming and error-prone, motivated by that we will implement an automatic credit card application evaluator which predicts the acceptance or refusal using supervised machine learning algorithms.

<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository. </p>
<ul>
<li>First, we will start off performing Exploratory Data Analysis on the dataset.</li>
<li>Next, we perform data cleaning and make sure to handle any fauly or missing data.</li>
<li>Apply data transformations for preprocessing the data to be suitable for the machine learning models.</li>
<li>Slicing and spliting the dataset into train and test data.</li>
<li>Scaling the data into the appropriate range</li>
<li> Checking correlation between features and target data.</li>
<li> Training, hyperparameter tuning and evaluation of two different classification models.</li>
</ul>
<p>First, loading and viewing the dataset. We find that since this data is confidential, the contributor of the dataset has anonymized the columns names.</p>

In [1]:
import pandas as pd

cc_apps = pd.read_csv('datasets/cc_approvals.data', header = None )

print(cc_apps.head())

  0      1      2  3  4  5  6     7  8  9   10 11 12     13   14 15
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  f  g  00202    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  f  g  00043  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  f  g  00280  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  t  g  00100    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  f  s  00120    0  +


## 2. EDA
<p> The features of this dataset have been anonymized privacy reasons, but <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">this blog</a> gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>.</p>

Next we explore the dataset more by getting the summery statistics using Pandas [describe( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) and [info( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) methods as well as take a look at some rows at the end.

In [2]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print("\n")

# Inspect the last 20 rows of the dataset
print(cc_apps.tail(n = 20))

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no


<p>Some issues that can affect the performance of our machine learning model(s) include:</p>
<ul>
<li>Our dataset contains both numeric and non-numeric data, the columns 2, 7, 10 and 14 contain numeric values and all the other features contain non-numeric values.</li>
<li>The dataset also contains values from several ranges.
    
<li>Finally, the dataset has missing values, which we'll take care of next. The missing values in the dataset are labeled with '?' and can be seen in row 673 and column 0</li>
</ul>

## 3. Finding the missing values 

<p>First, let's temporarily replace these missing value question marks with NaN.</p>

In [3]:
import numpy as np

print(cc_apps.tail(n = 17))


cc_apps.replace('?' , np.nan , inplace = True)
print('\n')

print(cc_apps.tail(n = 17))
print('\n')

print(cc_apps.info())
print('\n')
print('Total number of NaN variavbles',sum(cc_apps.isna().sum()))

    0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  ?  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674  a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675  a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676  a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677  b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678  a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679  a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680  b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681  b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682  b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683  b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684  b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
685  b  21.08  10.085  y  p   e   h  1

- As we can see, we don't have any missing values in the columns with numerical values (2, 7, 10 and 14).

- The total number of missing values is 67.

## 4. Handling the missing values

<p>We replaced all the question marks with NaNs.</p>

There are missing values to be imputed for columns (0, 1, 3, 4, 5, 6 and 13). All of these columns contain non-numeric data.

We are going to impute these missing values with the most frequent values as present in the respective columns. 

In [4]:
for col in cc_apps.columns:
    if cc_apps[col].dtypes == 'object':
        cc_apps[col] = cc_apps[col].fillna(cc_apps[col].value_counts().index[0])

print('Total number of NaN variavbles:',sum(cc_apps.isna().sum()))
print('\n')
print(cc_apps.tail(17))

Total number of NaN variavbles: 0


    0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  b  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674  a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675  a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676  a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677  b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678  a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679  a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680  b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681  b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682  b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683  b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684  b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
68

## 5. Preprocessing the data (converting into numeric data) 

In this step, we will be converting all values into numeric ones which is essential when using SciKit-learn models. 

This can be done using [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [5]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

for col in cc_apps.columns.to_numpy():
    if cc_apps[col].dtypes == 'object':
        cc_apps[col]=encoder.fit_transform(cc_apps[col])

In [6]:
print(cc_apps.head())

   0    1      2   3   4   5   6     7   8   9   10  11  12  13   14  15
0   1  156  0.000   1   0  12   7  1.25   1   1   1   0   0  68    0   0
1   0  328  4.460   1   0  10   3  3.04   1   1   6   0   0  11  560   0
2   0   89  0.500   1   0  10   3  1.50   1   0   0   0   0  96  824   0
3   1  125  1.540   1   0  12   7  3.75   1   1   5   1   0  31    3   0
4   1   43  5.625   1   0  12   7  1.71   1   0   0   0   2  37    0   0


In [7]:
print(cc_apps.tail())

     0    1       2   3   4   5   6     7   8   9   10  11  12  13   14  15
685   1   52  10.085   2   2   4   3  1.25   0   0   0   0   0  90    0   1
686   0   71   0.750   1   0   1   7  2.00   0   1   2   1   0  67  394   1
687   0   97  13.500   2   2   5   2  2.00   0   1   1   1   0  67    1   1
688   1   20   0.205   1   0   0   7  0.04   0   0   0   0   0  96  750   1
689   1  197   3.375   1   0   1   3  8.29   0   0   0   1   0   0    0   1


## 6. Splitting the dataset into train and test sets

Now we need to split our dataset into two different sets as following:

- train set: for training and fine tuning our models.
- test set: for testing and evaluating the performance of the models.

Before that, we need to drop the features that aren't important, those feautres are `DriversLicense` and `ZipCode` 
which are represented in columns number 12 and 13 respectively.

In [8]:
from sklearn.model_selection import train_test_split

cc_apps = cc_apps.drop([11, 13], axis=1)

print(cc_apps.head())
print('\n')

X = cc_apps.loc[ : , 0:14]
Y = cc_apps.loc[ : , 15]

print(X.head())
print('\n')
print(Y.head())

X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size = 0.3, random_state= 11 )

   0    1      2   3   4   5   6     7   8   9   10  12   14  15
0   1  156  0.000   1   0  12   7  1.25   1   1   1   0    0   0
1   0  328  4.460   1   0  10   3  3.04   1   1   6   0  560   0
2   0   89  0.500   1   0  10   3  1.50   1   0   0   0  824   0
3   1  125  1.540   1   0  12   7  3.75   1   1   5   0    3   0
4   1   43  5.625   1   0  12   7  1.71   1   0   0   2    0   0


   0    1      2   3   4   5   6     7   8   9   10  12   14
0   1  156  0.000   1   0  12   7  1.25   1   1   1   0    0
1   0  328  4.460   1   0  10   3  3.04   1   1   6   0  560
2   0   89  0.500   1   0  10   3  1.50   1   0   0   0  824
3   1  125  1.540   1   0  12   7  3.75   1   1   5   0    3
4   1   43  5.625   1   0  12   7  1.71   1   0   0   2    0


0    0
1    0
2    0
3    0
4    0
Name: 15, dtype: int64


## 7. Preprocessing the Data (scaling)

The last preprocessing step we need to perform is scaling all our data to be in the same range.

This is important in order to remove any bias that can happen due to some features having larger values compared to others.

Next we use the [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to scale all the input data in 0 to 1 range.

In [9]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=( 0 , 1 ))
scaled_X_train = min_max_scaler.fit_transform(X_train)
scaled_X_test = min_max_scaler.fit_transform(X_test)

## 8. Model Selection

Our goal is to predict if a credit card application will be accepted or not which makes it a classification problem.

Before deciding which model to use for our case of classification, let's check if our features (input data) are correlated to the target (output data).

In [10]:
print(cc_apps.loc[ : , : 14].corrwith(cc_apps[15]))

0     0.028934
1    -0.166966
2    -0.206294
3     0.194306
4     0.185134
5    -0.130434
6    -0.000866
7    -0.322475
8    -0.720407
9    -0.458301
10   -0.406410
12    0.100867
14   -0.175657
dtype: float64


As we can see from the correlation matrix; our data shows a decent amount of correlation.

Possible models to use in this case inclue: [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html) and [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).


## 8. RidgeClassifier

In [11]:
from sklearn.linear_model import RidgeClassifier

ridge_model = RidgeClassifier()

## 9. Hyperparameter Tuning (RidgeClassifier)

Next, we are going to use [GridSearchCV()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to tune our model of choice.

We will also perform cross validation on five folds by passing the __cv__ argument.

Hyperparameter in this case:

- `alpha`: Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization.

In [12]:
from sklearn.model_selection import GridSearchCV

alphas = {'alpha': np.array([ 0.001 , 0.01 , 0.1 , 0.2 , 0.4 , 0.8 , 1])}
ridge_grid = GridSearchCV( ridge_model , alphas , cv = 5)

## 10. Model Training (RidgeClassifier)



In [13]:
ridge_trained_model = ridge_grid.fit(scaled_X_train , Y_train)

## 11. Model Evaluation (RidgeClassifier)

Next step, we will use the test data to make predictions using the trained `RidgeClassifier` model.

Then the performance of the model will be evaluated by calculating the [Classification Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) and the [Confusion Matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [14]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

Y_preds = ridge_trained_model.predict(scaled_X_test)

confMatrixRidge = confusion_matrix(Y_test , Y_preds)
print("Confusion Matrix :")
print(confMatrixRidge)

print('\n')

classificationReportRidge = classification_report(Y_test , Y_preds)
print('Classification report :')
print(classificationReportRidge)

print('\n')

print("Accuracy :")
print(ridge_trained_model.score(scaled_X_test , Y_test))

print('\n')

print("Hyperparameter (alpha) :")
print(ridge_trained_model.best_params_)

print('\n')

print("Cross validation accuracy :")
print(ridge_trained_model.best_score_)

Confusion Matrix :
[[ 80   5]
 [ 21 101]]


Classification report :
              precision    recall  f1-score   support

           0       0.79      0.94      0.86        85
           1       0.95      0.83      0.89       122

    accuracy                           0.87       207
   macro avg       0.87      0.88      0.87       207
weighted avg       0.89      0.87      0.88       207



Accuracy :
0.8743961352657005


Hyperparameter (alpha) :
{'alpha': 0.001}


Cross validation accuracy :
0.8488187285223369


## 12. KNeighborsClassifier

In [15]:
from sklearn.neighbors import KNeighborsClassifier

kn_model = KNeighborsClassifier()

## 13. Hyperparameter Tuning (KNeighborsClassifier)

Hyperparameter in this case:

- `n_neighbors`: the number of neighbors to use.

In [16]:
neighbors = {'n_neighbors': np.arange(1, 50)}
kn_grid = GridSearchCV(kn_model , neighbors)

## 14. Model Training (KNeighborsClassifier)

In [17]:
kn_trained_model = kn_grid.fit(scaled_X_train , Y_train)

## 15. Model Evaluation (KNeighborsClassifier)

Once again, we will use the test data to make predictions using the`KNeighbors` trained model.

Then the performance of the model will be evaluated by calculating the [Classification Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) and the [Confusion Matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [18]:
Y_preds = kn_trained_model.predict(scaled_X_test)

confMatrixRidge = confusion_matrix(Y_test , Y_preds)
print("Confusion Matrix :")
print(confMatrixRidge)

print('\n')

classificationReportRidge = classification_report(Y_test , Y_preds)
print('Classification report :')
print(classificationReportRidge)

print('\n')

print("Accuracy :")
print(kn_trained_model.score(scaled_X_test , Y_test))

print('\n')

print("Hyperparameter :")
print(kn_trained_model.best_params_)

print('\n')

print("Cross validation accuracy :")
print(kn_trained_model.best_score_)

Confusion Matrix :
[[ 76   9]
 [ 14 108]]


Classification report :
              precision    recall  f1-score   support

           0       0.84      0.89      0.87        85
           1       0.92      0.89      0.90       122

    accuracy                           0.89       207
   macro avg       0.88      0.89      0.89       207
weighted avg       0.89      0.89      0.89       207



Accuracy :
0.8888888888888888


Hyperparameter :
{'n_neighbors': 30}


Cross validation accuracy :
0.8633161512027492
