# Credit card approval

Our aim is build an automatic credit card approval predictor using machine learning techniques. For that, we'll use the [Credit Card Approval dataset]("http://archive.ics.uci.edu/ml/datasets/credit+approval") from the UCI Machine Learning Repository.  

This dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries, so we will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions. After our data is in good shape, we will do some exploratory data analysis to build our intuitions and finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.

# 0. Import libraries

In [40]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# 1. Overview

In [3]:
cc_apps = pd.read_csv("cc_approvals.data", header = None)

In [4]:
cc_apps.head(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
5,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+
6,b,33.17,1.04,u,g,r,h,6.5,t,f,0,t,g,164,31285,+
7,a,22.92,11.585,u,g,cc,v,0.04,t,f,0,f,g,80,1349,+
8,b,54.42,0.5,y,p,k,h,3.96,t,f,0,f,g,180,314,+
9,b,42.5,4.915,y,p,w,v,3.165,t,f,0,t,g,52,1442,+


In [5]:
cc_apps.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


Since this data is confidential, the contributor of the dataset has anonymized the feature names to protect the privacy, something that make the output may appear a bit confusing at its first sight. 

Considering 
[this blog]("http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html") we can figure out a pretty good overview of the probable features. In a typical credit card application these features are ```Gender```, ```Age```, ```Debt```, ```Married```, ```BankCustomer```, ```EducationLevel```, ```Ethnicity```, ```YearsEmployed```, ```PriorDefault```, ```Employed```, ```CreditScore```, ```DriversLicense```, ```Citizen```, ```ZipCode```, ```Income``` and finally the ```ApprovalStatus```. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.

The dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.

In [6]:
cc_apps.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [7]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


# 2. Preparing data

Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to preprocess the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then preprocess it.

Also, features like ```DriversLicense``` and ```ZipCode``` are not as important as the other features in the dataset for predicting credit card approvals. To get a better sense, we can measure their [statistical correlation]("https://realpython.com/numpy-scipy-pandas-correlation-python/") to the labels of the dataset. We will make a simple aproach as a first try and drop them to design our machine learning model with the best set of features, if we need improve the model we'll reconsider them.

In [8]:
cc_apps = cc_apps.drop(columns = [11,13])

In [9]:
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size = 0.33, random_state = 42)

## Missing values

Now we've split our data, we can handle some of the issues we identified when inspecting the DataFrame, including:

- Our dataset contains both numeric and non-numeric data (specifically data that are of ```float64```, ```int64``` and ```object``` types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.
- The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like ```mean```, ```max```, and ```min```) about the features that have numerical values. 
- Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the tail cell's output.

Now, let's temporarily replace these missing value question marks with NaN.

In [10]:
cc_apps_train = cc_apps_train.replace('?',np.nan)
cc_apps_test = cc_apps_test.replace('?',np.nan)

In [11]:
cc_apps_train.isnull().sum()
cc_apps_test.isnull().sum()

0     4
1     7
2     0
3     0
4     0
5     2
6     2
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64

Although here is no missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 5 and 6. All of these columns contain non-numeric data. This needs a special treatment. 

We are going to impute these missing values with the most frequent values as present in the respective columns. This is good practice when it comes to imputing missing values for categorical data in general.

In [22]:
for col in cc_apps_train:
    if np.dtype(cc_apps_train[col]) == 'object':
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().idxmax())
        cc_apps_test = cc_apps_test.fillna(cc_apps_train[col].value_counts().idxmax())

In [23]:
cc_apps_train.isnull().sum()
cc_apps_test.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64

# 3. Preprocessing the data

## Convert the non-numeric data into numeric

In [26]:
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

In [29]:
cc_apps_test = cc_apps_test.reindex(columns = cc_apps_train.columns, fill_value = 0)

## Scale the feature values to a uniform range

In [35]:
X_train, y_train = cc_apps_train.iloc[:, :-1].values, cc_apps_train.iloc[:, [-1]].values
X_test, y_test = cc_apps_test.iloc[:, :-1].values, cc_apps_test.iloc[:, [-1]].values

In [36]:
scaler = MinMaxScaler()

rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

# 4. Logistic regresion

In [39]:
logreg = LogisticRegression()

logreg.fit(rescaledX_train, y_train)

  y = column_or_1d(y, warn=True)


## Evaluating performance

We will now evaluate our model on the test set with respect to classification accuracy. But we will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is important to see if our machine learning model is equally capable of predicting approved and denied status, in line with the frequency of these labels in our original dataset.

In [50]:
y_pred = logreg.predict(rescaledX_test)

print("Accuracy of logistic regression classifier:", logreg.score(rescaledX_test, y_test))
print('\n')
print("Confusion_matrix: \n \n", confusion_matrix(y_test,y_pred))

Accuracy of logistic regression classifier: 1.0


Confusion_matrix: 
 
 [[103   0]
 [  0 125]]


The model was pretty good. In fact it was able to yield an accuracy score of 100%.

For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

# 5. Conclusion

Despite we made a shallow analysis we got a perfect score so there is nothing else we can do on this dataset. Its a good result but also odd. Generally we haven't such neat results something that make me distrust of the original dataset. Maybe we should looking for another surce of Credit Card Approval dataset and remake the analysis to check the veracity of the data. 