# Case Study 03 - Credit Card Approval Prediction
---

## 1. Introduction 

In this case study, we use the **Credit Card Approval Data Set** from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/credit+approval) to build a machine learning classification model that predicts if the credit card application of an applicant will be accepted, based on several key factors available in the data set. 

As mentioned in the above website, the data set has been anonymized due to confidentiality of the data. 

## 2. Basic Insight of Data

#### Loading data and creating a dataframe

In [None]:
# Import pandas
import pandas as pd

# Load dataset
cc_apps = pd.read_csv( "cc_approvals.data", header = None)

# Inspect data
cc_apps.head()

In [None]:
print("The dataframe has {} datapoints and {} columns.".format(*cc_apps.shape))

#### Remark on column names

As mentioned above, the data set has been anonymized due to confidentiality of the data. To have a better idea about meaning of data, we list the most important features that are taken into account in processing a credit card application. These are:

- Male
- Age
- Debt 
- Married       
- BankCustomer 
- EducationLevel
- Ethnicity 
- YearsEmployed
- PriorDefault  
- Employed      
- CreditScore   
- DriversLicense
- Citizen   
- ZipCode      
- Income      
- Approved

For further information, you may want to refer to [here](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html). Before inspecting the dataframe further, let's rename the columns so that our dataframe be in a better form.

In [None]:
# create headers list
headers = [
"Male",
"Age",
"Debt", 
"Married",       
"BankCustomer", 
"EducationLevel",
"Ethnicity", 
"YearsEmployed",
"PriorDefault",  
"Employed",      
"CreditScore",   
"DriversLicense",
"Citizen",   
"ZipCode",       
"Income",      
"Approved"
]

cc_apps.columns = headers
cc_apps.head()

Remember that the values of the data sets have been anonymized by the data set provider. So, we weren't surprised for example with 'float' datatype in the 'Age' column. Note also that since features like `DriversLicense` and `ZipCode` are not as important as the other features in the dataset for predicting credit card approvals, we can drop them to design our machine learning model with the best set of features. 

In [None]:
# Drop the features 'DriversLicense' and 'ZipCode'
cc_apps.drop(['DriversLicense', 'ZipCode'], axis = 1, inplace = True)
cc_apps.head()

In [None]:
# a summary of dataframe
cc_apps.info()

In [None]:
# a summary statistics of numerical variables
cc_apps.describe()

## 3. Preprocessing

### 3.1 Identifying missing values

In [None]:
# any missing values?
if cc_apps.isnull().values.any():
    print("YES")
else:
    print("Nothing detected!")

It seems that there are no 'NaN' missing values. However, we need to make sure there is no other type of missing values in the dataframe. Let's examine the dataframe further.

In [None]:
# Inspect missing values
cc_apps.iloc[80:90]

As we can see, there are some missing values labled with "?". We replace them with 'NaN' to be able to identify all variables with missing values.

In [None]:
# Import numpy
import numpy as np

# replace "?" with NaN
cc_apps.replace("?", np.nan, inplace = True)

# check the dataframe again
cc_apps.iloc[80:90]

Now, we can find the total number of missing values in the dataframe.

In [None]:
# Total number of missing values
tot = cc_apps.isnull().sum().sum()
print("The are totally {} missing values in the dataframe.".format(tot))

We can find the missing values by column.

In [None]:
# Missing values by column
missing_condition = cc_apps.isnull().sum()

# number of column with missing values
column_null = missing_condition[missing_condition > 0]

print("There are {} features with missing values.".format(column_null.count()), "\n")

# missing values by column sorted descending
column_null.sort_values(ascending=False)

Here, the first column is the column number. Most of missing values are in the 'Age' and 'Male' columns. Notice that all of these features are categorical.

### 3.2 Imputing missing data

We can use data discription to find out the meaning of these missing values. We proceed imputing missing values. 

#### Replacing missing values by mean

In [None]:
# Impute the missing values with mean imputation
cc_apps.fillna(cc_apps.mean(), inplace=True)

In [None]:
# Count the number of NaNs in the dataset to verify
cc_apps.isnull().sum().sum()

So, as identified above, all missing values are in the columns with of categorical variables.

#### Replacing missing values by frequency (mode)

For categorical variables, we replace the missing values by the most frequent value.

In [None]:
# Iterate over each column of cc_apps
for col in cc_apps.columns.values:
    # Check if the column is of object type
    if cc_apps[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
cc_apps.isnull().sum().sum()

### 3.3 Encoding categorical variables

We need to convert categorical to numerical variables. We use [Label Encoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) technique. 

In [None]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in cc_apps.columns.values:
    # Compare if the dtype is object
    if cc_apps[col].dtypes == 'object':
    # Use LabelEncoder to do the numeric transformation
        cc_apps[col] = le.fit_transform(cc_apps[col])

### 3.4 Splitting the dataset into train and test sets

Now, we will split our data into train set and test set to prepare our data for training and testing. 

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split  

#convert the dataframe to a numpy array
cc_apps = cc_apps.values

# Segregate features and labels into separate variables
X, y = cc_apps[:,0:-1] , cc_apps[:,-1]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

### 3.5 Scaling

We use `MinMaxScaler` to scale the data to the range of numbers between 0 and 1, inclusive.

In [None]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

Now, our data is ready for modelling.

## 4. Modeling 

Predicting if a credit card application will be approved or not is indeed a classification task. According to [UCI](http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names), our dataset contains more "denial" instances than "approval". Specifically, out of 690 applications, 383 (55.5%) applications were denied and 307 (44.5%) applications were approved. This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.

We fit a logistic regression model to the train set to perform the classification task. This is because generalised linear models (and in particular, logistic regression) perform well in cases where there are correlations between predictor variables. Here, we assume that this is the case and the features that affect the credit card approval decision process correlated with each other, which is acceptable intuitively and can be confirmed computationally. 

We start the modeling with a Logistic Regression algorithm.

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

## 5. Prediction and its performance

To see how well our model performs, we evaluate its performance on the test set, using the model's `acuracy score` and `confusion matrix`.

In [None]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

# The confusion matrix of the logreg model
confusion_matrix(y_test, y_pred)

In the confusion matrix, the first diagonal element, 87, denotes the **true negatives** or the number of negative instances (denied applications) that our model predicted correctly. The next diagonal element, 87, denotes the **true positives** or the number of positive instances (approved applications) that the model predicted correctly. The off-diagonal elements, 10 + 23, represent the total number of wrong predictions, positive or negative.

## 6. Grid searching

The performance of our model was very good with an accuracy score of about 84%. The `Grid Search` is a scikit-learn technique that tunes the algorithm's parameters to improve the performance of the model. Here we grid search only over the following two among [parameters of logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). 

- tol
- max_iter

Let's first define the grid of hyperparameter values and convert them to a dictionary which is the expected format of input parameter in `GridSearchCV()`.

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001 ,0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

### 6.1 Selecting best performing model

To begin the grid search and find out which values perform best, we need to create an instance of `GridSearchCV()` with already created instance `logreg` of our logistic regression model. We also instruct `GridSearchCV()` to perform a 5-fold cross validation. Note that we pass `X` (scaled version) and `y` as fit parameters rather than passing train and test sets. Finally, we store the best-achieved score and the respective best parameters.

In [None]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator = logreg, param_grid = param_grid, cv = 5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results: store the best-achieved score and parameters 
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

<div class="alert alert-info" role="alert" style="margin-top: 1px">

## This notebook has been created by [ALIREZA RAFIYI](www.linkedin.com/in/alireza-rafiyi) and last updated on June 2020.  