## 1. Credit card applications
<p>Well, to put it simply, credit is any arrangement where you buy goods or services now but you agree to pay later.<br>
Numerous requests for <em>credit</em> cards are made to the banks. For a variety of reasons, including high loan balances and low income levels, many of them are <i><b>rejected</b></i>. These applications require manual analysis, which is challenging, error-prone, and time-consuming (because time is Money!). But with the power of machine learning, this can be carried out automatically. <br>Here,i have build an automatic credit card approval predictor using machine learning techniques, just like the all the national and internatioanals banks do.</p>
<p><img src="datasets/1545568181-Credit_Card.jpg" alt="Credit card being held in hand"></p>
<p>I have used Credit Card Approval dataset from the<a href="http://archive.ics.uci.edu/ml/datasets/credit+approval"> UCI Machine Learning Repository.</a></p>


In [89]:
import pandas as pd
credit_data = pd.read_csv("datasets/cc_approvals.data",header =None )
credit_data.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


## 2. Inspecting the applications
<p>Let's try to figure out the most important features of a credit card application. Most of information have been anonymized to protect the privacy, but <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">this blog</a> gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.   </p>
<p>As you can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before i do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.</p>

In [90]:
credit_description = credit_data.describe()
print(credit_description)
print('\n')

cc_apps_info = credit_data.info()
print(cc_apps_info)
print('\n')

# Missing Values
credit_data.isna().tail(17)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
673,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
674,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
675,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
676,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
677,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
678,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
679,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
680,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
681,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
682,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


## 3. Splitting the dataset into train and test sets
<p>Now, I have split the  data into train set and test set to prepare  data for two different phases of machine learning modeling: <em>training</em> and <em>testing</em>.
Also, features like <code>DriversLicense</code> and <code>ZipCode</code> are not as important as the other features in the dataset for approvals avaiable.</p>

In [91]:
from sklearn.model_selection import train_test_split

credit_data = credit_data.drop([11,13 ],axis =1)

# train and test sets
credit_data_train, credit_data_test = train_test_split(credit_data, test_size=0.33, random_state=42)
print("Number of Creadit Cards = {} \nNumber of Cards to be trained for model = {} \nNumber of cards to be tested = {}".format(credit_data.shape[0],credit_data_train.shape[0],credit_data_test.shape[0]))

Number of Creadit Cards = 690 
Number of Cards to be trained for model = 462 
Number of cards to be tested = 228


## 4. Handling the missing values (part i)

<ul>
<li>Our dataset contains both numeric and non-numeric data (specifically data that are of <code>float64</code>, <code>int64</code> and <code>object</code> types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.</li>
<li>The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like <code>mean</code>, <code>max</code>, and <code>min</code>) about the features that have numerical values. </li>
<li>As there are some "?" in datasets which can reduce the efficiency of our model</li>
</ul>
<p>Now, let's temporarily replace these missing value question marks("?") with NaN.</p>

In [92]:
import numpy as np

credit_data_train = credit_data_train.replace('?',np.NaN)
credit_data_test = credit_data_test.replace('?',np.NaN)
credit_data.describe()


Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


## 5. Handling the missing values (part ii)
<p>I have replaced all the question marks with NaNs. This is going to help me in the next missing value treatment that i am going to perform.</p>
<p>An important question that gets raised here is <em>why are we giving so much importance to missing values</em>? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as Logistic Regression. </p>
<p>So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.</p>

In [93]:
# Impute the missing values with mean imputation

credit_data_train.fillna(credit_data_train.mean(), inplace=True)
credit_data_test.fillna(credit_data_test.mean(), inplace=True)

# Count the number of NaNs in the datasets and print the counts to verify
display(credit_data_train.isnull().sum())
display(credit_data_test.isnull().sum())

  credit_data_train.fillna(credit_data_train.mean(), inplace=True)
  credit_data_test.fillna(credit_data_test.mean(), inplace=True)


0     8
1     5
2     0
3     6
4     6
5     7
6     7
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64

0     4
1     7
2     0
3     0
4     0
5     2
6     2
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64

## 6. Handling the missing values (part iii)
<p>We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this is why the mean imputation strategy would not work here. This needs a different treatment. </p>
<p>We are going to impute these missing values with the most frequent values as present in the respective columns. This is <a href="https://www.datacamp.com/community/tutorials/categorical-data">good practice</a> when it comes to imputing missing values for categorical data in general.</p>

In [94]:
for col in credit_data_train:
    # Check if the column is of object type
    if credit_data_train[col].dtypes == 'object':
        # Impute with the most frequent value
        credit_data_train = credit_data_train.fillna(credit_data_train[col].value_counts().index[0])
        credit_data_test = credit_data_test.fillna(credit_data_test[col].value_counts().index[0])


print("Number of missing value in train data = {}".format(credit_data_train.isnull().values.sum()))
print("Number of missing value in test data = {}".format(credit_data_test.isnull().values.sum()))


Number of missing value in train data = 0
Number of missing value in test data = 0


## 7. Preprocessing the data (part i)
<p>The missing values are now successfully handled.</p>
<p>There is still some minor but essential data preprocessing needed before i proceed towards building  machine learning model. i am going to divide these remaining preprocessing steps into two main tasks:</p>
<ol>
<li>Convert the non-numeric data into numeric.</li>
<li>Fit  the feature values to a uniform range (0,1).</li>
</ol>
<p>First, I will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models require the data to be in a strictly numeric format. I will do this by using the <code>get_dummies()</code> method from pandas.</p>

In [95]:
credit_data_train = pd.get_dummies(credit_data_train)
credit_data_test = pd.get_dummies(credit_data_test)

# Reindex the columns of the test set aligning with the train set
credit_data_test = credit_data_test.reindex(columns=credit_data_train.columns,fill_value =0)

## 8. Preprocessing the data (part ii)
<p>Now, i am only left with one final preprocessing step of scaling before we can fit a machine learning model to the data. </p>
<p>Now, let's try to understand what these scaled values mean in the real world. Let's use <code>CreditScore</code> as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a <code>CreditScore</code> of 1 is the highest since i'm rescaling all the values to the range of 0-1.</p>

In [96]:
from sklearn.preprocessing import MinMaxScaler

X_train, y_train = credit_data_train.iloc[:,:-1].values, credit_data_train.iloc[:,[-1]].values
X_test, y_test = credit_data_test.iloc[:,:-1].values, credit_data_test.iloc[:,[-1]].values

scaler = MinMaxScaler(feature_range=(0,1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

## 9. Fitting a logistic regression model to the train set
<p>Essentially, predicting if a credit card application will be approved or not is a <a href="https://en.wikipedia.org/wiki/Statistical_classification">classification</a> task. According to UCI, the dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.<br>
Which model should we pick? A question to ask is: <em>are the features that affect the credit card approval decision process correlated with each other?</em>  Let's start machine learning modeling with a Logistic Regression model (a generalized linear model).</p>

In [101]:
# LogisticRegression
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(rescaledX_train,y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression()

## 10. Making predictions and evaluating performance
<p>But how well does the model perform? </p>
<p>I will now evaluate the model on the test set with respect to <a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy">classification accuracy</a>. But I will also take a draw the model's <a href="http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/">confusion matrix</a>. In the case of predicting credit card applications, it is important to see if  machine learning model is equally capable of predicting approved and denied status,

In [98]:
# confusion_matrix
from sklearn.metrics import confusion_matrix

y_pred = logreg.predict(rescaledX_test)

# Accuracy score
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test,y_test))

confusion_matrix(y_test,y_pred)

Accuracy of logistic regression classifier:  1.0


array([[103,   0],
       [  0, 125]], dtype=int64)

## 11. Hyperparameter Tuning 

<p>For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.</p>
<p>But if i hadn't got a perfect score what's to be done?.Then i can n perform a <a href="https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/">grid search</a> of the model parameters to improve the model's ability to predict credit card approvals.</p>
<p><a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">scikit-learn's implementation of logistic regression</a> consists of different hyperparameters but I will grid search over the following two:</p>
<ul>
<li>tol</li>
<li>max_iter</li>
</ul>

In [99]:
# GridSearchCV
from sklearn.model_selection import GridSearchCV

tol = [0.01,0.001,0.0001]
max_iter = [100,150,200]

param_grid = {"tol":tol,"max_iter":max_iter}

## 12. Creadit Card Predictor Model

<p>I will instantiate <code>GridSearchCV()</code> with  earlier <code>logreg</code> model with all the data i have. I will also instruct <code>GridSearchCV()</code> to perform a <a href="https://www.dataschool.io/machine-learning-with-scikit-learn/">cross-validation</a> of five folds.<br>
I finished with some <strong>machine learning model </strong> to predict if a person's application for a credit card would get approved or not given some information about that person.</p>

In [100]:

grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

grid_model_result = grid_model.fit(rescaledX_train,y_train)

best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

best_model = grid_model_result.score(rescaledX_test,y_test)
print("Accuracy of logistic regression classifier: ", best_model)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

Best: 1.000000 using {'max_iter': 100, 'tol': 0.01}
Accuracy of logistic regression classifier:  1.0
