Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

![Credit card being held in hand](credit_card.jpg)

You have been provided with a small subset of the credit card applications a bank receives. The dataset has been loaded as a Pandas DataFrame for you. You will start from there. 

In [174]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [175]:
# Drop features 11 + 13
cc_apps = cc_apps.drop([11, 13], axis=1)

In [176]:
# Split data into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

In [177]:
# Check to see if there are nulls
cc_apps_train.isnull().count()
cc_apps_test.isnull().count()

0     228
1     228
2     228
3     228
4     228
5     228
6     228
7     228
8     228
9     228
10    228
12    228
14    228
15    228
dtype: int64

In [178]:
# There are currently no NaN, but there are ?s - replace the ?s in the data with NaN
cc_apps_train_nans_replaced = cc_apps_train.replace("?", np.NaN)
cc_apps_test_nans_replaced = cc_apps_test.replace("?", np.NaN)

In [179]:
# Impute missing values for numeric columns
cc_apps_train_imputed = cc_apps_train_nans_replaced.fillna(cc_apps_train_nans_replaced.mean())
cc_apps_test_imputed = cc_apps_test_nans_replaced.fillna(cc_apps_train_nans_replaced.mean())

# Impute missing values for object-type columns
for col in cc_apps_train_imputed.columns:
    if cc_apps_train_imputed[col].dtypes == "object":
        cc_apps_train_imputed.fillna(
            cc_apps_train_imputed[col].value_counts().index[0],inplace=True
        )
        cc_apps_test_imputed.fillna(
            cc_apps_train_imputed[col].value_counts().index[0],inplace=True
        )

In [180]:
# Convert categorical features into numeric
cc_apps_train_cat_encoding = pd.get_dummies(cc_apps_train_imputed)
cc_apps_test_cat_encoding = pd.get_dummies(cc_apps_test_imputed)

In [181]:
# Checks

# 204
cc_apps_train_cat_encoding[cc_apps_train_cat_encoding["15_-"] == 0].count()

# 258
cc_apps_train_cat_encoding[cc_apps_train_cat_encoding["15_-"] == 1].count()

# 103
cc_apps_test_cat_encoding[cc_apps_test_cat_encoding["15_-"] == 0].count()

# 125
cc_apps_test_cat_encoding[cc_apps_test_cat_encoding["15_-"] == 1].count()

2       125
7       125
10      125
14      125
0_a     125
       ... 
12_g    125
12_p    125
12_s    125
15_+    125
15_-    125
Length: 221, dtype: int64

In [182]:
# Reindex
cc_apps_test_cat_encoding = cc_apps_test_cat_encoding.reindex(
    columns=cc_apps_train_cat_encoding.columns, fill_value=0
)

In [183]:
# Put into separate variables
X_train, y_train = (
    cc_apps_train_cat_encoding.iloc[:, :-1].values,
    cc_apps_train_cat_encoding.iloc[:, [-1]].values,
)
X_test, y_test = (
    cc_apps_test_cat_encoding.iloc[:, :-1].values,
    cc_apps_test_cat_encoding.iloc[:, [-1]].values,
)

In [184]:
# Rescale
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

In [185]:
# LogisticRegression
logreg = LogisticRegression()
logreg.fit(rescaledX_train, y_train)
y_pred = logreg.predict(rescaledX_test)

In [186]:
# Confusion matrix of the logreg model
confusion_matrix(y_test, y_pred)

array([[103,   0],
       [  0, 125]])

In [187]:
# GridSearchCV
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]
param_grid = {"tol":tol, 
              "max_iter":max_iter}

grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)
grid_model_result = grid_model.fit(rescaledX_train, y_train)

In [188]:
# Results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 1.000000 using {'max_iter': 100, 'tol': 0.01}


In [189]:
# Extract best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print(
    "Accuracy of logistic regression classifier: ",
    best_model.score(rescaledX_test, y_test),
)

Accuracy of logistic regression classifier:  1.0
