![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


This is a binary classification problem. Therefore a Logistic Regression model is used.

Perform basic EDA - understand data types and spot missing values

In [3]:
cc_apps.info() # column 1 is of type object even though it's populated with floats

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    int64  
 13  13      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


In [4]:
# Dig deeper into column 1 to make sure conversion to float is possible
cc_apps[1].value_counts() # there are twelve '?' values which will prevent conversion

# A Logistic Regression model cannot employ these values
# Convert them to nulls and then drop them, since it cannot process null values either
cc_apps[1] = cc_apps[1].replace('?', np.nan)
cc_apps = cc_apps.dropna(subset=[1])

# Convert to float
cc_apps[1] = cc_apps[1].astype(float)

In [5]:
# Get item description
cc_apps.describe() # column 12 is way out of scale 

Unnamed: 0,1,2,7,10,12
count,678.0,678.0,678.0,678.0,678.0
mean,31.568171,4.777625,2.209226,2.435103,1021.240413
std,11.957862,4.99724,3.350755,4.896966,5251.971453
min,13.75,0.0,0.0,0.0,0.0
25%,22.6025,1.0,0.165,0.0,0.0
50%,28.46,2.75,1.0,0.0,5.0
75%,38.23,7.4375,2.57375,3.0,395.5
max,80.25,28.0,28.5,67.0,100000.0


Feature engineering

In [6]:
# Inspect target column (last one - 13)
cc_apps[13].value_counts() # suppose + is approved, - is non-approved

# Convert these values to binary
cc_apps[13] = cc_apps[13].replace({'+':1, '-':0})

# Convert field to integer
cc_apps[13] = cc_apps[13].astype(int)

In [7]:
# Before training the model two transformations are pending
    # Get dummies for object fields
    # Scale numeric fields
    
# Get dummies
cc_apps = pd.get_dummies(cc_apps, columns=[0, 3, 4, 5, 6, 8, 9, 11])

# Scale numeric fields
# First drop target field
scaler = StandardScaler()
features = cc_apps.drop([13], axis=1)

# Turn all column names to strings to apply Standard Scaler
features.columns = features.columns.astype(str)

# Scale by creating a new dataframe, so as to avoid issues with column names in 'features'
features_scaled = pd.DataFrame(scaler.fit_transform(features), columns=features.columns)


Now train some models

In [8]:
# Define target data
target = cc_apps[13]

# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.8, random_state=42)

# Target data is binary - use Logistic Regression
# Initialize
logreg = LogisticRegression(random_state=42)

# Train and predict
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)

# Evaluate
print(classification_report(y_test, logreg_pred)) # Accuracy ~.84 - more than satisfactory
print(confusion_matrix(y_test, logreg_pred))

              precision    recall  f1-score   support

           0       0.86      0.85      0.86       305
           1       0.81      0.82      0.82       238

    accuracy                           0.84       543
   macro avg       0.84      0.84      0.84       543
weighted avg       0.84      0.84      0.84       543

[[260  45]
 [ 42 196]]


Can accuracy still be improved?
Let's try a Random Forest Classifier

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize
rf = RandomForestClassifier(random_state=42)

# Train and predict
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

# Evaluate
print(classification_report(y_test, rf_pred)) # Accuracy ~.87 - even better
print(confusion_matrix(y_test, rf_pred))

              precision    recall  f1-score   support

           0       0.90      0.85      0.88       305
           1       0.82      0.88      0.85       238

    accuracy                           0.87       543
   macro avg       0.86      0.87      0.86       543
weighted avg       0.87      0.87      0.87       543

[[260  45]
 [ 28 210]]


Random Forest Classifier proves to be a better model than Logistic Regression. However, we've only used vanilla versions of both models - we can still tweak their parameters to see if better accuracy can be achieved.

Let's use GridSearch for it.

In [10]:
# Define parameters to test for both models
logreg_params = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

rf_params = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Try Logistic Regression first
# Instantiate GridSearch object
grid_search_logreg = GridSearchCV(
    estimator = logreg,
    param_grid = logreg_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1    
)

# Fit grid
grid_search_logreg.fit(X_train, y_train)

# Get best features
logreg_best_params = grid_search_logreg.best_params_
logreg_best_score = grid_search_logreg.best_score_
logreg_best_model = grid_search_logreg.best_estimator_

# Now try Random Forest
# Instantiate
grid_search_rf = GridSearchCV(
    estimator = rf,
    param_grid = rf_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1    
)

# Fit
grid_search_rf.fit(X_train, y_train)

# Get best features
rf_best_params = grid_search_rf.best_params_
rf_best_score = grid_search_rf.best_score_
rf_best_model = grid_search_rf.best_estimator_

# Assess prediction capacity of both best models
logreg_pred_best = logreg_best_model.predict(X_test)
logreg_best_score = accuracy_score(y_test, logreg_pred_best)
rf_pred_best = rf_best_model.predict(X_test)
rf_best_score = accuracy_score(y_test, rf_pred_best)

# Get values
print(logreg_best_score)
print(rf_best_score)


0.856353591160221
0.8637200736648251


With GridSearch, overall accuracy has actually decreased by a bit, which can be early signs of overfitting in the model. With that in mind and having already achieved an accuracy larger than .75, the exercise stops here.

After thoughts: some other binary classification models, like SVC and XGBoost, can also be considered. Further increase in accuracy may come from more complex feature engineering. Keep also in mind that other metrics, like f1 score, have not been studied.

In [11]:
best_score = accuracy_score(y_test, rf_pred)
print(best_score)

0.8655616942909761
