# Predicting Credit Card Approvals

Every week, millions of people around the world apply to have credit issued to them in the form of a credit card. However determining how much credit to approve, and whether to approve the customer at all is a challenge that faces every major bank. Today, most banks take advantage of trained models that can predict whether or not someone should be issued a credit card. In this simple example we will look at training a model on a sample dataset to predict whether or not a credit card application should be accepted or rejected.

This example uses the credit card approval dataset from the [UCI Machine Learning Repository.](http://archive.ics.uci.edu/ml/datasets/credit+approval) It is worth noting that to protect confidentiality, all the attribute names and values have been obfuscated.

In [None]:
# Import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from warnings import filterwarnings
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score

# Set our seed values for reproducible results
np.random.seed(1)

# Load dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data", header=None)

# Let's take a quick look at the head of the data
print(cc_apps.head())

This dataset is pretty small, and inspecting the head we can see what we discussed in the introduction - all the values have been obfuscated for privacy reasons, and even the feature names don't exist. Using this [blog post](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) as a source, we can see the column names. They are: Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus. The ApprovalStatus is what we are looking to predict, and in the head of the data above we can see that it is either a + or a - for approval or rejection respectively.

Out of 690 instances in this data there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved. As we progess through this, and especially in our visualisation stage, it is important to keep this in mind. More applications are rejected than accepted!

In [None]:
# Display summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

Something interesting here is that column 1 (Age) has been assigned an object data type. However we know from looking at this data that it is a floating point value and so later on we will need to deal with this by changing the data type of this column.

## Data Inspection and Cleaning
Now that we've loaded in our data and taken a quick look at it, let's look for missing values and clean the data accordingly. It's worth noting that I've done this step before visualisation - a case can be made to visualise the data before you clean it, or to visualise the data initially, then clean it, then visualise it again, but in a dataset like this where there are very few missing values we can simply clean it initially and then visualise it later.

In [None]:
# We can take a look at the tail of the dataset and see if we can see any missing values
print(cc_apps.tail(20))

We can see some ? values in the data here (e.g. row 673, column 0) which represents missing values in this dataset. To make it easier to work with, we can easily replace the ? with NaN.

In [None]:
# Replace the '?'s in the dataset with NaN
cc_apps = cc_apps.replace("?", np.nan)

Now that our missing values are represented by NaN, let's see how many missing values are actually in the entirety of this dataset.

In [None]:
# Count the number of NaNs in the dataset and print the counts to verify
print("NaN values:")
print(format(cc_apps.isnull().sum()))

cc_apps[1] = pd.to_numeric(cc_apps[1], downcast="float")

print(cc_apps.info())

We can see missing values in column 0 (Gender), column 1 (Age), column 3 (Marriage), column 4 (BankCustomer), column 5 (EducationLevel), column 6 (Ethnicity), and column 13 (ZipCode). Now we need to decide what to do with our missing values.

There are a few options here. The first option is simply to drop any rows (applications) that contain missing values. This is a pretty small dataset, so even though there is not that many missing values, dropping these rows is something we would like to try and avoid. Instead, we can try fill in the missing values, and that's what we will do here. At the end of this notebook we will compare our method of filling in the missing values vs just simply dropping the missing values and see which approach gives us a better model.

For our columns that are numeric (i.e. income, age) we can simply use mean imputation.

In [None]:
# Impute the missing values using mean imputation - this works for numeric values only!
cc_apps.fillna(cc_apps.mean(), inplace=True)

However, not all columns are numeric! Many are objects and contain strings (e.g. column 0 representing the gender is either "a" or "b"). For this, we will simply take the most frequent value that appears in the respective column and use that to impute the missing value.

In [None]:
# To deal with the remaining NaN values in object columns, we need to use the most frequent value that appears
# Iterate over each column of cc_apps
for col in cc_apps:
    # Check if the column is of object type
    if cc_apps[col].dtypes == 'object':
        # Impute the missing values with the most frequent value
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
print("NaN values:")
print(format(cc_apps.isnull().sum()))

Perfect, we can see that there are no more missing values in the data.

## Visualise our data
Now that we've cleaned our data appropriately, let's visualise the data and get a better understanding of the correlation between some of the features. I personally prefer to view images outside of the notebook environment and so all figures have the option to be saved as images locally, however I have commented out these lines to save space in the notebook environment. If you want to save them and download them yourself, simply uncomment the lines. Some of the larger plots are oriented vertically for easier viewing in a notebook environment.

In [None]:
# Initially we will plot the first column (Gender) vs the final column (ApprovalStatus)
sns.set_theme(style="ticks")

fig = plt.subplots(figsize=(10, 8))
genderPlot = sns.countplot(x=cc_apps[0], hue=cc_apps[15])
genderPlot.set(xlabel="Gender", ylabel="Observations")
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Gender")
plt.show()
# fig = genderPlot.get_figure()
# fig.savefig("images/genderPlot.png", bbox_inches='tight', pad_inches=0.5)

Interesting to note here is in this dataset there were 12 missing values out of a total of 690 observations in the Gender column. Of the remaining 678 observations, 468 applications are by someone of gender b, and 210 by someone of gender a. For gender a the amount of approved and rejected applications are very similar, however gender b has ~25% more rejections than approvals.

Next we will look at the age column.

In [None]:
# We need to sort our data in ascending order of age to plot this
age_cc_apps = cc_apps.sort_values(by=[1], ascending=True)

# fig = plt.subplots(figsize=(96, 12))
# agePlot1 = sns.countplot(x=age_cc_apps[1], hue=age_cc_apps[15])
fig = plt.subplots(figsize=(12, 80))
agePlot1 = sns.countplot(y=age_cc_apps[1], hue=age_cc_apps[15])
agePlot1.set(xlabel="Observations", ylabel="Age")
plt.xticks(rotation=70)
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Age")
plt.show()
# fig = agePlot1.get_figure()
# fig.savefig("images/agePlot.png", bbox_inches='tight', pad_inches=0.5)

Looking at this (rather large) plot we can get some basic information - rejections clearly seem focused towards the left of the plot, as the age is younger. We can see one large spike of rejections at age 31.5 - and this is the data that we have imputed using the mean of the age previously, so we can see that a lot of these applications were rejected, potentially because they had missing information on their application, or for other unknown reasons.

We can take a closer look at this data by finding the median age across all the applications and then looking at rejections above and below the median. From looking at the above plot (and a bit of a sanity check) we would expect to find significantly more rejections below the median than approvals.

In [None]:
# Find and display the median age
print("The median age in this data is {}.".format(cc_apps[1].median()))

all_applications_under_median = age_cc_apps[1][(age_cc_apps[1] <= 28)]  # list of all application ages at or under 28
approvals_below_median_age = len(age_cc_apps[(age_cc_apps[1] <= 28) & (age_cc_apps[15] == "+")])  # how many approvals at or under 28
approvals_above_median_age = len(age_cc_apps[(age_cc_apps[1] > 28) & (age_cc_apps[15] == "+")])  # how many approvals above 28
rejections_below_median_age = len(age_cc_apps[(age_cc_apps[1] <= 28) & (age_cc_apps[15] == "-")])  # how many rejections at or under 28
rejections_above_median_age = len(age_cc_apps[(age_cc_apps[1] > 28) & (age_cc_apps[15] == "-")])  # how many rejections above 28

xvals = (age_cc_apps[(age_cc_apps[1] <= 28) & (age_cc_apps[15] == "+")])[1]
yvals = (age_cc_apps[(age_cc_apps[1] <= 28) & (age_cc_apps[15] == "+")])[15]

# Summing these 4 numbers together we get 690, the total amount of applications in the dataset
print("The number of approvals below the median age is {}. ".format(approvals_below_median_age))
print("The number of approvals above the median age is {}. ".format(approvals_above_median_age))
print("The number of rejections below the median age is {}. ".format(rejections_below_median_age))
print("The number of rejections above the median age is {}. ".format(rejections_above_median_age))

# We can take a closer look now with a plot
fig = plt.subplots(figsize=(12, 48))
agePlot2 = sns.countplot(y=age_cc_apps[1][(age_cc_apps[1] < 28)], hue=age_cc_apps[15])
agePlot2.set(xlabel="Observations", ylabel="Age")
plt.xticks(rotation=70)
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Age")
plt.show()
# fig = agePlot2.get_figure()
# fig.savefig("images/agePlotBelowMedian.png", bbox_inches='tight', pad_inches=0.5)

This makes things a little clearer. We can see 132 approvals below the median age, while there are 198 rejections below the median age. Above the median, the values are closer, with 185 approvals and 185 rejections. This supports our hypothesis that below the median, as applicants are younger, the more likely their application is to be rejected.

Let's look at one final plot for the age - applicants at or below the age of 21. Again, we expect to see significantly more rejections than approvals.

In [None]:
# We can also inspect for ages well below the median - in this case an age of 21 or below
approvals_under_21 = len(age_cc_apps[(age_cc_apps[1] <= 21) & (age_cc_apps[15] == "+")])
rejections_under_21 = len(age_cc_apps[(age_cc_apps[1] <= 21) & (age_cc_apps[15] == "-")])
print("Approvals under the age of 21: {}.".format(approvals_under_21))
print("Rejections under the age of 21: {}.".format(rejections_under_21))

# We can take a closer look now with a plot
fig = plt.subplots(figsize=(10, 8))
agePlot3 = sns.countplot(x=age_cc_apps[1][(age_cc_apps[1] < 21)], hue=age_cc_apps[15])
agePlot3.set(xlabel="Age", ylabel="Observations")
plt.xticks(rotation=70)
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Age - 21 and under")
plt.show()
# fig = agePlot3.get_figure()
# fig.savefig("images/agePlotBelow21.png", bbox_inches='tight', pad_inches=0.5)

Again we see the same correlation - more rejections than approvals at or below the age of 21, with a lot of those approvals being between the age of 20 and 21. We see almost all rejections below the age of 18.

We can move on now to the next column, debt.

In [None]:
fig = plt.subplots(figsize=(12, 48))
debtPlot = sns.countplot(y=cc_apps[2], hue=cc_apps[15])
debtPlot.set(xlabel="Observations", ylabel="Debt")
plt.xticks(rotation=70)
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Debt")
plt.show()
# fig = debtPlot.get_figure()
# fig.savefig("images/debtPlot.png", bbox_inches='tight', pad_inches=0.5)

This is quite a large figure but it is clear that in the bottom third of the figure we see more rejections than approvals, with the majority between a debt of 1 and 3. This isn't as clear cut to draw any information from - there are a lot of rejections for applicants with no debt which seems counterintuitive. One possible explanation is that these applicants have not proven their ability to manage debt (mortgage, loans, prior credit cards) leading to a rejection. There are other columns such as credit score later in this visualisation which will help us understand this further.

Marital status is the next column to inspect.

In [None]:
# Now we take a look at the Marital Status and how it relates to the ApprovalStatus
fig = plt.subplots(figsize=(10, 8))
maritalStatus = sns.countplot(x=cc_apps[3], hue=cc_apps[15])
maritalStatus.set(xlabel="Marital Status", ylabel="Observations")
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Marital Status")
plt.show()
# fig = maritalStatus.get_figure()
# fig.savefig("images/maritalStatus.png", bbox_inches='tight', pad_inches=0.5)

With obfuscated columns it's always hard to glean information, but we can assume that "u" and "y" denote a married and single applicant respectively, with the other values potentially referring to de-facto relationships.

Whether or not the applicant is a customer of the bank is the next column.

In [None]:
# Bank Customer
fig = plt.subplots(figsize=(10, 8))
bankCustomer = sns.countplot(x=cc_apps[4], hue=cc_apps[15])
bankCustomer.set(xlabel="Bank Customer", ylabel="Observations")
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Bank Customer")
plt.show()
# fig = bankCustomer.get_figure()
# fig.savefig("images/bankCustomer.png", bbox_inches='tight', pad_inches=0.5)

The value "g" here denotes a positive value - that is, the applicant is a customer of the bank. "p" denotes that the applicant is not a customer, and "b" denotes a previous customer. We can see that there are significantly more rejections than approvals for applicants who are not customers of the bank, however it does not seem to be a significant factor (more on this in the feature importance section later) as there is still a significant amount of rejections for applicants who are customers, despite the data having more rejections than approvals.

We will now look at the education level of an applicant.

In [None]:
# Next let's look at the Education Level
fig = plt.subplots(figsize=(10, 8))
educationPlot = sns.countplot(x=cc_apps[5], hue=cc_apps[15])
educationPlot.set(xlabel="Education Level", ylabel="Observations")
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Education Level")
plt.show()
# fig = educationPlot.get_figure()
# fig.savefig("images/educationPlot.png", bbox_inches='tight', pad_inches=0.5)

This is a complicated plot to draw information from, as there are many different education levels and all are obfuscated. Logically you would expect applicants with a higher education level (undergraduate, postgraduate or postdoctoral study) to have a higher income and thus be less of a risk to issue credit to. However this isn't always the case and again in the feature importance section we will see the education level is not one of the most significant factors in this model. Education levels such as "ff" exhibit a significantly higher number of  rejections than approvals, whereas "q" and "cc" are the opposite - significantly more approvals than rejections.

The next column is the ethnicity of the applicant.

In [None]:
# Now the Ethnicity
fig = plt.subplots(figsize=(10, 8))
ethnicityPlot = sns.countplot(x=cc_apps[6], hue=cc_apps[15])
ethnicityPlot.set(xlabel="Ethnicity", ylabel="Observations")
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Ethnicity")
plt.show()
# fig = ethnicityPlot.get_figure()
# fig.savefig("images/ethnicityPlot.png", bbox_inches='tight', pad_inches=0.5)

This is a tricky column and brings up ethical issues. Luckily looking at this plot we don't need to dive into that as there doesn't seem to be any major correlation between the ethnicity of an applicant and their application status. A case could be made that ethnicity "ff" has a significantly higher amount of rejections than approvals. The feature importance section of our model supports this assertion.

The amount of years that an applicant has been employed is the next column.

In [None]:
# Years Employed
fig = plt.subplots(figsize=(12, 48))
yearsEmployed = sns.countplot(y=cc_apps[7], hue=cc_apps[15])
yearsEmployed.set(xlabel="Observations", ylabel="Years Employed")
plt.xticks(rotation=70)
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Years Employed")
plt.show()
# fig = yearsEmployed.get_figure()
# fig.savefig("images/yearsEmployedPlot.png", bbox_inches='tight', pad_inches=0.5)

There is a pretty clear correlation here between the amount of years someone has been employed and whether or not they are approved. This makes sense from a sanity point of view - the more years someone is employed, the more stable they are and thus the risk of them defaulting is less. We can see rejections mostly focused on the bottom quarter of our plot, where the years employed is between 0 and 1, with the most obvious amount of rejections for applicants who are currently not employed.

We can now move on to Prior Default.

In [None]:
# Prior Default
fig = plt.subplots(figsize=(10, 8))
priorDefault = sns.countplot(x=cc_apps[8], hue=cc_apps[15])
priorDefault.set(xlabel="Prior Default", ylabel="Observations")
priorDefault.set_xticklabels(["False", "True"])
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Prior Default")
plt.show()
# fig = priorDefault.get_figure()
# fig.savefig("images/priorDefaultPlot.png", bbox_inches='tight', pad_inches=0.5)

Again this is a pretty obvious column and the labels have been corrected to make this clear. If someone has previously defaulted on a loan they are far more of a risk. The vast majority of rejections come in where applicants have a prior default.

The next column is Employment Status.

In [None]:
# Employment Status
fig = plt.subplots(figsize=(10, 8))
employmentStatus = sns.countplot(x=cc_apps[9], hue=cc_apps[15])
employmentStatus.set(xlabel="Employment Status", ylabel="Observations")
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Employment Status")
plt.show()
# fig = employmentStatus.get_figure()
# fig.savefig("images/employmentStatusPlot.png", bbox_inches='tight', pad_inches=0.5)

This column links in to the YearsEmployed column. Recall previously that the most rejections we saw were when an applicant had zero years employed, i.e. they were unemployed. This plot further reinforces that, with far more rejections for applicants who are currently unemployed than those who are employed.

Next is the credit score of an applicant.

In [None]:
# Credit Score
fig = plt.subplots(figsize=(10, 8))
creditScore = sns.countplot(x=cc_apps[10], hue=cc_apps[15])
creditScore.set(xlabel="Credit Score", ylabel="Observations")
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Credit Score")
plt.show()
# fig = creditScore.get_figure()
# fig.savefig("images/creditScorePlot.png", bbox_inches='tight', pad_inches=0.5)

Of the 383 rejected applications in this data, this shows is that almost 300 have a credit score of zero. This may vary based on country, but a zero credit score generally means that the applicant has defaulted on a prior loan, rather than simply not having any credit history. This would make a lot of sense in explaining why so many applicants with a zero credit score were denied. The higher the credit score, the more likely an application is approved, and in fact we can see that there are almost no rejections for applicants with a credit score of over 5.

The final column we will inspect is the income of an applicant.

In [None]:
# Income
fig = plt.subplots(figsize=(12, 48))
income = sns.countplot(y=cc_apps[14], hue=cc_apps[15])
income.set(xlabel="Observations", ylabel="Income")
plt.xticks(rotation=70)
plt.legend(title="Application Status", labels=["Approved", "Rejected"], loc="upper right")
plt.title("Approved Credit Card Applications by Income")
plt.show()
# fig = income.get_figure()
# fig.savefig("images/incomePlot.png", bbox_inches='tight', pad_inches=0.5)

This plot (once again, quite large) looks at the income of an applicant. As we would expect, there are far more rejections for those on low incomes than those on higher incomes. In fact, the very first bar we can see a lot of applicants in this dataset actually have zero income! This is probably a huge contributing factor to why there are so many rejections in this data. The vast majority of rejections are centered around the 0-10 range.

## Model Preprocessing and Training
Now we've dealt with all the missing values in the dataset and visualised our data to get a better understanding of the features, we can move on to creating and training a model.

The first thing we need to do is encode our categorical features which pandas has given an object data type. For this we will LabelEncode our values. Another option here could have been to perform one-hot encoding on the data.

In [None]:
# Initialise the LabelEncoder
# The LabelEncoder will convert our categorical features (with an object data type) into numerical features
le = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in cc_apps:
    # See if the dtype is object
    if cc_apps[col].dtypes=='object':
    # Use LabelEncoder to transform the column
        cc_apps[col]=le.fit_transform(cc_apps[col])

In [None]:
# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
# Features 11 and 13 are DriversLicense and ZipCode, both of which are not relevant to our model and have very low feature importance
cc_apps = cc_apps.drop([11, 13], axis=1)
cc_apps = cc_apps.to_numpy()

# Split features and labels into separate variables, x being inputs and y being our target that we are trying to predict (ApprovalStatus)
X, y = cc_apps[:, 0:13] , cc_apps[:, 13]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [None]:
# Create a MinMaxScaler instance and use it to rescale X_train and X_test from 0 to 1
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

In [None]:
# Create both a LogisticRegression and RF Classifier model
logreg = LogisticRegression()
rfmodel = RandomForestClassifier()

# Fit logreg and rf models to the scaled train data
logreg.fit(rescaledX_train, y_train)
rfmodel.fit(rescaledX_train, y_train)

Now that we've split our data, scaled it, and initialised our models, it's time to determine the hyperparameters for each model. We will do this using GridSearchCV, passing in a grid of hyperparameters to iterate through.

In [None]:
# Hyperparameter grid definitions
# We can take a look at the available hyperparameters for the logistic regression model
print(logreg.get_params().keys())
# Also the available hyperparameters for the RF model
print(rfmodel.get_params().keys())

# Now let's define the hyperparameters for our logistic regression model
# It is worth noting that for this model, not all solvers and penalty values are compatible. For example, "elasticnet" is only supported by the "saga" solver.
# This will lead to some NaN values for some combinations. We supress warnings from sklearn for this.
filterwarnings("ignore")

logregParamGrid = {
    "tol": [0.1, 0.01, 0.001, 0.0001],
    "max_iter": [50, 100, 200, 300, 400, 500, 600, 1000, 1500],
    "penalty": ["none", "l1", "l2", "elasticnet"],
    "solver": ["lbfgs", "liblinear", "sag", "saga"],
    "C": [100, 10, 1.0, 0.1, 0.01]
}

# We also define a hyperparameter grid for our random forest classifier model
rfmodelParamGrid = {
    "max_depth": [4, 6, 8, 10, 12],
    "max_features": [4, 6, 8, 10, 12, "sqrt", "log2", "auto"],
    "n_estimators": [100, 200, 300, 400, 600, 800, 1000],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "bootstrap": [True, False]
}

In [None]:
# Find the best parameters for our logreg model
# Initialise GridSearchCV with the parameter grid that we previously created, using 5-fold cross validation
logreg_grid_model = GridSearchCV(estimator=logreg, param_grid=logregParamGrid, cv=5, verbose=1, n_jobs=-1)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
logreg_grid_model_result = logreg_grid_model.fit(rescaledX, y)

# Summarise results for the logreg model
logreg_best_score, logreg_best_params = logreg_grid_model_result.best_score_, logreg_grid_model_result.best_params_
print("Best: %f using %s for logreg model." % (logreg_best_score, logreg_best_params))

In [None]:
# Now we do the same as above but for our RF model
rf_grid_model = GridSearchCV(estimator=rfmodel, param_grid=rfmodelParamGrid, cv=5, verbose=1, n_jobs=-1)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
rf_grid_model_result = rf_grid_model.fit(rescaledX, y)

# Summarize results
rf_best_score, rf_best_params = rf_grid_model_result.best_score_, rf_grid_model_result.best_params_
print("Best: %f using %s for RF model" % (rf_best_score, rf_best_params))

Now we have the optimal hyperparameters for each model and we have fitted each model. We will then use these fitted models to make predictions on our test data.

Before we make predictions, let's take a look at the feature importances as determined by each model and display them as a bar graph. This is one of the most important figures in this entire notebook, showing the importance of each feature in the model.

In [None]:
# Note the sum of all individual importances sum to 1
# Define our feature names and feature importance lists
feature_names = ["Gender", "Age", "Debt", "Married", "BankCustomer", "EducationLevel", "Ethnicity", "YearsEmployed", "PriorDefault", "Employed", "CreditScore", "Citizen", "Income"]
feature_importance = rfmodel.feature_importances_

# Create a dictionary containing our feature importances and the corresponding feature names
featurePlot = {
	"feature_names": feature_names,
	"feature_importance": feature_importance
}

featurePlotData = pd.DataFrame(featurePlot)

# Sort our feature importances in descending order
featurePlotData.sort_values(by=['feature_importance'], ascending=False,inplace=True)

# Set the size of our figure and the default seaborn blue colour hex code
colour = ["#5975A4"]
plt.figure(figsize=(10,8))

# Plot our feature importances as a bar chart
sns.barplot(x=featurePlotData['feature_importance'], y=featurePlotData['feature_names'], palette=sns.color_palette(colour))

plt.title("Random Forest Model Feature Importance")
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.show()

This is great to see as it's what you would expect when assessing a credit card application. Whether or not someone else has defaulted previously, the time they have been employed, their credit score, and their income/debt are clearly all critical factors in deciding whether or not to approve an application. This backs up some of the data we saw previously in our visualisations, where low credit scores and prior defaults led to rejections.

On the other hand, factors such as gender, citizenship and whether or not an applicant is a bank customer are not as relevant in determining the outcome of an application.

Something to look at in the future here is sklearn's Recursive Feature Elimination (RFE), which will allow us to find the most important features using cross-validation and then discard features that are not as important.

We will now make predictions using our fitted models and then look at the AUC for both fitted models - area under the ROC curve. The higher the AUC, the better the model is at distinguishing between an approval and a rejection. An excellent model would have an AUC of 1, meaning it has a good measure of separability.

In [None]:
# Use both models to predict instances from the test set
logreg_y_pred = logreg_grid_model.predict(rescaledX_test)
rf_y_pred = rf_grid_model.predict(rescaledX_test)

# AUC for our logistic regression model
logreg_y_pred_proba = logreg_grid_model.predict_proba(rescaledX_test)[::,1]
logreg_fpr, logreg_tpr, _ = roc_curve(y_test, logreg_y_pred_proba)
logreg_auc = roc_auc_score(y_test, logreg_y_pred_proba)
plt.plot(logreg_fpr, logreg_tpr, label="Logreg Model, AUC = "+str(logreg_auc))
plt.title("Logistic Regression Model")
plt.legend(loc=4)
plt.show()

In [None]:
# AUC for our RF model
rf_y_pred_proba = rf_grid_model.predict_proba(rescaledX_test)[::,1]
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_y_pred_proba)
rf_auc = roc_auc_score(y_test, rf_y_pred_proba)
plt.plot(rf_fpr, rf_tpr, label="RF Model, AUC = "+str(rf_auc))
plt.title("Random Forest Model")
plt.legend(loc=4)
plt.show()

This concludes our notebook. This isn't meant to be an exhauastive analysis and as such there are a lot of extra things that could be added, including:

* A heatmap to visualise the accuracy of our predictions
* Comparing a model where the missing values are simply dropped to this model where we impute the values
* Other classification models such as XGBoost which could perform better than the two models in this example
* Further feature selection analysis, dropping features with low importance
* Creating a new feature from the existing data