# Predicting Credit Card Risk 

This project involves building a machine learning model to predict the risk for credit card applicants. The primary challenge addressed in this project is the significant class imbalance, where instances of the positive class are much less frequent than instances of the negative class. The goal is to develop a model that can accurately identify high-risk customers despite this severe imbalance.

## 1. Load, Preview and Clean the Data

In [40]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [41]:
path = "/Users/amanserhan/predicting-credit-card-risk/credit_data.csv"
data = pd.read_csv(path)
data.head()

Unnamed: 0,ID,Gender,Has a car,Has a property,Children count,Income,Employment status,Education level,Marital status,Dwelling,Age,Employment length,Has a mobile phone,Has a work phone,Has a phone,Has an email,Job title,Family member count,Account age,Is high risk
0,5037048,M,Y,Y,0,135000.0,Working,Secondary / secondary special,Married,With parents,-16271,-3111,1,0,0,0,Core staff,2,-17,0
1,5044630,F,Y,N,1,135000.0,Commercial associate,Higher education,Single / not married,House / apartment,-10130,-1651,1,0,0,0,Accountants,2,-1,0
2,5079079,F,N,Y,2,180000.0,Commercial associate,Secondary / secondary special,Married,House / apartment,-12821,-5657,1,0,0,0,Laborers,4,-38,0
3,5112872,F,Y,Y,0,360000.0,Commercial associate,Higher education,Single / not married,House / apartment,-20929,-2046,1,0,0,1,Managers,1,-11,0
4,5105858,F,N,N,0,270000.0,Working,Secondary / secondary special,Separated,House / apartment,-16207,-515,1,0,1,0,,1,-41,0


In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36457 entries, 0 to 36456
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   36457 non-null  int64  
 1   Gender               36457 non-null  object 
 2   Has a car            36457 non-null  object 
 3   Has a property       36457 non-null  object 
 4   Children count       36457 non-null  int64  
 5   Income               36457 non-null  float64
 6   Employment status    36457 non-null  object 
 7   Education level      36457 non-null  object 
 8   Marital status       36457 non-null  object 
 9   Dwelling             36457 non-null  object 
 10  Age                  36457 non-null  int64  
 11  Employment length    36457 non-null  int64  
 12  Has a mobile phone   36457 non-null  int64  
 13  Has a work phone     36457 non-null  int64  
 14  Has a phone          36457 non-null  int64  
 15  Has an email         36457 non-null 

Let's encode the "object" data into integer labels in preparation for using them in the model. 

In [44]:
#Printing the unique values in the columns in question to view all the categories for encoding

employment_statuses = data['Employment status'].unique().tolist()
education_levels = data['Education level'].unique().tolist()
marital_statuses = data['Marital status'].unique().tolist()
dwellings = data['Dwelling'].unique().tolist()
job_titles = data['Job title'].unique().tolist()

print("Employment Statuses: ", employment_statuses)
print("Education Levels: ", education_levels)
print("Marital Statuses: ", marital_statuses)
print("Job Title: ", job_titles)

Employment Statuses:  ['Working', 'Commercial associate', 'Pensioner', 'State servant', 'Student']
Education Levels:  ['Secondary / secondary special', 'Higher education', 'Lower secondary', 'Incomplete higher', 'Academic degree']
Marital Statuses:  ['Married', 'Single / not married', 'Separated', 'Civil marriage', 'Widow']
Job Title:  ['Core staff', 'Accountants', 'Laborers', 'Managers', nan, 'Sales staff', 'Medicine staff', 'High skill tech staff', 'HR staff', 'Low-skill Laborers', 'Drivers', 'Secretaries', 'Cleaning staff', 'Cooking staff', 'Security staff', 'Private service staff', 'IT staff', 'Waiters/barmen staff', 'Realty agents']


In [45]:
#Enconding the binary columns as 0s and 1s
data['Gender'] = data['Gender'].map({'M': 1, 'F': 0})
data['Has a car'] = data['Has a car'].map({'Y': 1, 'N': 0})
data['Has a property'] = data['Has a property'].map({'Y': 1, 'N': 0})

# Changing the "Employment Status" coulmn to "Employed?"
data['Employed'] = data['Employment status'].map({
    'Working': 1, 
    'Commercial associate': 1, 
    'State servant': 1,
    'Pensioner': 0,
    'Student': 0
})
data = data.drop('Employment status', axis = 1)

# Changing the "Education Level" column to "Completed Post-Secondary Education?"
data['Completed Post-Secondary Education'] = data['Education level'].map({
    'Secondary / secondary special': 0,
    'Lower secondary': 0,
    'Incomplete higher': 0,
    'Higher education': 1,
    'Academic degree': 1
})
data = data.drop("Education level", axis = 1)

# Changing the "Marital Status" column to "Married?"
data['Married'] = data['Marital status'].map({
    'Single / not married': 0,
    'Separated': 0,
    'Widow': 0,
    'Married': 1,
    'Civil marriage': 1
})
data = data.drop("Marital status", axis = 1)

# Dropping the "Dwelling" column, since it is similar to the "Has a property" column in the context of this analysis
data = data.drop("Dwelling", axis = 1)

# Dropping the "Job Title" column, since the large number of nominal categories is not very useful for our model
data = data.drop("Job title", axis = 1)

# Dropping the "Account age" column, since the unit of measurement is unclear, which may negatively impact the accuracy of the model
data = data.drop("Account age", axis = 1)

types = {
    'Gender': "int64",
    'Has a car': "int64",
    'Has a property': "int64",
    'Employed': "int64",
    'Completed Post-Secondary Education': "int64",
    'Married': "int64"
}
data.astype(types)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36457 entries, 0 to 36456
Data columns (total 17 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   ID                                  36457 non-null  int64  
 1   Gender                              36457 non-null  int64  
 2   Has a car                           36457 non-null  int64  
 3   Has a property                      36457 non-null  int64  
 4   Children count                      36457 non-null  int64  
 5   Income                              36457 non-null  float64
 6   Age                                 36457 non-null  int64  
 7   Employment length                   36457 non-null  int64  
 8   Has a mobile phone                  36457 non-null  int64  
 9   Has a work phone                    36457 non-null  int64  
 10  Has a phone                         36457 non-null  int64  
 11  Has an email                        36457

According to the dataset documentation, the "Age", "Employment length", and "Account age" columns are counted backwards in days, which is why the numbers are very large and negative. Let's convert that to years, positive. 

In [47]:
data["Age"] = (np.abs(data["Age"])/365).astype("int64")

# 'Employment Length' must be handled differently since positive values exist, indicating unemployment
data['Employment length'] = np.where(
    data["Employment length"] >= 0, 
    0, 
    (np.abs(data["Employment length"]) / 365).astype("int64")
)

data.head()

Unnamed: 0,ID,Gender,Has a car,Has a property,Children count,Income,Age,Employment length,Has a mobile phone,Has a work phone,Has a phone,Has an email,Family member count,Is high risk,Employed,Completed Post-Secondary Education,Married
0,5037048,1,1,1,0,135000.0,44,8,1,0,0,0,2,0,1,0,1
1,5044630,0,1,0,1,135000.0,27,4,1,0,0,0,2,0,1,1,0
2,5079079,0,0,1,2,180000.0,35,15,1,0,0,0,4,0,1,0,1
3,5112872,0,1,1,0,360000.0,57,5,1,0,0,1,1,0,1,1,0
4,5105858,0,0,0,0,270000.0,44,1,1,0,1,0,1,0,1,0,0


## Analysis and Modelling

In [49]:
#Imports 
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

# Split features and target
target  = data['Is high risk']
features = data[['Gender', 'Has a car', 'Has a property', 'Children count', 'Income', 'Age', 
                 'Employment length', 'Has a mobile phone', 'Has a work phone', 'Has a phone', 
                 'Has an email', 'Family member count', 'Employed', 
                 'Completed Post-Secondary Education', 'Married']]

# Train Test Split
training_data, validation_data, training_labels, validation_labels = train_test_split(
    features, target, test_size=0.2, random_state=100
)

# Creating and training the Random Forest Classifier
classifier = RandomForestClassifier()
classifier.fit(training_data,training_labels)

score = classifier.score(validation_data,validation_labels)
print(score)

0.9825836533187055


The accuracy of the model is very high at 0.98. And while this could be a good sign, it could also be a sign of overfitting. One of the best things we could to evaluate this is to test again using a new dataset the model has not yet seen. If the accuracy drops on this new dataset, it's likely a sign of overfitting. But unfortunately, we don't have access to that. So instead, let's try some other steps to validate the performance of the model, starting with its confusion matrix. 

In [51]:
from sklearn.metrics import confusion_matrix

# Predict on the validation set
validation_predictions = classifier.predict(validation_data)

# Calculate the confusion matrix
conf_matrix = confusion_matrix(validation_labels, validation_predictions)
print(conf_matrix)
print(data['Is high risk'].value_counts())

[[7147   20]
 [ 107   18]]
Is high risk
0    35841
1      616
Name: count, dtype: int64


The number of positive cases in the data seems to be orders of magnitude lower than the negatives. Additionally, the number of false positives are more than five times the number of false negatives. This indicates the the model is biased towards identifying non high-risk cases, and for good reason. There are much more negative cases in the data than positive. There is a very pronounced class imbalance in the data. To examine its effects further, let's calculate the precision, recall, and f1-score.

In [53]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Calculate precision
precision = precision_score(validation_labels, validation_predictions)
print("Precision:", precision)

# Calculate recall (sensitivity)
recall = recall_score(validation_labels, validation_predictions)
print("Recall (Sensitivity):", recall)

# Calculate F1-score
f1 = f1_score(validation_labels, validation_predictions)
print("F1-Score:", f1)

Precision: 0.47368421052631576
Recall (Sensitivity): 0.144
F1-Score: 0.22085889570552147


The precision is pretty low, at around 0.4. Of all the instances the model predicted the classification to be positive, it was actually positive only 40% of the time. 

Recall is even lower, at around 0.14. Out of all the instances that are truly positive, the model only predicted 14% of them correctly.

Based on these results, we can conclude that our main issues are class imbalance and poor model performance on the minority class. The model is not effectively identifying the "high-risk" cases, which suggests the model might be over-reliant on predicting the majority class (non-high-risk) correctly.

High overall accuracy combined with low precision, recall, and F1-score for the minority class is a sign of a model that is biased towards the majority class rather than overfitting in the traditional sense.

Some of the solutions we can implement to mitigate this include resampling techniques, either oversampling the minority class or undersampling the majority class. A second option is adjusting the "class_weight" parameter in the model. Let's start with the second option.

In [55]:
# Creating a Random Forest Classifier with balanced class weights
classifier2 = RandomForestClassifier(class_weight='balanced')

# Train the classifier
classifier2.fit(training_data, training_labels)

# Evaluate the classifier
score2 = classifier2.score(validation_data, validation_labels)

# Generating predictions for the classifier
validation_predictions2 = classifier2.predict(validation_data)

# Calculate accuracy
print("Accuracy:", score2)

# Calculate precision
precision2 = precision_score(validation_labels, validation_predictions2)
print("Precision:", precision2)

# Calculate recall (sensitivity)
recall2 = recall_score(validation_labels, validation_predictions2)
print("Recall (Sensitivity):", recall2)

# Calculate F1-score
f12 = f1_score(validation_labels, validation_predictions2)
print("F1-Score:", f12)

Accuracy: 0.9591332967635765
Precision: 0.18315018315018314
Recall (Sensitivity): 0.4
F1-Score: 0.25125628140703515


The precision has dropped to less than half, which means the model now has more false positives. This is expected because the model is now predicting more "high-risk" cases due to the increased weight on the minority class. Recall improved substantially, meaning the model is now capturing more true "high-risk" cases. This is a positive outcome as it indicates the model is better at identifying the minority class. The F1-score improved slightly, indicating a better balance between precision and recall. 

In our case, it's important to improve the model's performance on the minority class because even though the "is high risk" positive outcome is generally less common, the model is not useful if it mislabels these few critical points as negative. This is why we must look at precision, recall and f1 more closely than accuracy, since it's pretty easy to be accurate when the overwhelming majority of points are in one class. There is still room for improvement, so let's try a different method, resampling techniques, and see if that improves our metrics.

In [57]:
from imblearn.over_sampling import ADASYN

# Instantiating ADASYN
adasyn = ADASYN(sampling_strategy="minority", random_state= 42, n_neighbors=15)

# Resampling the training data
training_data_resampled, training_labels_resampled = adasyn.fit_resample(training_data, training_labels)

# Creating a Random Forest Classifier for the resampled data
classifier3 = RandomForestClassifier()

# Train the classifier
classifier3.fit(training_data_resampled, training_labels_resampled)

# Evaluate the classifier
score3 = classifier3.score(validation_data, validation_labels)

# Generating predictions for the classifier
validation_predictions3 = classifier3.predict(validation_data)

# Calculate accuracy
print("Accuracy:", score3)

# Calculate precision
precision3 = precision_score(validation_labels, validation_predictions3)
print("Precision:", precision3)

# Calculate recall (sensitivity)
recall3 = recall_score(validation_labels, validation_predictions3)
print("Recall (Sensitivity):", recall3)

# Calculate F1-score
f13 = f1_score(validation_labels, validation_predictions3)
print("F1-Score:", f13)

Accuracy: 0.9561162918266594
Precision: 0.15789473684210525
Recall (Sensitivity): 0.36
F1-Score: 0.21951219512195122


The recall in this case is a improvement from the first model, but not as good as our results from setting class_weight='balanced' in our second model. Knowing this, let's go back to this strategy, this time balancing the individual bootstrap samples generated by the classifier as opposed to the dataset as a whole.

In [59]:
# Creating a Random Forest Classifier with balanced subsample weights
classifier4 = RandomForestClassifier(class_weight='balanced_subsample')

# Train the classifier
classifier4.fit(training_data, training_labels)

# Evaluate the classifier
score4 = classifier4.score(validation_data, validation_labels)

# Generating predictions for the classifier
validation_predictions4 = classifier4.predict(validation_data)

# Calculate accuracy
print("Accuracy:", score4)

# Calculate precision
precision4 = precision_score(validation_labels, validation_predictions4)
print("Precision:", precision4)

# Calculate recall (sensitivity)
recall4 = recall_score(validation_labels, validation_predictions4)
print("Recall (Sensitivity):", recall4)

# Calculate F1-score
f14 = f1_score(validation_labels, validation_predictions4)
print("F1-Score:", f14)

Accuracy: 0.9585847504114098
Precision: 0.1827956989247312
Recall (Sensitivity): 0.408
F1-Score: 0.2524752475247525


As expected, there's a slight improvement from our second model, which balanced the dataset as a whole before the random subsamples were taken. For an even more dramatic improvement to our recall, we can try a balanced random forest classifier from the imbalanced learn library.

In [61]:
from imblearn.ensemble import BalancedRandomForestClassifier

# Creating a Random Forest Classifier with balanced subsample weights
classifier5 = BalancedRandomForestClassifier(sampling_strategy = 'auto', replacement = False, bootstrap = True)

# Train the classifier
classifier5.fit(training_data, training_labels)

# Evaluate the classifier
score5 = classifier5.score(validation_data, validation_labels)

# Generating predictions for the classifier
validation_predictions5 = classifier5.predict(validation_data)

# Calculate accuracy
print("Accuracy:", score4)

# Calculate precision
precision5 = precision_score(validation_labels, validation_predictions5)
print("Precision:", precision5)

# Calculate recall (sensitivity)
recall5 = recall_score(validation_labels, validation_predictions5)
print("Recall (Sensitivity):", recall5)

# Calculate F1-score
f15 = f1_score(validation_labels, validation_predictions5)
print("F1-Score:", f15)

Accuracy: 0.9585847504114098
Precision: 0.03878406708595388
Recall (Sensitivity): 0.592
F1-Score: 0.07279881947860305


As expected, our recall increased quite a bit, with a substantial blow to our precision and f1 score. The very low precision indicates the model is predicted many false positives. The recall score means that around 61% of our positive cases are predicted correctly. For a more complete picture, let's also perform cross validation to infer how the model would perform on new unseen data. We will use stratified cross validation since the data has a class imbalance.

In [63]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Initialize StratifiedKFold
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
# Perform cross-validation
scores = cross_val_score(classifier4, training_data, y=training_labels, cv = stratified_kfold, scoring='accuracy')
scores.mean()

0.9621807008064451

A cross-validation score of 73% is not terrible, but there's still much room for improvement. Next steps for this project could be exploring other techniques for dealing with class imbalances, such as other resampling methods we have not tried (SMOTE, undersampling the majority class, etc), combining ensemble methods with resampling, and tuning some of the hyperparameters. But for now, we'll call it a wrap.