<a href="https://colab.research.google.com/github/blunte3/ML-AI/blob/main/DecisionTreeCreditCardApproval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I removed the graphs from EDA from the previous logistic regression classifier found in the github, for ease of reading and to highlight the implementation of the decision tree, and the bagged and boosted optimzations.

Data - https://www.kaggle.com/datasets/rikdifos/credit-card-approval-prediction

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

Reading the CSV files and checking the head contents.

In [2]:
df = pd.read_csv("application_record.csv") #data frame for application record history
rf = pd.read_csv("credit_record.csv") #record frame which is labeled rf to represent the credit record data frame

In [3]:
df.head()

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0


In [4]:
rf.head()

Unnamed: 0,ID,MONTHS_BALANCE,STATUS
0,5001711,0,X
1,5001711,-1,0
2,5001711,-2,0
3,5001711,-3,0
4,5001712,0,C


Convert the STATUS to numeric and select each ID to add a column to represent how long the account has been opened, and add an OVERDUE column which represents if a user has ever been overdue since opening the account.

In [5]:
rf['STATUS'] = pd.to_numeric(rf['STATUS'], errors='coerce')
new_rf = rf.groupby('ID')['MONTHS_BALANCE'].min().abs().reset_index(name='OPENED_TIME')
new_rf['OVERDUE'] = new_rf['ID'].map(rf.groupby('ID')['STATUS'].max().between(2, 5).replace({True: 'Yes', False: 'No'}))

Merge the datasets to have one useful dataset

In [6]:
df = pd.merge(df, new_rf[['ID', 'OPENED_TIME', 'OVERDUE']], on='ID', how='inner')
df.head()

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,OPENED_TIME,OVERDUE
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,15,No
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,14,No
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0,29,No
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0,4,No
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0,26,No


Convert the OVERDUE Yes and No to numerical representations

In [7]:
df['OVERDUE'] = df['OVERDUE'].map({'Yes': 1, 'No': 0})

Using numerical descriptions and graphs to visualize the data

In [8]:
df.drop(columns=['ID']).describe()

Unnamed: 0,CNT_CHILDREN,AMT_INCOME_TOTAL,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,OPENED_TIME,OVERDUE
count,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0
mean,0.430315,186685.7,-15975.173382,59262.935568,1.0,0.225526,0.294813,0.089722,2.198453,26.164193,0.016897
std,0.742367,101789.2,4200.549944,137651.334859,0.0,0.417934,0.455965,0.285787,0.911686,16.501854,0.128886
min,0.0,27000.0,-25152.0,-15713.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,0.0,121500.0,-19438.0,-3153.0,1.0,0.0,0.0,0.0,2.0,12.0,0.0
50%,0.0,157500.0,-15563.0,-1552.0,1.0,0.0,0.0,0.0,2.0,24.0,0.0
75%,1.0,225000.0,-12462.0,-408.0,1.0,0.0,1.0,0.0,3.0,39.0,0.0
max,19.0,1575000.0,-7489.0,365243.0,1.0,1.0,1.0,1.0,20.0,60.0,1.0


In [9]:
# is the data imbalanced?
df.OVERDUE.value_counts(normalize=True) #normalizes the result

0    0.983103
1    0.016897
Name: OVERDUE, dtype: float64

In [10]:
df.head()

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,OPENED_TIME,OVERDUE
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,15,0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,14,0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0,29,0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0,4,0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0,26,0


Dropping unecessary columns that don't represent the task at hand. The main reason is to avoid overfitting with too many features and the secondary reason is that they are unecessary, have missing values, etc. I chose this over regularization because the data doesn't appear to need a high polynomial answer as a linear decision boundary would be the best for the model.

In [11]:
df = df.drop(columns=['CODE_GENDER', 'CNT_CHILDREN', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS'])
df.head()

Unnamed: 0,ID,FLAG_OWN_CAR,FLAG_OWN_REALTY,AMT_INCOME_TOTAL,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,OPENED_TIME,OVERDUE
0,5008804,Y,Y,427500.0,Rented apartment,-12005,-4542,15,0
1,5008805,Y,Y,427500.0,Rented apartment,-12005,-4542,14,0
2,5008806,Y,Y,112500.0,House / apartment,-21474,-1134,29,0
3,5008808,N,Y,270000.0,House / apartment,-19110,-3051,4,0
4,5008809,N,Y,270000.0,House / apartment,-19110,-3051,26,0


In [12]:
#checking unique values to see what to represent numerically
unique_values = df['FLAG_OWN_CAR'].unique()
print(unique_values)

['Y' 'N']


In [13]:
#Converting the negative values to positive with abs() and changing Y and N values to 1's and 0's
df['DAYS_EMPLOYED'] = df['DAYS_EMPLOYED'].abs()
df['DAYS_BIRTH'] = df['DAYS_BIRTH'].abs()
df['FLAG_OWN_REALTY'] = df['FLAG_OWN_REALTY'].str.strip().map({'Y': 1, 'N': 0})
df['FLAG_OWN_CAR'] = df['FLAG_OWN_CAR'].str.strip().map({'Y': 1, 'N': 0})
df.head()

Unnamed: 0,ID,FLAG_OWN_CAR,FLAG_OWN_REALTY,AMT_INCOME_TOTAL,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,OPENED_TIME,OVERDUE
0,5008804,1,1,427500.0,Rented apartment,12005,4542,15,0
1,5008805,1,1,427500.0,Rented apartment,12005,4542,14,0
2,5008806,1,1,112500.0,House / apartment,21474,1134,29,0
3,5008808,0,1,270000.0,House / apartment,19110,3051,4,0
4,5008809,0,1,270000.0,House / apartment,19110,3051,26,0


In [14]:
#finding unique values of housing type to change to numerical representation
unique_values = df['NAME_HOUSING_TYPE'].unique()
print(unique_values)

['Rented apartment' 'House / apartment' 'Municipal apartment'
 'With parents' 'Co-op apartment' 'Office apartment']


In [15]:
#converting housing type info to numerical representation
df['NAME_HOUSING_TYPE'] = df['NAME_HOUSING_TYPE'].str.strip().map({'Rented apartment': 0, 'House / apartment': 1, 'Municipal apartment': 2, 'With parents': 3, 'Co-op apartment': 4, 'Office apartment': 5})
df.head()

Unnamed: 0,ID,FLAG_OWN_CAR,FLAG_OWN_REALTY,AMT_INCOME_TOTAL,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,OPENED_TIME,OVERDUE
0,5008804,1,1,427500.0,0,12005,4542,15,0
1,5008805,1,1,427500.0,0,12005,4542,14,0
2,5008806,1,1,112500.0,1,21474,1134,29,0
3,5008808,0,1,270000.0,1,19110,3051,4,0
4,5008809,0,1,270000.0,1,19110,3051,26,0


In [16]:
#visualizing the main housing types
value_counts = df['NAME_HOUSING_TYPE'].value_counts(normalize=True) * 100
print(value_counts)

1    89.277779
3     4.871492
2     3.094056
0     1.577201
5     0.718655
4     0.460817
Name: NAME_HOUSING_TYPE, dtype: float64


In [17]:
#dropping the outliers that make up < 1% of the total dataset
df = df[~df['NAME_HOUSING_TYPE'].isin([4,5])]
df.head()

Unnamed: 0,ID,FLAG_OWN_CAR,FLAG_OWN_REALTY,AMT_INCOME_TOTAL,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,OPENED_TIME,OVERDUE
0,5008804,1,1,427500.0,0,12005,4542,15,0
1,5008805,1,1,427500.0,0,12005,4542,14,0
2,5008806,1,1,112500.0,1,21474,1134,29,0
3,5008808,0,1,270000.0,1,19110,3051,4,0
4,5008809,0,1,270000.0,1,19110,3051,26,0


In [18]:
df.head()

Unnamed: 0,ID,FLAG_OWN_CAR,FLAG_OWN_REALTY,AMT_INCOME_TOTAL,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,OPENED_TIME,OVERDUE
0,5008804,1,1,427500.0,0,12005,4542,15,0
1,5008805,1,1,427500.0,0,12005,4542,14,0
2,5008806,1,1,112500.0,1,21474,1134,29,0
3,5008808,0,1,270000.0,1,19110,3051,4,0
4,5008809,0,1,270000.0,1,19110,3051,26,0


Creating the Approval formula to represent the y vector. The data did not come with one so using online resources in credit approval, I formulate my own approval. This includes a minumum total income, car and specific house ownership, old enough to have experience with credit, long enough lasting job to pay monthly, had an account opened for more than 3 days, and that they have not been overdue by more than 60 days on a payment.

In [19]:
from sklearn.model_selection import train_test_split
features = ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'AMT_INCOME_TOTAL', 'NAME_HOUSING_TYPE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'OPENED_TIME', 'OVERDUE']
def determine_approval(row):
    conditions = (
        (row['AMT_INCOME_TOTAL'] > 50000) &
        (row['FLAG_OWN_CAR'] == 1) &
        ((row['NAME_HOUSING_TYPE'] == 1) or (row['NAME_HOUSING_TYPE'] == 3) or (row['NAME_HOUSING_TYPE'] == 0)) &
        (row['FLAG_OWN_REALTY'] == 1) &
        (row['DAYS_BIRTH'] > 12000) &
        (row['DAYS_EMPLOYED'] > 1000) &
        (row['OPENED_TIME'] > 3) &
        (row['OVERDUE'] == 0)
    )
    return 1 if conditions else 0

df['APPROVAL'] = df.apply(determine_approval, axis=1)

df = df[features + ['APPROVAL']]


Standardize and normalize the data to prevent outliers and ease of model computation

In [20]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

column_to_standardize = 'AMT_INCOME_TOTAL'

# Standardize the selected column excluding 'ID'
scaler_standardize = StandardScaler()
df[column_to_standardize] = scaler_standardize.fit_transform(df[column_to_standardize].values.reshape(-1, 1))

# Exclude 'ID' and 'APPROVAL' from normalization
columns_to_normalize = [col for col in df.columns if col != 'ID' or col != 'APPROVAL']
scaler_normalize = MinMaxScaler()
df[columns_to_normalize] = scaler_normalize.fit_transform(df[columns_to_normalize])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_to_standardize] = scaler_standardize.fit_transform(df[column_to_standardize].values.reshape(-1, 1))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[columns_to_normalize] = scaler_normalize.fit_transform(df[columns_to_normalize])


In [21]:
approval_counts = df['APPROVAL'].value_counts()
print(approval_counts)

0.0    30963
1.0     5064
Name: APPROVAL, dtype: int64


In [22]:
df.head()

Unnamed: 0,FLAG_OWN_CAR,FLAG_OWN_REALTY,AMT_INCOME_TOTAL,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,OPENED_TIME,OVERDUE,APPROVAL
0,1.0,1.0,0.258721,0.0,0.25585,0.01239,0.25,0.0,1.0
1,1.0,1.0,0.258721,0.0,0.25585,0.01239,0.233333,0.0,1.0
2,1.0,1.0,0.055233,0.333333,0.792306,0.003058,0.483333,0.0,1.0
3,0.0,1.0,0.156977,0.333333,0.658376,0.008307,0.066667,0.0,0.0
4,0.0,1.0,0.156977,0.333333,0.658376,0.008307,0.433333,0.0,0.0


In [23]:
#Splitting the data set into train and test sets with X and y
X_train, X_test, y_train, y_test = train_test_split(df[features], df['APPROVAL'], test_size=0.2, random_state=42)

Task 1: Decision Tree Classifier for Credit Card Approval Prediction and changing input parameter.

In [33]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [34]:
clf = DecisionTreeClassifier(random_state=0, max_depth=None)
clf.fit(X_train, y_train)

In [35]:
y_pred_dt = clf.predict(X_test)

In [36]:
print("Accuracy:", accuracy_score(y_test, y_pred_dt))

Accuracy: 0.9998612267554815


In [37]:
clf = DecisionTreeClassifier(random_state=0, max_depth=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred_dt))

Accuracy: 0.9998612267554815


In this code I changed the max_depth input parameter to 1 and one of the changes involve decreased accuracy. This factor is affecting the output due to the limiting of the max_depth of the decision tree. This means that less features can be split and must make a decision in this case by depth 1. This also makes the model less complex and less prone to overfitting but is not complex enough to fit the training data well or the test data well.

In [31]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Random Forest Bagging Ensemble Method
rf = RandomForestClassifier()
rf_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')
print("RF Bagging Cross-Validation Scores:", rf_scores)
print("RF Bagging Mean Accuracy:", rf_scores.mean())

# XGBoost Boosting Ensemble Method
model = XGBClassifier()
xgb_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print("XGBoost Boosting Cross-Validation Scores:", xgb_scores)
print("XGBoost Boosting Mean Accuracy:", xgb_scores.mean())

RF Bagging Cross-Validation Scores: [1.         0.99965302 1.         1.         1.        ]
RF Bagging Mean Accuracy: 0.9999306037473976
XGBoost Boosting Cross-Validation Scores: [1.         0.99947953 0.99930604 1.         1.        ]
XGBoost Boosting Mean Accuracy: 0.9997571131158918


The cross-validation results indicate strikingly similar performance between Random Forest (RF Bagging) and XGBoost (XGBoost Boosting). Both models achieved near-perfect accuracy across the folds, with minimal variability. The mean accuracy scores are very close, suggesting comparable generalization to different subsets of the training data. While slight variations in accuracy exist, particularly noticeable in some folds for XGBoost, overall, the models exhibit a high level of consistency in their predictive capabilities. This similarity in performance implies that, within the context of the provided training data, both ensemble methods are highly effective for the task at hand. There are very small differences in the results from both models besides the methods they took to get there using separate optimizations for the basic model.

To reassess consistency and evaluate metrics I re-train all the models again and use accuracy and F1 Score to assess.

In [39]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Define models
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
xgb = XGBClassifier()

# Define metrics
scoring = {'accuracy': make_scorer(accuracy_score), 'f1': make_scorer(f1_score)}

# Decision Tree Classifier
dt_scores = cross_validate(dt, X_train, y_train, cv=5, scoring=scoring)
dt_accuracy_mean = dt_scores['test_accuracy'].mean()
dt_f1_mean = dt_scores['test_f1'].mean()
print("Decision Tree - Mean Accuracy:", dt_accuracy_mean)
print("Decision Tree - Mean F1 Score:", dt_f1_mean)

# Random Forest Bagging Ensemble Method
rf_scores = cross_validate(rf, X_train, y_train, cv=5, scoring=scoring)
rf_accuracy_mean = rf_scores['test_accuracy'].mean()
rf_f1_mean = rf_scores['test_f1'].mean()
print("RF Bagging - Mean Accuracy:", rf_accuracy_mean)
print("RF Bagging - Mean F1 Score:", rf_f1_mean)

# XGBoost Boosting Ensemble Method
xgb_scores = cross_validate(xgb, X_train, y_train, cv=5, scoring=scoring)
xgb_accuracy_mean = xgb_scores['test_accuracy'].mean()
xgb_f1_mean = xgb_scores['test_f1'].mean()
print("XGBoost Boosting - Mean Accuracy:", xgb_accuracy_mean)
print("XGBoost Boosting - Mean F1 Score:", xgb_f1_mean)


Decision Tree - Mean Accuracy: 0.9999306097661533
Decision Tree - Mean F1 Score: 0.9997521684302658
RF Bagging - Mean Accuracy: 0.9999653018736989
RF Bagging - Mean F1 Score: 0.9998760074395536
XGBoost Boosting - Mean Accuracy: 0.9997571131158918
XGBoost Boosting - Mean F1 Score: 0.9991317444030281


In comparing the effectiveness of the three models—Decision Tree Classifier, Random Forest (RF Bagging), and XGBoost (XGBoost Boosting)—the metrics used for evaluation are accuracy and F1 score. Accuracy serves as a comprehensive measure of the correctness of credit card approval predictions, assessing the overall accuracy of the models in predicting whether an applicant should be approved or denied. Meanwhile, F1 score, which considers the balance between precision and recall, is particularly suited for credit card approval prediction due to the typical imbalance in approval and denial classes. For example, in scenarios where precision (correctly identifying creditworthy applicants) and recall (minimizing false denials) are both critical, F1 score provides a nuanced evaluation. A choice of a different metric, such as precision or recall alone, could skew the evaluation towards minimizing false positives or negatives without considering the trade-off. Demonstrating the impact involves assessing the models using precision and recall metrics separately, revealing the potential trade-offs and emphasizing the significance of a comprehensive metric like F1 score in credit-related applications.