<a href="https://colab.research.google.com/github/derewor/Credit-Card-Approval/blob/main/src/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [94]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

In [5]:
# Load the dataset and view the top few rows.
cc_apps = pd.read_csv("/content/cc_approvals.data", header=None)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


In [7]:
# count the number of observation in the two categories of the target.
cc_apps[13].value_counts().reset_index()

Unnamed: 0,13,count
0,-,383
1,+,307


In [9]:
df_copy = cc_apps.copy()

In [10]:
# copy the dataframe and convert the target bool to int.
def bool_to_int(df_copy):
    # Iterate over each column in the DataFrame
    for col in df_copy.columns:
        df_copy[col] = df_copy[col].apply(lambda val: 1 if val == '+' else 0 if val == '-' else '99' if val == '?' else val)
    return df_copy
df_transformed = bool_to_int(df_copy)

In [16]:
# view the top few rows of the transformed dataframe
df_transformed.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,1
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,1


In [15]:
# determine the number of unique values in each non-numerical fields.
df_transformed[[0,3,4,5,6,8,9,11]].nunique()

Unnamed: 0,0
0,3
3,4
4,4
5,15
6,10
8,2
9,2
11,3


In [84]:
# one hot encode the catagorical variables
encoded_features = pd.get_dummies(df_transformed[[0,3,4,5,6,8,9,11]]).astype(int)
encoded_features.head()

Unnamed: 0,0_99,0_a,0_b,3_99,3_l,3_u,3_y,4_99,4_g,4_gg,...,6_o,6_v,6_z,8_f,8_t,9_f,9_t,11_g,11_p,11_s
0,0,0,1,0,0,1,0,0,1,0,...,0,1,0,0,1,0,1,1,0,0
1,0,1,0,0,0,1,0,0,1,0,...,0,0,0,0,1,0,1,1,0,0
2,0,1,0,0,0,1,0,0,1,0,...,0,0,0,0,1,1,0,1,0,0
3,0,0,1,0,0,1,0,0,1,0,...,0,1,0,0,1,0,1,1,0,0
4,0,0,1,0,0,1,0,0,1,0,...,0,1,0,0,1,1,0,0,0,1


In [85]:
# concat the encoded_features and the df_copy
num_features = pd.concat([df_transformed.drop(columns=df_transformed[[0,3,4,5,6,8,9,11,13]]),encoded_features], axis=1)
target = df_transformed[13]


In [86]:
# The column names need to be string. If there are non string column names, convert them.
def convertion(num_features):
    for col in num_features.columns:
        if num_features[col].dtype != 'str':
            num_features[col] =  num_features[col].astype('str')
    return num_features
num_feature = convertion(num_features)
num_feature.columns = num_feature.columns.astype(str)
num_feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 48 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       690 non-null    object
 1   2       690 non-null    object
 2   7       690 non-null    object
 3   10      690 non-null    object
 4   12      690 non-null    object
 5   0_99    690 non-null    object
 6   0_a     690 non-null    object
 7   0_b     690 non-null    object
 8   3_99    690 non-null    object
 9   3_l     690 non-null    object
 10  3_u     690 non-null    object
 11  3_y     690 non-null    object
 12  4_99    690 non-null    object
 13  4_g     690 non-null    object
 14  4_gg    690 non-null    object
 15  4_p     690 non-null    object
 16  5_99    690 non-null    object
 17  5_aa    690 non-null    object
 18  5_c     690 non-null    object
 19  5_cc    690 non-null    object
 20  5_d     690 non-null    object
 21  5_e     690 non-null    object
 22  5_ff    690 non-null    ob

In [87]:
# view the top few rows of the concantinaned df.
num_feature.head()

Unnamed: 0,1,2,7,10,12,0_99,0_a,0_b,3_99,3_l,...,6_o,6_v,6_z,8_f,8_t,9_f,9_t,11_g,11_p,11_s
0,30.83,0.0,1.25,1,0,0,0,1,0,0,...,0,1,0,0,1,0,1,1,0,0
1,58.67,4.46,3.04,6,560,0,1,0,0,0,...,0,0,0,0,1,0,1,1,0,0
2,24.5,0.5,1.5,0,824,0,1,0,0,0,...,0,0,0,0,1,1,0,1,0,0
3,27.83,1.54,3.75,5,3,0,0,1,0,0,...,0,1,0,0,1,0,1,1,0,0
4,20.17,5.625,1.71,0,0,0,0,1,0,0,...,0,1,0,0,1,1,0,0,0,1


In [88]:
# scale the numerical features data to improve the model.
scaler = StandardScaler()
scaled_data = scaler.fit_transform(num_feature)
scaled_data = pd.DataFrame(scaled_data)
scaled_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,38,39,40,41,42,43,44,45,46,47
0,-0.129422,-0.956613,-0.291083,-0.288101,-0.195413,-0.133038,-0.661438,0.688737,-0.093659,-0.053916,...,-0.053916,0.854004,-0.108306,-0.95465,0.95465,-1.157144,1.157144,0.32249,-0.108306,-0.300079
1,1.756139,-0.060051,0.24419,0.74083,-0.087852,-0.133038,1.511858,-1.451933,-0.093659,-0.053916,...,-0.053916,-1.170954,-0.108306,-0.95465,0.95465,-1.157144,1.157144,0.32249,-0.108306,-0.300079
2,-0.558144,-0.856102,-0.216324,-0.493887,-0.037144,-0.133038,1.511858,-1.451933,-0.093659,-0.053916,...,-0.053916,-1.170954,-0.108306,-0.95465,0.95465,0.864196,-0.864196,0.32249,-0.108306,-0.300079
3,-0.332608,-0.647038,0.456505,0.535044,-0.194837,-0.133038,-0.661438,0.688737,-0.093659,-0.053916,...,-0.053916,0.854004,-0.108306,-0.95465,0.95465,-1.157144,1.157144,0.32249,-0.108306,-0.300079
4,-0.851408,0.174141,-0.153526,-0.493887,-0.195413,-0.133038,-0.661438,0.688737,-0.093659,-0.053916,...,-0.053916,0.854004,-0.108306,-0.95465,0.95465,0.864196,-0.864196,-3.100868,-0.108306,3.332456


In [101]:
# split the dataframe into train and test
X_train, X_test, y_train, y_test = train_test_split(scaled_data, target, test_size=0.2, random_state=42)

In [102]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(552, 48)
(138, 48)
(552,)
(138,)


In [103]:
# fitting model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
best_score = accuracy_score(y_test, y_pred)
print(best_score)
print(confusion_matrix(y_test,y_pred))

0.8188405797101449
[[56 12]
 [13 57]]


In [104]:
# import Randomforest classifier from sckit_learn ensemble.
from sklearn.ensemble import RandomForestClassifier

In [105]:
# check if randomforest improves the classification.
model1 = RandomForestClassifier(max_depth= 30, random_state=42, n_estimators=300)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
best_score1 = accuracy_score(y_test, y_pred1)
best_score2 = classification_report(y_test, y_pred1)
print(best_score1)
print(best_score2)

0.8623188405797102
              precision    recall  f1-score   support

           0       0.85      0.88      0.86        68
           1       0.88      0.84      0.86        70

    accuracy                           0.86       138
   macro avg       0.86      0.86      0.86       138
weighted avg       0.86      0.86      0.86       138



## Credit Card Approval Using Machine Learning

Evaluating thousands of credit card applications is a complex task that requires careful analysis of each applicant's financial history. Fortunately, machine learning offers a powerful and efficient solution to support decision-making in the banking sector.

I recently completed a real-world project on DataCamp focused on predicting credit card approvals using classification models. The goal was to classify each application as approved or rejected based on various financial indicators.

## Key Insights:

* Removing categorical variables with more than 10 unique values helped improve model performance.

* As expected, the Random Forest Classifier slightly outperformed Logistic Regression, providing better accuracy and robustness.

This project reinforced the value of machine learning in automating high-stakes financial decisions with speed, reliability, and scalability. Looking forward to applying these skills to real-world business challenges!

#MachineLearning #CreditScoring #DataScience #LogisticRegression #RandomForest #FinancialTech #BankingInnovation