# **Credit Card Approval Model**

For this project, I decided to analyze data regarding credit card applications. When reviewing credit card applications, many factors are considered, such as gender, years employed, etc., to check whether the applicant is reliable and non-risky for the bank to approve a credit card. Historically, I have relied on credit score as the primary factor I evaluate to decide whether or not an applicant is trustworthy enough for loans, mortgages, and even credit card approvals (Wagner, 2004).

Thus, I aim to compare credit score with debt history, age, and income to analyze the relationship between these quantitative variables and their influence on my decision to approve or reject a credit card application. I created a knn neighbors algorithm to predict how new observations would be classified based on data from these quantitative factors. I will use the dataset called “Credit Approval” from the UC Irvine Machine Learning Repository. The dataset is multivariate and contains a mix of real, integer, and categorical values. I will use the four quantitative listed variables as predictors in my model and approval, a categorical variable, as my predicted value.

Before beginning, we imported all the packages I thought would be relevant to our analysis.  Next, I transferred the data from the web into a Google spreadsheet and, using the read_csv function, loaded the data into our workspace with the name data. 


In [None]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In [2]:
# load data from the original source on the web 
data = pd.read_csv('https://docs.google.com/spreadsheets/d/18Uxxd5YAfGLn4qWe5crLplKU19zPXPziaomAZLWpDyI/gviz/tq?tqx=out:csv&sheet=clean_dataset')

# wrangling and cleaning the data from it's original (downloaded) format to the format necessary for our analysis
data = data.drop('DriversLicense', axis=1)
data = data.drop('Gender', axis=1)
data = data.drop('Married', axis=1)
data = data.drop('BankCustomer', axis=1)
data = data.drop('Industry', axis=1)
data = data.drop('Ethnicity', axis=1)
data = data.drop('YearsEmployed', axis=1)
data = data.drop('PriorDefault', axis=1)
data = data.drop('Employed', axis=1)
data = data.drop('Citizen', axis=1)
data = data.drop('ZipCode', axis=1)
data["Approved"] = data["Approved"].replace({
     0 : "No",
     1 : "Yes"
 })

data

Unnamed: 0,Age,Debt,CreditScore,Income,Approved
0,30.83,0.000,1,0,Yes
1,58.67,4.460,6,560,Yes
2,24.50,0.500,0,824,Yes
3,27.83,1.540,5,3,Yes
4,20.17,5.625,0,0,Yes
...,...,...,...,...,...
685,21.08,10.085,0,0,No
686,22.67,0.750,2,394,No
687,25.25,13.500,1,1,No
688,17.92,0.205,0,750,No


Table 1. The dataset has been cleaned and wrangled to have only wanted columns. 

Our data was clean before I imported it. Still, because it contained variables we are not interested in working with, to wrangle our data, we dropped all columns except “Debt,” “Income,” “Age,” and “CreditScore.” These columns were selected because I believed that these columns would be the most crucial components when deciding on an applicant’s acceptance of a credit card. When making approvals, credit card companies are concerned with whether applicants will be able to repay the company or not. I felt that the variables that would best represent this quality were debt, income, age, and credit score. The credit score is also especially important because it examines the customer’s history of making payments on time.

In [3]:
data_train, data_test = train_test_split(data, test_size=0.25, random_state=123)
data_train

Unnamed: 0,Age,Debt,CreditScore,Income,Approved
618,29.58,4.750,1,68,No
121,25.67,12.500,67,258,Yes
352,22.50,11.500,0,4000,No
210,39.33,5.875,14,0,Yes
299,22.17,12.125,2,173,No
...,...,...,...,...,...
98,22.50,11.000,0,0,No
322,33.67,0.375,0,44,Yes
382,24.33,2.500,0,456,No
365,42.83,1.250,1,112,No


Table 2. Training dataset

To conduct preliminary summary statistics on our data, I used the info() function to see the number of columns, column names, non-null count, and Dtype in our dataset, naming this data_summary_info. Next, we used the describe() function to see the count, mean, std, and other summary statistics of the columns in our data, naming this data_summary_description. Lastly, I created a data frame called data_summary_means to store our mean values for the number of “yes” and “no” values in the “approved” column. I stored these mean values in columns named “approved” and “not approved” for the mean of “yes” and “no” values, respectively, in our new data frame. 

In [4]:
data_summary_info = data_train.info()
data_summary_info

<class 'pandas.core.frame.DataFrame'>
Index: 517 entries, 618 to 510
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          517 non-null    float64
 1   Debt         517 non-null    float64
 2   CreditScore  517 non-null    int64  
 3   Income       517 non-null    int64  
 4   Approved     517 non-null    object 
dtypes: float64(2), int64(2), object(1)
memory usage: 24.2+ KB


Table 3. Data summary information.

In [5]:
data_summary_description = data_train.describe()
data_summary_description

Unnamed: 0,Age,Debt,CreditScore,Income
count,517.0,517.0,517.0,517.0
mean,31.53911,4.573714,2.537718,884.789168
std,11.737083,4.889537,5.190651,3395.134308
min,13.75,0.0,0.0,0.0
25%,22.92,0.875,0.0,0.0
50%,28.46,2.71,0.0,4.0
75%,37.33,6.665,3.0,458.0
max,80.25,28.0,67.0,50000.0


Table 4. Data summary description.

In [6]:
data_summary_means = pd.DataFrame()
data_summary_means['Approved'] = (data[data['Approved'] == 'Yes'].mean(numeric_only=True))
data_summary_means['Not Approved'] = (data[data['Approved'] == 'No'].mean(numeric_only=True))
data_summary_means

Unnamed: 0,Approved,Not Approved
Age,33.686221,29.773029
Debt,5.904951,3.839948
CreditScore,4.605863,0.631854
Income,2038.859935,198.605744


Table 5. Data summary means.

I used the altair package to create all of our visualizations, using the mark_point() function for scatter plots and mark_bar() for histograms.

To explore the relationships in our preliminary data, I first created histograms, faceting whether the observation was approved or not. This allowed us to understand the distribution of the data and which values of our variables had a lot of approvals or non-approvals. Looking at these histograms, credit score had the biggest impact on approvals. Keeping this in mind, I focused our other visualizations, mainly scatter plots, on the relationships between Credit Score and our other predictive variables (debt, income, and age). 

In [7]:
# data summarization

data_vis_cs = alt.Chart(data_train).mark_bar().encode(
    x = alt.X('CreditScore').scale(domain=[0,40],clamp= True).bin(maxbins=30),
    y = alt.Y('count()'),
    color = alt.Color('Approved')
).properties(
    height=100
).facet(
    "Approved:N",)
data_vis_cs

Figure 1. Number of records approved and not approved for credit card based on applicant's credit score.

In [8]:
data_vis_i = alt.Chart(data_train).mark_bar().encode(
    x = alt.X('Income').scale(clamp= True).bin(maxbins=20),
    y = alt.Y('count()'),
    color = alt.Color('Approved')
).properties(
    height=100
).facet(
    "Approved:N",)
data_vis_i

Figure 2. Number of records approved and not approved for credit card based on applicant's income.

In [9]:
data_vis_age = alt.Chart(data_train).mark_bar().encode(
    x = alt.X('Age').scale(clamp= True).bin(maxbins=45),
    y = alt.Y('count()'),
    color = alt.Color('Approved')
).properties(
    height=100
).facet(
    "Approved:N",)
data_vis_age

Figure 3. Number of records approved and not approved for credit card based on applicant's age.

In [10]:
data_vis_d = alt.Chart(data_train).mark_bar().encode(
    x = alt.X('Debt').scale(clamp= True).bin(maxbins=30),
    y = alt.Y('count()'),
    color = alt.Color('Approved')
).properties(
    height=100
).facet(
    "Approved:N",)
data_vis_d

Figure 4. Number of records approved and not approved for credit card based on applicant's amount of debt.

In [11]:
# Exploratory data analysis 
# Data Vizualisation (Preliminary)

Therefore, I created scatter plot visualizations using the altair package, creating graphs comparing “Income vs Credit Score,” “Debt vs Credit Score”, and “Age vs Credit Score.” I used colour encoding to visualize approved applications vs unapproved and the clamp function to ensure the scale of our graphs highlighted the data well. This combined with the data summary work (such as examining means and so on), gave us a preliminary sense of the data layout and some basic relationships between our variables of interest. 


In [12]:
#Debt vs Credit Score
scatterplot_debt_creditscore = alt.Chart(data_train, title = "Debt vs Credit Score").mark_point().encode(
    y=alt.Y("Debt").title("Debt").scale(domain=[0,40],clamp=True),
    x=alt.X("CreditScore").scale(domain=[0,20],clamp= True),
    color=alt.Color("Approved")
)
scatterplot_debt_creditscore

Figure 5. A scatter plot that shows a relationship between debt and credit score and its correlation to credit card approcal using training data. 

In [13]:
#Income vs Credit Score
scatterplot_income = alt.Chart(data_train, title = "Income vs Credit Score").mark_point().encode(
    x=alt.X("CreditScore").title("Credit Score").scale(domain=[0,20],clamp=True),
    y=alt.Y("Income").scale(domain=[0,10000],clamp= True),
    color=alt.Color("Approved")
)
scatterplot_income

Figure 6. A scatter plot that shows a relationship between income and credit score and its correlation to credit card approcal using training data. 

In [14]:
#Age vs Credit Score
scatterplot_age = alt.Chart(data_train, title = "Age vs Credit Score").mark_point().encode(
    x=alt.X("CreditScore").title("Credit Score").scale(domain=[0,20],clamp=True),
    y=alt.Y("Age").scale(domain=[0,90],clamp= True),
    color=alt.Color("Approved")
)
scatterplot_age

Figure 7. A scatter plot that shows a relationship between age and credit score and its correlation to credit card approcal using training data. 

I used knn-neighbors data analysis. To choose the number of neighbors we looked at a combination of accuracy and standard error for a range of different values of k. We then tested our model with values of 7 such as 7,11,15, 25, and so on. 

In [15]:
# knn classifier
# Model: knn Neighbours Classification
knn = KNeighborsClassifier(n_neighbors=15)
# we increased k from 7, to 11, to 15, to 25, choosing k=15 with the best 
# the accuracy of the "No" column increased with every increase in k, but the "Yes" column either decreased in accuracy or didnt change

# create the preprocessor
preprocessor = make_column_transformer(
    (StandardScaler(), ["Age", "Debt","Income","CreditScore"]),
    remainder='passthrough'
)
data_fit = preprocessor.fit(data_train)
data_fit

In [16]:
# Make pipeline and fit it to our data
knn_fit = make_pipeline(preprocessor, knn).fit(
    X=data_train.drop(columns=['Approved']), 
    y=data_train["Approved"]
)

knn_fit

In [17]:
data_test_predictions = data_test.assign(
    predicted = knn_fit.predict(data_test[["Age", "Debt","Income", "Approved","CreditScore"]])
)
data_test_predictions[["Age", "Debt","Income","CreditScore", "Approved", 'predicted']]

Unnamed: 0,Age,Debt,Income,CreditScore,Approved,predicted
399,31.00,2.085,0,0,No,No
250,40.25,21.500,1200,11,Yes,Yes
396,29.83,2.040,1,0,No,No
192,41.75,0.960,600,0,Yes,No
602,29.83,1.250,0,0,No,No
...,...,...,...,...,...,...
100,37.50,1.750,400,0,No,No
572,21.92,0.540,59,1,Yes,No
101,35.25,16.500,0,0,No,No
195,28.25,5.040,7,8,Yes,Yes


Table 6. This dataset is a testing dataset with predicted column added.

In [18]:
data_preds = data_test_predictions[
    data_test_predictions['Approved'] == data_test_predictions['predicted']
]

data_preds.shape[0] / data_test_predictions.shape[0]

0.7572254335260116

In [19]:
pd.crosstab(
    data_test_predictions["Approved"],
    data_test_predictions["predicted"]
)

predicted,No,Yes
Approved,Unnamed: 1_level_1,Unnamed: 2_level_1
No,90,5
Yes,37,41


Table 7. Estimated accuracy of the classifier using crosstab function.

In [20]:
approval_count_data = alt.Chart(data_train, title="Number of approvals").mark_bar().encode(
    x="Approved",
    y="count()",
    color="Approved"
)
approval_count_data

Figure 8. A bar graph that shows the number of approved and not approved applicant for credit card.

I found that our model was very good at predicting “No”, but struggled with predicting “Yes” accurately ( a type II error). This may be linked to the fact that our dataset contained more Nos and Yeses, as shown by Figure 8. In future, a focus on reducing type II errors would improve our model. 

Next, I created a preprocessor using the make_column-transformer. Within this function, we used StandardScaler() on our quantitative predictors: “age,” “debt,” income,” and “credit score” to make sure our predictors had a similar scale and, thus, could not unjustly influence our classification model. I used remainder = “passthrough” to allow our “approved” data to remain in the preprocessor unscaled. We used the make_pipeline() function on our preprocessor and knn, which was fit with x equal to all columns except approved and y equal to the approved column from our data, naming the result knn_fit.  The x and y just specify our predictor values (x) and what our predicted value (y) will be for this model. 

In [21]:
data_pipe = make_pipeline(preprocessor, knn)

X=data_train.drop(columns=['Approved'])
y=data_train["Approved"]

cv_5_df = pd.DataFrame(
    cross_validate(
        estimator=data_pipe,
        cv=5,
        X=X,
        y=y
    )
)

cv_5_df

Unnamed: 0,fit_time,score_time,test_score
0,0.002305,0.003632,0.836538
1,0.002209,0.003778,0.701923
2,0.002244,0.003481,0.757282
3,0.002006,0.002959,0.757282
4,0.00189,0.002967,0.699029


Table 8. Cross validation of the classifier using cross_validate function.

I cross-validated our classifier to estimate the accuracy of the classifier, which had a mean test score of 75 percent and a standard error of 2.5 percent. These values indicate that our classifier is considerably reliable.

In [22]:
cv_5_metrics = cv_5_df.agg(['mean', 'sem'])
cv_5_metrics

Unnamed: 0,fit_time,score_time,test_score
mean,0.002131,0.003363,0.750411
sem,7.8e-05,0.00017,0.025004


Table 9. Mean and standard error values from the cross validation.

In [23]:
knn_best_k = KNeighborsClassifier()
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 100, 4),
}
parameter_grid

{'kneighborsclassifier__n_neighbors': range(1, 100, 4)}

In [24]:
data_tune_grid = GridSearchCV(
    estimator=data_pipe,
    param_grid=parameter_grid,
    cv=5
)
data_tune_grid

In [25]:
accuracies_grid = pd.DataFrame(
    data_tune_grid.fit(
        data_train[["Age", "Debt","Income","CreditScore"]],
        data_train["Approved"]
    ).cv_results_
)
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "std_test_score"
    ]]
    .assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2))
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
    .drop(columns=["std_test_score"])
)
accuracies_grid

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,0.680695,0.011028
1,5,0.719436,0.012293
2,9,0.736875,0.015468
3,13,0.744604,0.015378
4,17,0.7427,0.016288
5,21,0.748506,0.015391
6,25,0.752427,0.016217
7,29,0.746583,0.015437
8,33,0.746583,0.01471
9,37,0.750429,0.014847


Table 10. Mean and standard error values for each K-values that increases by 4.

I made a table of Mean and standard error values for each K-values that increases by 4 to make a plot of estimated accurcy of a classifier versus different K-values..

In [26]:
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(domain=(0.65, 0.80))
        .title("Accuracy estimate")
)

accuracy_vs_k

Figure 9. The graph shows the accuracy estimates as K-value increases. 

Although the K-value of 25 was the most accurate, as shown onf Figure 9, we decided to keep 15 because it had the best combination of a lower standard error and high accuracy. 

I then created a dataframe, named new_observation_1, where: age = 33, debt = 6, creditscore = 5, and income = 2039 and aother observation, names new_observation_2, with: age = 29, debt = 4, creditscore = 1, and income = 198. These numbers are based on the mean values I found for these columns earlier. Finally, I use the predict() function with the knn_fit object to predict new_observation. I assign this step to the name prediction and run “prediction” independently to see how our model classified our new observation. 

In [27]:
# Add 2 new observations based on the data summary means above, 1 where we expect approval, 1 where we don't
# Observation 1: Using Approved data means
new_observation_1 = pd.DataFrame({"Age": [33], "Debt": [6], "CreditScore": [5], "Income":[2039]})

# Prediction 1
prediction_1 = knn_fit.predict(new_observation_1)
prediction_1


array(['Yes'], dtype=object)

In [28]:
# Observation 2: Using Not Approved data means
new_observation_2 = pd.DataFrame({"Age": [29], "Debt": [4], "CreditScore": [1], "Income":[198]})


# Prediction 2
prediction_2 = knn_fit.predict(new_observation_2)
prediction_2

array(['No'], dtype=object)

In [30]:
new_obs_1 = pd.DataFrame(new_observation_1)
new_obs_2 = pd.DataFrame(new_observation_2)
new_obs = pd.concat([new_obs_1, new_obs_2], ignore_index=True)
new_obs


Unnamed: 0,Age,Debt,CreditScore,Income
0,33,6,5,2039
1,29,4,1,198


Table 11. New observation 1 and 2.

I used the model that I created to make predictions for a few observations I created, new_observation_1 and new_observations_2. In our final visualization, I used our preprocessed data to create scatterplots. I then added black circles using the mark_point() graph, using the & operator to combine this new point with the scatterplots. 


In [None]:


debt_pred = scatterplot_debt_creditscore + (
    # Standardize the new data point with transformer fitted on the original data
    alt.Chart(new_obs)
    .mark_point(size=80, color='black', clip=True).encode(
        y=alt.Y("Debt").title("Debt").scale(domain=[0,40],clamp=True),
    x=alt.X("CreditScore").scale(domain=[0,20],clamp= True),
    )
)
age_pred = scatterplot_age + (
    alt.Chart(new_obs)
    .mark_point(size=80, color='black', clip=True).encode(
        x=alt.X("CreditScore").scale(domain=[0,20],clamp=True),
        y=alt.Y("Age").scale(domain=[0,90],clamp= True),
    )
)
income_pred = scatterplot_income + (
    alt.Chart(new_obs)
    .mark_point(size=80, color='black', clip=True).encode(
        x=alt.X("CreditScore").title("Credit Score").scale(domain=[0,20],clamp=True),
        y=alt.Y("Income").scale(domain=[0,10000],clamp= True),
    )
)
final_plot = debt_pred&age_pred&income_pred
final_plot

Figure 10. These scatter plots shows the relationship between credit score and debt, age, and income and the correlation with credit card approval. The two new observations are added to the plot as black circles. 

This allowed me to visualize the nearest neighbors for these observations in two dimensions, and understand why my model was making the predictions it was. The model predicted observation 1 would be approved, while 2 would not. The visualizations help understand this; observation 1 was surrounded by orange (approved) dots on all of our graphs, while 2’s nearest neighbors were typically blue (unapproved). 

From all the visualizations, we have noticed that credit score and income strongly correlate with credit card approval. However, age and debt did not show a strong correlation as expected, as we did not see a big difference in approval rates across ages.

The analysis focused on credit card approval, exploring the relationships between key variables (debt, income, age, credit score) and approval status. I employed a KNN-Neighbors algorithm to predict approvals based on these variables. Noteworthy findings include robust correlations between credit score and income with approval rates. Surprisingly, age exhibited a smaller impact than expected, and the average age difference between approved and non-approved applications was relatively modest.

My expectations were generally met, as the analysis uncovered correlations aligning with anticipated trends. The strong influence of credit score and income on approval aligned with expectations. However, the smaller-than-expected impact of age was a notable deviation. This finding suggests that age might not be as influential in credit card approval decisions as initially hypothesized.

These findings hold significance in several aspects. Understanding the factors influencing approval in a society increasingly reliant on credit cards in the digital economy is crucial. I can leverage these insights to refine approval criteria, potentially leading to more accurate risk assessments. Moreover, recognizing the potential biases and disparities in approval processes highlights the need for ethical considerations in credit decision algorithms. Addressing these issues could contribute to fairer and more inclusive financial systems.

The current findings open avenues for future exploration:
1. Model Generalization: How well does the KNN-Neighbors model generalize to new, unseen data? Further evaluation on external datasets is essential to assess the model's reliability in real-world scenarios.

2. Additional Features: Could the inclusion of additional features beyond the current set (employment history, education, etc.) improve prediction accuracy? Investigating additional factors may enhance the model's overall performance.

In conclusion, while the analysis provided valuable insights into credit card approval dynamics, there is room for further exploration to refine models, address ethical considerations, and enhance my understanding of the complex factors influencing financial decisions.

References:

Quinlan,J. R.. Credit Approval. UCI Machine Learning Repository. https://doi.org/10.24432/C5FS30.
Surekha, M., Umesh, U., & Dhinakaran, D. P. (2022). A study on utilization and convenient of credit card. Journal of Positive School Psychology, 5635-5645.

Wagner, H. (2004). The use of credit scoring in the mortgage industry. Journal of Financial Services Marketing, 9(2), 179–183. https://doi.org/10.1057/palgrave.fsm.4770151