Introduction:
-

Receiving approval for a credit card requires the satisfaction of very specific requirements. Although employment and income might be considered the most influential factors in determining an applicant’s approval, other personal factors might influence an applicant’s approval in unexpected ways. This project sought to evaluate the effects of personal factors on the approval of credit card applications. Specifically, the project’s research question is “What will be the approval status of an individual’s credit card based on their personal information?” The “Credit Card Approvals” dataset, owned by Samuel Cortinhas, was used to answer this question. This dataset includes the personal information of individuals who submitted credit card applications. The variables used in the dataset are Gender, Age, Debt, Ethnicity, Prior Default, Credit Score, Income, and Employment. Relationships between these variables and Approval Status are explored in this project.



Expected outcomes and significance:
-

We expect that high debt will decrease the chances of approval, and low or no debt will increase chances for approval. This would imply that individuals with high debt are less likely to pay their credit card fees, and thus banks are less likely to approve their applications. On the other hand, we expect that older individuals will have a greater chance of being approved, whereas younger applicants will have less chances of being approved for a credit card. 

These findings could inform credit card applicants of their chances of receiving approval and the qualities that banks are looking for. For instance, individuals who have high debts will be alerted that their chances of getting approved are relatively smaller than others, which would encourage them to pay off their debts before applying for a credit card.


Data Source:
-

https://www.kaggle.com/datasets/samuelcortinhas/credit-card-approval-clean-data 

First we downloaded the appropriate libraries and packages, including Repr, Tidyverse, Tidymodels, and GGally into our Jupyter notebook. The additional package GGally allowed us to utilize `ggpairs()` and display the relationships between each of our chosen variables in the best way possible.

In [1]:
#should install packages: kknn, GGally
install.packages("kknn")
install.packages("GGally")
library(kknn)
library(repr)
library(tidyverse)
library(tidymodels)
library(GGally)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

also installing the dependencies ‘broom.helpers’, ‘ggstats’


“installation of package ‘broom.helpers’ had non-zero exit status”
“installation of package ‘ggstats’ had non-zero exit status”
“installation of package ‘GGally’ had non-zero exit status”
Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.4     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[

ERROR: Error in library(GGally): there is no package called ‘GGally’


Assigned the raw data file of our dataset “Credit Card Approvals (Clean Data)” to the object `url`, then read in the data and assigned it to `credit_data`

In [None]:
url <- "https://raw.githubusercontent.com/bcruz29/DSCI-100-Group-Project/main/clean_dataset.csv"

credit_data <- read.csv(url)

Selected our variables that are going to be considered for training and visualization, converted values from Approved to “True” and “False”, and checked for missing data which there was none of.

In [None]:
credit_data <- credit_data |>
    select(Gender, Age, Debt, PriorDefault, Employed, CreditScore, Income, Approved) 
# Gender Male = 1, Female = 0

credit_data <- credit_data |>
    mutate(Gender = as.numeric(Gender), PriorDefault = as.numeric(PriorDefault), 
           Employed = as.numeric(Employed), Approved = as.factor(Approved))

credit_data <- credit_data |>
    mutate(Approved = fct_recode(Approved, "True" = "1", "False" = "0"))




missing <- sum(is.na(credit_data)) # Checking for any missing datas
missing 

head(credit_data)

Here we used `ggpairs()` to plot every variable against each other in an organized grid. As well, we created a legend that shows that the `Approval` variable is used with the `color` aesthetic to help us understand how each variable influences approval status.

In [None]:
options(repr.prod.width = 40, repr.prod.height = 15)

credit_pairs <- credit_data |> 
    select(Gender:Approved) |>
    ggpairs(legend = 1, aes(color = Approved, alpha = 0.05)) +
    labs(fill = "Approved") +
    theme(text = element_text(size = 20)) +
    ggtitle("Figure 1: GGpairs for different variables")
credit_pairs

A graph of ggpair was created to compare the relationships between different variables with appropriate sizes and labels.

Here we split the data into two sets: one training set (75% of the original data set) and one testing set. We assigned the training set to the object `credit_train`, and the testing set to the object `credit_test`.

In [None]:
set.seed(1)  # Don't Change

credit_split <- initial_split(credit_data, prop = 0.75, strata = Approved)  
credit_train <- training(credit_split)   
credit_test <- testing(credit_split)

The means of the variables were taken to provide some statistical summary, then the `Approval` variable was grouped into two categories(Approved and not approved) to check if the numbers of Approved and not approved observations were divided equally. 

In [None]:
mean_table <- credit_train |>
    summarize(mean_age = mean(Age), 
              mean_debt = mean(Debt),
                mean_income = mean(Income),
             mean_credit_score = mean(CreditScore))

observation_table <- credit_train |> 
    group_by(Approved) |>
    count()

mean_table
observation_table

A recipe and a model specification had been made as initial steps to start the classification. The model had been set to knn-classification, and the data points were standardized in order to make the observations comparable.

In [None]:
#Preprocessing Data for Recipe and Spec

knn_recipe <- recipe(Approved ~ . , data = credit_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

A workflow was created to combine the model and the recipe, fitting the data onto the model. 

In [None]:
k_fit <- workflow() |>
      add_recipe(knn_recipe) |>
      add_model(knn_spec) |>
    fit(data = credit_train)

Prepared 5 fold cross validation

In [None]:
k_vfold <- vfold_cv(credit_train, v = 5, strata = Approved)

# 5 fold cross validation 

The seq function was used to check neighbors from 1 to 50 by stepping by 3. 


In [None]:
k_vals <- tibble(neighbors = seq(from = 1, to = 50, by = 3))

# Checking all K neighbors from 1 to 50 by 3

Retrieves the results from the cross validation. 

In [None]:
knn_results <- k_fit |>
    tune_grid(resamples = k_vfold, grid = k_vals) |>
    collect_metrics()

In [None]:
knn_results

Accuracy Estimates are plotted in the vertical axis while the Neighbors are plotted in the horizontal axis to compare and find the Neighbor with the highest accuracy. 

In [None]:
accuracies <- knn_results |>
    filter(.metric == "accuracy")

cross_val_plot <- accuracies |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() + 
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate")

In [None]:
cross_val_plot

# Plotting each K value for it's accuracy

We choose K = 49 since it has the highest accuracy, changing the K by some numbers doesn't change the accuracy largely, and cost of training the model isn't expensive.

In [None]:
knn_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 49) |>
    set_engine("kknn") |>
    set_mode("classification")

# Now I'm using our best neighbor from above.

The classifier is being trained again with the finest neighbor found from previous work steps. 


In [None]:
k_best_fit <- workflow() |>
      add_recipe(knn_recipe) |>
      add_model(knn_best_spec) |>
    fit(data = credit_train)

The test data set is being predicted with the trained classifier. k_metrics is the performance evaluation for accuracy. 


In [None]:
k_predictions <- predict(k_best_fit, credit_test) |>
    bind_cols(credit_test)

k_metrics <- k_predictions |>
    metrics(truth = Approved, estimate = .pred_class) |>
    filter(.metric == "accuracy") 

k_metrics

# This progress is using our test data to examine our model's ability.

# Below is our model's accuracy 

The confusion matrix was composed to compare the true and prediction values.

In [None]:
k_conf_mat <- k_predictions |>
    conf_mat(truth = Approved, estimate = .pred_class)

k_conf_mat
# Our confusion matrix 

In [None]:
# Gender Age Debt PriorDefault Employed CreditScore Income Approved
male_one <- tibble(Gender = 1, Age = 23, Debt = 0, PriorDefault = 0, Employed = 1, CreditScore = 40, Income = 500)
predict(k_best_fit, male_one)
    
male_two <- tibble(Gender = 1, Age = 19, Debt = 0, PriorDefault = 0, Employed = 1, CreditScore = 40, Income = 1500)
predict(k_best_fit, male_two)

female_one <- tibble(Gender = 0, Age = 21, Debt = 0, PriorDefault = 0, Employed = 1, CreditScore = 40, Income = 1000)
predict(k_best_fit, female_one)

female_two <- tibble(Gender = 0, Age = 19, Debt = 0, PriorDefault = 0, Employed = 0, CreditScore = 40, Income = 0)
predict(k_best_fit, female_two)

In [None]:
male_one <- tibble(Gender = 1, Age = 23, Debt = 0, PriorDefault = 0, Employed = 1, CreditScore = 0, Income = 500)
predict(k_best_fit, male_one)
    
male_two <- tibble(Gender = 1, Age = 19, Debt = 0, PriorDefault = 0, Employed = 1, CreditScore = 0, Income = 1500)
predict(k_best_fit, male_two)

female_one <- tibble(Gender = 0, Age = 21, Debt = 0, PriorDefault = 0, Employed = 1, CreditScore = 0, Income = 1000)
predict(k_best_fit, female_one)

female_two <- tibble(Gender = 0, Age = 19, Debt = 0, PriorDefault = 0, Employed = 0, CreditScore = 0, Income = 0)
predict(k_best_fit, female_two)

Discussing the Impact of Credit Score Using Our Own Data:
- 

We have tried to fit customized personal information into the predicting variables and make predictions from them. The results suggest that the variable “credit score” is more heavily weighted than other variables in terms of predicting card credit approvals. By applying the same credit score rate on all four customized individuals, the predictor approves credit cards when the score rate is high and rejects them when the score rate is low. However, changing other variables won’t change the final prediction as much. For instance, an individual with a relatively high income and an individual with low income both have been approved for credit cards when they have the same high credit score rate. Therefore, our model suggests that “credit score” is weighted more than other variables, it can influence the final decision of credit card approvals more easily than others. 

In [None]:
graph_credit <- credit_data |> 
    ggplot(aes(x = Age, y = CreditScore)) + 
        geom_point() + 
        labs(x = "Age", y = "Credit Score") + 
        ggtitle("The impact of Credit score and Age on the Approval status") + 
        theme(text = element_text(size = 12))
graph_credit

Evaluation and Discussion of Results:
- 

- Our function for credit card approval is too complex to make conclusions based on only two variables. Moreover, some variables, including employment, gender, and prior default, have shown no patterns and no trends when plotted. Therefore, we will use the other variables to build on our discussion. 

- One of the graphs that we produced showed that individuals with lower age and lower debt were more likely to be denied approval for their credit card applications. However, individuals with higher age and higher debt were more likely to be approved for receiving a credit card. We expected that older applicants would be more likely to receive approval than younger applicants because they may have more life and work experience. However, we did not expect that individuals with high debt to be more likely to receive approval. We can hypothesize, however, that these individuals with high debt might be using their money to buy a house or invest in assets, which would be considered “beneficial” debt. 
- Another graph showed a correlation between credit score and approval status. A higher credit score is correlated with approved status for credit card applications, and lower credit score is associated with denied status. This is expected because credit scores should help banks determine the probability that applicants will pay back the money that was borrowed.
- Likewise, another graph showed that most applicants with higher income get approved, while applicants with lower income get denied. Again, this is an expected outcome because individuals with more money are expected to pay back the banks on time.  
- Overall, our results indicate that age, debt, income, and credit score primarily play a role in dictating the approval status of applicants.
- Based on an article released by Scotiabank on what credit score an applicant would need to be approved for a credit card, higher credit scores indicate a better chance at approval. However, the article confirms that credit score is not the only factor being considered by banks, as other personal characteristics and factors are taken into account. This supports our results because debt, age, and income are all factors that have been found to influence credit card approval status.

Impact of Findings:
- 

- Data from our model indicates that personal factors including age, debt, income, and credit score are the only factors that play a role in credit card approval. This shows that banks do not show illegal bias or discrimination against applicants because this personal information is legally acquired and used by banks to determine any applicant's approval status.

- Since other variables like ethnicity and gender have no relationship with credit card approval based on our model, we can confirm that there is no indication of any illegal bias present in their methods of credit card issuing.

- In summary, our findings can have a considerable impact on the public perception of credit-issuing banks, considering the history of biased banks in North America. For example, “At the Boundaries of Homeownership”, written by Chloe N. Thurston, details a critical moment in American history regarding the Women’s Equity Action League and its president Arvonne Fraser. The aim of the group was “to convince the government to extend the existing ban on mortgage discrimination by race, national origin, and color to cover marital status and sex as well, and for legislation that dealt with sex discrimination in consumer lending generally.” In the case of Fraser and the Women’s Equity Action League, they saw bias from banking institutions in the United States with regard to consumer lending and took legal action to challenge this issue. Knowing this moment in history, we can see how our project takes a look at the credit card issuing facet of consumer lending in North America as it stands today and ensures that there is still no illegal bias present.






Future Questions:
-
The result of this study could lead to a variety of future questions:

- Are credit card approval rates different across different regional banks? (Data trends found through our results indicated that there was no illegal bias present on behalf of credit card issuers. There is no mention of which banks/credit card issuers are being considered in the curation of this dataset, but we can see that they meet the North American legal standards of legal credit card issuing.)

- Are there any other variables, excluding the ones this study touched on, that could possibly suggest signs of biases and discrimination on credit card approvals?  

- Is the current banks’ credit approval system fair? Can the study be representative of credit card approval systems for most of the banks?


References:
- 
Scotiabank. (2023, July 6). What credit score do you need for a credit card?. https://www.scotiabank.com:443/content/scotiabank/ca/en/personal/advice-plus/features/posts.what-credit-score-do-you-need-for-a-credit-card.html
    
Thurston, C. (2018). Bankers in the Bedroom. In At the Boundaries of Homeownership: Credit, 
    Discrimination, and the American State (pp. 142-182). Cambridge University. 
    Press. https://doi.org/10.1017/9781108380058.006 
    
Cortinhas, S. (2021). UC Irvine. 
    https://www.kaggle.com/datasets/samuelcortinhas/credit-card-approval-clean-data 

