## Notes

In [93]:
# Environment Setup
library(tidyverse)
library(tidymodels)
library(readxl)

options(repr.plot.width = 10, repr.plot.height = 8)

# Credit Card Default Prediction
___

## Introduction
___

**TODO: (!!! delete below in final draft)**
* Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
* Clearly state the question you will try to answer with your project
* Identify and describe the dataset that will be used to answer the question

#### **Background** ####
Credit Card is an essential part of our daily lives today. Per Statista there were 76M credit cards in circulation within Canada implaying 2 per every Canadian. In addition to the impact it has on everyday citizens, credit card is booming and loan securitization is a huge business for banks and asset managers. As such, a large credit card default at a macro level could lead to systemic failures of banks and the broader capital markets similar to the one observed in GFC.

#### **Thesis** ####
Using the data and the techniques learned in class, we would like to answer the question: **can we predict the default status of a credit card client?**

#### **Data** ####
The data is from an unnamed debit and credit card issuing bank in Taiwan. The data consists of information about 30,000 customers as at October 2005, of which 23,364 (78%) have not defaulted while 6,636 (22%) have defaulted. The default status is represented as binary variable (1 = Yes, 0 = No). Variables available to be used as direct inputs are as below:
* **X1**: Amount of Credit Given (NT$)
* **X2**: Sex (1 = Male, 2 = Female)
* **X3**: Education (1 = Graduate School, 2 = University, 3 = High School, 4 = Others)
* **X4**: Marital Status (1 = Married, 2 = Single, 3 = Others)
* **X5**: Age (# of Years)
* **X6-X11**: Repayment Status (-1 = Clear, 1 = Payment Delay of 1 Month, 2 = Payment Delay of 2 Months, ..., 8 = Payment Delay of 8 Months, 9 = Payment Delay of 9 Months or greater)
    * **X6**  = Repayment status in 2005-09
    * **X7**  = Repayment status in 2005-08
    * **X8**  = Repayment status in 2005-07
    * **X9**  = Repayment status in 2005-06
    * **X10** = Repayment status in 2005-05
    * **X11** = Repayment status in 2005-04
* **X12-X17**: Amount of Bill Statement (NT$)
    * **X12** = Amount of Bill Statement in 2005-09
    * **X13** = Amount of Bill Statement in 2005-08
    * **X14** = Amount of Bill Statement in 2005-07
    * **X15** = Amount of Bill Statement in 2005-06
    * **X16** = Amount of Bill Statement in 2005-05
    * **X17** = Amount of Bill Statement in 2005-04
* **X18-X23**: Amount of Previous Payment (NT$)
    * **X18** = Amount Paid in 2005-09
    * **X19** = Amount Paid in 2005-08
    * **X21** = Amount Paid in 2005-06
    * **X20** = Amount Paid in 2005-07
    * **X21** = Amount Paid in 2005-05
    * **X23** = Amount Paid in 2005-04

Source of Data: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

## Preliminary Exploratory Data Analysis
___

* Demonstrate that the dataset can be read from the web into R 
* Clean and wrangle your data into a tidy format
* Using only **training data**, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
* Using only **training data**, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.


In [99]:
columns = c('id', 'credit_limit', 'sex', 'education', 'marital_status', 'age',
            'status_09', 'status_08', 'status_07', 'status_06', 'status_05', 'status_04',
            'balance_09', 'balance_08', 'balance_07', 'balance_06', 'balance_05', 'balance_04',
            'payment_09', 'payment_08', 'payment_07', 'payment_06', 'payment_05', 'payment_04',
            'y')
            
credit_card_data <- read_excel('data/default of credit card clients.xls',sheet='Data',skip=2,col_names=columns) |>
    select(-id)

head(credit_card_data)

credit_limit,sex,education,marital_status,age,status_09,status_08,status_07,status_06,status_05,⋯,balance_06,balance_05,balance_04,payment_09,payment_08,payment_07,payment_06,payment_05,payment_04,y
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
20000,2,2,1,24,2,2,-1,-1,-2,⋯,0,0,0,0,689,0,0,0,0,1
120000,2,2,2,26,-1,2,0,0,0,⋯,3272,3455,3261,0,1000,1000,1000,0,2000,1
90000,2,2,2,34,0,0,0,0,0,⋯,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
50000,2,2,1,37,0,0,0,0,0,⋯,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
50000,1,2,1,57,-1,0,-1,0,0,⋯,20940,19146,19131,2000,36681,10000,9000,689,679,0
50000,1,1,2,37,0,0,0,0,0,⋯,19394,19619,20024,2500,1815,657,1000,1000,800,0


In [101]:
set.seed(154)
credit_card_split <- credit_card_data |> initial_split(prop=0.75, strata = y)
credit_card_train <- training(credit_card_split)
credit_card_test <- testing(credit_card_split)

In [189]:
#testing for na data
na_data_count <- sapply(credit_card_train,function(x) sum(is.na(x))) |>
    t() |>
    as_tibble()
    
na_data_count

credit_limit,sex,education,marital_status,age,status_09,status_08,status_07,status_06,status_05,⋯,balance_06,balance_05,balance_04,payment_09,payment_08,payment_07,payment_06,payment_05,payment_04,y
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


### Summary of Categorical Columns ###

In [181]:
credit_card_train |>
    group_by(y) |>
    summarize(count=n(),
            average_age = mean(age),
            average_credit_limit = mean(credit_limit),
            average_balance_09 = mean(balance_09),
            average_balance_09 = mean(balance_09))

y,count,average_age,average_credit_limit,average_balance_09
<dbl>,<int>,<dbl>,<dbl>,<dbl>
0,17523,35.40969,178766.3,51866.63
1,4977,35.827,130377.3,47696.3


In [182]:
credit_card_train |>
    group_by(sex) |>
    summarize(count=n(),
            average_age = mean(age),
            average_credit_limit = mean(credit_limit),
            average_balance_09 = mean(balance_09),
            average_balance_09 = mean(balance_09))

sex,count,average_age,average_credit_limit,average_balance_09
<dbl>,<int>,<dbl>,<dbl>,<dbl>
1,8919,36.51856,164356.3,53862.09
2,13581,34.8344,170496.7,49027.86


In [183]:
credit_card_train |>
    group_by(education) |>
    summarize(count=n(),
            average_age = mean(age),
            average_credit_limit = mean(credit_limit),
            average_balance_09 = mean(balance_09),
            average_balance_09 = mean(balance_09))

education,count,average_age,average_credit_limit,average_balance_09
<dbl>,<int>,<dbl>,<dbl>,<dbl>
0,11,38.81818,219090.9,12670.82
1,7962,34.22143,213146.2,48448.09
2,10493,34.785,148308.0,53463.19
3,3702,40.22447,126155.5,46874.74
4,84,33.61905,219285.7,58567.31
5,207,35.57971,164521.7,85266.74
6,41,43.85366,151951.2,79780.1


In [184]:
credit_card_train |>
    group_by(marital_status) |>
    summarize(count=n(),
            average_age = mean(age),
            average_credit_limit = mean(credit_limit),
            average_balance_09 = mean(balance_09),
            average_balance_09 = mean(balance_09))

marital_status,count,average_age,average_credit_limit,average_balance_09
<dbl>,<int>,<dbl>,<dbl>,<dbl>
0,39,37.53846,133846.2,20942.15
1,10270,40.02561,181968.1,52493.6
2,11949,31.46715,157597.1,49805.05
3,242,42.42562,100206.6,46268.08


### Summary of Numeric Columns ###

In [None]:
numeric_cols_train <- credit_card_train |>
    select(-sex, -education, -marital_status, -status_09, -status_08, -status_07, -status_06, -status_05, -status_04)

min_0 <- apply(filter(numeric_cols_train, y == 0), 2, min, na.rm = TRUE)
average_0 <- apply(filter(numeric_cols_train, y == 0), 2, mean, na.rm = TRUE)
median_0 <- apply(filter(numeric_cols_train, y == 0), 2, median, na.rm = TRUE)
max_0 <- apply(filter(numeric_cols_train, y == 0), 2, max, na.rm = TRUE)

min_1 <- apply(filter(numeric_cols_train, y == 1), 2, min, na.rm = TRUE)
average_1 <- apply(filter(numeric_cols_train, y == 1), 2, mean, na.rm = TRUE)
median_1 <- apply(filter(numeric_cols_train, y == 1), 2, median, na.rm = TRUE)
max_1 <- apply(filter(numeric_cols_train, y == 1), 2, max, na.rm = TRUE)

In [158]:
numeric_cols_summary <- rbind(min_0, average_0, median_0, max_0, min_1, average_1, median_1, max_1) |>
    as_tibble() |>
    mutate(info = c('Min','Mean','Median','Max','Min','Mean','Median','Max')) |>
    relocate(y,.before=credit_limit) |>
    relocate(info,.before=credit_limit)

numeric_cols_summary

y,info,credit_limit,age,balance_09,balance_08,balance_07,balance_06,balance_05,balance_04,payment_09,payment_08,payment_07,payment_06,payment_05,payment_04
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,Min,10000.0,21.0,-165580.0,-69777.0,-157264.0,-170000.0,-81334.0,-209051.0,0.0,0.0,0.0,0.0,0.0,0.0
0,Mean,178766.3,35.40969,51866.63,49686.6,47550.7,43656.7,40764.53,39372.77,6313.638,6643.116,5831.461,5316.808,5370.884,5778.33
0,Median,150000.0,34.0,22672.0,21146.0,20012.0,18887.0,17850.0,16617.0,2410.0,2249.0,2000.0,1714.0,1774.0,1710.0
0,Max,1000000.0,79.0,964511.0,983931.0,1664089.0,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0
1,Min,10000.0,21.0,-6676.0,-17710.0,-61506.0,-50616.0,-53007.0,-94625.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Mean,130377.3,35.827,47696.3,46498.54,44411.35,41452.31,38983.0,37697.45,3240.056,3385.959,3288.094,3186.211,3118.103,3229.31
1,Median,90000.0,34.0,19930.0,19974.0,19590.0,18919.0,18160.0,17786.0,1600.0,1500.0,1200.0,1000.0,1000.0,1000.0
1,Max,740000.0,75.0,613860.0,572677.0,578971.0,541019.0,547880.0,514975.0,235728.0,358689.0,319494.0,432130.0,287982.0,228548.0


## Methods
___

* Explain how you will conduct either your data analysis and which variables/columns you will use. <u>Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?</u>
* Describe at least one way that you will visualize the results

## Expected Outcomes and Significance
___

* What do you expect to find?
* What impact could such findings have?
* What future questions could this lead to?

Using this data and applying the information we learned in this course, we plan to predict the default status of a credit card client using Machine Learning.

With this data and predictions, we can:

1: Create a less risky credit environment for both banks and people
2: Provide better rates for less-risky credit card owners
3: Develop educational programs for people in higher risk levels for better credit card management.

This could lead to questions like: How can we increase efficiency as well as create a low-risk environment where everyone can afford to have a credit card?