# 1.1 - Blog Post: Classification Model for Credit Card Default Analysis 💰 

By: Angelique Clara Hanzell, Andres Zepeda

*December 7, 2022*

## Background about the Project
---

In 2005, a Taiwanese bank conducted their clients information regarding default payments, demographic factors, credit data, history of payments, and bill statements as seen in [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset) taken from the UCI's Machine Learning Repository.

Our **goal** here is to utilize this data set to use the previous 6 months of repayment history to try to analyze and predict the likelihood whether or not a customer will default in the following month. This is useful for banks to reduce the loss they will have and increase profit. Banks will then give customers credit to accounts that is less likely to default, reducing their risk exposure.

## Components of the Project
---
Our project was focused on five different areas to reach our goal and objective:
1. Data Wrangling 
2. Exploratory Data Analysis (EDA)
3. Different Models
4. Interpretation + Feature Importance
5. Results

> ### Data Wrangling 

The [original data set](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset) has several columns which includes:
- `ID`: ID of clients
- `LIMIT_BAL`: Amount of given credit (NT Dollars)
- `SEX`: Gender 
- `EDUCATION`: Level of education
- `MARRIAGE`: Martial status
- `AGE`: Age (Years)
- `PAY_#`: Current repayment status 
- `BILL_AMT#`: Amount of bill statements (NT Dollars)
- `PAY_AMT#`: Amount of previous payment (NT Dollars)
- `default.payment.next.month`: The target variable which indicates whether or not the customer defaulted the payment for the following month (0 = No, 1 = Yes)

We decided to take out some variables from the original data set that will reduce bias, such as `SEX`, which can be discriminatory. We also decided to remove the column `ID` since it is not relevant to our analysis.

> ### EDA

Our EDA allowed us to discover some underlying insights that we have not taken into account before. As seen in the visualizations or plots below, for both the amount of bill statements and amount of previous payment there is clearly an uneven distribution, more towards the right.

![bill amount](img/bill_amt.png "Bill Amount")

![pay amount](img/pay_amt.png "Pay Amount")

This makes sense when people with a lot of money start asking for large loans such as for starting a business. No outlier removal may be necessary due to the rarity that larger loan amounts could occur and seem to not effect the values to a large degree, and it is unlikely to cause any "class discrimination" where class means the economic class of the individual.

> ### Different Models


We tried different models for that is suitable for our classification model, starting from [Logistic Regression](https://www.ibm.com/topics/logistic-regression), [K-NN](https://www.ibm.com/topics/knn), [Random Forest](https://www.ibm.com/cloud/learn/random-forest), [CatBoost](https://towardsdatascience.com/catboost-regression-in-6-minutes-3487f3e5b329). At last, we found out that CatBoost worked best and is more trustworthy since it performed the best, as seen below, with an accuracy of ~82% for the optimized version that is achieved through [hyperparameter tuning](https://neptune.ai/blog/hyperparameter-tuning-in-python-complete-guide). We also decided on to use CatBoost as the model is known to outperform other algorithms and the fact that it will treat categorical feature in the best way.

![models](img/models.png "models")

> ### Interpretation


To see which predictor variables or features actually influence and matters the most for our model, we decided to run [eli5](https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html) and [SHAP](https://christophm.github.io/interpretable-ml-book/shap.html) to give us the rank of the from the most important feature to the least important feature. 

With eli5, we got a table of weight associated with each feature. This value can tell us of how much important a feature is in each of our model. As seen below, when running eli5 on our most accurate model, which is the CatBoost model as mentioned before, we see that the feature `EDUCATION` is the most dominant feature out of all with 24% of importance on the data set. This will infer us an idea how education and whether or not a client will default their credit card is correlated.

![eli5](img/eli_5.png "eli5")

While it is suprising and interesting to think that education is the most feature. While one could be surprised by it being the most important by such a large margin, on it's own it makes sense that it matters, as we heuristically know that education is correlated with someone's job and therefore job and ability to repay credit.

> ### Results

After building our model, the most important part will come right now, which is trying our previously built model out on part of the data set that have yet to be seen, or in other words: unseen data.

![test score](img/test_score.png "test score")

As we can see and expect, our CatBoost once again has the highest test score, which is aligned with what we found before. With a score of accuracy ~82%, there is certainly some uncertainty regarding the accuracy that we think could be improved, but ~82% do seems like a reasonable model to work with. In other words: for a bank that expects 1000 clients, the error of predicting whether or not the client's will default is around 180 clients.

## What I Learned & Caveats
---
From this project, we realized that at the end of the day, there are still a lot of aspects that could be improved to make our overall model stronger, which includes:
- We are dealing with a data set where we our main goal is to build a model that can help banks reduce their cost and expenses by not giving out loans to people who are likely to default. The importance of which metrics in this case is critical in order to evaluate our model performance. In our project, we solely focus more on improving the accuracy and precision of the model, but then after carefully understanding the goal of this project, we realized that **improving the recall metrics would have been so much more relevant** with the problem we are trying to solve and specific data set. The recall metrics is the metrics to quantify the number of correct predictions, hence the higher the recall, the fewer the false negatives. Even though our accuracy was as high as ~82%, the recall metrics was relatively much lower. It could be misleading to use this model without trying to improve the recall score.
- The CatBoost model that we are dealing with did not necessarily give us the best accuracy as it can be. CatBoost model is in most cases, very expensive and slow to work with, even though they perform the best in the most part. When dealing with the hyperparameter tuning, there are certain areas that we did not put into account, for example doing more hyperparameter tuning with grid search due to it taking a very long time when we tried more combinations of parameters. On the other hand, the small possible combinations of hyperparameter could be misleading.
- We could have done more rigorous feature selection and engineering (the optional part we didn't do). Aside from the second point of caveat (where we only tried a small portion of the possible combinations of hyperparameter), there could also be a big likelihood that our final metrics is not performing the best that it could because we did not do more feature selection (e.g. forward selection) and feature engineering. Feature selection in this case could help us deal with smaller amount of features - which will make the hyperparameter optimization a lot faster to work with and hence improve our CatBoost model overall, which will lead to the improvement of our metrics. There are certainly possibilities that some of the features are not as important, as seen on the eli5 table previously, that some features contributed very little importance, and feature selection would be ideal to avoid this. Manipulating (addition, deletion, combination, mutation) our features through feature engineering could also be another way where we could improve our metrics.

Overall, we think that we did a great job in the project. Machine learning is certainly a vast field, there will be a lot of different ways one could tackle a problem - and a lot of improvement could be made as time goes on!

# 1.2 - Effective Communication Technique 🤝

As this is our first time in incorporating our project in a blog post, we are particularly happy with our interpretation and results part, since we know that for most people - the results are usually what matters and overall we are satisified with it. We used one of the communication technique, where "Interesting to you != useful to the reader (aka it's not about you)". 

There are certainly some things that is interesting for me but not useful for the reader (such as the hyperparameter tuning part - it will be harder to follow along for someone especially when they do not have a background in machine learning), we decided to leave that part and elaborate more on how it is pretty interesting how education is considered the most important feature out of all the other features. 

# 2 - Takeaway from the Course 🌱

I am going to be honest, when I first took this course, I did not know what I was expecting since I did not have any experience in machine learning before, aside from the basic algorithms, such as K-NN. 

After the past 4 months, this course is absolutely amazing, and I can say that this is the most enjoyable course I have taken throughout my degree at UBC. I enjoyed every single part of it, lectures were engaging and homeworks were really fun to play around with. It was really fun seeing how data wrangling could be the most important part before moving on to building the model itself. I believe that having mastery in machine learning would require a lot of education, practice, and also passion. 

Thanks to this course, I have gained a lot of understanding on how machine learning works along with the reasoning behind it - and as a future data scientist (will be doing my very first co-op in this role in the upcoming January), I am very excited to use this course in a more practical capacity - from data wrangling, model, writing a report on projects (which is very important to elaborate our model to non-technical people), up until deployment. The field of machine learning is vast, and it is like solving a puzzle for me.

Thank you a lot for professor Varada and also all the teaching assitants who have put a lot effort in delivering this course the best way that it could be! 