In [7]:
# Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import altair as alt
alt.data_transformers.enable('json')
#alt.renderers.enable('notebook')
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn import preprocessing
import numpy as np
from sklearn.metrics import accuracy_score, plot_confusion_matrix, confusion_matrix, classification_report, roc_auc_score, roc_curve
from sklearn.metrics import recall_score, precision_score
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
from docopt import docopt
from sklearn.feature_selection import RFECV

In [8]:
# Reading results from the output files
evaluation_matrix = pd.read_csv("../results/accuracies.csv")
#evaluation_matrix_base = pd.read_csv("../results_baseline//accuracies.csv")
#head = pd.read_csv("../results/head.csv")
summary=pd.read_csv("../results/num_describe.csv", index_col=0).applymap(lambda x: '%.2f' %x)
test_accuracy = round(evaluation_matrix.iloc[0][2],2)
test_accuracy_base = round(evaluation_matrix.iloc[0][1],2)
recall = round(evaluation_matrix.iloc[2][2],2)
recall_base = round(evaluation_matrix.iloc[2][1],2)
precision = round(evaluation_matrix.iloc[3][2],2)
precision_base = round(evaluation_matrix.iloc[3][1],2)
auc = round(evaluation_matrix.iloc[4][2],2)
auc_base = round(evaluation_matrix.iloc[4][1],2)


# **Table of Content:**
* Summary
* Introduction
* Methods
* Results
* Conclusions
* References

# 1. Summary <a class="anchor" id="first-bullet"></a>
In this project we try to find the best features that best predict default customers using machine learning tools. Logestic Regression was found to achieve acceptable results on the test data provided to the trained model. The accuracy of the model on test data was about {{test_accuracy}} and the recall on test data found to be {{recall}}. The precision for the model on the test was about {{precision}} .The area under the ROC Curve for the final model is {{auc}}.

Due to the risk associated with customers failing to pay, the model was designed to maximize the recall rate, identifing customers that will default to the greatest extent. This was also balanced with the overall accuracy on the training and test dataset. The model predict the following 7 features to be the most important features to predict customers default.

1. Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
2. EDUCATION
3. MARRIAGE
4. AGE
5. Past monthly repayment status in September 2005 (which is the most recent month before the prediction month)
6. Past monthly repayment status in August 2005 (which is the second from the most recent month before the predition month)
7. Amount of previous payment (NT dollar) in September 2005 (which is the most recent month before the prediction month)



# 2. Introduction <a class="anchor" id="second-bullet"></a>
Prediction of customers default behaviour is critically important in Risk Management by lenders. In particular,  there has been a significant interest in identifying features that are associated with the highest prediction power to reduce the overall lender's credit risk. In this study, we perform a data-informed analysis to build a model that can successfully capture features that predict default payment.


# 3. Methods <a class="anchor" id="third-bullet"></a>
## Data
We used credit default data collected from the Taiwanese market in 2005. The Data Set is available from [UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). The data that contains 23 features from 30,000 customers. was originally publicized by Chung Hua University of Taiwan and Tamkang University of Taiwan. Features include :

- `LIMIT_BAL`: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
- `SEX`: Gender(1 = male; 2 = female).
- `EDUCATION`: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
- `MARRIAGE`: Marital status (1 = married; 2 = single; 3 = others).  
- `AGE`: Age (year).  
- `PAY_1`, `PAY_2`, ..., `PAY_6`: Past monthly repayment status in September 2005, August 2005, ..., April 2005 respectively. ( -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.)  
- `BILL_AMT1`, `BILL_AMT2`, ..., `BILL_AMT6`: Amount of bill statement (NT dollar) in September 2005, August 2005, ..., April 2005 respectively.  
- `PAY_AMT1`, `PAY_AMT2`, ..., `PAY_AMT6`: Amount of previous payment (NT dollar) in September 2005, August 2005, ..., April 2005 respectively.  



## Analysis

Immediately after importing the data it was split into traning and test data. Only 75% of the data was used to train the models and the test data was only used to obtain the test performance of the model on unseen data. 

Next, we created list for numeric and categorical features, below is the summary of the traning data. It shows that that mean, standard deviation, min, max etc. The bill amount, payment amount and credit limit ranges are roughly similar which are around 800,000. It's interesting that The medians for the bill statement amounts are around 20,000, but the medians for payment amounts are 2,000. Age ranges from 21 to 75 which is reasonable.

In [9]:
summary

Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
count,22500.0,22500.0,22500.0,22500.0,22500.0,22500.0,22500.0,22500.0,22500.0,22500.0,22500.0,22500.0,22500.0,22500.0
mean,167229.76,35.49,50992.9,48905.72,46629.69,42932.42,39905.28,38385.69,5714.38,5848.26,5132.9,4728.45,4725.76,5282.13
std,129384.49,9.18,73064.69,70748.07,68376.99,63802.95,60135.85,58733.43,17078.24,21916.9,16892.47,15430.72,15138.46,18506.38
min,10000.0,21.0,-165580.0,-69777.0,-157264.0,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,28.0,3565.75,2928.0,2577.0,2313.0,1711.75,1190.0,990.0,800.0,390.0,285.75,238.0,119.75
50%,140000.0,34.0,22169.0,20859.0,19889.0,18855.5,17875.0,16715.0,2100.0,2001.0,1800.0,1500.0,1500.0,1500.0
75%,240000.0,41.0,66732.75,63104.25,59532.5,53339.5,49743.0,48863.5,5006.0,5000.0,4512.0,4000.0,4000.0,4000.0
max,800000.0,75.0,746814.0,743970.0,855086.0,616836.0,587067.0,568638.0,873552.0,1227082.0,889043.0,621000.0,426529.0,528666.0


Table 1. Summary the data used in this study

To learn the association between numeric features we explored their inter-correlations which can be seen below. 
We can observe that some features a stronger co-linearity such as BILL-AMT1,BILL-AMT2,.. to BILL-AMT6. 

![](../results/num_corr_chart.png)

Figure 1. Inter-correlation between numeric features

We can also study the correlation between the features and the response varibale. We can see that some of the features have stronger correlation with the response varibale than others, for example LIMIT_BALANCE and Age.

![](../results/num_res_chart.png)


[](roc.png)

Figure 2. Correlation between numeric features and response

Figure 2 also shows that many of the features have a heavy tail distribution.  To mitigate this issue we applied [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html) (Synthetic Minority Oversampling Technique) on the response variable to create a balanced data set to fit the model. Furthermore, we implemented [`RobustScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) to scale predictors

# 4. Results <a class="anchor" id="fourth-bullet"></a>


We selected logistic regression model(`LogisticRegression`) and [`RFE`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE)(recursive feature elimination) as our model since it is more robust given that the dataset has many of the features are not normally distributed. One additional advantage of (`LogisticRegression`) that is much interpretable than more complex models

We started the analysis by applying a robust scalar on the training data-set. Following that we build a model with the full set of features as our base-case model. The confusion matrix, evaluation matrix and ROC results were obtained to set the a bench-mark for comparison purposes. `RFE` was then used to identify the most useful predictors and consequently we dropped those columns that are deemed as less useful. Eventually 7 features were used to train the model.

The hyperparameters `C` was tunned in the range from -4 to 20 using 5-fold cross-validation and the model was then fitted with the best hyperparameter. Let us now look at the result by glancing into the confusion matrix

Best-case Model            |  Base-case Model
:-------------------------:|:-------------------------:
![](../results/confusion_matrix.png)  |  ![](../results_baseline/confusion_matrix.png)





Figure 3. Confusion matrix of the fitted model with 7 features

We can see that the best-case model which uses 7 features out-performs the base-case model in many aspects. First, it has obtained more true negatives on the test data than the base-case model. Furthermore, the best-case model has less false positives than the base-case model making it more precise as depicted in the evaluation matrix below.

In [4]:
evaluation_matrix

Unnamed: 0,measurement,baseline,alternate model
0,test accuracy,0.670267,0.7004
1,train accuracy,0.670933,0.6952
2,test recall,0.664653,0.625378
3,test precision,0.36448,0.388805
4,auc score,0.720814,0.715227


Table 2. Comparison of the evaluation matrix between models

Evaluation matrix in Table 2 shows the accuracy of the alternate model on test data was about {{test_accuracy}} compared with only {{test_accuracy_base}} for the baseline model. The recall on test data dropped slightly to {{recall}} compared with recall of the baseline which is {{recall_base}}. The precision for the model on the test data has improved {{precision}} compared with only {{precision_base}} for the baseline model. The area under the ROC Curve for the final model is {{auc}} which is comparable to the baseline model.

ROC was plotted to measure the model's discriminative ability. The dashed diagonal line represents a model that labelling observations randomly. The further away from the diagonal line towards top left corner, the better the model can distinguish two classes - default, non-default customers correctly.  The blue line is our best model with 7 features, we can see that the model performs fairly good. 

![](../results/roc.png)

Figure 4. ROC curve for the fitted model with 7 features

# 5. Conclusions <a class="anchor" id="fifth-bullet"></a>

We were able to successfully use `LogisticRegression` model to find the most important features that predict customer default. The model acheives an acceptable level of accuracy on the testing data, better tunning of hyper paramters may result a higher accuracy. Overall, we selected the best-case model to extract the most important features as it is more accurate. The precision of the best-case model is  {{precision}}. In comparison, the base-case model only scores  {{precision_base}}. While the recall of best-case model decreased from {{recall_base}} to {{recall}}, AUC score only slightly dropped.  Since the best-case model is more accurate, we expect the following 7 features to have the highest predictive power among all the features

1. Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
2. EDUCATION
3. MARRIAGE
4. AGE
5. Past monthly repayment status in September 2005 (which is the most recent month before the prediction month)
6. Past monthly repayment status in August 2005 (which is the second from the most recent month before the predition month)
7. Amount of previous payment (NT dollar) in September 2005 (which is the most recent month before the prediction month)



Although the best model are doing better than the baseline model overall, the result is still not very satisfactory with {{test_accuracy}} as our highest accuracy from test accuracy. To improve accuracy in the future, we have some suggentions that are not yet implemented due to time limitation. 

- Use some other feature scaling techniques: one-hot encoding the categorical features.
- Use pipeline to be able to grid search the best combinations of values of LogisticRegression model parameters and number of features to select.
- Use L1 regularization to eliminate features.  

Another limitation is how to generalize the characteristics of feature 5,6,7. It makes sense the most months' repayment status and amount would be a good indicator of whether the customer will default or not in the next month. However, how well can it predict if the customer will default or not in half a year or longer is explorable.

# References <a class="anchor" id="fifth-bullet"></a>



[1] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. [UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)

[2] [Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. CreateSpace, Scotts Valley, CA, 2009](https://dl.acm.org/doi/book/10.5555/1593511)

[3] Wickham, H. 2017. tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

[4] Wickham H (2011). “testthat: Get Started with Testing.” The R Journal, 3, 5–10. https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

[5] McKinney, W. (2012). Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.".

[6] Nielsen, F. Å. (2014). Python programming—Scripting.

[7] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), 2825-2830.

[8] VanderPlas, J., Granger, B., Heer, J., Moritz, D., Wongsuphasawat, K., Satyanarayan, A., ... & Sievert, S. (2018). Altair: Interactive statistical visualizations for python. Journal of open source software, 3(32), 1057.

[9] Percival, H. (2014). Test-driven development with Python: obey the testing goat: using Django, Selenium, and JavaScript. " O'Reilly Media, Inc.".

[10] Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1), 559-563.

[11] Li, Susan. "Building A Logistic Regression in Python, Step by Step." Towards Data Science (2017). https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

<cite data-cite="Python"></cite>
<cite data-cite="Dua:2019"></cite>

