<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_IV_13_CreditCardCaseStudyWLDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Credit Card Default Case Study



In this tutorial, we will use logistic regression to predict credit card default scores.

We rely on the dataset `pa_data_UCI_Credit_Card.csv` from the UCI Machine Learning Repository (Lichman, M., 2013. [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science).  This datasets provides credit card defaults for customers in Taiwan.  We are given some demographic information ($X_1$-$X_5$), the previous history of payments ($X_6$-$X_{11}$), the amount of previous bills ($X_{12}$-$X_{17}$), and amounts of previous payments ($X_{18}$-$X_{23}$).  Finally, variable 24 is our target, whetyher there was a default in the next months.


As always, let's start with importing the libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, precision_score, roc_curve, auc

Let's load the dataset

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
mydata = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_IV_9_UCI_Credit_Card.csv', index_col=0)

### Data Exploration and Preparation

In [None]:
mydata.head()

Let's look at some aggregate statistics.

In [None]:
mydata.describe()

First, a number of the variables are included numerically but really they have factor character, particularly Gender (1 = male; 2 = female), Education (1 = graduate school; 2 = university; 3 = high school; 4 = others), Marital status (1 = married; 2 = single; 3 = others), and default payment. Let's store them as factors.  We will do the same for history of past payment ($X_6$-$X_{11}$), although they really have ordinal character.

In [None]:
factor = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'default.payment.next.month']

Also, a number of the levels occur very sparsely: there are 11 levels for all the `PAY` variables, but only the first six seem to be frequent.  So let's collapse levels 7 through 11 to one:

In [None]:
mydata['PAY_0'][mydata['PAY_0']>4] = 4
mydata['PAY_2'][mydata['PAY_2']>4] = 4
mydata['PAY_3'][mydata['PAY_3']>4] = 4
mydata['PAY_4'][mydata['PAY_4']>4] = 4
mydata['PAY_5'][mydata['PAY_5']>4] = 4
mydata['PAY_6'][mydata['PAY_6']>4] = 4

Next, generate dummies:

In [None]:
mydata_numcols = mydata.drop(columns = factor)
mydata_faccols = mydata[factor].astype('category')
dummies = pd.get_dummies(mydata_faccols, drop_first=True)
mydata = pd.concat([mydata_numcols, dummies], axis = 1)

And Let's relabel the long name of the dependent variable:

In [None]:
mydata = mydata.rename(columns={"default.payment.next.month_1": "default"})

Let's take a look:

In [None]:
mydata.head()

In [None]:
mydata.describe()

Let's check a correlation plot to make sure none of the variables is extremely correlated:

In [None]:
mask = np.triu(np.ones_like(mydata.corr(), dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(mydata.corr(), mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

So it looks like the bill amounts are highly correlated.  Let's just keep the most recent one and then the average of all of them:


In [None]:
mydata.insert(17, "BILL_AVG", (mydata['BILL_AMT1']+mydata['BILL_AMT2']+mydata['BILL_AMT3']+mydata['BILL_AMT4']+mydata['BILL_AMT5']+mydata['BILL_AMT6'])/6, True)
mydata = mydata.rename(columns={"BILL_AMT1": "BILL_REC"})
del mydata['BILL_AMT2']
del mydata['BILL_AMT3']
del mydata['BILL_AMT4']
del mydata['BILL_AMT5']
del mydata['BILL_AMT6']
mydata.describe()

Let's save the dataset so that we can use it in coming tutorials without having to go through this procedure again:

In [None]:
mydata.to_csv('GB886_IV_9_UCI_Credit_Card_prepped.csv')

### Predictive Modeling: Logistic Regression

Define


In [None]:
y = mydata['default']
X = mydata.drop(columns = ['default'])

In [None]:
logistic_mod = sm.Logit(y, sm.add_constant(X).astype(float))
logistic_mod = logistic_mod.fit()
print(logistic_mod.summary())

So we notice the limit balance and the pay amounts have a negative association with default, whereas the bill average and the bill average have a positive association with default. Several of the demographic variables seem to matter, too!

(I also ran this regression using sklearn, but the coefficients were quite different. There are a few reasons that I explored, but in the end the results still did not align. Given that there is a numerical procedure involved in solving the model, sometimes small inconsistencies can lead to substantial differences in coefficients.)

Let's check predictions:

In [None]:
p_x = logistic_mod.predict()
y_hat = (p_x > 0.5)

Let's take a look at the **confusion table**:

In [None]:
conf_mat = pd.crosstab(y, y_hat, rownames=['Actual Defaults'], colnames=['Predicted Defualts'])
# Add row and column sums
conf_mat.loc['Column_Total']= conf_mat.sum(numeric_only=True, axis=0)
conf_mat.loc[:,'Row_Total'] = conf_mat.sum(numeric_only=True, axis=1)
print(conf_mat)

And let's calculate some resulting metrics:

In [None]:
TPR = 2372 / 6636 # True-Positive Rate
TNR = 22272 / 23364 # True-Negative Rate
MCR = (1092+4264)/30000 # Miss Classification Rate
print('TPR =', TPR)
print('TNR =', TNR)
print('MCR =', MCR)


So we are missing a few, yet the misclassification rate seems reasonable.

Let's consider the **ROC curve**:

In [None]:
fpr, tpr, threshold = roc_curve(y, p_x)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

So, the AUC is 77%---which is ok but it is clear that in order to get a high true positive rate (e.g., 80%+), we need to have to accept a high FPR (e.g., of more than 40%+). Default prediction is not a simple problem!

### Predictive Modeling: LDA

Let's also check the LDA Classifier, going through similar steps:

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [None]:
y.astype('int')

In [None]:
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X,y.astype('int'))
lda_pred = lda_model.predict(X)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y,lda_pred)

And let's look at the ROC curve and AUC:

In [None]:
lda_pred_proba = lda_model.predict_proba(X)
fpr, tpr, threshold = roc_curve(y, lda_pred_proba[:,1])
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

So very similar!