<a href="https://colab.research.google.com/github/danielbauer1979/ML_656/blob/main/Module3_CreditCardDataExample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Credit Card Default Case Study

Dani Bauer, 2022

In this tutorial, we will introduce one of the examples we will use in the coming classes: We will attempt to predict defaults on credits cards using a public dataset.

We will use logistic regression to predict default scores.

## Credit Card Default Application

We rely on the dataset `pa_data_UCI_Credit_Card.csv` from the UCI Machine Learning Repository (Lichman, M., 2013. [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science).  This datasets provides credit card defaults for customers in Taiwan.  We are given some demographic information ($X_1$-$X_5$), the previous history of payments ($X_6$-$X_{11}$), the amount of previous bills ($X_{12}$-$X_{17}$), and amounts of previous payments ($X_{18}$-$X_{23}$).  Finally, variable 24 is our target, whetyher there was a default in the next months.


As always, let's start with importing the libraries:

In [1]:
import numpy as np 
import matplotlib.pyplot as plt  
import pandas as pd 
from sklearn.model_selection import train_test_split
import seaborn as sns

from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import confusion_matrix, classification_report, precision_score, roc_curve, auc
from sklearn import preprocessing
from sklearn.preprocessing import scale

Let's load the dataset

In [2]:
!git clone https://github.com/danielbauer1979/ML_656.git

Cloning into 'ML_656'...
remote: Enumerating objects: 141, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 141 (delta 12), reused 0 (delta 0), pack-reused 117[K
Receiving objects: 100% (141/141), 23.32 MiB | 8.83 MiB/s, done.
Resolving deltas: 100% (62/62), done.
Checking out files: 100% (28/28), done.


In [3]:
mydata = pd.read_csv('ML_656/UCI_Credit_Card.csv', index_col=0)

### Data Exploration and Preparation

In [None]:
mydata.head()

Let's look at some aggregate statistics.

In [None]:
mydata.describe()

First, a number of the variables are included numerically but really they have factor character, particularly Gender (1 = male; 2 = female), Education (1 = graduate school; 2 = university; 3 = high school; 4 = others), Marital status (1 = married; 2 = single; 3 = others), and default payment. Let's store them as factors.  We will do the same for history of past payment ($X_6$-$X_{11}$), although they really have ordinal character.

In [6]:
factor = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'default.payment.next.month']

Also, a number of the levels occur very sparsely: there are 11 levels for all the `PAY` variables, but only the first six seem to be frequent.  So let's collapse levels 7 through 11 to one:

In [None]:
mydata['PAY_0'][mydata['PAY_0']>4] = 4
mydata['PAY_2'][mydata['PAY_2']>4] = 4
mydata['PAY_3'][mydata['PAY_3']>4] = 4
mydata['PAY_4'][mydata['PAY_4']>4] = 4
mydata['PAY_5'][mydata['PAY_5']>4] = 4
mydata['PAY_6'][mydata['PAY_6']>4] = 4

Next, we rescale the numerical columns, as we know a number of learners require scaled inputs (and it should not matter for others):

In [None]:
mydata_numcols = mydata.drop(columns = factor)
mydata_faccols = mydata[factor].drop(columns = ['default.payment.next.month']).astype('category')
dummies = pd.get_dummies(mydata_faccols, drop_first=True)
mydata_numcols_sc_0 = scale(mydata_numcols)
mydata_numcols_sc = pd.DataFrame(data=mydata_numcols_sc_0, columns = mydata_numcols.columns, index = dummies.index)
mydata_sc = pd.concat([mydata_numcols_sc, dummies], axis = 1)
mydata_sc = pd.concat([mydata_sc, mydata['default.payment.next.month']], axis = 1)

And Let's relabel the long name of the dependent variable:

In [None]:
mydata = mydata.rename(columns={"default.payment.next.month": "default"})

Let's take a look:

In [None]:
mydata.head()

In [None]:
mydata.describe()

Let's check a correlation plot to make sure none of the variables is extremely correlated:

In [None]:
mask = np.triu(np.ones_like(mydata.corr(), dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(mydata.corr(), mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

So it looks like the bill amounts are highly correlated.  Let's just keep the most recent one and then the average of all of them:


In [None]:
mydata.insert(17, "BILL_AVG", (mydata['BILL_AMT1']+mydata['BILL_AMT2']+mydata['BILL_AMT3']+mydata['BILL_AMT4']+mydata['BILL_AMT5']+mydata['BILL_AMT6'])/6, True) 
mydata = mydata.rename(columns={"BILL_AMT1": "BILL_REC"})
del mydata['BILL_AMT2']
del mydata['BILL_AMT3']
del mydata['BILL_AMT4']
del mydata['BILL_AMT5']
del mydata['BILL_AMT6']
mydata.describe()

Let's save the dataset so that we can use it in coming tutorials without having to go through this procedure again:

In [None]:
mydata.to_csv('pa_data_UCI_Credit_Card_prepped.csv') 

### Predictive Modeling

As usually, let's split our dataset:

In [None]:
Train, Test = train_test_split(mydata, test_size=0.25)
Train_y = Train['default']
Train = Train.drop(columns = ['default'])
Test_y = Test['default']
Test = Test.drop(columns = ['default'])

Let's run a logistic regression model:

In [None]:
logistic_model1 = LogisticRegression(fit_intercept=True, max_iter=500).fit(Train,Train_y)
print(logistic_model1.intercept_)
print(logistic_model1.coef_)

Let's check predictions:

In [None]:
logistic_pred_1 = logistic_model1.predict_proba(Test)
np.sum(logistic_pred_1[:,1] > 0.5)
np.sum(logistic_pred_1[:,1] > 0.38)
logistic_pred_1_lab = logistic_pred_1[:,1] > 0.36
confusion_matrix(Test_y, logistic_pred_1_lab)

So we are missing quite a few.  Let's condider the AUC:

In [None]:
fpr, tpr, threshold = roc_curve(Test_y, logistic_pred_1[:,1])
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()