<a href="https://colab.research.google.com/github/danielbauer1979/ML_656/blob/main/Module6_SVM_and_kNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Credit Card Default Case Study

Dani Bauer, 2022

Setup -- let's load the necesary packages.

In [6]:
import numpy as np 
import matplotlib.pyplot as plt  
import pandas as pd 
from sklearn.model_selection import train_test_split
import seaborn as sns

from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import confusion_matrix, classification_report, precision_score, roc_curve, auc
from sklearn import preprocessing
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report

In this tutorial, we introduce and illustrate *k-nearest-neighbor classifier*  (kNN) and *support-vector-machine* (SVM) classfiers. We first introduce the conceps and then go back to predicting defaults on credits cards using a public dataset.

## Review of Concepts and Maths -- SVM

The intellectual starting point for SVMs is that:
$$
f(x) = \beta_0 + \beta_1\,x_1 + \beta_2\,x_2 + \ldots + \beta_p\,x_p = 0
$$
defines a *hyperplane*, and we can define a *binary linear classifier* according to 
$$
f(x^{(0)}) \begin{array}{c} >\\<\end{array} 0,
$$ 
(where points fall relative to that hyperplane).  We can determine a hyperplane that maximizes the distance to the two classes, allowing for exceptions.  It turns out that only a few points where the distances are sharp or that are exceptions matter, these are called *support vectors*.

This idea generalizes to non-linear functions $f(\cdot)$:
$$
f(x) = 0
$$
defines a boundary, and we can classify according to $$f(x^{(0)}) \begin{array}{c} >\\<\end{array} 0.$$  SVMs use *kernel functions* to define non-linear $f$'s that work particularly well.

### SVM Illustration

Let's consider some arbitrary predictors in two-dimensional space, $x_1$ and $x_2$:

In [7]:
np.random.seed(1)
x_1 = 2 * np.random.normal(0, 1, 100)
x_2 = 2 * np.random.normal(0, 1, 100)

As we discussed above, the equation $f(x_1,x_2) = 0$ fixes a boundary in the two-dimensional space, so the "true" classification may be done according to:
$$
f(x_1,x_2) \;\; \begin{array}{c}>\\< \end{array} \;\; 0.
$$
We rely on the example that was used in generating the (non-linear) plot from lecture: 

In [8]:
y_true = np.sign(5 * np.sin(0.5 * (x_1 - 2)) + 2 - x_2)
mydata_true = pd.DataFrame({'y_true':y_true,'x_1':x_1,'x_2':x_2})

To generate more realistic data sets, where the classification is not sharp along the boundary, let's add noise.  We consider two different examples, one with less noise -- labeled 1 -- and one with more noise -- labeled 2.

In [9]:
y_1 = np.sign(5 * np.sin(0.5 * (x_1 - 2)) + 2 - x_2 + 1 * np.random.normal(0,1, 100))
mydata_1 = pd.DataFrame({'y_1':y_1,'x_1':x_1,'x_2':x_2})

In [10]:
y_2 = np.sign(5 * np.sin(0.5 * (x_1 - 2)) + 2 - x_2 + 3.5 * np.random.normal(0,1, 100))
mydata_2 = pd.DataFrame({'y_2':y_2,'x_1':x_1,'x_2':x_2})

We also set up a function for plotting:

In [12]:
def plot_svc(svc, X, y, h=0.02, pad=0.25):
  x_min, x_max = X[:, 0].min()-pad, X[:, 0].max()+pad
  y_min, y_max = X[:, 1].min()-pad, X[:, 1].max()+pad
  xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
  Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
  Z = Z.reshape(xx.shape)
  plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.2)
  plt.scatter(X[:,0], X[:,1], s=70, c=y, cmap=plt.cm.Paired)
  # Support vectors indicated in plot by vertical lines
  sv = svc.support_vectors_
  plt.scatter(sv[:,0], sv[:,1], c='k', marker='|', s=100, linewidths='1')
  plt.xlim(x_min, x_max)
  plt.ylim(y_min, y_max)
  plt.xlabel('X1')
  plt.ylabel('X2')
  plt.show()
  print('Number of support vectors: ', svc.support_.size)

Let's start out with the first example, where there is little noise such that the classes are well-separated.  Let's run our first SVM classifier:

In [None]:
X = mydata_1.drop(columns = ['y_1']).values
y = mydata_1['y_1'].values
svc = SVC(kernel='rbf', gamma=1 ,C=10) # Equivalent to radial in R.
svc.fit(X, y)
plot_svc(svc, X, y)

In [None]:
We can see that the SVM does well -- consider the confusion matrix:

In [None]:
pred = svc.predict(X)
table = pd.DataFrame({'y_1':y_1,'pred':pred})
table.groupby(['y_1','pred']).size().unstack('y_1').fillna(0)

Not a single misclassification in the training data.

Even relative to the ground truth without noise the performance is good:

In [None]:
table = pd.DataFrame({'y_true':y_true,'pred':pred})
table.groupby(['y_true','pred']).size().unstack('y_true')

Very few misses!

Now let's look at the second data set where the data is less well-separated.  Let's again fit the SVM, using the same choices (after all, the "true" pattern is the same):

In [None]:
X = mydata_2.drop(columns = ['y_2']).values
y = mydata_2['y_2'].values
svc = SVC(kernel='rbf', gamma=1 ,C=10) # Equivalent to radial in R.
svc.fit(X, y)
plot_svc(svc, X, y)

It does not look as good. It seems to overfit. Let's evaluate:

In [None]:
pred = svc.predict(X)
table = pd.DataFrame({'y_2':y_2,'pred':pred})
table.groupby(['y_2','pred']).size().unstack('y_2').fillna(0)

In [None]:
table = pd.DataFrame({'y_true':y_true,'pred':pred})
table.groupby(['y_true','pred']).size().unstack('y_true')

Hence, the error rate is substantially larger, and the number of support vectors is significantly greater.

Overall, it appears that SVM is doing particularly well in the first context where the classes are well separated.

## Review of Concepts and Maths - KNN

Another so-called *algorithmic* learners use different structural assumptions. For instance, we illustrate a **k-nearest neighbor (knn)** approach, where the predicted class at a point $x_0$ is chosen based on the $k$ points that are closest:
$$
y(x_0) = \max_j\left\{\frac{1}{K} \sum_{i \in N_K(x_0)} 1_{\{y_i=j\}}\right\},
$$
where $N_k(x_0)$ denotes the index set of the $K$ points in the training sample that are closest to the point $x_0$ (usually in the sense of Euclidean distance).  This is very differnt than what we have seen before in that we don't have an underlying "probabilistic" approach.

## Credit Card Default Application

We go back rely on the dataset `pa_data_UCI_Credit_Card.csv` from the UCI Machine Learning Repository (Lichman, M., 2013. [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science).  This datasets provides credit card defaults for customers in Taiwan.  We are given some demographic information ($X_1$-$X_5$), the previous history of payments ($X_6$-$X_{11}$), the amount of previous bills ($X_{12}$-$X_{17}$), and amounts of previous payments ($X_{18}$-$X_{23}$).  Finally, variable 24 is our target, whetyher there was a default in the next months.


As always, let's start with importing the libraries:

Let's load the dataset

In [None]:
!git clone https://github.com/danielbauer1979/ML_656.git

In [20]:
mydata = pd.read_csv('ML_656/UCI_Credit_Card_prepped.csv', index_col=0)

### Predictive Modeling

As usually, let's split our dataset:

In [23]:
Train, Test = train_test_split(mydata, test_size=0.25)
Train_y = Train['default']
Train = Train.drop(columns = ['default'])
Test_y = Test['default']
Test = Test.drop(columns = ['default'])

Let's run a logistic regression model as a comparison:

In [None]:
logistic_model1 = LogisticRegression(fit_intercept=True, max_iter=500).fit(Train,Train_y)
print(logistic_model1.intercept_)
print(logistic_model1.coef_)

Let's check predictions:

In [None]:
logistic_pred_1 = logistic_model1.predict_proba(Test)
np.sum(logistic_pred_1[:,1] > 0.5)
np.sum(logistic_pred_1[:,1] > 0.38)
logistic_pred_1_lab = logistic_pred_1[:,1] > 0.36
confusion_matrix(Test_y, logistic_pred_1_lab)

So we are missing quite a few.  Let's condider the AUC:

In [None]:
fpr, tpr, threshold = roc_curve(Test_y, logistic_pred_1[:,1])
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Let's re-run the GAM:

Let's check the knn classifier:

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(Train, Train_y)
Test_y_knn = knn_model.predict(Test)
confusion_matrix(Test_y, Test_y_knn)

So it doesn't look it works too well here.

Let's fun a naive svm (note: this takes a bit):

In [None]:
svm = SVC(C=1.0, kernel='rbf', gamma=1)
svm.fit(Train, Train_y) 

Let's evaluate:

In [None]:
pred = svm.predict(Test)
table = pd.DataFrame({'y_test':Test_y,'pred':pred})
table.groupby(['y_test','pred']).size().unstack('y_test')

So not great, we have to tune. Using cross validation, we can try (this takes a looong time):

In [None]:
tuned_parameters = [{'C': [0.01, 0.1, 1, 10, 100],'gamma': [0.5, 1,2,3,4]}]
clf = GridSearchCV(SVC(kernel='rbf'), tuned_parameters, cv=10,scoring='accuracy', return_train_score=True)
clf.fit(Train, Train_y)

In [None]:
clf.best_params_

In [None]:
confusion_matrix(Test_y, clf.best_estimator_.predict(Test))