<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB888_II_2_CreditCardsWithKNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Credit Card Defaults with knn

In this tutorial, we will go back to the credit card default scores -- but this time we will us knn instead of logistic regression.

## KNN Background

Another so-called *algorithmic* learners use different structural assumptions. For instance, we illustrate a **k-nearest neighbor (knn)** approach, where the predicted class at a point $x_0$ is chosen based on the $k$ points that are closest:
$$
y(x_0) = \max_j\left\{\frac{1}{K} \sum_{i \in N_K(x_0)} 1_{\{y_i=j\}}\right\},
$$
where $N_k(x_0)$ denotes the index set of the $K$ points in the training sample that are closest to the point $x_0$ (usually in the sense of Euclidean distance).  This is very differnt than what we have seen before in that we don't have an underlying "probabilistic" approach.

As a reminder, the dataset provides credit card defaults for customers in Taiwan.  We are given some demographic information and information whether there was a default in the next months. We had prepared the data in GB 886, see [here](https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_IV_12_CreditCardCaseStudy.ipynb).


## Predicting Credit Card defaults

As always, let's start with importing the libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import statsmodels.api as sm

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

And let's load the prepared the dataset (see the steps from before):

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
mydata = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_IV_12_UCI_Credit_Card_prepped.csv', index_col=0)

In [None]:
mydata.head()

We also will split our dataset, to evaluate how knn prepares to conventional algorithms:

In [None]:
Train, Test = train_test_split(mydata, test_size=0.25)
Train_y = Train['default']
Train_X = Train.drop(columns = ['default'])
Test_y = Test['default']
Test_X = Test.drop(columns = ['default'])

### Predictive Modeling: Baseline Logistic Regression

Let's again run our baseline logistic regression model:


In [None]:
logistic_mod = sm.Logit(Train_y, sm.add_constant(Train_X).astype(float))
logistic_mod = logistic_mod.fit(maxiter = 10000)
print(logistic_mod.summary())

Let's first check the predictions in the training set:

In [None]:
logistic_pred = logistic_mod.predict()
logistic_pred_lab = logistic_pred > 0.5
conf_matrix = confusion_matrix(Train_y, logistic_pred_lab)
TN, FP, FN, TP = conf_matrix.ravel()
conf_matrix

And some of the prediction rates are:

In [None]:
# Calculate the misclassification rate
MCR = 1 - (TP + TN) / (TP + TN + FP + FN)
print(f"Misclassification Rate: {MCR}")

# Calculate the False Positive Rate (FPR)
FPR = FP / (FP + TN)
print(f"False Positive Rate (FPR): {FPR}")

# Calculate the False Negative Rate (FNR)
FNR = FN / (FN + TP)
print(f"False Negative Rate (FNR): {FNR}")

And let's check the predictions via the confusion matrix in the Test set:

In [None]:
logistic_pred = logistic_mod.predict(sm.add_constant(Test_X).astype(float))
logistic_pred_lab = logistic_pred > 0.5
conf_matrix = confusion_matrix(Test_y, logistic_pred_lab)
TN, FP, FN, TP = conf_matrix.ravel()
conf_matrix

And some of the prediction rates are:

In [None]:
# Calculate the misclassification rate
MCR = 1 - (TP + TN) / (TP + TN + FP + FN)
print(f"Misclassification Rate: {MCR}")

# Calculate the False Positive Rate (FPR)
FPR = FP / (FP + TN)
print(f"False Positive Rate (FPR): {FPR}")

# Calculate the False Negative Rate (FNR)
FNR = FN / (FN + TP)
print(f"False Negative Rate (FNR): {FNR}")

### Predictive Modeling: knn

Let's run the knn algorithm. We first also consider the test set:

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(Train_X, Train_y)
Train_y_knn = knn_model.predict(Train_X)
Test_y_knn = knn_model.predict(Test_X)

Let's look at the performance in the training set:

In [None]:
conf_matrix = confusion_matrix(Train_y, Train_y_knn)
TN, FP, FN, TP = conf_matrix.ravel()
conf_matrix

In [None]:
# Calculate the misclassification rate
MCR = 1 - (TP + TN) / (TP + TN + FP + FN)
print(f"Misclassification Rate: {MCR}")

# Calculate the False Positive Rate (FPR)
FPR = FP / (FP + TN)
print(f"False Positive Rate (FPR): {FPR}")

# Calculate the False Negative Rate (FNR)
FNR = FN / (FN + TP)
print(f"False Negative Rate (FNR): {FNR}")

So, we're beating the logistic regression results---in the Training set, though.

Let's look at the Test set:

In [None]:
conf_matrix = confusion_matrix(Test_y, Test_y_knn)
TN, FP, FN, TP = conf_matrix.ravel()
conf_matrix

In [None]:
# Calculate the misclassification rate
MCR = 1 - (TP + TN) / (TP + TN + FP + FN)
print(f"Misclassification Rate: {MCR}")

# Calculate the False Positive Rate (FPR)
FPR = FP / (FP + TN)
print(f"False Positive Rate (FPR): {FPR}")

# Calculate the False Negative Rate (FNR)
FNR = FN / (FN + TP)
print(f"False Negative Rate (FNR): {FNR}")

So, here we are performing quite a bit worse. Maybe we didn't choose the right k? Let's check the misclassification rate in the test set for different choices of k:

In [None]:
k_values = [2, 5, 10, 15, 20]
misclass_rates = []

for k in k_values:
    knn_model = KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(Train_X, Train_y)
    Test_y_knn = knn_model.predict(Test_X)
    conf_matrix = confusion_matrix(Test_y, Test_y_knn)
    TN, FP, FN, TP = conf_matrix.ravel()
    MCR = 1 - (TP + TN) / (TP + TN + FP + FN)
    misclass_rates.append(MCR)

plt.plot(k_values, misclass_rates, marker='o')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Misclassification Rate')
plt.title('Misclassification Rate vs. k for KNN')
plt.grid(True)
plt.show()

So, it appears that we don't get close to the misclassification rate for any of the choices. Hence, knn doesn't seem like a great choice in this setting.