<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_VI_12_CreditCardCaseStudyLASSO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Credit Card Defaults -- Revisited

In this tutorial, we will use logistic to predict credit card default scores -- but this time we will select an appropriate model rather than using the full set of features.

As a reminder, the dataset provides credit card defaults for customers in Taiwan.  We are given some demographic information ($X_1$-$X_5$), the previous history of payments ($X_6$-$X_{11}$), the amount of previous bills ($X_{12}$-$X_{17}$), and amounts of previous payments ($X_{18}$-$X_{23}$).  Finally, variable 24 is our target, whetyher there was a default in the next months.


As always, let's start with importing the libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import statsmodels.api as sm
from sklearn.metrics import confusion_matrix, classification_report, precision_score, roc_curve, auc

And let's load and prepare the dataset (we follow the exact same steps as before):

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
mydata = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_IV_12_UCI_Credit_Card.csv', index_col=0)

In [None]:
factor = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'default.payment.next.month']

mydata['PAY_0'][mydata['PAY_0']>4] = 4
mydata['PAY_2'][mydata['PAY_2']>4] = 4
mydata['PAY_3'][mydata['PAY_3']>4] = 4
mydata['PAY_4'][mydata['PAY_4']>4] = 4
mydata['PAY_5'][mydata['PAY_5']>4] = 4
mydata['PAY_6'][mydata['PAY_6']>4] = 4

mydata_numcols = mydata.drop(columns = factor)
mydata_faccols = mydata[factor].astype('category')
dummies = pd.get_dummies(mydata_faccols, drop_first=True)
mydata = pd.concat([mydata_numcols, dummies], axis = 1)

mydata = mydata.rename(columns={"default.payment.next.month_1": "default"})

mydata.insert(17, "BILL_AVG", (mydata['BILL_AMT1']+mydata['BILL_AMT2']+mydata['BILL_AMT3']+mydata['BILL_AMT4']+mydata['BILL_AMT5']+mydata['BILL_AMT6'])/6, True)
mydata = mydata.rename(columns={"BILL_AMT1": "BILL_REC"})
del mydata['BILL_AMT2']
del mydata['BILL_AMT3']
del mydata['BILL_AMT4']
del mydata['BILL_AMT5']
del mydata['BILL_AMT6']

### Predictive Modeling: Baseline Logistic Regression

Let's again run our baseline logistic regression model:


In [None]:
y = mydata['default']
X = mydata.drop(columns = ['default'])

In [None]:
logistic_mod = sm.Logit(y, sm.add_constant(X).astype(float))
logistic_mod = logistic_mod.fit(maxiter = 10000)
print(logistic_mod.summary())

And let's check the predictions via the confusion matrix:

In [None]:
p_x = logistic_mod.predict()
y_hat = (p_x > 0.5)

conf_mat = pd.crosstab(y, y_hat, rownames=['Actual Defaults'], colnames=['Predicted Defualts'])
# Add row and column sums
conf_mat.loc['Column_Total']= conf_mat.sum(numeric_only=True, axis=0)
conf_mat.loc[:,'Row_Total'] = conf_mat.sum(numeric_only=True, axis=1)
print(conf_mat)


And the **ROC curve**:

In [None]:
fpr, tpr, threshold = roc_curve(y, p_x)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

### Predictive Modeling: Forward Selection

Now, instead of using all features, let's run a forward selection procedure using AIC:


In [None]:
variables = X.columns.tolist()
selected_variables = []
best_aic = float('inf')

while variables:
    remaining_variables = list(set(variables) - set(selected_variables))
    candidate_models = []

    for var in remaining_variables:
        candidate_vars = selected_variables + [var]
        X_candidate = sm.add_constant(X[candidate_vars]).astype(float)
        model = sm.Logit(y, X_candidate).fit(disp=0, maxiter=10000)
        candidate_models.append((model, model.aic))

    best_model, current_aic = min(candidate_models, key=lambda x: x[1])

    if current_aic < best_aic:
        best_aic = current_aic
        selected_variables.append(best_model.params.index[-1])  # Add the variable that improved AIC
    else:
        break  # Stop if AIC starts increasing

print("Selected Variables:", selected_variables)
print(best_model.summary())


So it turns out that we dropped quite a few variables: The original model had 54 features, the selected model has 35!

Let's see how the predictions compare:

In [None]:
# Predict probabilities using the best model
p_x_best = best_model.predict()

# Classify predictions based on a 0.5 cutoff
y_hat_best = (p_x_best > 0.5)

# Create confusion matrix
conf_mat_best = pd.crosstab(y, y_hat_best, rownames=['Actual Defaults'], colnames=['Predicted Defaults'])

# Add row and column sums
conf_mat_best.loc['Column_Total'] = conf_mat_best.sum(numeric_only=True, axis=0)
conf_mat_best.loc[:, 'Row_Total'] = conf_mat_best.sum(numeric_only=True, axis=1)

print(conf_mat_best)

# Generate ROC curve
fpr_best, tpr_best, threshold_best = roc_curve(y, p_x_best)
roc_auc_best = auc(fpr_best, tpr_best)

plt.title('Receiver Operating Characteristic (Best Model)')
plt.plot(fpr_best, tpr_best, 'b', label='AUC = %0.2f' % roc_auc_best)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


So the model does not perform any better---BUT WE SHOULD ALSO NOT EXPECT IT TO!! In fact, the confusion matrix and the AUC above were cast within the training sample, so they are arguably overly optimistic. Recall, in sample, we will always choose the most complex model.

In fact, it is suprising how well the model performs in relation given that it has substantially fewer features: The AUC seems unchanged, and the confusion matrix only has slightly more false negatives and false positives.

Out of sample, we should expect to the reduced model to perform better!