<a href="https://colab.research.google.com/github/apsc/responsible_ai/blob/main/IMT_589K_Problem_Set_1_%E2%80%93_Confusion_Matrix_and_Model_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## IMT 589K: Problem Set 1 – Confusion Matrix and Model Evaluation

In this problem set, you'll learn about confusion matrices—a fundamental tool for evaluating classification models. We'll guide you through selecting models and interpreting key performance metrics, including accuracy, precision, recall, and F1-score. Additionally, you'll learn how to interpret a confusion matrix effectively.

### Part I: Conceptual Knowledge [10 points]

1. What is a confusion matrix? [2pt]

A confusion matrix is a performance evaluation tool for classification models that displays the counts of correct and incorrect predictions. It's essentially a table that breaks down predictions into four categories, showing how "confused" a model gets when making predictions. This gives us insights beyond simple accuracy, revealing specific types of errors the model makes.

2. Define and explain the four components of a confusion matrix: [2pt]
*  True Positive (TP)
*  True Negative (TN)
*  False Positive (FP)
*  False Negative (FN)

1. True Positive (TP): Cases where the model correctly predicts the positive
class. Example: predicting a loan applicant will default, and they actually do.

2. True Negative (TN): Cases where the model correctly predicts the negative class. Example: predicting a loan applicant won't default, and they don't.

3. False Positive (FP): Cases where the model incorrectly predicts the positive class. Example: predicting a loan applicant will default, but they don't (Type I error).

4. False Negative (FN): Cases where the model incorrectly predicts the negative class. Example: predicting a loan applicant won't default, but they do (Type II error).

3. Choose any two metrics (e.g., accuracy, precision, recall, F1-score) and write their equations. [2pt]

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Proportion of all predictions that are correct


Precision = TP / (TP + FP)

Proportion of positive predictions that are correct


Recall = TP / (TP + FN)

Proportion of actual positives correctly identified


F1-score = 2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean balancing precision and recall

4. Confusion Matrix in Credit Prediction [2pt]
*   Define and distinguish between False Positives and False Negatives in the context of credit prediction.
*   Discuss the real-world consequences of each error type.
*   Which error type is more critical in credit prediction scenarios? Based on this analysis, identify which metric (precision or recall) should be prioritized.



1. False Positives: Applicants predicted to default who actually would repay. This denies credit to qualified individuals, resulting in lost revenue, customer dissatisfaction, and potentially discriminatory practices.

2. False Negatives: Applicants predicted to repay who actually default. This results in direct financial losses, increased collection costs, and potential bad debt write-offs.

3. More critical error: False negatives generally cost lenders more directly because they represent actual financial losses. While false positives represent opportunity costs, defaulted loans mean real money lost. Therefore, recall should typically be prioritized, though the exact balance depends on the lender's risk tolerance, market conditions, and profit margins.

5. Consider the following confusion matrix: [2pt]


The total number of rows in the dataset is 10000. There is a split across the protected attribute Veteran status of 9000: 1000 as follows: Veteran = 1 size of dataset N1 = 1000; Veteran = 0 size of dataset N2 = 9000.

The confusion matrix for Veteran = 1 is as follows (Y = 1 means default; Y = 0 means not defaulted in the outcome variable):

*   True Positives =  10
*   True Negatives = 900
*   False Positives = 80
*   False Negatives = 10

The confusion matrix for Veteran = 0 is as follows:

*   True Positives = 350
*   True Negatives = 8000
*   False Positives = 150
*   False Negatives = 350

Does this satisfy Demographic Parity? Explain and include your calculations.

Do you see anything else that may be strange? Look at the overall sample sizes for the two categories of the protected attributes. Is it feasible to have the same number of applicants based on protected attribute category?

Now, let's apply what we learned to real-world credit prediction data! You can download the dataset from [here](https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients), and import the dataset to Google Colab following the [instruction](https://docs.google.com/document/d/1XPWMRAwuSPnuu9Y9NHSk2DJOhGxmxXU6xwZP8DWrBXg/edit?usp=sharing).

**You should duplicate this Google collab file in your UW google drive and submit an individual copy to your canvas site by the deadline.**

We will build three machine learning models: Logistic Regression, Random Forest, and Neural Network. For this problem set, don't worry if you're not yet familiar with the mathematical details behind these models. We will focus only on their performance metrics to determine the best-performing model.

We will first import the libraries needed to build the models:

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

import tensorflow as tf
from tensorflow import keras

Now, import the dataset linked in Module 3 (update this folder path as needed):

In [None]:
df = pd.read_csv("/content/drive/MyDrive/IMT 589/default of credit card clients.csv", header = 1)

In [None]:
df

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000,1,3,1,39,0,0,0,0,...,88004,31237,15980,8500,20000,5003,3047,5000,1000,0
29996,29997,150000,1,3,2,43,-1,-1,-1,-1,...,8979,5190,0,1837,3526,8998,129,0,0,0
29997,29998,30000,1,2,2,37,4,3,2,-1,...,20878,20582,19357,0,0,22000,4200,2000,3100,1
29998,29999,80000,1,3,1,41,1,-1,0,0,...,52774,11855,48944,85900,3409,1178,1926,52964,1804,1


Typically we split the dataset into two parts: training data and test data. We will use this for confusion matrix.

In [None]:
# Train-test Split
X = df.drop(columns=['default payment next month'])
y = df['default payment next month']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize the training and testing data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Model 1: Logistic Regression

In [None]:

# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Compute the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
conf_matrix


array([[4550,  137],
       [1004,  309]])

#### Model 2: Random Forest

In [None]:
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize and train a RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Compute and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix


array([[4411,  276],
       [ 845,  468]])

#### Model 3: Neural Network

In [None]:
# Build the neural network model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')  # Output layer
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Make predictions
y_pred_prob = model.predict(X_test)
y_pred = (y_pred_prob > 0.5).astype(int) # Convert probabilities to class labels

# Compute and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5ms/step - accuracy: 0.8031 - loss: 0.4877 - val_accuracy: 0.8008 - val_loss: 0.4677
Epoch 2/10
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.8172 - loss: 0.4429 - val_accuracy: 0.8054 - val_loss: 0.4607
Epoch 3/10
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.8219 - loss: 0.4340 - val_accuracy: 0.8079 - val_loss: 0.4546
Epoch 4/10
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8224 - loss: 0.4272 - val_accuracy: 0.8146 - val_loss: 0.4549
Epoch 5/10
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8219 - loss: 0.4270 - val_accuracy: 0.8100 - val_loss: 0.4524
Epoch 6/10
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8223 - loss: 0.4251 - val_accuracy: 0.8096 - val_loss: 0.4544
Epoch 7/10
[1m675/675[0m 

array([[4455,  232],
       [ 863,  450]])

**Fairness Analysis [15 points]**
Part II. 1. Split the dataset based on marital status. Make sure to split both train and test. [3 pt]
Print the code here and run the code in a cell.

II. 2. Run Model 1 Logistic Regression and determine the confusion matrices based on the subsampling on marital status you did at part II.1. [4pt]
* Does the model satisfy demographic parity based on marital status?

* Does the model satisfy accuracy parity based on marital status? Include the accuracy values by marital status.

* Does the model satisfy equality of opportunity based on marital status? Include the formulas to calculate TPR and the values per matrix.


II.3. Run Model 2 Random Forest on the subsamples determined in part II.1. [4pt]  

* Does the model satisfy demographic parity based on marital status?

* Does the model satisfy accuracy parity based on marital status? Include the accuracy values by marital status.

* Does the model satisfy equality of opportunity based on marital status? Include the formulas to calculate TPR and the values per matrix.

II.4. Run Model 3 Neural Network on the subsamples determined in part II.1.  
* Does the model satisfy demographic parity based on marital status? [4pt]

* Does the model satisfy accuracy parity based on marital status? Include the accuracy values by marital status.

* Does the model satisfy equality of opportunity based on marital status? Include the formulas to calculate TPR and the values per matrix.

**[Optional - bonus points stretch question] Part III. Naive Bayes ** [up to 5 additional points]

Sometimes models are more computationally expensive, and in certain applications this may be important (see the analysis on computational complexity in the IEEE paper).

Re-run the analysis above using Naive-Bayes as a model. Look up:

```
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb_pred = gnb.fit(x_train, y_train).predict(x_test)
```

Compare the accuracy of this model with something simple like logistic; compare it with a longer runtime model like random forest. What can you infer? [2pt]

If you have time, re-run the demographic parity, accuracy parity, and equality of opportunity fairness criteria for protected attribute marital status. What do you discover comparing these results with logistic or with random forest? [3pt]

**General Rules of Conduct Reminder**

You may (and are encouraged) to collaborate in this class. You may also use Generative AI, including Gemini included with the Google suite, or Copilot, or others, provided you do cite the use here.

You are encouraged to collaborate in class. If you wish to do so in solving this assignment, please make sure to include here the names of all collaborators on this assignment.

**[Optional - leave blank if not applicable]** Collaborators on this assignment are:

**[Optional - leave blank if not applicable]** Use of Gen AI: [Yes/No] [If Yes, which AI was used and what prompts did you give it?]

In [None]:
# Marrital status (1 = married; 2 = single; 3 = others).
print(df['MARRIAGE'].value_counts())

MARRIAGE
2    15964
1    13659
3      323
0       54
Name: count, dtype: int64


In [None]:
test_df.columns.values
test_married = x_test[x_test['MARRIAGE'] == 1]
test_single = x_test[x_test['MARRIAGE'] == 2]
test_divorced = x_test[x_test['MARRIAGE'] == 3]