# Assignment 4

### <span style="color:chocolate"> Submission requirements </span>

Your work will not be graded if your notebook doesn't include output. In other words, <span style="color:red"> make sure to rerun your notebook before submitting to Gradescope </span> (Note: if you are using Google Colab: go to Edit > Notebook Settings  and uncheck Omit code cell output when saving this notebook, otherwise the output is not printed).

Additional points may be deducted if these requirements are not met:

    
* Comment your code;
* Each graph should have a title, labels for each axis, and (if needed) a legend. Each graph should be understandable on its own;
* Try and minimize the use of the global namespace (meaning, keep things inside functions).
---

### Import libraries

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns  # for nicer plots
sns.set(style="darkgrid")  # default style

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import tensorflow as tf
from tensorflow import keras
from keras import metrics
from keras.datasets import fashion_mnist

tf.get_logger().setLevel('INFO')

---
### Step 1: Data ingestion

You'll train a binary classifier using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset. This consists of 70,000 grayscale images (28x28). Each image is associated with 1 of 10 classes. The dataset was split by the creators; there are 60,000 training images and 10,000 test images. Note also that Tensorflow includes a growing [library of datasets](https://www.tensorflow.org/datasets/catalog/overview) and makes it easy to load them in numpy arrays.

In [None]:
# Load the Fashion MNIST dataset.
(X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()

---
### Step 2: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) and Data Preprocessing are often iterative processes that involve going back and forth to refine and improve the quality of data analysis and preparation. However, the specific order can vary depending on the project's requirements. In some cases, starting with EDA, as you see in this assignment, could be more useful, but there is no rigid rule dictating the sequence in all situations.

### <span style="color:chocolate">Exercise 1:</span> Getting to know your data (5 points)

Complete the following tasks:

1. Print the shapes and types of (X_train, Y_train) and (X_test, Y_test). Interpret the shapes (i.e., what do the numbers represent?). Hint: For types use the <span style="color:chocolate">type()</span> function.
2. Define a list of strings of class names corresponding to each class in (Y_train, Y_test). Call this list label_names. Hint: Refer to the Fashion MNIST documentation.

In [None]:
# YOUR CODE HERE

# 1:
print("X_train shape:", X_train.shape, ", type:", type(X_train))
print("Interpretation: X_train contains 60,000 training images of size 28x28 pixels.")
print("Y_train shape:", Y_train.shape, ", type:", type(Y_train))
print("Interpretation: Y_train contains 60,000 labels, one for each training image.")
print("X_test shape:", X_test.shape, ", type:", type(X_test))
print("Interpretation: X_test contains 10,000 test images of size 28x28 pixels.")
print("Y_test shape:", Y_test.shape, ", type:", type(Y_test))
print("Interpretation: Y_test contains 10,000 labels, one for each test image.")

# 2:
label_names = [
    "T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
    "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"
]
print("Class names defined as:", label_names)


### <span style="color:chocolate">Exercise 2:</span> Getting to know your data - cont'd (5 points)

Fashion MNIST images have one of 10 possible labels (shown above).

Complete the following tasks:

1. Display the first 5 images in X_train for each class in Y_train, arranged in a 10x5 grid. Use the label_names list defined above;
2. Determine the minimum and maximum pixel values for images in the X_train dataset.

In [None]:
# YOUR CODE HERE

# 1:
plt.figure(figsize=(15, 15))
for class_index in range(10):
    # Extract indices of the first 5 images for the current class
    class_indices = np.where(Y_train == class_index)[0][:5]
    for i, image_index in enumerate(class_indices):
        plt.subplot(10, 5, class_index * 5 + i + 1)
        plt.imshow(X_train[image_index], cmap="gray")
        plt.axis('off')
        plt.title(label_names[class_index])
plt.tight_layout()
plt.show()

# 2:
min_pixel_value = np.min(X_train)
max_pixel_value = np.max(X_train)

print("Minimum pixel value in X_train:", min_pixel_value)
print("Maximum pixel value in X_train:", max_pixel_value)


---
### Step 3: Data preprocessing

This step is essential for preparing this image data in a format that is suitable for ML algorithms.

### <span style="color:chocolate">Exercise 3:</span> Feature preprocessing (5 points)

In the previous lab, the input data had just a few features. Here, we treat **every pixel value as a separate feature**, so each input example has 28x28 (784) features!

In this exercise, you'll perform the following tasks:

1. Normalize the pixel values in both X_train and X_test data so they range between 0 and 1;
2. For each image in X_train and X_test, flatten the 2-D 28x28 pixel array to a 1-D array of size 784. Hint: use the <span style="color:chocolate">reshape()</span> method available in NumPy. Note that by doing so you will overwrite the original arrays;
3. Pint the shape of X_train and X_test arrays.

In [None]:
# YOUR CODE HERE
# 1:
X_train = X_train / 255.0
X_test = X_test / 255.0

# 2:
X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)

# 3:
print("X_train shape after preprocessing:", X_train.shape)
print("X_test shape after preprocessing:", X_test.shape)


### <span style="color:chocolate">Exercise 4:</span> Label preprocessing (5 points)

This assignment involves binary classification. Specifically, the objective is to predict whether an image belongs to the sneaker class (class 7) or not.

Therefore, write code so that for each example in (Y_train, Y_test), the outcome variable is represented as follows:
* $y=1$, for sneaker class (positive examples), and
* $y=0$, for non-sneaker class (negative examples).

Note: To avoid "ValueError: assignment destination is read-only", first create a copy of the (Y_train, Y_test) data and call the resulting arrays (Y_train, Y_test). Then overwrite the (Y_train, Y_test) arrays to create binary outcomes.

In [None]:
# Make copies of the original dataset for binary classification task.
Y_train = np.copy(Y_train)
Y_test = np.copy(Y_test)

# YOUR CODE HERE
# This will make it so it's 1 for sneaker class, 0 otherwise
Y_train = (Y_train == 7).astype(int)
Y_test = (Y_test == 7).astype(int)

### <span style="color:chocolate">Exercise 5:</span> Data splits (10 points)

Using the <span style="color:chocolate">train_test_split()</span> method available in scikit-learn:
1. Retain 20% from the training data for validation purposes. Set random state to 1234. All the other arguments of the method are set to default values. Name the resulting dataframes as follows: X_train_mini, X_val, Y_train_mini, Y_val.
2. Print the shape of each array.

In [None]:
# YOUR CODE HERE

# 1:
X_train_mini, X_val, Y_train_mini, Y_val = train_test_split(
    X_train, Y_train, test_size=0.2, random_state=1234
)

# 2:
print("X_train_mini shape:", X_train_mini.shape)
print("X_val shape:", X_val.shape)
print("Y_train_mini shape:", Y_train_mini.shape)
print("Y_val shape:", Y_val.shape)


### <span style="color:chocolate">Exercise 6:</span> Data shuffling (10 points)

Since you'll be using Batch Gradient Descent (BGD) for training, it is important that **each batch is a random sample of the data** so that the gradient computed is representative.

1. Use [integer array indexing](https://numpy.org/doc/stable/reference/arrays.indexing.html#integer-array-indexing) to re-order (X_train_mini, Y_train_mini) using a list of shuffled indices. In doing so, you will overwrite the arrays.

In [None]:
np.random.seed(0)
# YOUR CODE HERE
shuffled_indices = np.random.permutation(len(X_train_mini))
X_train_mini = X_train_mini[shuffled_indices]
Y_train_mini = Y_train_mini[shuffled_indices]

---
### Step 4: Exploratory Data Analysis (EDA) - cont'd

Before delving into model training, let's further explore the raw feature values by comparing sneaker and non-sneaker training images.

### <span style="color:chocolate">Exercise 7:</span> Pixel distributions (10 points)

1. Identify all sneaker images in X_train_mini and calculate the mean pixel value for each sneaker image. Visualize these pixel values using a histogram. Print the mean pixel value across all sneaker images.
2. Identify all non-sneaker images in X_train_mini and calculate the mean pixel value for each non-sneaker image. Visualize these pixel values using a histogram. Print the mean pixel value across all non-sneaker images.
3. Based on the histogram results, assess whether there is any evidence suggesting that pixel values can be utilized to distinguish between sneaker and non-sneaker images. Justify your response.

Notes: Make sure to provide a descriptive title and axis labels for each histogran. Make sure you utilize Y_train_mini to locate the sneaker and non-sneaker class.

In [None]:
# YOUR CODE HERE

# Identifying sneaker images and getting means
sneaker_indices = np.where(Y_train_mini == 1)[0]
sneaker_means = X_train_mini[sneaker_indices].mean(axis=1)

# Identifying non-sneaker images and getting means
non_sneaker_indices = np.where(Y_train_mini == 0)[0]
non_sneaker_means = X_train_mini[non_sneaker_indices].mean(axis=1)

# 1:
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plt.hist(sneaker_means, bins=30, alpha=0.7, color='blue', label="Sneaker Images")
plt.title("Histogram of Mean Pixel Values for Sneaker Images")
plt.xlabel("Mean Pixel Value")
plt.ylabel("Frequency")
plt.legend()

# 2:
plt.subplot(1, 2, 2)
plt.hist(non_sneaker_means, bins=30, alpha=0.7, color='green', label="Non-Sneaker Images")
plt.title("Histogram of Mean Pixel Values for Non-Sneaker Images")
plt.xlabel("Mean Pixel Value")
plt.ylabel("Frequency")
plt.legend()

plt.tight_layout()
plt.show()

sneaker_mean_pixel_value = sneaker_means.mean()
print("Mean pixel value across all sneaker images:", sneaker_mean_pixel_value)
non_sneaker_mean_pixel_value = non_sneaker_means.mean()
print("Mean pixel value across all non-sneaker images:", non_sneaker_mean_pixel_value)

# 3:
print("Based on the printed Histograms above, there is a clear difference in the mean pixel values for sneaker and non-sneaker images.\nThe distribution of mean pixel values for non-sneaker images is much wider than is the distribution for sneaker images.\nThis suggests that pixel values might be useful for distinguishing between the two classes.")


---
### Step 4: Modeling

### <span style="color:chocolate">Exercise 8:</span> Baseline model (10 points)

When dealing with classification problems, a simple baseline is to select the *majority* class (the most common label in the training set) and use it as the prediction for all inputs.

With this information in mind:

1. What is the number of sneaker images in Y_train_mini?
2. What is the number of non-sneaker images in Y_train_mini?
3. What is the majority class in Y_train_mini?
4. What is the accuracy of a majority class classifier for Y_train_mini?
5. Implement a function that computes the Log Loss (binary cross-entropy) metric and use it to evaluate this baseline on both the mini train (Y_train_mini) and validation (Y_val) data. Use 0.1 as the predicted probability for your baseline (reflecting what we know about the original distribution of classes in the mini training data). Hint: for additional help, see the file ``04 Logistic Regression with Tensorflow_helpers.ipynb``.

In [None]:
# YOUR CODE HERE
from sklearn.metrics import log_loss
# 1:
num_sneaker_images = np.sum(Y_train_mini == 1)
print("Number of sneaker images in Y_train_mini:", num_sneaker_images)

# 2:
num_non_sneaker_images = np.sum(Y_train_mini == 0)
print("Number of non-sneaker images in Y_train_mini:", num_non_sneaker_images)

# 3:
majority_class = np.bincount(Y_train_mini).argmax()
print("Majority class in Y_train_mini:", "Sneaker" if majority_class == 1 else "Non-Sneaker")
print("Majority class in Y_train_mini:", "Sneaker" if num_sneaker_images >  num_non_sneaker_images else "Non-Sneaker")

# 4:
majority_class_predictions = np.full(Y_train_mini.shape, majority_class)
accuracy_mc = np.mean(majority_class_predictions == Y_train_mini)
print(f"Accuracy of the majority class classifier: {accuracy_mc:.4f}")

# 5:
def compute_log_loss(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    log_loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return log_loss

# Use 0.1 as the predicted probability for the baseline model
baseline_predictions_train = np.full(Y_train_mini.shape, 0.1)
baseline_predictions_val = np.full(Y_val.shape, 0.1)

# 5: Calculate Log Loss on the mini training and validation datasets
log_loss_train = compute_log_loss(Y_train_mini, baseline_predictions_train)
log_loss_val = compute_log_loss(Y_val, baseline_predictions_val)

print(f"Log Loss on mini training data: {log_loss_train:.4f}")
print(f"Log Loss on validation data: {log_loss_val:.4f}")


### <span style="color:chocolate">Exercise 9:</span> Improvement over Baseline with TensorFlow (10 points)

Let's use TensorFlow to train a binary logistic regression model much like you did in the previous assignment. The goal here is to build a ML model to improve over the baseline classifier.

1. Fill in the <span style="color:green">NotImplemented</span> parts of the build_model() function below by following the instructions provided as comments. Hint: the activation function, the loss, and the evaluation metric are different compared to the linear regression model;
2. Build and compile a model using the build_model() function and the (X_train_mini, Y_train_mini) data. Set learning_rate = 0.0001. Call the resulting object *model_tf*.
3. Train *model_tf* using the (X_train_mini, Y_train_mini) data. Set num_epochs = 5 and batch_size=32. Pass the (X_val, Y_val) data for validation. Hint: see the documentation behind the [tf.keras.Model.fit()](https://bcourses.berkeley.edu/courses/1534588/files/88733489?module_item_id=17073646) method.
3. Generate a plot (for the mini training and validation data) with the loss values on the y-axis and the epoch number on the x-axis for visualization. Make sure to include axes name and title. Hint: check what the [tf.keras.Model.fit()](https://bcourses.berkeley.edu/courses/1534588/files/88733489?module_item_id=17073646) method returns.

In [None]:
def build_model(num_features, learning_rate):
  """Build a TF linear regression model using Keras.

  Args:
    num_features: The number of input features.
    learning_rate: The desired learning rate for SGD.

  Returns:
    model: A tf.keras model (graph).
  """
  # This is not strictly necessary, but each time you build a model, TF adds
  # new nodes (rather than overwriting), so the colab session can end up
  # storing lots of copies of the graph when you only care about the most
  # recent. Also, as there is some randomness built into training with SGD,
  # setting a random seed ensures that results are the same on each identical
  # training run.
  tf.keras.backend.clear_session()
  tf.random.set_seed(0)

  # Build a model using keras.Sequential. While this is intended for neural
  # networks (which may have multiple layers), we want just a single layer for
  # binary logistic regression.
  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Dense(
      units=1,        # output dim
      input_shape=(num_features,),  # input dim
      use_bias=True,               # use a bias (intercept) param
      activation='sigmoid',
      kernel_initializer=tf.keras.initializers.Ones(),  # initialize params to 1
      bias_initializer=tf.keras.initializers.Ones(),    # initialize bias to 1
  ))

  # We need to choose an optimizer. We'll use SGD, which is actually mini-batch GD
  optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)

  # Finally, compile the model. Select the accuracy metric. This finalizes the graph for training.
  model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

  return model

In [None]:
tf.random.set_seed(0)
# 2. Build and compile model
# YOUR CODE HERE
num_features = X_train_mini.shape[1]
learning_rate = 0.0001
model_tf = build_model(num_features, learning_rate)

# 3. Fit the model
# YOUR CODE HERE
history = model_tf.fit(
    X_train_mini, Y_train_mini,  # Training data and labels
    epochs=5,                    # Number of epochs
    batch_size=32,               # Batch size
    validation_data=(X_val, Y_val)  # Validation data and labels
)

# 4. Graph
train_loss = history.history['loss']
val_loss = history.history['val_loss']

# Plot the loss
plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), train_loss, label='Training Loss', color='blue')
plt.plot(range(1, 6), val_loss, label='Validation Loss', color='red')
plt.title('Training and Validation Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()


---
### Step 5: Hyperparameter tuning

Hyperparameter tuning is a crucial step in optimizing ML models. It involves systematically adjusting hyperparameters such as learning rate, number of epochs, and optimizer to find the model configuration that leads to the best generalization performance.

This tuning process is typically conducted by monitoring the model's performance on the validation vs. training set. It's important to note that using the test set for hyperparameter tuning can compromise the integrity of the evaluation process by violating the assumption of "blindness" of the test data.

### <span style="color:chocolate">Exercise 10:</span> Hyperparameter tuning (10 points)

1. Fine-tune the **learning rate** and **number of epochs** hyperparameters of *model_tf* to determine the setup that yields the most optimal generalization performance. Feel free to explore various values for these hyperparameters. Hint: you can manually test different hyperparameter values or you can use the [Keras Tuner](https://www.tensorflow.org/tutorials/keras/keras_tuner). If you decide to work with the Keras Tuner, define a new model building function named <span style="color:chocolate">build_model_tuner()</span>.

After identifying your preferred model configuration, print the following information:

2. The first five learned parameters of the model (this should include the bias term);
3. The loss at the final epoch on both the mini training and validation datasets;
4. The percentage difference between the losses observed on the mini training and validation datasets.
5. Compare the training/validation loss of the TensorFlow model (model_tf) with the baseline model's loss. Does the TensorFlow model demonstrate an improvement over the baseline model?


Please note that we will consider 'optimal model configuration' any last-epoch training and validation loss that is below 0.08.

In [None]:
pip install -q -U keras-tuner

In [None]:
# YOUR CODE HERE
from keras_tuner import HyperModel, Hyperband
# Set the random seed for reproducibility
tf.random.set_seed(0)

# Define the model for hyperparameter tuning
def build_model_tuner(hp):
    """Build a TF logistic regression model using Keras for hyperparameter tuning.

    Args:
        hp: Keras Tuner Hyperparameter object for tuning learning_rate and epochs.

    Returns:
        model: A tf.keras model (graph).
    """
    # Clear any previous sessions to avoid memory issues
    tf.keras.backend.clear_session()
    tf.random.set_seed(0)

    model = tf.keras.Sequential()

    # Add a dense layer for binary classification
    model.add(tf.keras.layers.Dense(
        units=1,  # Single output unit for binary classification
        input_shape=(X_train_mini.shape[1],),  # Input shape corresponds to the number of features (784)
        use_bias=True,
        activation='sigmoid',  # Sigmoid activation for binary classification
        kernel_initializer=tf.keras.initializers.Ones(),  # Initialize weights to 1
        bias_initializer=tf.keras.initializers.Ones()  # Initialize bias to 1
    ))

    # Hyperparameters for tuning
    learning_rate = hp.Choice('learning_rate', values=[1e-1, 1e-2, 5e-2, 1e-3, 5e-3, 1e-4, 1e-5])
    optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)

    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Set up the hyperparameter tuning using Keras Tuner (Hyperband)
tuner = Hyperband(
    build_model_tuner,
    objective='val_loss',  # Optimize for validation loss
    max_epochs=10,         # Maximum number of epochs to search through
    factor=3,              # The factor to increase the number of epochs
    directory='hyperparameter_tuning',  # Directory to store the results
    project_name='logistic_regression_tuning'
)

# Set up early stopping to avoid overfitting
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

# Perform hyperparameter search
tuner.search(X_train_mini, Y_train_mini, epochs=50, validation_data=(X_val, Y_val), callbacks=[stop_early], verbose=0)

# Retrieve the best model and hyperparameters
best_model = tuner.get_best_models(num_models=1)[0]
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]

# Print the best hyperparameters found by the tuner
print("Best hyperparameters:", best_hyperparameters.values)

# Determine the best number of epochs by training the model with the best hyperparameters
best_model = tuner.hypermodel.build(best_hyperparameters)
history = best_model.fit(X_train_mini, Y_train_mini, epochs=50, validation_data=(X_val, Y_val), verbose=1)

# Find the best epoch based on validation loss
val_loss_per_epoch = history.history['val_loss']
best_epoch = val_loss_per_epoch.index(min(val_loss_per_epoch)) + 1
print(f'Best epoch: {best_epoch}')

# Retraining the model with the optimal epoch
best_model = tuner.hypermodel.build(best_hyperparameters)
history = best_model.fit(X_train_mini, Y_train_mini, epochs=best_epoch, validation_data=(X_val, Y_val), verbose=1)

# Retrieve the learned parameters (weights and bias) of the best model
weights, bias = best_model.layers[0].get_weights()
print("Learned Parameters (Weights):", weights)
print("Learned Bias Term:", bias)

# Printing the final training and validation loss at the best epoch
final_train_loss = best_model.history.history['loss'][-1]
final_val_loss = best_model.history.history['val_loss'][-1]
print(f"Final Training Loss: {final_train_loss}")
print(f"Final Validation Loss: {final_val_loss}")

# Printing the difference between the last-epoch training and validation losses
loss_difference = final_train_loss - final_val_loss
print(f"Loss Difference (Train - Validation): {loss_difference}")

# Plotting the training and validation loss curves
plt.plot(best_model.history.history['loss'], label='Training Loss')
plt.plot(best_model.history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.show()


In [None]:
# 1.
weights, bias = best_model.layers[0].get_weights()
print("First five learned parameters (weights and bias):")
print(f"Weights: \n{weights[:5]}")
print(f"Bias: \n{bias[0]}")

# 2.
final_train_loss = best_model.history.history['loss'][-1]
final_val_loss = best_model.history.history['val_loss'][-1]
print(f"Final Training Loss: {final_train_loss}")
print(f"Final Validation Loss: {final_val_loss}")

# 3.
loss_difference_percentage = ((final_train_loss - final_val_loss) / final_train_loss) * 100
print(f"Percentage Difference between Training and Validation Loss: {loss_difference_percentage:.2f}%")

# 4.
print(f"Baseline Model Log Loss (Train): {log_loss_train:.4f}")
print(f"Baseline Model Log Loss (Validation): {log_loss_val:.4f}")

train_loss_diff_percentage = ((final_train_loss - log_loss_train) / log_loss_train) * 100
val_loss_diff_percentage = ((final_val_loss - log_loss_val) / log_loss_val) * 100

print(f"Percentage Difference between Training Loss (final vs baseline): {train_loss_diff_percentage:.2f}%")
print(f"Percentage Difference between Validation Loss (final vs baseline): {val_loss_diff_percentage:.2f}%")

if final_val_loss < 0.08:
    print("The TensorFlow model demonstrates an improvement over the baseline model.")
else:
    print("The TensorFlow model does NOT demonstrate an improvement over the baseline model.")


---
### Step 6: Evaluation and Generalization


Now that you've determined the optimal set of hyperparameters, it's time to evaluate your optimized model on the test data to gauge its performance in real-world scenarios, commonly known as inference.

### <span style="color:chocolate">Exercise 11:</span> Computing accuracy (10 points)

1. Calculate aggregate accuracy on both mini train and test datasets using a probability threshold of 0.5. Hint: You can utilize the <span style="color:chocolate">model.evaluate()</span> method provided by tf.keras. Note: Aggregate accuracy measures the overall correctness of the model across all classes in the dataset;

2. Does the model demonstrate strong aggregate generalization capabilities? Provide an explanation based on your accuracy observations.

In [None]:
# YOUR CODE HERE
# Evaluate the model on the mini training dataset
train_loss, train_accuracy = best_model.evaluate(X_train_mini, Y_train_mini, verbose=0)
print(f"Training Accuracy: {train_accuracy * 100:.2f}%")

# Evaluate the model on the test dataset
test_loss, test_accuracy = best_model.evaluate(X_test, Y_test, verbose=0)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

# 3. Aggregate accuracy on both mini training and test datasets using a threshold of 0.5
train_preds = best_model.predict(X_train_mini)
test_preds = best_model.predict(X_test)
train_preds_binary = (train_preds >= 0.5).astype(int)
test_preds_binary = (test_preds >= 0.5).astype(int)
train_accuracy_manual = np.mean(train_preds_binary == Y_train_mini)
test_accuracy_manual = np.mean(test_preds_binary == Y_test)

print(f"Aggregate Training Accuracy with threshold 0.5: {train_accuracy_manual * 100:.2f}%")
print(f"Aggregate Test Accuracy with threshold 0.5: {test_accuracy_manual * 100:.2f}%")

print("The aggregate training accuracy of the model with a threshold of 0.5 is 82.12%. \n"
      "The aggregate test accuracy is 81.85%. \n"
      "These results suggest that the model has good generalization capabilities, as the test accuracy is close to the training accuracy. \n"
      "With accuracy above 80%, the model demonstrates solid performance in distinguishing between sneaker and non-sneaker images. \n"
      "This indicates that the model is not overfitting and is effectively generalizing to unseen data."
      "The fact that the accuracies are not identical also suggests the model is not overfitting.")

### <span style="color:chocolate">Exercise 12:</span> Fairness evaluation (10 points)

1. Generate and visualize the confusion matrix on the test dataset using a probability threshold of 0.5. Additionally, print the True Positives (TP), False Negatives (FN), False Positives (FP), and True Negatives (TN). Hint: you can utilize the <span style="color:chocolate">model.predict()</span> method available in tf.keras, and then the <span style="color:chocolate">confusion_matrix()</span>, <span style="color:chocolate">ConfusionMatrixDisplay()</span> methods available in sklearn.metrics;

2. Compute subgroup accuracy, separately for the sneaker and non-sneaker classes, on the test dataset using a probability threshold of 0.5. Reflect on any observed accuracy differences (potential lack of fairness) between the two classes.

3. Does the model demonstrate strong subgroup generalization capabilities? Provide an explanation based on your accuracy observations.

In [None]:
# YOUR CODE HERE

# 1.
cm = confusion_matrix(Y_test, test_preds_binary)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Non-Sneaker', 'Sneaker'])
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix on Test Dataset")
plt.show()

# Print TP, FN, FP, TN
TP = cm[1, 1]
FN = cm[1, 0]
FP = cm[0, 1]
TN = cm[0, 0]

print(f"True Positives (TP): {TP}")
print(f"False Negatives (FN): {FN}")
print(f"False Positives (FP): {FP}")
print(f"True Negatives (TN): {TN}")

# 2.
sneaker_accuracy = np.mean(test_preds_binary[Y_test == 1] == 1)
non_sneaker_accuracy = np.mean(test_preds_binary[Y_test == 0] == 0)

print(f"Sneaker class accuracy: {sneaker_accuracy * 100:.2f}%")
print(f"Non-sneaker class accuracy: {non_sneaker_accuracy * 100:.2f}%")

print("There is a noticeable difference in accuracy between the sneaker (91.80%) and non-sneaker (98.88%) classes.\n"
      "The difference suggests that the model performs much better on the non-sneaker class.\n"
      "This ~7% difference may indicate that the model has a bias toward predicting non-sneaker images correctly, while it tends to make more mistakes on sneaker images.\n"
      "That is, there is a potential lack of fairness. This is most likely due to the class imbalance or the model's tendency to favor the majority class.\n")

# 3.
print("While the model achieves high overall accuracy on both classes, the stronger performance on the non-sneaker class and the slight underperformance on the sneaker class\n"
      "suggest that the model's subgroup generalization capabilities may need improvement.\n"
      "The model could be further tuned or balanced (via techniques like class weighting or oversampling on the sneaker class) to ensure more equitable performance across both classes.\n"
      "Nonetheless, for each subclass, the model demonstrates strong generalization capabilities, as indicated by the confusion matrix and accuracy values.")


----
#### <span style="color:chocolate">Additional practice question</span> (not graded)

Is it possible to enhance the prediction accuracy for the sneaker class by performing the following steps?

1. Implement data balancing techniques, such as oversampling or undersampling, to equalize the representation of both classes.
2. After balancing the data, retrain the model on the balanced dataset.
3. Evaluate the model's performance, particularly focusing on the accuracy achieved for the sneaker class.

1. Yes, it is possible to enhance prediction accuracy for the sneaker class by implementing data balancing techniques like oversampling or undersampling. Oversampling increases the number of sneaker class samples, while undersampling reduces the number of non-sneaker class samples to ensure both classes are equally represented. This can help the model focus more on learning the sneaker class, improving accuracy for that class.

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score

smote = SMOTE(random_state=1234)
X_train_balanced, Y_train_balanced = smote.fit_resample(X_train_mini, Y_train_mini)
model_tf = build_model(X_train_balanced.shape[1], learning_rate=0.0001)
history_balanced = model_tf.fit(X_train_balanced, Y_train_balanced, epochs=5, batch_size=32, validation_data=(X_val, Y_val))
Y_test_pred = (model_tf.predict(X_test) > 0.5).astype("int32")
test_accuracy = accuracy_score(Y_test, Y_test_pred)
print(f"Test Accuracy with balanced data: {test_accuracy * 100:.2f}%")

cm = confusion_matrix(Y_test, Y_test_pred)
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm)
cm_display.plot(cmap="Blues")
plt.title("Confusion Matrix for Test Dataset")
plt.show()

sneaker_accuracy = (Y_test_pred[Y_test == 1] == 1).mean()
non_sneaker_accuracy = (Y_test_pred[Y_test == 0] == 0).mean()

print(f"Sneaker class accuracy with balanced data: {sneaker_accuracy * 100:.2f}%")
print(f"Non-sneaker class accuracy with balanced data: {non_sneaker_accuracy * 100:.2f}%")

3. I did something wrong above.