# Question 2:
Consider the L2-regularized multiclass logistic regression. That is, add to the logistic regression loss a regularization term that represents the L2-norm of the parameters. More precisely, the regularization term is 

$ (w, b) = \lambda \sigma_i (||w^i||^2 + ||b^i||^2) $

where ${w^i, b^i}$ are all the parameters in the logistic regression, and $\lambda \in R$ is the regularization hyper-parameter. Typically, $\lambda$ is about C/n where n is the number of data points and C is some constant in `[0.01,100]` (need to tune C). Run the regularized multiclass logistic regression on MNIST, using the basic minibatch SGD, and compare its results to those of the basic minibatch SGD with non-regularized loss, in Question #1.

**Import packages and load MNIST dataset**

In [4]:
import tensorflow as tf
import numpy as np
from sklearn.metrics import classification_report

# Load MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()


**Normalize the data and define a function to return a feedforward neural network model**

We also applied momentum optimizer and batch size = 20.

In [5]:
x_train = tf.keras.utils.normalize(x_train, axis=1)
x_test = tf.keras.utils.normalize(x_test, axis=1)

# Clear any previous models from memory
tf.keras.backend.clear_session()

**Run the regularized multiclass logistic regression using the basic minibatch SGD**

Tuning C: We use a validation split during training to evaluate performance on a portion of training data (20%). The best parameter C will be the one with the best validation performance. 

In [6]:
n = x_train.shape[0]

# Define the range of C values to try
C_values = [0.01, 0.1, 1, 10, 100]

best_validation_acc = 0
best_C = None

for C in C_values:
    lambda_reg = C / n
    regularizer = tf.keras.regularizers.L2(lambda_reg)
    sgd = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=regularizer),
        tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=regularizer),
        tf.keras.layers.Dense(32, activation='relu', kernel_regularizer=regularizer),
        tf.keras.layers.Dense(10, activation='softmax')
    ])

    model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    history = model.fit(x_train, y_train, epochs=10, batch_size=20, verbose=0, validation_split=0.2)
    val_acc = history.history['val_accuracy'][-1]  # take the last epoch's validation accuracy
    # Track the best performing C
    if val_acc > best_validation_acc:
        best_validation_acc = val_acc
        best_C = C


print(f"Best performing C based on validation accuracy: {best_C} with validation accuracy: {best_validation_acc}")

Best performing C based on validation accuracy: 0.01 with validation accuracy: 0.9608333110809326


With the selected the best C, we retrained the model using all of the training data (without validation split). Finally, evaluate the model's performance on the independent test set.

**Compare results to those of the basic minibatch SGD with non-regularized loss, in Question #1.**

We observed that with L2 regularization, the accuracy (97%) on the testset is higher than the result of basic minibatch SGD with non-regularized loss, in Question 1 (96%)

In [7]:

# Now, retrain the model with the best C on the entire training set
n = x_train.shape[0]
lambda_reg = best_C / n
regularizer = tf.keras.regularizers.L2(lambda_reg)

best_model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=regularizer),
    tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=regularizer),
    tf.keras.layers.Dense(32, activation='relu', kernel_regularizer=regularizer),
    tf.keras.layers.Dense(10, activation='softmax')
])

sgd = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
best_model.compile(optimizer=sgd, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the best model on the entire training set
best_model.fit(x_train, y_train, epochs=10, batch_size=20, verbose=1)

# Evaluate the best model on the test set
test_loss, test_acc = best_model.evaluate(x_test, y_test)

print(f"Best model - Test accuracy: {test_acc}, Test loss: {test_loss}")

Epoch 1/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 498us/step - accuracy: 0.8138 - loss: 0.6057
Epoch 2/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 502us/step - accuracy: 0.9582 - loss: 0.1336
Epoch 3/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 538us/step - accuracy: 0.9725 - loss: 0.0889
Epoch 4/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 520us/step - accuracy: 0.9787 - loss: 0.0695
Epoch 5/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 499us/step - accuracy: 0.9831 - loss: 0.0512
Epoch 6/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 534us/step - accuracy: 0.9863 - loss: 0.0429
Epoch 7/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 495us/step - accuracy: 0.9895 - loss: 0.0325
Epoch 8/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 515us/step - accuracy: 0.9901 - loss: 0.0270
Epoch 9/

In [8]:
y_pred = np.argmax(best_model.predict(x_test), axis=1)
print(classification_report(y_test, y_pred))

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 375us/step
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       980
           1       0.98      0.99      0.99      1135
           2       0.97      0.98      0.97      1032
           3       0.95      0.97      0.96      1010
           4       0.96      0.99      0.98       982
           5       0.99      0.94      0.96       892
           6       0.98      0.98      0.98       958
           7       0.97      0.97      0.97      1028
           8       0.96      0.97      0.97       974
           9       0.98      0.94      0.96      1009

    accuracy                           0.97     10000
   macro avg       0.97      0.97      0.97     10000
weighted avg       0.97      0.97      0.97     10000

