# Homework 03: Learning Curves and Training Workflow

## Due: Midnight on September 21 (with 2-hour grace period)

**Points:** 85

In this assignment, you will learn how to design, train, and evaluate neural networks by systematically exploring key design choices. Your focus will be on developing an effective **training workflow** — using learning curves and validation metrics to guide your decisions.

We'll use the **Forest Cover Type (Covertype) dataset,** which has ~581k tabular records with 54 cartographic/topographic features (elevation, aspect, slope, soil and wilderness indicators) used to predict one of seven tree cover types in Colorado’s Roosevelt National Forest. It’s a large, mildly imbalanced multi-class benchmark commonly used to compare classical ML and deep learning on tabular data.

We will start with a **baseline model** (two hidden layers of sizes 64 and 32), and gradually introduce and tune different hyperparameters. Each of the first five problems considers  different hyperparameter choices, and the last problem is your chance to use what you have learned to design your best model:

1. **Activation function** – Compare ReLU, sigmoid, and tanh to see which provides the best accuracy.
2. **Learning rate** – Explore a range of learning rates and identify which balances convergence speed and stability.
3. **Dropout** – Investigate how different dropout rates reduce overfitting and where they are most effective.
4. **L2 regularization** – Experiment with weight penalties to encourage simpler models and avoid memorization.
5. **Dropout + L2** – Combine both regularization techniques and study their interaction.
6. **Best model design** – Use all your insights to build and train your strongest model, with the option to try **learning rate scheduling** for further improvement.

Throughout, you will use **early stopping** to select the model at the epoch of **minimum validation loss**, and you will report the **validation accuracy** of that selected model as the primary measure of performance.

By the end of this homework, you will not only understand how different hyperparameters affect training and generalization, but also gain hands-on practice in building a disciplined workflow for model development.

There are 10 graded problems, worth 8 points each, with 5 points for free if you complete the homework. 

In [10]:
# Useful imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import time
import os

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.utils import class_weight

import tensorflow as tf
from tensorflow.keras import layers, models, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Input,Dropout
from tensorflow.keras.optimizers import Adam

from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers.schedules import ExponentialDecay


from tensorflow.keras.datasets import fashion_mnist

# utility code

random_seed = 42

def format_hms(seconds):
    return time.strftime("%H:%M:%S", time.gmtime(seconds))

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Suppresses INFO and WARNING messages


In [11]:
# Utility function to plot learning curves and keep track of all results

# Call `print_results()` to see listing of all results logged so far


def plot_learning_curves(hist, title, verbose=True):
    
    val_losses = hist.history['val_loss']
    min_val_loss = min(val_losses)
    min_val_epoch = val_losses.index(min_val_loss)
    val_acc_at_min_loss = hist.history['val_accuracy'][min_val_epoch]

    epochs = range(1, len(val_losses) + 1)  # epoch numbers starting at 1

    fig, axs = plt.subplots(2, 1, figsize=(8, 8), sharex=True)

    # --- Loss Plot ---
    axs[0].plot(epochs, hist.history['loss'], label='train loss')
    axs[0].plot(epochs, hist.history['val_loss'], label='val loss')
    axs[0].scatter(min_val_epoch + 1, min_val_loss, color='red', marker='x', s=50, label='min val loss')
    axs[0].set_title(f'{title} - Categorical Cross-Entropy Loss')
    axs[0].set_ylabel('Loss')
    axs[0].legend()
    axs[0].grid(True)

    # --- Accuracy Plot ---
    axs[1].plot(epochs, hist.history['accuracy'], label='train acc')
    axs[1].plot(epochs, hist.history['val_accuracy'], label='val acc')
    axs[1].scatter(min_val_epoch + 1, val_acc_at_min_loss, color='red', marker='x', s=50, label='acc @ min val loss')
    axs[1].set_title(f'{title} - Accuracy')
    axs[1].set_xlabel('Epoch')
    axs[1].set_ylabel('Accuracy')
    axs[1].legend()
    axs[1].grid(True)
    axs[1].set_ylim(0, 1.05)

    plt.tight_layout()
    plt.show()

    if verbose:
        print(f"Final Training Loss:            {hist.history['loss'][-1]:.4f}")
        print(f"Final Training Accuracy:        {hist.history['accuracy'][-1]:.4f}")
        print(f"Final Validation Loss:          {hist.history['val_loss'][-1]:.4f}")
        print(f"Final Validation Accuracy:      {hist.history['val_accuracy'][-1]:.4f}")
        print(f"Minimum Validation Loss:        {min_val_loss:.4f} (Epoch {min_val_epoch + 1})")
        print(f"Validation Accuracy @ Min Loss: {val_acc_at_min_loss:.4f}")

    results[title] = (val_acc_at_min_loss,{min_val_epoch + 1})

results = {}

**The plotting function will record the validation accuracy for each experiment, using the plot title as key. The next function will print these out (see the last cell in the notebook).**


In order to see all results, you must give a different plot title to each experiment.

In [12]:
def print_results():
    for title, (acc, ep) in sorted(results.items(), 
                                   key=lambda kv: kv[1][0],   # kv[1] is (acc, epoch); [0] is acc
                                   reverse=True
                                  ):
        print(f"{title:<40}\t{acc:.4f}")

### Wrapper to train, display results, and run test set

We assume multi-class classification, and allow setting various parameters for training. 

In [13]:
# Uses globals X_train,y_train,X_val,y_val

def train_and_test(model, 
                   epochs        = 500,                   # Just needs to be bigger than early stop point
                   lr_schedule   = 0.001,                 # Adam default / 10 seems to work well for this dataset
                   optimizer     = "Adam",
                   title         = "Learning Curves",
                   batch_size    = 64,                     # experiments confirmed this was optimal with other parameters at default
                   use_early_stopping = True,
                   patience      = 10,                                       
                   min_delta     = 0.0001,                 
                   callbacks     = [],                     # for extra callbacks other than early stopping
                   verbose       = 0,
                   return_history = False
                  ):

    print(f"\n{title}\n")


    if optimizer == "Adam":
        opt = Adam(learning_rate=lr_schedule) 
    else:
        opt = optimizer
    
    #Compiling the model
    model.compile(optimizer=opt, 
                  loss="sparse_categorical_crossentropy", 
                  metrics=["accuracy"]
                 )

    early_stop = EarlyStopping(
        monitor='val_loss',
        patience=patience,
        min_delta=min_delta,
        restore_best_weights=True,               # this will mean that the model which produced the smallest validation loss will be returned
        verbose=verbose
    )
    

    if use_early_stopping:
        cbs=[early_stop] + callbacks
    else:
        cbs=callbacks

    # start timer
    start = time.time()
    
    # Fit the model with early stopping
    history = model.fit(X_train, y_train,
                        epochs=epochs,
                        batch_size=batch_size,
                        validation_data=(X_val, y_val),       # must use stratified validation set
                        callbacks=cbs,
                        verbose=verbose
                       )

    if use_early_stopping:
        best_epoch = early_stop.best_epoch
        best_acc   = history.history['val_accuracy'][best_epoch]
    else:
        best_epoch = np.argmax(history.history['val_accuracy'])
        best_acc   = history.history['val_accuracy'][best_epoch]
    
    # Plot training history
    plot_learning_curves(history, title=title)

    # Evaluate on test data
    test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
    
    print(f"\nTest Loss: {test_loss:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")

    print(f"\nValidation-Test Gap (accuracy): {abs(best_acc - test_accuracy):.6f}")
    
    # Record end time and print execution time
    end = time.time()
    print(f"\nExecution Time: " + format_hms(end-start))

    if return_history:
        return history

### Load the dataset and extract a stratified subset

This datasest is rather large (581,012 samples) and unbalanced, but for the purposes of this homework, we use a much smaller set, and select samples so that it is balanced. 

In [14]:
# complete cell: load, balance, split into X_train/y_train/x_val/y_val/X_test/y_test, and standardize
from collections import Counter
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# 1) load
x, y = fetch_covtype(return_X_y=True)  # y in {1..7}
print("full dataset shape:", x.shape)

# 2) build a perfectly balanced subset across 7 classes (no replacement)
classes, counts = np.unique(y, return_counts=True)
# min_count = counts.min()  # size of rarest class                         # You can modify this parameter to increase the size of the dataset, but above
min_count = 1000                                                           # counts.min() you'll produce an unbalanced set. 


rng = np.random.default_rng(42)

idx_list = []
for c in classes:
    c_idx = np.where(y == c)[0]
    chosen = rng.choice(c_idx, size=min_count, replace=False)
    idx_list.append(chosen)

idx_bal = np.concatenate(idx_list)
rng.shuffle(idx_bal)

X_sub = x[idx_bal]
y_sub = y[idx_bal] - 1  # relabel to {0..6} for keras
print("balanced subset shape:", X_sub.shape, "class counts:", dict(Counter(y_sub)))

# 3) stratified 60/20/20 split (train/val/test)
test_size = 0.20
val_size = 0.20  # of the whole dataset

X_trainval, X_test, y_trainval, y_test = train_test_split(
    X_sub, y_sub, test_size=test_size, random_state=random_seed, stratify=y_sub
)
val_size_rel = val_size / (1.0 - test_size)  # e.g., 0.20 / 0.80 = 0.25

X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=val_size_rel, random_state=random_seed, stratify=y_trainval
)

# 4) standardize using train-only stats (float32 for tensorflow friendliness)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype(np.float32)
X_val   = scaler.transform(X_val).astype(np.float32)
X_test  = scaler.transform(X_test).astype(np.float32)

# 5) quick sanity checks
def show_counts(name, y_arr):
    c = Counter(y_arr)
    total = sum(c.values())
    print(f"{name}: total={total}, per-class={dict(c)}")

print("shapes:", "X_train", X_train.shape, "X_val", X_val.shape, "X_test", X_test.shape)
show_counts("train", y_train)
show_counts("val  ", y_val)
show_counts("test ", y_test)

# you now have: X_train, y_train, X_val, y_val, X_test, y_test

# Looks like integer encoded multi-class, let's check and define the global n_classes

labels = np.unique(y_train)

n_classes = len(labels)

print("class labels:",labels)


full dataset shape: (581012, 54)
balanced subset shape: (7000, 54) class counts: {np.int32(2): 1000, np.int32(0): 1000, np.int32(4): 1000, np.int32(5): 1000, np.int32(1): 1000, np.int32(3): 1000, np.int32(6): 1000}
shapes: X_train (4200, 54) X_val (1400, 54) X_test (1400, 54)
train: total=4200, per-class={np.int32(2): 600, np.int32(1): 600, np.int32(0): 600, np.int32(3): 600, np.int32(5): 600, np.int32(4): 600, np.int32(6): 600}
val  : total=1400, per-class={np.int32(1): 200, np.int32(6): 200, np.int32(5): 200, np.int32(4): 200, np.int32(3): 200, np.int32(2): 200, np.int32(0): 200}
test : total=1400, per-class={np.int32(0): 200, np.int32(6): 200, np.int32(2): 200, np.int32(3): 200, np.int32(4): 200, np.int32(5): 200, np.int32(1): 200}
class labels: [0 1 2 3 4 5 6]


## Prelude: Defining a model builder

In order to facilitate our experimentation, we'll write a function which builds models according to specifications:

- How many layers
- How wide each layer is
- How much dropout in each layer
- How much L2 Regularization in each layer

This is a fairly standard practice in ML, since the structure of simple models is fairly predictable and can be specified by a few hyperparameters. 

In [15]:
# This function will build a multi-class classifier with dropout and L2 regularization.
# You must specify the number of input features, the number of classes, and a list of layer hyperparameters
# in the form  [ ...., (width, activation function, L2 lambda, dropout rate), .... ]

# Note that when adding dropout, this appears as a separate layer, but it has no parameters to be trained. 

def build_model(n_inputs,layer_list,n_classes):
    layers = [ Input(shape=(n_inputs,)) ]
    for (width,act,l2_lambda,dropout_rate) in layer_list:
        layers.append( Dense(width, activation=act, kernel_regularizer=regularizers.l2(l2_lambda)) )
        if dropout_rate > 0:
            layers.append( Dropout(dropout_rate) )
    layers.append( Dense(n_classes, activation='softmax') )
    return models.Sequential( layers )


**Example: To build the following model:**

In [16]:
model = models.Sequential(
   [
    Input(shape=(X_train.shape[1],)),                              
    Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.0)),              # 0.0 means no regularization applied; no dropout, so no Dropout layer necessary
    Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    Dropout(0.3),
    Dense(n_classes, activation='softmax')
   ]
)

model.summary()

**We call `build_model` as shown here:**

In [17]:
build_model(X_train.shape[1], [ (64,'relu',0.0,0.0), (32,'relu',0.001,0.3)], n_classes).summary()

-------------------



### Baseline Model Architecture

**Problems 1–5 will use the following baseline model structure,** implemented with the provided `build_model` function and trained using `train_and_test`:

```
input → 64 → 32 → output
```

* Two hidden layers of widths 64 and 32.
* Activation function, dropout rate, and L2 regularization term (λ) will vary as specified in each problem.
* **Early stopping** is always applied to select the model at the epoch of **minimum validation loss**.
* We will report the **validation accuracy** of the selected model as the primary metric.


### Problem One: Which Activation Function?

In this problem, you will train the **baseline neural network** and investigate which activation function produces the best performance. The model you create will be the one saved by **early stopping** — that is, the epoch where validation loss is minimized.

**Steps to follow:**

* Use the provided functions `train_and_test` and `build_model` to create a model named **`model_baseline`**.
* Train and evaluate this model using each of the following activation functions in the hidden layers:

  * `relu`
  * `sigmoid`
  * `tanh`
* Identify which activation function produces the **best validation accuracy** at the epoch of **minimum validation loss**.
* Answer the graded questions.


In [None]:
activations = ["relu", "sigmoid", "tanh"]

for act in activations:
    print(f"\nTraining with activation: {act}")
    
    layer_list = [
        (64, act, 0.0, 0.0),
        (32, act, 0.0, 0.0)
    ]
    
    model_baseline = build_model(
        n_inputs=X_train.shape[1],
        layer_list=layer_list,
        n_classes=n_classes
    )
    
    model_baseline.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    early_stop = EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    )
    
    history = model_baseline.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=100,
        batch_size=32,
        callbacks=[early_stop],
        verbose=2
    )
    
    val_losses = history.history['val_loss']
    min_val_epoch = val_losses.index(min(val_losses))
    best_val_acc = history.history['val_accuracy'][min_val_epoch]

    results[act] = best_val_acc


Training with activation: relu
Epoch 1/100


132/132 - 2s - 18ms/step - accuracy: 0.4738 - loss: 1.4273 - val_accuracy: 0.6043 - val_loss: 1.0146
Epoch 2/100
132/132 - 1s - 4ms/step - accuracy: 0.6276 - loss: 0.9148 - val_accuracy: 0.6529 - val_loss: 0.8427
Epoch 3/100
132/132 - 0s - 3ms/step - accuracy: 0.6648 - loss: 0.8142 - val_accuracy: 0.6629 - val_loss: 0.8059
Epoch 4/100
132/132 - 0s - 2ms/step - accuracy: 0.6824 - loss: 0.7657 - val_accuracy: 0.6821 - val_loss: 0.7552
Epoch 5/100
132/132 - 0s - 2ms/step - accuracy: 0.6979 - loss: 0.7292 - val_accuracy: 0.6943 - val_loss: 0.7338
Epoch 6/100
132/132 - 0s - 2ms/step - accuracy: 0.7048 - loss: 0.7003 - val_accuracy: 0.7000 - val_loss: 0.7120
Epoch 7/100
132/132 - 0s - 2ms/step - accuracy: 0.7233 - loss: 0.6764 - val_accuracy: 0.7114 - val_loss: 0.6921
Epoch 8/100
132/132 - 0s - 2ms/step - accuracy: 0.7269 - loss: 0.6607 - val_accuracy: 0.7179 - val_loss: 0.6965
Epoch 9/100
132/132 - 0s - 3ms/step - accuracy: 0.7326 - loss: 0.6432 - val_accuracy: 0.7214 - val_loss: 0.6654
Epo

### Graded Questions

In [19]:
# Set a1a to the activation function which provided the best validation accuracy at the epoch of minimum validation loss

activation_to_index = {'relu': 0, 'sigmoid': 1, 'tanh': 2}
best_activation = max(results, key=results.get)

a1a = activation_to_index[best_activation]

In [20]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a1a = {a1a}') 


a1a = 0


In [21]:
# Set a1b to the validation accuracy found by this best activation function

a1b = results[best_activation]

In [22]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a1b = {a1b:.4f}') 

a1b = 0.7779


### Problem Two: Finding the Right Learning Rate

In this problem, you will continue working with the **baseline model** and determine which learning rate produces the best performance. As before, the model you evaluate should be the one saved by **early stopping** — the epoch where validation loss is minimized.

**Steps to follow:**

* Build and train the **baseline model** using the **activation function identified in Problem One**.

* Train and evaluate this model using each of the following learning rates:

  ```
      [1e-3, 5e-4, 1e-4, 5e-5, 1e-5]
  ```

* Identify which learning rate produces the **best validation accuracy** at the epoch of **minimum validation loss**, within a maximum of **500 epochs**.

* Answer the graded questions.


**Note: Smaller learning rates will generally take more epochs to reach the optimal point, so some of these will not engage early stopping, but run the full 500 epochs.**


In [None]:
learning_rates = [1e-3, 5e-4, 1e-4, 5e-5, 1e-5]
results_lr = {}

for lr in learning_rates:
    print(f"\nTraining with learning rate: {lr}")
    
    layer_list = [
        (64, best_activation, 0.0, 0.0),
        (32, best_activation, 0.0, 0.0)
    ]
    
    model_lr = build_model(
        n_inputs=X_train.shape[1],
        layer_list=layer_list,
        n_classes=n_classes
    )
    
    model_lr.compile(
        optimizer=Adam(learning_rate=lr),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    early_stop = EarlyStopping(
        monitor='val_loss',
        patience=20,
        restore_best_weights=True
    )
    
    history = model_lr.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=500,
        batch_size=32,
        callbacks=[early_stop],
        verbose=2
    )
    
    val_losses = history.history['val_loss']
    min_val_epoch = val_losses.index(min(val_losses))
    best_val_acc = history.history['val_accuracy'][min_val_epoch]

    results_lr[lr] = best_val_acc


Training with learning rate: 0.001
Epoch 1/500
132/132 - 1s - 8ms/step - accuracy: 0.4852 - loss: 1.3859 - val_accuracy: 0.5950 - val_loss: 1.0037
Epoch 2/500
132/132 - 0s - 2ms/step - accuracy: 0.6257 - loss: 0.9174 - val_accuracy: 0.6307 - val_loss: 0.8477
Epoch 3/500
132/132 - 0s - 4ms/step - accuracy: 0.6612 - loss: 0.8168 - val_accuracy: 0.6779 - val_loss: 0.7950
Epoch 4/500
132/132 - 0s - 2ms/step - accuracy: 0.6814 - loss: 0.7658 - val_accuracy: 0.6807 - val_loss: 0.7552
Epoch 5/500
132/132 - 0s - 3ms/step - accuracy: 0.7002 - loss: 0.7338 - val_accuracy: 0.7021 - val_loss: 0.7253
Epoch 6/500
132/132 - 0s - 2ms/step - accuracy: 0.7040 - loss: 0.7078 - val_accuracy: 0.7007 - val_loss: 0.7060
Epoch 7/500
132/132 - 0s - 2ms/step - accuracy: 0.7183 - loss: 0.6828 - val_accuracy: 0.6986 - val_loss: 0.7079
Epoch 8/500
132/132 - 0s - 2ms/step - accuracy: 0.7224 - loss: 0.6620 - val_accuracy: 0.7214 - val_loss: 0.6683
Epoch 9/500
132/132 - 0s - 3ms/step - accuracy: 0.7321 - loss: 0.648

#### Graded Questions

In [24]:
# Set a2a to the learning rate which provided the best validation accuracy at the epoch of minimum validation loss

best_lr = max(results_lr, key=results_lr.get)

a2a = best_lr

In [25]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a2a = {a2a:.6f}') 

a2a = 0.001000


In [26]:
# Set a2b to the validation accuracy found by this best learning rate

a2b = results_lr[best_lr]

In [27]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a2b = {a2b:.4f}') 

a2b = 0.7779


### Problem Three: Dropout

In this problem, you will explore how **dropout** can help prevent overfitting in neural networks. There are no absolute rules, but some useful hueristics are:

* Dropout typically works best in **later dense layers** (e.g., the second hidden layer of width 32) in the range **0.3–0.5**.
* If applied to **earlier layers** (e.g., the first hidden layer), dropout should be smaller, typically **0.0–0.2** (where 0.0 means no dropout).

**Steps to follow:**

* Build and train the **baseline model** using the **activation function from Problem One** and the **learning rate from Problem Two**.
* Investigate dropout in the ranges suggested, using increments of **0.1**.
* Identify which dropout configuration produces the **best validation accuracy** at the epoch of **minimum validation loss**.
* Answer the graded questions.


In [None]:
from itertools import product

dropout_first = [0.0, 0.1, 0.2] 
dropout_second = [0.3, 0.4, 0.5]

results_dropout = {}

for d1, d2 in product(dropout_first, dropout_second):
    print(f"\nTraining with dropout: first={d1}, second={d2}")
    
    layer_list = [
        (64, best_activation, 0.0, d1),
        (32, best_activation, 0.0, d2)
    ]
    
    model_dropout = build_model(
        n_inputs=X_train.shape[1],
        layer_list=layer_list,
        n_classes=n_classes
    )
    
    model_dropout.compile(
        optimizer=Adam(learning_rate=a2a),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    early_stop = EarlyStopping(
        monitor='val_loss',
        patience=20,
        restore_best_weights=True
    )
    
    history = model_dropout.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=500,
        batch_size=32,
        callbacks=[early_stop],
        verbose=2
    )
    
    val_losses = history.history['val_loss']
    min_val_epoch = val_losses.index(min(val_losses))
    best_val_acc = history.history['val_accuracy'][min_val_epoch]

    results_dropout[(d1, d2)] = best_val_acc
    


Training with dropout: first=0.0, second=0.3
Epoch 1/500
132/132 - 1s - 9ms/step - accuracy: 0.3905 - loss: 1.5905 - val_accuracy: 0.5857 - val_loss: 1.1420
Epoch 2/500
132/132 - 0s - 2ms/step - accuracy: 0.5519 - loss: 1.1173 - val_accuracy: 0.6264 - val_loss: 0.9257
Epoch 3/500
132/132 - 0s - 2ms/step - accuracy: 0.5905 - loss: 0.9903 - val_accuracy: 0.6571 - val_loss: 0.8549
Epoch 4/500
132/132 - 0s - 3ms/step - accuracy: 0.6286 - loss: 0.9139 - val_accuracy: 0.6664 - val_loss: 0.8106
Epoch 5/500
132/132 - 0s - 2ms/step - accuracy: 0.6371 - loss: 0.8719 - val_accuracy: 0.6907 - val_loss: 0.7750
Epoch 6/500
132/132 - 0s - 2ms/step - accuracy: 0.6631 - loss: 0.8466 - val_accuracy: 0.7000 - val_loss: 0.7522
Epoch 7/500
132/132 - 0s - 2ms/step - accuracy: 0.6700 - loss: 0.8148 - val_accuracy: 0.7114 - val_loss: 0.7286
Epoch 8/500
132/132 - 0s - 2ms/step - accuracy: 0.6752 - loss: 0.7905 - val_accuracy: 0.7164 - val_loss: 0.7085
Epoch 9/500
132/132 - 0s - 2ms/step - accuracy: 0.6914 - l

In [29]:
# Set a3a to the pair (dropout_rate_64,dropout_rate_32) of dropout rates for the two hidden layers which provided the best 
# validation accuracy at the epoch of minimum validation loss

best_dropout = max(results_dropout, key=results_dropout.get)

a3a = best_dropout

In [30]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a3a = {a3a}') 

a3a = (0.1, 0.3)


In [31]:
# Set a3b to the validation accuracy found by this best pair of dropout rates

a3b = results_dropout[best_dropout]

In [32]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a3b = {a3b:.4f}') 

a3b = 0.8014


### Problem Four: L2 Regularization

In this problem, you will explore how **L2 regularization** (also called *weight decay*) can help prevent overfitting in neural networks. There are no absolute rules, but some useful heuristics are:

* Start simple by using the **same λ in both hidden layers**, with values:

  ```
      1e-4, 1e-3, 1e-2
  ```

* If validation results suggest underfitting in the first layer or persistent overfitting in the later one, then try adjusting per layer, for example:

  * First hidden layer: λ = 1e-4
  * Second hidden layer: λ = 1e-3

**Steps to follow:**

* Build and train the **baseline model** using the **activation function from Problem One** and the **learning rate from Problem Two**, but **without dropout**.
* Investigate at least the four cases suggested (three with the same λ and one with different λ values). You may also consider additional combinations.
* Identify which configuration produces the **best validation accuracy** at the epoch of **minimum validation loss**.
* Answer the graded questions.


In [None]:
l2_cases = [
    (1e-4, 1e-4),
    (1e-3, 1e-3),
    (1e-2, 1e-2),
    (1e-4, 1e-3)
]

results_l2 = {}

for l1, l2 in l2_cases:
    print(f"\nTraining with L2: first={l1}, second={l2}")
    
    layer_list = [
        (64, best_activation, l1, 0.0),
        (32, best_activation, l2, 0.0)
    ]
    
    model_l2 = build_model(
        n_inputs=X_train.shape[1],
        layer_list=layer_list,
        n_classes=n_classes
    )
    
    model_l2.compile(
        optimizer=Adam(learning_rate=a2a),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    early_stop = EarlyStopping(
        monitor='val_loss',
        patience=20,
        restore_best_weights=True
    )
    
    
    history = model_l2.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=500,
        batch_size=32,
        callbacks=[early_stop],
        verbose=2
    )
    
    val_losses = history.history['val_loss']
    min_val_epoch = val_losses.index(min(val_losses))
    best_val_acc = history.history['val_accuracy'][min_val_epoch]

    results_l2[(l1, l2)] = best_val_acc
    


Training with L2: first=0.0001, second=0.0001
Epoch 1/500


132/132 - 1s - 8ms/step - accuracy: 0.4871 - loss: 1.3923 - val_accuracy: 0.6064 - val_loss: 1.0188
Epoch 2/500
132/132 - 0s - 3ms/step - accuracy: 0.6260 - loss: 0.9309 - val_accuracy: 0.6450 - val_loss: 0.8675
Epoch 3/500
132/132 - 1s - 4ms/step - accuracy: 0.6493 - loss: 0.8335 - val_accuracy: 0.6729 - val_loss: 0.8096
Epoch 4/500
132/132 - 1s - 4ms/step - accuracy: 0.6779 - loss: 0.7818 - val_accuracy: 0.6871 - val_loss: 0.7709
Epoch 5/500
132/132 - 0s - 3ms/step - accuracy: 0.6938 - loss: 0.7467 - val_accuracy: 0.7000 - val_loss: 0.7461
Epoch 6/500
132/132 - 0s - 2ms/step - accuracy: 0.7140 - loss: 0.7124 - val_accuracy: 0.7093 - val_loss: 0.7208
Epoch 7/500
132/132 - 0s - 3ms/step - accuracy: 0.7190 - loss: 0.6912 - val_accuracy: 0.7157 - val_loss: 0.7077
Epoch 8/500
132/132 - 0s - 2ms/step - accuracy: 0.7290 - loss: 0.6672 - val_accuracy: 0.7207 - val_loss: 0.6879
Epoch 9/500
132/132 - 0s - 2ms/step - accuracy: 0.7395 - loss: 0.6473 - val_accuracy: 0.7193 - val_loss: 0.6791
Epoc

In [34]:
# Set a4a to the pair (L2_lambda_64,L2_lambda_32) of the L2 lambdas for the two hidden layers which provided the best 
# validation accuracy at the epoch of minimum validation loss

best_l2 = max(results_l2, key=results_l2.get)

a4a = best_l2

In [35]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a4a = {a4a}') 

a4a = (0.001, 0.001)


In [36]:
# Set a4b to the validation accuracy found by this best pair of lambdas

a4b = results_l2[best_l2]

In [37]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a4b = {a4b:.4f}') 

a4b = 0.7836


### Problem Five: Combining Dropout with L2 Regularization

In this problem, you will explore how **dropout** and **L2 regularization** can work together to prevent overfitting. These two methods complement each other, but must be balanced carefully. A useful rule of thumb is:

* If dropout is **high**, use a **smaller λ**.
* If dropout is **low**, you can afford a **larger λ**.

**Steps to follow:**

* Build and train the **baseline model** using the **activation function from Problem One** and the **learning rate from Problem Two**.
* Investigate combinations of dropout and L2:

  * First, use the **dropout rate you identified in Problem Three** as a baseline.
  * Then, add L2 to both hidden layers with values:

    ```
        1e-4, 1e-3, 1e-2
    ```

    while keeping dropout fixed.
  * Finally, try **reducing dropout slightly** when L2 is added to see if performance improves.
  * [Optional] You may wish to investigate other combinations not covered here; for example, promising but not optimal choices of dropout rates may provide overall better performance when combines with L2 Regulari 
* Identify which combination produces the **best validation accuracy** at the epoch of **minimum validation loss**.
* Answer the graded questions.



In [None]:
from itertools import product

l2_values = [1e-4, 1e-3, 1e-2]

results_combined = {}

for d1, d2, l1, l2 in product(dropout_first, dropout_second, l2_values, l2_values):
    print(f"\nTraining with dropout: first={d1}, second={d2} | L2: first={l1}, second={l2}")
    
    layer_list = [
        (64, best_activation, l1, d1),
        (32, best_activation, l2, d2)
    ]
    
    model_combined = build_model(
        n_inputs=X_train.shape[1],
        layer_list=layer_list,
        n_classes=n_classes
    )
    
    model_combined.compile(
        optimizer=Adam(learning_rate=a2a),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    early_stop = EarlyStopping(
        monitor='val_loss',
        patience=20,
        restore_best_weights=True
    )
    
    history = model_combined.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=500,
        batch_size=32,
        callbacks=[early_stop],
        verbose=2
    )
    
    best_val_acc = max(history.history['val_accuracy'])
    results_combined[(d1, d2, l1, l2)] = best_val_acc
    


Training with dropout: first=0.0, second=0.3 | L2: first=0.0001, second=0.0001
Epoch 1/500


132/132 - 1s - 9ms/step - accuracy: 0.3874 - loss: 1.6471 - val_accuracy: 0.5836 - val_loss: 1.1935
Epoch 2/500
132/132 - 0s - 2ms/step - accuracy: 0.5579 - loss: 1.1585 - val_accuracy: 0.6279 - val_loss: 0.9437
Epoch 3/500
132/132 - 0s - 3ms/step - accuracy: 0.5862 - loss: 1.0138 - val_accuracy: 0.6414 - val_loss: 0.8667
Epoch 4/500
132/132 - 0s - 2ms/step - accuracy: 0.6143 - loss: 0.9390 - val_accuracy: 0.6736 - val_loss: 0.8170
Epoch 5/500
132/132 - 0s - 2ms/step - accuracy: 0.6302 - loss: 0.9073 - val_accuracy: 0.6714 - val_loss: 0.7957
Epoch 6/500
132/132 - 0s - 2ms/step - accuracy: 0.6405 - loss: 0.8580 - val_accuracy: 0.6886 - val_loss: 0.7625
Epoch 7/500
132/132 - 0s - 2ms/step - accuracy: 0.6633 - loss: 0.8334 - val_accuracy: 0.7014 - val_loss: 0.7529
Epoch 8/500
132/132 - 0s - 2ms/step - accuracy: 0.6719 - loss: 0.8156 - val_accuracy: 0.7029 - val_loss: 0.7271
Epoch 9/500
132/132 - 0s - 2ms/step - accuracy: 0.6702 - loss: 0.8008 - val_accuracy: 0.7050 - val_loss: 0.7194
Epoc

In [39]:
# Set a5 to the validation accuracy found by this best combination of dropout and L2 regularization

best_config = max(results_combined, key=results_combined.get)

a5 = results_combined[best_config]

In [40]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a5 = {a5:.4f}') 

a5 = 0.8129


### Problem Six: Build and Train Your Best Model

In this final problem, you will design and train your **best-performing model** using the techniques explored in the previous problems. You may make your own choices for:

* **Model architecture** (number of layers, widths, etc.)
* **Learning rate**
* **Batch size** (a new hyperparameter not varied in earlier problems)
* **Dropout rates** in both layers
* **L2 λ values** in both layers
* **[Optional but strongly suggested]:** Learning rate scheduling, using either **Exponential Decay** or **Cosine Decay**.

  * For Exponential Decay, typical decay rates are **0.90–0.999**, with **0.95** often a good starting point.

**Steps to follow:**

* Build and train the model according to your design choices.
* Use early stopping as before to evaluate performance at the epoch of **minimum validation loss**.
* Answer the graded question.


In [43]:
dropout_first_best = 0.2
dropout_second_best = 0.3
l2_first_best = 0.0001
l2_second_best = 0.01

lr_schedule = ExponentialDecay(
    initial_learning_rate=a2a,
    decay_steps=1000,
    decay_rate=0.95,
    staircase=True
)

layer_list = [
    (64, best_activation, l2_first_best, dropout_first_best),
    (32, best_activation, l2_second_best, dropout_second_best)
]

best_model = build_model(
    n_inputs=X_train.shape[1],
    layer_list=layer_list,
    n_classes=n_classes
)

best_model.compile(
    optimizer=Adam(learning_rate=lr_schedule),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=20,
    restore_best_weights=True
)

history = best_model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=500,
    batch_size=64,
    callbacks=[early_stop],
    verbose=2
)

Epoch 1/500


66/66 - 1s - 18ms/step - accuracy: 0.2895 - loss: 2.2281 - val_accuracy: 0.5300 - val_loss: 1.8438
Epoch 2/500
66/66 - 0s - 4ms/step - accuracy: 0.4733 - loss: 1.7611 - val_accuracy: 0.5957 - val_loss: 1.4447
Epoch 3/500
66/66 - 0s - 3ms/step - accuracy: 0.5319 - loss: 1.4986 - val_accuracy: 0.6243 - val_loss: 1.2474
Epoch 4/500
66/66 - 0s - 5ms/step - accuracy: 0.5621 - loss: 1.3578 - val_accuracy: 0.6364 - val_loss: 1.1506
Epoch 5/500
66/66 - 0s - 3ms/step - accuracy: 0.5762 - loss: 1.2619 - val_accuracy: 0.6414 - val_loss: 1.0830
Epoch 6/500
66/66 - 0s - 3ms/step - accuracy: 0.5948 - loss: 1.1725 - val_accuracy: 0.6486 - val_loss: 1.0347
Epoch 7/500
66/66 - 0s - 4ms/step - accuracy: 0.5952 - loss: 1.1367 - val_accuracy: 0.6579 - val_loss: 0.9961
Epoch 8/500
66/66 - 0s - 5ms/step - accuracy: 0.6088 - loss: 1.0887 - val_accuracy: 0.6536 - val_loss: 0.9634
Epoch 9/500
66/66 - 0s - 6ms/step - accuracy: 0.6174 - loss: 1.0567 - val_accuracy: 0.6714 - val_loss: 0.9399
Epoch 10/500
66/66 - 

In [44]:
# Set a6 to the validation accuracy found by this best model

a6 = max(history.history['val_accuracy'])

In [45]:
# Graded Answer
# DO NOT change this cell in any way          

print(f'a6 = {a6:.4f}') 

a6 = 0.7864


### Optional: Print out your results of all experiments

In [None]:
# print_results()

TypeError: 'float' object is not subscriptable

## Reflection Questions (ungraded)

1. Activation Functions:

    - Why do you think one activation function worked better than the others for this task?
    
    - How might this choice differ for deeper or wider networks?

2. Learning Rate:

    - Would a much smaller learning rate (with many more epochs) likely produce better accuracy?
    
    - When is it worth training longer with a smaller step size, and when is it unnecessary?

3. Dropout vs. L2:

    - Which form of regularization — dropout or L2 — gave better results in your experiments?
    
    - Why might one method be more effective in this setting?

4. Combining Dropout and L2:

    - Why might the combination of dropout and L2 sometimes perform worse than using one method alone?
    
    - What does this tell you about the balance between bias and variance in regularization?

5. Best Model:

    - When you designed your best model, what trade-offs did you notice between model complexity, training stability, and generalization?
    
    - Did learning rate scheduling (if you tried it) improve results? Why might it help?