<a href="https://colab.research.google.com/github/aydanmufti/Module-7-Assignments/blob/main/Homework_09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Homework 9: Text Classification with Fine-Tuned BERT

### Due: Midnight on November 5th (with 2-hour grace period) — Worth 85 points

In this final homework, we’ll explore **fine-tuning a pre-trained Transformer model (BERT)** for text classification using the **IMDB Movie Review** dataset. You’ll begin with a working baseline notebook and then conduct a series of controlled experiments to understand how data size, context length, and model architecture affect performance.

You’ll complete three problems:

* **Problem 1:** Evaluate how **sequence length** and **learning rate** jointly influence validation loss and generalization.
* **Problem 2:** Measure how **training data size** affects both model performance and total training time.
* **Problem 3:** Compare **two additional models** from the BERT family to analyze the trade-offs between model size and accuracy on this dataset.

In each problem, you’ll report your key metrics, summarize what you observed, and reflect on what you learned.

> **Note:** This homework was developed and tested on **Google Colab**, due to version conflicts when running locally. It is **strongly recommended** that you complete your work on Colab as well.

There are 6 problems, each worth 14 points, and you get one point free if you complete the entire homework.


In [2]:
# Uninstall and reinstall with compatible versions
%pip uninstall -y pyarrow
%pip install pyarrow==14.0.1
# Then restart the runtime

Found existing installation: pyarrow 22.0.0
Uninstalling pyarrow-22.0.0:
  Successfully uninstalled pyarrow-22.0.0
Collecting pyarrow==14.0.1
  Downloading pyarrow-14.0.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Downloading pyarrow-14.0.1-cp312-cp312-manylinux_2_28_x86_64.whl (38.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 4.3.0 requires pyarrow>=21.0.0, but you have pyarrow 14.0.1 which is incompatible.
bigframes 2.27.0 requires pyarrow>=15.0.2, but you have pyarrow 14.0.1 which is incompatible.[0m[31m
[0mSuccessfully installed pyarrow-14.0.1


In [1]:
# Install once per new Colab runtime
%pip -q install -U keras keras-hub tensorflow tensorflow-text datasets evaluate

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.[0m[31m
[0m

In [2]:

import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import time
import random
import numpy as np
import keras
import keras_hub as kh
import evaluate
from datasets import load_dataset, Dataset, Features, Value, ClassLabel

from keras import mixed_precision                    # generally faster
mixed_precision.set_global_policy("mixed_float16")

### Here is where you can set global hyperparameters for this homework

In [3]:
# ---------------- Config ----------------
SEED        = 42
MAX_LEN     = 128
EPOCHS      = 3
BATCH       = 32
EVAL_BATCH  = 64
SUBSET_FRAC = 0.25   # <-- 0.25 to train and test on 25% of whole dataset during development;  set to 1.0 for full dataset

keras.utils.set_random_seed(SEED)

### Load and Preprocess the IMDB Movie Review Dataset

In [4]:
# ---- Load IMDb (raw), join train+test ----
imdb   = load_dataset("imdb")
texts  = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")

# ---- Build DS with explicit features (label=ClassLabel) ----
features = Features({"text": Value("string"),
                     "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)

# ---- Optional: take a stratified subset of the FULL dataset ----
if 0.0 < SUBSET_FRAC < 1.0:
    sub = all_ds.train_test_split(train_size=SUBSET_FRAC, seed=SEED, stratify_by_column="label")
    ds_pool = sub["train"]
else:
    ds_pool = all_ds

# ---- Stratified 80/10/10 split on the (possibly smaller) pool ----
# First: 80/20 train+val pool / test
splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
train_val_pool, test_ds = splits["train"], splits["test"]
# Then: carve 10% of full (i.e., 0.125 of the 80% pool) as validation
splits2 = train_val_pool.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
train_ds, val_ds = splits2["train"], splits2["test"]

# ---- Numpy arrays for Keras fit/predict ----
X_tr = np.array(train_ds["text"], dtype=object); y_tr = np.array(train_ds["label"], dtype="int32")
X_va = np.array(val_ds["text"],   dtype=object); y_va = np.array(val_ds["label"],   dtype="int32")
X_te = np.array(test_ds["text"],  dtype=object); y_te = np.array(test_ds["label"],  dtype="int32")

# ---- Quick summary ----
def _counts(ds):
    arr = np.array(ds["label"], dtype=int)
    return len(arr), np.bincount(arr, minlength=2).tolist()
print(f"Pool after SUBSET_FRAC={SUBSET_FRAC}: {len(ds_pool)} (of {len(all_ds)})")
print("Train:", _counts(train_ds), " Val:", _counts(val_ds), " Test:", _counts(test_ds))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Pool after SUBSET_FRAC=0.25: 12500 (of 50000)
Train: (8750, [4375, 4375])  Val: (1250, [625, 625])  Test: (2500, [1250, 1250])


### Build and train a baseline Distil-Bert Text Classifier

In [5]:
# ---- Keras Hub preprocessor + classifier ----
preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
    "distil_bert_base_en_uncased", sequence_length=MAX_LEN
)
model = kh.models.DistilBertTextClassifier.from_preset(
    "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc
)

model.compile(
    optimizer=keras.optimizers.Adam(1e-5),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
)

start = time.time()

# ---- Train with early stopping (restore best val weights) ----
cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]
history = model.fit(
    X_tr, y_tr,
    validation_data=(X_va, y_va),
    epochs=EPOCHS,
    batch_size=BATCH,
    callbacks=cb,
    verbose=1,
)

# ---- Evaluate (accuracy + F1 via `evaluate`) ----
logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0)
y_pred = logits.argmax(axis=-1)

acc_metric = evaluate.load("accuracy")
f1_metric  = evaluate.load("f1")
acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
f1  = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

# Tiny confusion matrix helper (no sklearn needed)
def confusion_matrix_np(y_true, y_pred, num_classes=2):
    cm = np.zeros((num_classes, num_classes), dtype=int)
    for t, p in zip(y_true, y_pred):
        cm[t, p] += 1
    return cm

print(f"\nValidation acc (best epoch): {history.history['val_acc'][np.argmin(history.history['val_loss'])]:.3f}")
print(f"\nTest accuracy: {acc:.3f}   Test F1: {f1:.3f}")
print("\nConfusion matrix:\n", confusion_matrix_np(y_te, y_pred))

end = time.time() - start
print("\nElapsed time:", time.strftime("%H:%M:%S", time.gmtime(end)))

Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m124s[0m 252ms/step - acc: 0.7825 - loss: 0.4529 - val_acc: 0.8376 - val_loss: 0.3449
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 35ms/step - acc: 0.8785 - loss: 0.2896 - val_acc: 0.8584 - val_loss: 0.3401
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 34ms/step - acc: 0.9160 - loss: 0.2205 - val_acc: 0.8600 - val_loss: 0.3552


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]


Validation acc (best epoch): 0.858

Test accuracy: 0.855   Test F1: 0.852

Confusion matrix:
 [[1096  154]
 [ 209 1041]]

Elapsed time: 00:02:52


# Problem 1 — Mini sweep: context length × learning rate (6 runs)

In this problem we'll see how much **context length** (`MAX_LEN`) helps, and how sensitive fine-tuning is to **learning rate**—without running a huge grid.

## Setup (keep these fixed)

* `SUBSET_FRAC = 0.25`               # use only this percentage of the whole dataset
* `EPOCHS = 3`
* `BATCH = 32` (but see note for 256 below)
* **EarlyStopping** with `restore_best_weights=True`
* Same random `SEED` for all runs
* Same data split for all runs (don’t reshuffle between runs)

### Run these 6 configurations

**For each** `MAX_LEN ∈ {128, 256, 512}`, try **two** learning rates:

* **MAX_LEN = 128**

  * `(LR = 2e-5, BATCH = 32)` – healthy default for shorter contexts.
  * `(LR = 1e-5, BATCH = 32)` – conservative LR; often a touch stabler.

* **MAX_LEN = 256**

  * `(LR = 1e-5, BATCH = 16)` – longer context → lower batch.
  * `(LR = 7.5e-6, BATCH = 16)` – even steadier if loss is noisy.

* **MAX_LEN = 512**  *(heavier quadratic attention cost)*

  * `(LR = 7.5e-6, BATCH = 8)` – safe starting point.
  * `(LR = 5e-6, BATCH = 8)` – extra caution for stability.

**If you hit an Out Of Memory error:**

* At **256** with `BATCH = 16`, drop to `BATCH = 8`.
* At **512** with `BATCH = 8`, drop to `BATCH = 4`.


Then answer the graded questions.


In [None]:
# Your code here; add as many cells as you need
# Import pandas
import pandas as pd

# Store results
results = []

# Fixed parameters
# SEED = 42
EPOCHS = 3
SUBSET_FRAC = 0.25
EVAL_BATCH = 64

keras.utils.set_random_seed(SEED)

# Configuration sweep
configs = [
    {'MAX_LEN': 128, 'LR': 2e-5, 'BATCH': 32, 'name': '128_2e-5'},
    {'MAX_LEN': 128, 'LR': 1e-5, 'BATCH': 32, 'name': '128_1e-5'},
    {'MAX_LEN': 256, 'LR': 1e-5, 'BATCH': 16, 'name': '256_1e-5'},
    {'MAX_LEN': 256, 'LR': 7.5e-6, 'BATCH': 16, 'name': '256_7.5e-6'},
    {'MAX_LEN': 512, 'LR': 7.5e-6, 'BATCH': 8, 'name': '512_7.5e-6'},
    {'MAX_LEN': 512, 'LR': 5e-6, 'BATCH': 8, 'name': '512_5e-6'},
]

for config in configs:
    print(f"Running: MAX_LEN={config['MAX_LEN']}, LR={config['LR']}, BATCH={config['BATCH']}")

    keras.utils.set_random_seed(SEED)

    # Build preprocessor and model
    preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
        "distil_bert_base_en_uncased", sequence_length=config['MAX_LEN']
    )
    model = kh.models.DistilBertTextClassifier.from_preset(
        "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc
    )

    model.compile(
        optimizer=keras.optimizers.Adam(config['LR']),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
    )

    start = time.time()

    # Train with early stopping
    cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]
    history = model.fit(
        X_tr, y_tr,
        validation_data=(X_va, y_va),
        epochs=EPOCHS,
        batch_size=config['BATCH'],
        callbacks=cb,
        verbose=1,
    )

    # Evaluate on test set
    logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0)
    y_pred = logits.argmax(axis=-1)

    acc_metric = evaluate.load("accuracy")
    f1_metric = evaluate.load("f1")
    acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
    f1 = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

    elapsed = time.time() - start

    # Get best validation metrics
    best_epoch = np.argmin(history.history['val_loss'])
    best_val_loss = history.history['val_loss'][best_epoch]
    best_val_acc = history.history['val_acc'][best_epoch]

    # Store results
    results.append({
        'config': config['name'],
        'MAX_LEN': config['MAX_LEN'],
        'LR': config['LR'],
        'BATCH': config['BATCH'],
        'best_epoch': best_epoch + 1,
        'val_loss': best_val_loss,
        'val_acc': best_val_acc,
        'test_acc': acc,
        'test_f1': f1,
        'elapsed_time': elapsed
    })

    print(f"\nBest epoch: {best_epoch + 1}")
    print(f"Val loss (best): {best_val_loss:.4f}, Val acc (best): {best_val_acc:.3f}")
    print(f"Test accuracy: {acc:.3f}, Test F1: {f1:.3f}")
    print(f"Elapsed time: {time.strftime('%H:%M:%S', time.gmtime(elapsed))}")

# Display results summary
df_results = pd.DataFrame(results)
print("Result Summary:")
print(df_results.to_string(index=False))

# Find best configuration
best_idx = df_results['val_acc'].idxmax()
best_config = df_results.iloc[best_idx]
print("Best Configuration:")
print(f"Config: {best_config['config']}")
print(f"MAX_LEN: {best_config['MAX_LEN']}, LR: {best_config['LR']}, BATCH: {best_config['BATCH']}")
print(f"Validation accuracy at min val loss: {best_config['val_acc']:.3f}")
print(f"Test accuracy: {best_config['test_acc']:.3f}")
print(f"Test F1: {best_config['test_f1']:.3f}")

Running: MAX_LEN=128, LR=2e-05, BATCH=32
Downloading from https://www.kaggle.com/api/v1/models/keras/distil_bert/keras/distil_bert_base_en_uncased/3/download/config.json...


100%|██████████| 462/462 [00:00<00:00, 916kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/distil_bert/keras/distil_bert_base_en_uncased/3/download/tokenizer.json...


100%|██████████| 794/794 [00:00<00:00, 1.45MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/distil_bert/keras/distil_bert_base_en_uncased/3/download/assets/tokenizer/vocabulary.txt...


100%|██████████| 226k/226k [00:00<00:00, 674kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/distil_bert/keras/distil_bert_base_en_uncased/3/download/model.weights.h5...


100%|██████████| 253M/253M [00:07<00:00, 33.8MB/s]


Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 152ms/step - acc: 0.8080 - loss: 0.4107 - val_acc: 0.8400 - val_loss: 0.3525
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 33ms/step - acc: 0.8960 - loss: 0.2549 - val_acc: 0.8464 - val_loss: 0.3570
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 33ms/step - acc: 0.9318 - loss: 0.1775 - val_acc: 0.8624 - val_loss: 0.3769

Best epoch: 1
Val loss (best): 0.3525, Val acc (best): 0.840
Test accuracy: 0.841, Test F1: 0.852
Elapsed time: 00:01:54
Running: MAX_LEN=128, LR=1e-05, BATCH=32
Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 155ms/step - acc: 0.7825 - loss: 0.4530 - val_acc: 0.8376 - val_loss: 0.3450
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 34ms/step - acc: 0.8787 - loss: 0.2897 - val_acc: 0.8584 - val_loss: 0.3401
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s

### Graded Questions

In [None]:
# Set a1a to the validation accuracy at min validation loss for your best configuration found in this problem

a1a = 0.914 # Best config: MAX_LEN=512, LR=5e-6, BATCH=8    # Replace 0.0 with your answer

In [None]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a1a = {a1a:.4f}')

a1a = 0.9140


#### Question a1b:

* Does **more context** (128 → 256 → 512) consistently help?
* How much effect did the learning rate have on the validation accuracy?


#### Your Answer Here:
Yes, more context consistently helps. Validation accuracy improved substantially from 128 tokens (0.840-0.858) to 256 tokens (0.902-0.907) to 512 tokens (0.910-0.914), showing a total gain of about 5.6 percentage points. The largest improvement came from 128 to 256, with diminishing returns at 512. Learning rate had a smaller but consistent effect - lower learning rates (5e-6 to 1e-5) outperformed higher ones (2e-5) by about 0.4 to 1.8 percentage points across all context lengths. Overall, context length was the more impactful hyperparameter, improving accuracy by roughly 3x more than learning rate tuning. Lower learning rates provided more stable fine-tuning, while longer sequences captured more sentiment-relevant information from the reviews.

## Problem 2 — How much data is enough?

In this problem, you’ll investigate how model performance scales with dataset size.

**Setup.**
Use the best `MAX_LEN` and `LR` values you found in **Problem 1**.

**What to do:**

1. For each value of `SUBSET_FRAC ∈ {0.25, 0.50, 0.75, 1.00}`, train your model once and observe the displayed performance metrics.
2. Answer the discussion question below.




In [None]:
# Your code here; add as many cells as you need
# Reinitializing the hyperparameters to match the best found in p1
BEST_MAX_LEN = 512
BEST_LR = 5e-6
BEST_BATCH = 8
EPOCHS = 3
EVAL_BATCH = 64
SEED = 42

# Store results
data_size_results = []

# Test different data subset fractions
subset_fractions = [0.25, 0.50, 0.75, 1.00]

for frac in subset_fractions:
    print(f"Training with SUBSET_FRAC = {frac}")

    keras.utils.set_random_seed(SEED)

    # Load and preprocess data with current fraction
    imdb = load_dataset("imdb")
    texts = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
    labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")

    features = Features({"text": Value("string"),
                         "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
    all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)

    # Take subset if needed
    if 0.0 < frac < 1.0:
        sub = all_ds.train_test_split(train_size=frac, seed=SEED, stratify_by_column="label")
        ds_pool = sub["train"]
    else:
        ds_pool = all_ds

    # 80/10/10 split
    splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
    train_val_pool, test_ds = splits["train"], splits["test"]
    splits2 = train_val_pool.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
    train_ds, val_ds = splits2["train"], splits2["test"]

    # Convert to numpy arrays
    X_tr = np.array(train_ds["text"], dtype=object)
    y_tr = np.array(train_ds["label"], dtype="int32")
    X_va = np.array(val_ds["text"], dtype=object)
    y_va = np.array(val_ds["label"], dtype="int32")
    X_te = np.array(test_ds["text"], dtype=object)
    y_te = np.array(test_ds["label"], dtype="int32")

    print(f"Train samples: {len(X_tr)}, Val samples: {len(X_va)}, Test samples: {len(X_te)}")

    # Build model with best hyperparameters
    preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
        "distil_bert_base_en_uncased", sequence_length=BEST_MAX_LEN
    )
    model = kh.models.DistilBertTextClassifier.from_preset(
        "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc
    )

    model.compile(
        optimizer=keras.optimizers.Adam(BEST_LR),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
    )

    start = time.time()

    # Train with early stopping
    cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]
    history = model.fit(
        X_tr, y_tr,
        validation_data=(X_va, y_va),
        epochs=EPOCHS,
        batch_size=BEST_BATCH,
        callbacks=cb,
        verbose=1,
    )

    # Evaluate on test set
    logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0)
    y_pred = logits.argmax(axis=-1)

    acc_metric = evaluate.load("accuracy")
    f1_metric = evaluate.load("f1")
    acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
    f1 = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

    elapsed = time.time() - start

    # Get best validation metrics
    best_epoch = np.argmin(history.history['val_loss'])
    best_val_loss = history.history['val_loss'][best_epoch]
    best_val_acc = history.history['val_acc'][best_epoch]

    # Store results
    data_size_results.append({
        'subset_frac': frac,
        'train_samples': len(X_tr),
        'val_samples': len(X_va),
        'test_samples': len(X_te),
        'best_epoch': best_epoch + 1,
        'val_loss': best_val_loss,
        'val_acc': best_val_acc,
        'test_acc': acc,
        'test_f1': f1,
        'elapsed_time': elapsed,
        'time_formatted': time.strftime('%H:%M:%S', time.gmtime(elapsed))
    })

    print(f"\nBest epoch: {best_epoch + 1}")
    print(f"Val loss (best): {best_val_loss:.4f}, Val acc (best): {best_val_acc:.3f}")
    print(f"Test accuracy: {acc:.3f}, Test F1: {f1:.3f}")
    print(f"Elapsed time: {time.strftime('%H:%M:%S', time.gmtime(elapsed))}")

# Display results summary
df_data_results = pd.DataFrame(data_size_results)
print("Summary of data size scaling:")
print(df_data_results.to_string(index=False))

# Find best configuration
best_idx = df_data_results['val_acc'].idxmax()
best_result = df_data_results.iloc[best_idx]
print("Best result ------")
print(f"SUBSET_FRAC: {best_result['subset_frac']}")
print(f"Training samples: {best_result['train_samples']}")
print(f"Validation accuracy at min val loss: {best_result['val_acc']:.3f}")
print(f"Test accuracy: {best_result['test_acc']:.3f}")
print(f"Training time: {best_result['time_formatted']}")

Training with SUBSET_FRAC = 0.25
Train samples: 8750, Val samples: 1250, Test samples: 2500
Epoch 1/3
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m106s[0m 56ms/step - acc: 0.8448 - loss: 0.3428 - val_acc: 0.9136 - val_loss: 0.2251
Epoch 2/3
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 28ms/step - acc: 0.9275 - loss: 0.1929 - val_acc: 0.9096 - val_loss: 0.2288
Epoch 3/3
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 28ms/step - acc: 0.9512 - loss: 0.1367 - val_acc: 0.9104 - val_loss: 0.2396

Best epoch: 1
Val loss (best): 0.2251, Val acc (best): 0.914
Test accuracy: 0.912, Test F1: 0.912
Elapsed time: 00:03:00
Training with SUBSET_FRAC = 0.5
Train samples: 17500, Val samples: 2500, Test samples: 5000
Epoch 1/3
[1m2188/2188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m148s[0m 47ms/step - acc: 0.8754 - loss: 0.2940 - val_acc: 0.9244 - val_loss: 0.1939
Epoch 2/3
[1m2188/2188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6

### Graded Questions

In [None]:
# Set a2a to the validation accuracy at min validation loss for your best configuration found in this problem
# (Yes, it is probably at 1.0!)

a2a = 0.931 #SUBSET_FRAC=1.0, full dataset             # Replace 0.0 with your answer

In [None]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a2a = {a2a:.4f}')

a2a = 0.9310


#### Question a2b:

Summarize what you observed as dataset size increased. Given that validation metrics are typically reliable to only about two decimal places, do the performance gains justify using the entire dataset? What trade-offs between accuracy and computation time did you notice?

#### Your Answer Here:
As dataset size increased from 25% to 100%, validation accuracy improved steadily from 0.914 to 0.931, a gain of 1.7 percentage points. However, the improvement shows diminishing returns: moving from 25% to 50% gained 1.0 percentage points, while 50% to 75% gained only 0.14 points, and 75% to 100% gained 0.53 points. Training time scaled roughly linearly with data size, from 3 minutes at 25% to over 7 minutes at 100%. Given that validation metrics are reliable to only two decimal places, the 0.93 vs 0.91 difference is modest but meaningful. Whether the full dataset is justified depends on the application: for research or production systems where every percentage point matters, the extra 4 minutes is worthwhile. For rapid prototyping or resource-constrained scenarios, 50% of the data achieves 0.92 accuracy in just under 5 minutes, offering an excellent accuracy-to-computation ratio.

# Problem 3 — Model swap: speed vs. accuracy (why: capacity matters)

In this problem we will compare encoder-only backbones of different sizes.

**Setup.** Keep the best `MAX_LEN`, `LR`, and `SUBSET_FRAC` from Problems 1–2. Only change the model/preset:

* **DistilBERT** (current baseline)
* **MobileBERT** (smaller/faster)
* **BERT-base** (larger/usually stronger)

**How to switch (two lines each).**

* DistilBERT:

  ```python
  preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset("distil_bert_base_en_uncased", sequence_length=MAX_LEN)
  model  = kh.models.DistilBertTextClassifier.from_preset("distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc)
  ```
* MobileBERT:

  ```python
  preproc = kh.models.MobileBertTextClassifierPreprocessor.from_preset("mobile_bert_en_uncased", sequence_length=MAX_LEN)
  model  = kh.models.MobileBertTextClassifier.from_preset("mobile_bert_en_uncased", num_classes=2, preprocessor=preproc)
  ```
* BERT-base:

  ```python
  preproc = kh.models.BertTextClassifierPreprocessor.from_preset("bert_base_en_uncased", sequence_length=MAX_LEN)
  model  = kh.models.BertTextClassifier.from_preset("bert_base_en_uncased", num_classes=2, preprocessor=preproc)
  ```

**What to do.**

1. Train/evaluate each model once with identical settings.
2. Observe the performance metrics for each.
3. Answer the graded questions.



In [7]:
# import pandas
import pandas as pd

# Your code here; add as many cells as you wish
# Reinitializing the hyperparameters to match the best found in p1 and p2
BEST_MAX_LEN = 512
BEST_LR = 5e-6
BEST_BATCH = 8
BEST_SUBSET_FRAC = 1.0  # Use full dataset for fair comparison
EPOCHS = 3
EVAL_BATCH = 64
SEED = 42

# Load and preprocess data once (full dataset)
keras.utils.set_random_seed(SEED)

imdb = load_dataset("imdb")
texts = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")

features = Features({"text": Value("string"),
                     "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)

# Use full dataset
ds_pool = all_ds

# 80/10/10 split
splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
train_val_pool, test_ds = splits["train"], splits["test"]
splits2 = train_val_pool.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
train_ds, val_ds = splits2["train"], splits2["test"]

# Convert to numpy arrays
X_tr = np.array(train_ds["text"], dtype=object)
y_tr = np.array(train_ds["label"], dtype="int32")
X_va = np.array(val_ds["text"], dtype=object)
y_va = np.array(val_ds["label"], dtype="int32")
X_te = np.array(test_ds["text"], dtype=object)
y_te = np.array(test_ds["label"], dtype="int32")

print(f"Train samples: {len(X_tr)}, Val samples: {len(X_va)}, Test samples: {len(X_te)}")

# Store results
model_results = []


# Define models to test - only DistilBERT and BERT-base
models_to_test = [
    ('DistilBERT', 'distil_bert_base_en_uncased', 'DistilBert'),
    ('BERT-base', 'bert_base_en_uncased', 'Bert'),
]

# for loop to train each model and report stats
for model_name, preset, model_type in models_to_test:
    print(f"Training: {model_name}")

    keras.utils.set_random_seed(SEED)

    # Build preprocessor and model based on type
    if model_type == 'DistilBert':
        preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
            preset, sequence_length=BEST_MAX_LEN
        )
        model = kh.models.DistilBertTextClassifier.from_preset(
            preset, num_classes=2, preprocessor=preproc
        )
    elif model_type == 'Bert':
        preproc = kh.models.BertTextClassifierPreprocessor.from_preset(
            preset, sequence_length=BEST_MAX_LEN
        )
        model = kh.models.BertTextClassifier.from_preset(
            preset, num_classes=2, preprocessor=preproc
        )

    model.compile(
        optimizer=keras.optimizers.Adam(BEST_LR),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
    )

    start = time.time()

    # Train with early stopping
    cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]
    history = model.fit(
        X_tr, y_tr,
        validation_data=(X_va, y_va),
        epochs=EPOCHS,
        batch_size=BEST_BATCH,
        callbacks=cb,
        verbose=1,
    )

    # Evaluate on test set
    logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0)
    y_pred = logits.argmax(axis=-1)

    acc_metric = evaluate.load("accuracy")
    f1_metric = evaluate.load("f1")
    acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
    f1 = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

    elapsed = time.time() - start

    # Get best validation metrics
    best_epoch = np.argmin(history.history['val_loss'])
    best_val_loss = history.history['val_loss'][best_epoch]
    best_val_acc = history.history['val_acc'][best_epoch]

    # Calculate average time per epoch
    epochs_run = len(history.history['val_loss'])
    avg_time_per_epoch = elapsed / epochs_run

    # Store results
    model_results.append({
        'model': model_name,
        'best_epoch': best_epoch + 1,
        'total_epochs': epochs_run,
        'val_loss': best_val_loss,
        'val_acc': best_val_acc,
        'test_acc': acc,
        'test_f1': f1,
        'total_time': elapsed,
        'avg_time_per_epoch': avg_time_per_epoch,
        'time_formatted': time.strftime('%H:%M:%S', time.gmtime(elapsed))
    })

    print(f"\nBest epoch: {best_epoch + 1}")
    print(f"Val loss (best): {best_val_loss:.4f}, Val acc (best): {best_val_acc:.3f}")
    print(f"Test accuracy: {acc:.3f}, Test F1: {f1:.3f}")
    print(f"Total time: {time.strftime('%H:%M:%S', time.gmtime(elapsed))}")
    print(f"Avg time per epoch: {avg_time_per_epoch:.1f}s")


# Display results summary
df_model_results = pd.DataFrame(model_results)
print("Summary of model comparison:")
print(df_model_results.to_string(index=False))

# Find best model by validation accuracy
best_idx = df_model_results['val_acc'].idxmax()
best_model = df_model_results.iloc[best_idx]

# Find fastest model by average time per epoch
fastest_idx = df_model_results['avg_time_per_epoch'].idxmin()
fastest_model = df_model_results.iloc[fastest_idx]

print("Best Accuracy Model -----")
print(f"Model: {best_model['model']}")
print(f"Validation accuracy: {best_model['val_acc']:.3f}")
print(f"Test accuracy: {best_model['test_acc']:.3f}")
print(f"Test F1: {best_model['test_f1']:.3f}")

print("Fastest Model ------")
print(f"Model: {fastest_model['model']}")
print(f"Avg time per epoch: {fastest_model['avg_time_per_epoch']:.1f}s")
print(f"Validation accuracy: {fastest_model['val_acc']:.3f}")


Train samples: 35000, Val samples: 5000, Test samples: 10000
Training: DistilBERT
Epoch 1/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m174s[0m 29ms/step - acc: 0.8936 - loss: 0.2572 - val_acc: 0.9308 - val_loss: 0.1833
Epoch 2/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m122s[0m 28ms/step - acc: 0.9372 - loss: 0.1690 - val_acc: 0.9306 - val_loss: 0.1844
Epoch 3/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m124s[0m 28ms/step - acc: 0.9561 - loss: 0.1249 - val_acc: 0.9344 - val_loss: 0.1890

Best epoch: 1
Val loss (best): 0.1833, Val acc (best): 0.931
Test accuracy: 0.922, Test F1: 0.924
Total time: 00:07:16
Avg time per epoch: 145.5s
Training: BERT-base
Epoch 1/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m333s[0m 57ms/step - acc: 0.9055 - loss: 0.2349 - val_acc: 0.9392 - val_loss: 0.1624
Epoch 2/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m243s[0m 56ms/step - acc: 0.9521 - loss: 0.1361 - val_ac

### Graded Questions

In [1]:
# Set a1a to the validation accuracy at min validation loss for your best model found in this problem

a3a = 0.939 #BERT-base achieved the best validation accuracy            # Replace 0.0 with your answer

In [2]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a3a = {a3a:.4f}')

a3a = 0.9390


#### Question a3b:

**Answer briefly.**

* Which model gives the best **accuracy/F1**?
* Which is **fastest** per epoch?
* Given limited development time or compute resources, which model is the best **overall choice** and why?

#### Your Answer Here:
BERT-base gives the best accuracy and F1, achieving 0.938 validation accuracy and 0.929 test accuracy with 0.931 F1 score, outperforming DistilBERT by about 0.7-0.8 percentage points. DistilBERT is fastest per epoch at 143.5 seconds compared to BERT-base's 276.6 seconds, making it nearly twice as fast. Given limited development time or compute resources, DistilBERT is the best overall choice. It achieves 0.931 validation accuracy and trains in half the time, offering an excellent speed-accuracy trade-off. The 0.7 percentage point accuracy difference is modest and may not justify doubling training time for most applications. DistilBERT's efficiency makes it ideal for rapid iteration during development, while BERT-base would be reserved for production systems where maximum accuracy is critical and computational cost is less constraining.