## Homework 9: Text Classification with Fine-Tuned BERT

### Due: Midnight on November 5th (with 2-hour grace period) — Worth 85 points

In this final homework, we’ll explore **fine-tuning a pre-trained Transformer model (BERT)** for text classification using the **IMDB Movie Review** dataset. You’ll begin with a working baseline notebook and then conduct a series of controlled experiments to understand how data size, context length, and model architecture affect performance.

You’ll complete three problems:

* **Problem 1:** Evaluate how **sequence length** and **learning rate** jointly influence validation loss and generalization.
* **Problem 2:** Measure how **training data size** affects both model performance and total training time.
* **Problem 3:** Compare **two additional models** from the BERT family to analyze the trade-offs between model size and accuracy on this dataset.

In each problem, you’ll report your key metrics, summarize what you observed, and reflect on what you learned.

> **Note:** This homework was developed and tested on **Google Colab**, due to version conflicts when running locally. It is **strongly recommended** that you complete your work on Colab as well.

There are 6 problems, each worth 14 points, and you get one point free if you complete the entire homework.


In [1]:
# Install once per new Colab runtime
%pip -q install -U keras keras-hub tensorflow tensorflow-text datasets evaluate

Note: you may need to restart the kernel to use updated packages.


In [2]:

import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import time
import random
import numpy as np
import keras
import keras_hub as kh
import evaluate
from datasets import load_dataset, Dataset, Features, Value, ClassLabel

from keras import mixed_precision                    # generally faster
mixed_precision.set_global_policy("mixed_float16")

2025-10-29 23:15:48.415248: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-10-29 23:15:48.423916: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1761804948.433343   42625 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1761804948.436480   42625 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1761804948.444293   42625 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

### Here is where you can set global hyperparameters for this homework

In [3]:
# ---------------- Config ----------------
SEED        = 42
MAX_LEN     = 128
EPOCHS      = 3
BATCH       = 32
EVAL_BATCH  = 64
SUBSET_FRAC = 0.25   # <-- 0.25 to train and test on 25% of whole dataset during development;  set to 1.0 for full dataset

keras.utils.set_random_seed(SEED)

### Load and Preprocess the IMDB Movie Review Dataset

In [4]:
# ---- Load IMDb (raw), join train+test ----
imdb   = load_dataset("imdb")
texts  = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")

# ---- Build DS with explicit features (label=ClassLabel) ----
features = Features({"text": Value("string"),
                     "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)

# ---- Optional: take a stratified subset of the FULL dataset ----
if 0.0 < SUBSET_FRAC < 1.0:
    sub = all_ds.train_test_split(train_size=SUBSET_FRAC, seed=SEED, stratify_by_column="label")
    ds_pool = sub["train"]
else:
    ds_pool = all_ds

# ---- Stratified 80/10/10 split on the (possibly smaller) pool ----
# First: 80/20 train+val pool / test
splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
train_val_pool, test_ds = splits["train"], splits["test"]
# Then: carve 10% of full (i.e., 0.125 of the 80% pool) as validation
splits2 = train_val_pool.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
train_ds, val_ds = splits2["train"], splits2["test"]

# ---- Numpy arrays for Keras fit/predict ----
X_tr = np.array(train_ds["text"], dtype=object); y_tr = np.array(train_ds["label"], dtype="int32")
X_va = np.array(val_ds["text"],   dtype=object); y_va = np.array(val_ds["label"],   dtype="int32")
X_te = np.array(test_ds["text"],  dtype=object); y_te = np.array(test_ds["label"],  dtype="int32")

# ---- Quick summary ----
def _counts(ds):
    arr = np.array(ds["label"], dtype=int)
    return len(arr), np.bincount(arr, minlength=2).tolist()
print(f"Pool after SUBSET_FRAC={SUBSET_FRAC}: {len(ds_pool)} (of {len(all_ds)})")
print("Train:", _counts(train_ds), " Val:", _counts(val_ds), " Test:", _counts(test_ds))


Pool after SUBSET_FRAC=0.25: 12500 (of 50000)
Train: (8750, [4375, 4375])  Val: (1250, [625, 625])  Test: (2500, [1250, 1250])


### Build and train a baseline Distil-Bert Text Classifier

In [5]:
# ---- Keras Hub preprocessor + classifier ----
preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
    "distil_bert_base_en_uncased", sequence_length=MAX_LEN
)
model = kh.models.DistilBertTextClassifier.from_preset(
    "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc
)

model.compile(
    optimizer=keras.optimizers.Adam(1e-5),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
)

start = time.time()

# ---- Train with early stopping (restore best val weights) ----
cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]
history = model.fit(
    X_tr, y_tr,
    validation_data=(X_va, y_va),
    epochs=EPOCHS,
    batch_size=BATCH,
    callbacks=cb,
    verbose=1,
)

# ---- Evaluate (accuracy + F1 via `evaluate`) ----
logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0)
y_pred = logits.argmax(axis=-1)

acc_metric = evaluate.load("accuracy")
f1_metric  = evaluate.load("f1")
acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
f1  = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

# Tiny confusion matrix helper (no sklearn needed)
def confusion_matrix_np(y_true, y_pred, num_classes=2):
    cm = np.zeros((num_classes, num_classes), dtype=int)
    for t, p in zip(y_true, y_pred):
        cm[t, p] += 1
    return cm

print(f"\nValidation acc (best epoch): {history.history['val_acc'][np.argmin(history.history['val_loss'])]:.3f}")
print(f"\nTest accuracy: {acc:.3f}   Test F1: {f1:.3f}")
print("\nConfusion matrix:\n", confusion_matrix_np(y_te, y_pred))

end = time.time() - start
print("\nElapsed time:", time.strftime("%H:%M:%S", time.gmtime(end)))

I0000 00:00:1761804959.255468   42625 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21456 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9


Epoch 1/3


2025-10-29 23:16:01.683338: E tensorflow/core/util/util.cc:131] oneDNN supports DT_INT64 only on platforms with AVX-512. Falling back to the default Eigen-based implementation if present.
I0000 00:00:1761804970.262637   42863 service.cc:152] XLA service 0x7a79e4014ab0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1761804970.262663   42863 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2025-10-29 23:16:10.496060: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1761804972.294297   42863 cuda_dnn.cc:529] Loaded cuDNN version 91400













I0000 00:00:1761804988.497672   42863 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m271/274[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 25ms/step - acc: 0.6918 - loss: 0.5611



















[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step - acc: 0.6928 - loss: 0.5599



























[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 131ms/step - acc: 0.7825 - loss: 0.4529 - val_acc: 0.8384 - val_loss: 0.3449
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 31ms/step - acc: 0.8787 - loss: 0.2896 - val_acc: 0.8592 - val_loss: 0.3399
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 30ms/step - acc: 0.9158 - loss: 0.2205 - val_acc: 0.8584 - val_loss: 0.3556































Validation acc (best epoch): 0.859

Test accuracy: 0.853   Test F1: 0.850

Confusion matrix:
 [[1092  158]
 [ 209 1041]]

Elapsed time: 00:01:34


# Problem 1 — Mini sweep: context length × learning rate (6 runs)

In this problem we'll see how much **context length** (`MAX_LEN`) helps, and how sensitive fine-tuning is to **learning rate**—without running a huge grid.

## Setup (keep these fixed)

* `SUBSET_FRAC = 0.25`               # use only this percentage of the whole dataset
* `EPOCHS = 3`
* `BATCH = 32` (but see note for 256 below)
* **EarlyStopping** with `restore_best_weights=True`
* Same random `SEED` for all runs
* Same data split for all runs (don’t reshuffle between runs)

### Run these 6 configurations

**For each** `MAX_LEN ∈ {128, 256, 512}`, try **two** learning rates:

* **MAX_LEN = 128**

  * `(LR = 2e-5, BATCH = 32)` – healthy default for shorter contexts.
  * `(LR = 1e-5, BATCH = 32)` – conservative LR; often a touch stabler.

* **MAX_LEN = 256**

  * `(LR = 1e-5, BATCH = 16)` – longer context → lower batch.
  * `(LR = 7.5e-6, BATCH = 16)` – even steadier if loss is noisy.

* **MAX_LEN = 512**  *(heavier quadratic attention cost)*

  * `(LR = 7.5e-6, BATCH = 8)` – safe starting point.
  * `(LR = 5e-6, BATCH = 8)` – extra caution for stability.

**If you hit an Out Of Memory error:**

* At **256** with `BATCH = 16`, drop to `BATCH = 8`.
* At **512** with `BATCH = 8`, drop to `BATCH = 4`.


Then answer the graded questions.


In [6]:
# Your code here; add as many cells as you need

import pandas as pd
from tqdm import tqdm  # <--- Use standard console tqdm (NOT .notebook)
import warnings

# Suppress some Keras Hub/TensorFlow warnings (Optional)
warnings.filterwarnings("ignore", category=UserWarning, module="tensorflow")

# --- 1. Problem 1 Experiment Configuration ---
p1_configs = [
    {"name": "P1.1 (128, 2e-5)", "max_len": 128, "lr": 2e-5, "batch_size": 32},
    {"name": "P1.2 (128, 1e-5)", "max_len": 128, "lr": 1e-5, "batch_size": 32},
    {"name": "P1.3 (256, 1e-5)", "max_len": 256, "lr": 1e-5, "batch_size": 16},
    {"name": "P1.4 (256, 7.5e-6)", "max_len": 256, "lr": 7.5e-6, "batch_size": 16},
    {"name": "P1.5 (512, 7.5e-6)", "max_len": 512, "lr": 7.5e-6, "batch_size": 8},
    {"name": "P1.6 (512, 5e-6)", "max_len": 512, "lr": 5e-6, "batch_size": 8},
]

# This list will store the results in memory
p1_results = [] 

# --- 2. Execute Experiment Sweep ---
print(f"--- Starting Problem 1 Sweep (6 runs) ---")
# Make sure X_tr, X_va, SUBSET_FRAC etc. exist from running baseline cells
print(f"Using SUBSET_FRAC={SUBSET_FRAC} (Train={len(X_tr)}, Val={len(X_va)})")
print(f"Fixed EPOCHS={EPOCHS}, SEED={SEED}\n")

for config in tqdm(p1_configs, desc="Running Problem 1 Sweep"):
    
    print(f"\n--- [Running] {config['name']}: MAX_LEN={config['max_len']}, LR={config['lr']}, BATCH={config['batch_size']} ---")
    
    keras.backend.clear_session()
    keras.utils.set_random_seed(SEED) 
    
    max_len = config['max_len']
    lr = config['lr']
    batch_size = config['batch_size']
    
    preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
        "distil_bert_base_en_uncased", sequence_length=max_len
    )
    model = kh.models.DistilBertTextClassifier.from_preset(
        "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc
    )
    
    model.compile(
        optimizer=keras.optimizers.Adam(lr), 
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
    )
    
    start = time.time() 
    cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]
    
    history = model.fit(
        X_tr, y_tr,                 
        validation_data=(X_va, y_va), 
        epochs=EPOCHS,              
        batch_size=batch_size,      
        callbacks=cb,
        verbose=0, 
    )
    
    end = time.time() - start
    
    # Ensure acc_metric and f1_metric were loaded in baseline cells
    logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0) 
    y_pred = logits.argmax(axis=-1)
    
    test_acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
    test_f1  = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]
    
    best_epoch_idx = np.argmin(history.history['val_loss']) 
    best_val_loss = history.history['val_loss'][best_epoch_idx]
    best_val_acc = history.history['val_acc'][best_epoch_idx]

    result = {
        "name": config['name'],
        "max_len": max_len,
        "lr": lr,
        "batch_size": batch_size,
        "best_val_acc": best_val_acc,
        "best_val_loss": best_val_loss,
        "test_acc": test_acc,
        "test_f1": test_f1,
        "time_sec": end,
        "time_str": time.strftime("%M:%S", time.gmtime(end)) 
    }
    p1_results.append(result)
    
    print(f"--- [Finished] {config['name']}. Best Val Acc: {best_val_acc:.4f}, Test Acc: {test_acc:.4f}, Time: {result['time_str']} ---")

print("\n\n" + "="*50)
print("           Problem 1: All 6 runs COMPLETED.")
print(f"Data is now stored in the 'p1_results' list ({len(p1_results)} items).")
print("Run the next cell to display the summary table.")
print("="*50)

--- Starting Problem 1 Sweep (6 runs) ---
Using SUBSET_FRAC=0.25 (Train=8750, Val=1250)
Fixed EPOCHS=3, SEED=42



Running Problem 1 Sweep:   0%|          | 0/6 [00:00<?, ?it/s]


--- [Running] P1.1 (128, 2e-5): MAX_LEN=128, LR=2e-05, BATCH=32 ---





















































































































Running Problem 1 Sweep:  17%|█▋        | 1/6 [01:42<08:31, 102.32s/it]

--- [Finished] P1.1 (128, 2e-5). Best Val Acc: 0.8552, Test Acc: 0.8480, Time: 01:31 ---

--- [Running] P1.2 (128, 1e-5): MAX_LEN=128, LR=1e-05, BATCH=32 ---


Running Problem 1 Sweep:  33%|███▎      | 2/6 [03:13<06:22, 95.58s/it] 

--- [Finished] P1.2 (128, 1e-5). Best Val Acc: 0.8568, Test Acc: 0.8444, Time: 01:21 ---

--- [Running] P1.3 (256, 1e-5): MAX_LEN=256, LR=1e-05, BATCH=16 ---






































































































































Running Problem 1 Sweep:  50%|█████     | 3/6 [05:24<05:35, 111.75s/it]

--- [Finished] P1.3 (256, 1e-5). Best Val Acc: 0.9064, Test Acc: 0.8928, Time: 01:59 ---

--- [Running] P1.4 (256, 7.5e-6): MAX_LEN=256, LR=7.5e-06, BATCH=16 ---


Running Problem 1 Sweep:  67%|██████▋   | 4/6 [07:19<03:46, 113.07s/it]

--- [Finished] P1.4 (256, 7.5e-6). Best Val Acc: 0.9032, Test Acc: 0.8936, Time: 01:46 ---

--- [Running] P1.5 (512, 7.5e-6): MAX_LEN=512, LR=7.5e-06, BATCH=8 ---

























































































































































Running Problem 1 Sweep:  83%|████████▎ | 5/6 [11:08<02:35, 155.04s/it]

--- [Finished] P1.5 (512, 7.5e-6). Best Val Acc: 0.9144, Test Acc: 0.9108, Time: 03:35 ---

--- [Running] P1.6 (512, 5e-6): MAX_LEN=512, LR=5e-06, BATCH=8 ---


Running Problem 1 Sweep: 100%|██████████| 6/6 [14:37<00:00, 146.31s/it]

--- [Finished] P1.6 (512, 5e-6). Best Val Acc: 0.9120, Test Acc: 0.9124, Time: 03:18 ---


           Problem 1: All 6 runs COMPLETED.
Data is now stored in the 'p1_results' list (6 items).
Run the next cell to display the summary table.





In [7]:
print("\n\n" + "="*50)
print("           Problem 1: Final Results Summary")
print("="*50)

# Set pandas display options (optional, but nice)
pd.set_option('display.float_format', '{:.6f}'.format)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)

df_p1 = pd.DataFrame(p1_results)

# Formatting for easier analysis
df_p1['lr_str'] = df_p1['lr'].apply(lambda x: f"{x:.1e}") # Format LR in scientific notation
df_p1_display = df_p1[[
    'name', 'max_len', 'lr_str', 'batch_size', 
    'best_val_acc', 'test_acc', 'test_f1', 'time_str'
]].round(4)

print(df_p1_display)

# --- 5. Auto-set a1a Answer ---
# Find the row with the highest 'best_val_acc'
best_run_p1 = df_p1.loc[df_p1['best_val_acc'].idxmax()]
a1a_value = best_run_p1['best_val_acc'] 

print(f"\n--- Graded Answer (a1a) ---")
print(f"Best run detected: {best_run_p1['name']}")
print(f"Best Validation Accuracy (at min val_loss): {a1a_value:.4f}")



           Problem 1: Final Results Summary
                 name  max_len   lr_str  batch_size  best_val_acc  test_acc  test_f1 time_str
0    P1.1 (128, 2e-5)      128  2.0e-05          32      0.855200  0.848000 0.854200    01:31
1    P1.2 (128, 1e-5)      128  1.0e-05          32      0.856800  0.844400 0.844500    01:21
2    P1.3 (256, 1e-5)      256  1.0e-05          16      0.906400  0.892800 0.893600    01:59
3  P1.4 (256, 7.5e-6)      256  7.5e-06          16      0.903200  0.893600 0.895300    01:46
4  P1.5 (512, 7.5e-6)      512  7.5e-06           8      0.914400  0.910800 0.910700    03:35
5    P1.6 (512, 5e-6)      512  5.0e-06           8      0.912000  0.912400 0.911900    03:18

--- Graded Answer (a1a) ---
Best run detected: P1.5 (512, 7.5e-6)
Best Validation Accuracy (at min val_loss): 0.9144


### Graded Questions

In [8]:
# Set a1a to the validation accuracy at min validation loss for your best configuration found in this problem

a1a = best_run_p1['best_val_acc']             # Replace 0.0 with your answer

In [9]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a1a = {a1a:.4f}')

a1a = 0.9144


#### Question a1b:

* Does **more context** (128 → 256 → 512) consistently help?
* How much effect did the learning rate have on the validation accuracy?


#### Your Answer Here:

Based on the 6 experiment runs, here is the analysis:

* **Does more context (128 $\rightarrow$ 256 $\rightarrow$ 512) consistently help?**
    
    Yes, adding more context consistently and significantly improved model performance, although it showed diminishing returns.
    * Increasing `MAX_LEN` from **128 to 256** (comparing the best runs, P1.2 vs. P1.3) yielded a massive performance jump, with `best_val_acc` increasing by **4.96%** (from `0.8568` to `0.9064`).
    * Increasing `MAX_LEN` again from **256 to 512** (P1.3 vs. P1.5) yielded a smaller, but still significant, gain of **0.8%** (from `0.9064` to `0.9144`).
    
    This strongly suggests that the initial `MAX_LEN=128` was **truncating** reviews and cutting off crucial information. By increasing the context window, the model could see more of the text (especially decisive keywords at the end of long reviews), leading to much better classification. However, this came at a high computational cost: the `MAX_LEN=512` run (P1.5) was **~2.65x slower** than the `MAX_LEN=128` run (P1.2) (`03:35` vs. `01:21`).

* **How much effect did the learning rate have on the validation accuracy?**
    
    The learning rate had a **critical and complex effect**. There was no single "best" LR; the optimal LR depended on the other parameters (like `BATCH_SIZE`).
    * **Too high:** At `MAX_LEN=128`, the higher `LR=2e-5` (Val Acc: `0.8552`) was slightly worse than `LR=1e-5` (Val Acc: `0.8568`), suggesting the larger LR was too aggressive for stable fine-tuning.
    * **Too low:** At `MAX_LEN=512`, the smallest `LR=5e-6` (Val Acc: `0.9120`) was clearly *worse* than the slightly larger `LR=7.5e-6` (Val Acc: `0.9144`). This implies `LR=5e-6` was too small to converge to the optimal solution within the 3-epoch limit.
    * **"Sweet Spot" (It's Complicated):** The best LR changed with the batch size.
        * At `MAX_LEN=256` / `BATCH=16`, the best LR was `1e-5` (P1.3).
        * At `MAX_LEN=512` / `BATCH=8`, the best LR was `7.5e-6` (P1.5).
    * This demonstrates a key principle: as the `BATCH_SIZE` got smaller (from 16 to 8), the training became "noisier," and the optimal `LR` also had to be slightly reduced (from `1e-5` to `7.5e-6`) to maintain stability and achieve the best result.

## Problem 2 — How much data is enough?

In this problem, you’ll investigate how model performance scales with dataset size.

**Setup.**
Use the best `MAX_LEN` and `LR` values you found in **Problem 1**.

**What to do:**

1. For each value of `SUBSET_FRAC ∈ {0.25, 0.50, 0.75, 1.00}`, train your model once and observe the displayed performance metrics.
2. Answer the discussion question below.




In [10]:
# Your code here; add as many cells as you need

# Suppress some Keras Hub/TensorFlow warnings (Optional)
warnings.filterwarnings("ignore", category=UserWarning, module="tensorflow")

# --- 1. Problem 2 Experiment Configuration ---
p2_fractions = [0.25, 0.50, 0.75, 1.00]

# --- Best config from Problem 1 ---
# (From P1.5: MAX_LEN=512, LR=7.5e-6, BATCH=8)
BEST_MAX_LEN = 512
BEST_LR = 7.5e-6
BEST_BATCH = 8

# This list will store the results
p2_results = [] 

print(f"--- Starting Problem 2 Sweep (4 runs) ---")
print(f"Using Best P1 Config: MAX_LEN={BEST_MAX_LEN}, LR={BEST_LR}, BATCH={BEST_BATCH}")

# --- 2. Load Full Dataset (Raw) ---
# We must re-load data *inside* the loop to respect SUBSET_FRAC.
# To optimize, we load the *full* dataset once, then split inside the loop.
print("Loading full IMDB dataset (once)...")
imdb   = load_dataset("imdb")
texts  = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")

features = Features({"text": Value("string"),
                     "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)
print(f"Full dataset loaded: {len(all_ds)} examples.")


# --- 3. Execute Experiment Sweep ---
# We loop over the fractions, *re-splitting* the data each time.
for frac in tqdm(p2_fractions, desc="Running Problem 2 Sweep"):
    
    print(f"\n--- [Running] SUBSET_FRAC = {frac:.2f} ---")
    
    # --- 3.1. Create Data Splits (based on baseline logic) ---
    # This logic is copied from the baseline 'Data Load' cell
    
    # Global SUBSET_FRAC is set for consistency (though we use local 'frac')
    SUBSET_FRAC = frac
    
    if 0.0 < frac < 1.0:
        sub = all_ds.train_test_split(train_size=frac, seed=SEED, stratify_by_column="label")
        ds_pool = sub["train"]
    else:
        ds_pool = all_ds # Use 1.00

    # Stratified 80/10/10 split on the (possibly smaller) pool
    splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
    train_val_pool, test_ds = splits["train"], splits["test"]
    splits2 = train_val_pool.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
    train_ds, val_ds = splits2["train"], splits2["test"]

    # Numpy arrays for Keras fit/predict
    X_tr = np.array(train_ds["text"], dtype=object); y_tr = np.array(train_ds["label"], dtype="int32")
    X_va = np.array(val_ds["text"],   dtype=object); y_va = np.array(val_ds["label"],   dtype="int32")
    X_te = np.array(test_ds["text"],  dtype=object); y_te = np.array(test_ds["label"],  dtype="int32")

    print(f"Data split: Train={len(X_tr)}, Val={len(X_va)}, Test={len(X_te)}")
    
    # --- 3.2. Model Build, Train, Eval (using P1 best params) ---
    keras.backend.clear_session()
    keras.utils.set_random_seed(SEED) 
    
    preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
        "distil_bert_base_en_uncased", sequence_length=BEST_MAX_LEN
    )
    model = kh.models.DistilBertTextClassifier.from_preset(
        "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc
    )
    
    model.compile(
        optimizer=keras.optimizers.Adam(BEST_LR), # Use best LR
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
    )
    
    start = time.time() 
    cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]
    
    history = model.fit(
        X_tr, y_tr,                 
        validation_data=(X_va, y_va), 
        epochs=EPOCHS,              
        batch_size=BEST_BATCH,      # Use best Batch Size
        callbacks=cb,
        verbose=0, 
    )
    
    end = time.time() - start
    
    logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0) 
    y_pred = logits.argmax(axis=-1)
    
    test_acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
    test_f1  = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]
    
    best_epoch_idx = np.argmin(history.history['val_loss']) 
    best_val_loss = history.history['val_loss'][best_epoch_idx]
    best_val_acc = history.history['val_acc'][best_epoch_idx]

    # --- 3.3. Store Results ---
    result = {
        "frac": frac,
        "train_size": len(X_tr),
        "test_size": len(X_te),
        "best_val_acc": best_val_acc,
        "test_acc": test_acc,
        "test_f1": test_f1,
        "time_sec": end,
        "time_str": time.strftime("%M:%S", time.gmtime(end)) 
    }
    p2_results.append(result)
    
    print(f"--- [Finished] Frac={frac:.2f}. Best Val Acc: {best_val_acc:.4f}, Test Acc: {test_acc:.4f}, Time: {result['time_str']} ---")

print("\n\n" + "="*50)
print("           Problem 2: All 4 runs COMPLETED.")
print(f"Data is now stored in the 'p2_results' list ({len(p2_results)} items).")
print("Run the next cell to display the summary table.")
print("="*50)

--- Starting Problem 2 Sweep (4 runs) ---
Using Best P1 Config: MAX_LEN=512, LR=7.5e-06, BATCH=8
Loading full IMDB dataset (once)...
Full dataset loaded: 50000 examples.


Running Problem 2 Sweep:   0%|          | 0/4 [00:00<?, ?it/s]


--- [Running] SUBSET_FRAC = 0.25 ---
Data split: Train=8750, Val=1250, Test=2500


Running Problem 2 Sweep:  25%|██▌       | 1/4 [03:39<10:59, 219.83s/it]

--- [Finished] Frac=0.25. Best Val Acc: 0.9144, Test Acc: 0.9100, Time: 03:25 ---

--- [Running] SUBSET_FRAC = 0.50 ---
Data split: Train=17500, Val=2500, Test=5000

















































Running Problem 2 Sweep:  50%|█████     | 2/4 [10:14<10:44, 322.40s/it]

--- [Finished] Frac=0.50. Best Val Acc: 0.9240, Test Acc: 0.9210, Time: 06:15 ---

--- [Running] SUBSET_FRAC = 0.75 ---
Data split: Train=26250, Val=3750, Test=7500



























































Running Problem 2 Sweep:  75%|███████▌  | 3/4 [19:09<06:59, 419.51s/it]

--- [Finished] Frac=0.75. Best Val Acc: 0.9285, Test Acc: 0.9256, Time: 08:30 ---

--- [Running] SUBSET_FRAC = 1.00 ---
Data split: Train=35000, Val=5000, Test=10000








Running Problem 2 Sweep: 100%|██████████| 4/4 [30:17<00:00, 454.36s/it]

--- [Finished] Frac=1.00. Best Val Acc: 0.9288, Test Acc: 0.9232, Time: 10:42 ---


           Problem 2: All 4 runs COMPLETED.
Data is now stored in the 'p2_results' list (4 items).
Run the next cell to display the summary table.





In [11]:
print("\n\n" + "="*50)
print("           Problem 2: Final Results Summary")
print("="*50)

# Set pandas display options
pd.set_option('display.float_format', '{:.6f}'.format)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)

df_p2 = pd.DataFrame(p2_results)

# Formatting for easier analysis
df_p2_display = df_p2[[
    'frac', 'train_size', 'test_size', 
    'best_val_acc', 'test_acc', 'test_f1', 'time_str'
]].round(4)

# Use standard print instead of .to_markdown() to avoid 'tabulate' error
print(df_p2_display)

# --- 5. (Bonus) Auto-set a2a Answer ---
# Find the row for the frac=1.00 run
# (Assuming the last run is 1.00, but this is safer)
best_run_p2 = df_p2.loc[df_p2['frac'] == 1.00].iloc[0]
a2a_value = best_run_p2['best_val_acc'] 

print(f"\n--- Graded Answer (a2a) ---")
print(f"Run detected: frac=1.00")
print(f"Best Validation Accuracy (at min val_loss): {a2a_value:.4f}")



           Problem 2: Final Results Summary
      frac  train_size  test_size  best_val_acc  test_acc  test_f1 time_str
0 0.250000        8750       2500      0.914400  0.910000 0.909900    03:25
1 0.500000       17500       5000      0.924000  0.921000 0.919300    06:15
2 0.750000       26250       7500      0.928500  0.925600 0.924900    08:30
3 1.000000       35000      10000      0.928800  0.923200 0.925800    10:42

--- Graded Answer (a2a) ---
Run detected: frac=1.00
Best Validation Accuracy (at min val_loss): 0.9288


### Graded Questions

In [12]:
# Set a2a to the validation accuracy at min validation loss for your best configuration found in this problem
# (Yes, it is probably at 1.0!)

a2a = best_run_p2['best_val_acc']             # Replace 0.0 with your answer

In [13]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a2a = {a2a:.4f}')

a2a = 0.9288


#### Question a2b:

Summarize what you observed as dataset size increased. Given that validation metrics are typically reliable to only about two decimal places, do the performance gains justify using the entire dataset? What trade-offs between accuracy and computation time did you notice?

#### Your Answer Here:

As the dataset size increased, model performance (both validation and test accuracy) consistently improved, but it showed a classic case of **diminishing returns**.

* The most significant performance jump came from increasing the data from **25% to 50%**, which boosted validation accuracy by nearly a full percentage point (`+0.0096`, from `0.9144` to `0.9240`).
* The next jump, from **50% to 75%**, provided a smaller but still meaningful gain (`+0.0045`, from `0.9240` to `0.9285`).
* However, the final jump from **75% to 100%** yielded a negligible gain of only `+0.0003` (from `0.9285` to `0.9288`).

Given that metrics are only reliable to about two decimal places, **the performance gains do not justify using the entire dataset.** The tiny `+0.0003` gain from 75% to 100% is statistically insignificant and well within the noise/margin of error. In fact, the test accuracy slightly *decreased* (from `0.9256` to `0.9232`), confirming that the model had effectively "saturated" and learned all it could by the 75% mark.

The **accuracy vs. computation time trade-off** was very clear:
* Computation time scaled almost **linearly** with the amount of training data. The 100% run (`10:42`) took about **3.13x** as long as the 25% run (`03:25`).
* Because the accuracy gains plateaued while the time cost continued to rise, the **75% dataset (`frac=0.75`) represents the best trade-off**. It achieved 99.9% of the final accuracy while saving over 2 minutes of training time compared to the 100% run.

# Problem 3 — Model swap: speed vs. accuracy (why: capacity matters)

In this problem we will compare encoder-only backbones of different sizes.

**Setup.** Keep the best `MAX_LEN`, `LR`, and `SUBSET_FRAC` from Problems 1–2. Only change the model/preset:

* **DistilBERT** (current baseline)
* **BERT-base** (larger/usually stronger)

**How to switch (two lines each).**

* DistilBERT:

  ```python
  preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset("distil_bert_base_en_uncased", sequence_length=MAX_LEN)
  model  = kh.models.DistilBertTextClassifier.from_preset("distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc)
  ```

* BERT-base:

  ```python
  preproc = kh.models.BertTextClassifierPreprocessor.from_preset("bert_base_en_uncased", sequence_length=MAX_LEN)
  model  = kh.models.BertTextClassifier.from_preset("bert_base_en_uncased", num_classes=2, preprocessor=preproc)
  ```

**What to do.**

1. Train/evaluate each model once with identical settings.
2. Observe the performance metrics for each.
3. Answer the graded questions.



In [8]:
import os
from pathlib import Path

print("--- P3 Cache Verification Step ---")

# 1. Keras 캐시의 기본 경로를 찾습니다. (Windows/Linux/Mac 호환)
# Keras 3는 .keras/models/ 를 사용합니다.
try:
    cache_dir = Path(os.environ.get(
        "KERAS_HOME", Path.home() / ".keras"
    )) / "models"
except Exception as e:
    print(f"Could not determine Keras home directory. {e}")
    
model_cache_path = cache_dir / "bert_base_en_uncased"

print(f"Checking for model in: {model_cache_path}")

# 2. 다운로드 로그에 나왔던 필수 파일 4개를 확인합니다.
files_to_check = [
    "config.json",
    "tokenizer.json",
    "assets/tokenizer/vocabulary.txt",
    "model.weights.h5"
]

all_files_found = True
total_size = 0

if not model_cache_path.exists():
    print("FAILURE: Model directory does not exist.")
    all_files_found = False
else:
    for file_name in files_to_check:
        file_path = model_cache_path / file_name
        if file_path.exists():
            file_size_mb = file_path.stat().st_size / (1024 * 1024)
            print(f"  [OK] Found: {file_name} ({file_size_mb:.2f} MB)")
            total_size += file_size_mb
        else:
            print(f"  [!!] MISSING: {file_name}")
            all_files_found = False

print("---" * 10)
if all_files_found:
    print(f"[SUCCESS] All {len(files_to_check)} files found. Total size: {total_size:.2f} MB.")
    print("It is safe to proceed with P3 training (Cell 1).")
    print("(Note: The next step *may still hang* while *loading* the model, but not while downloading.)")
else:
    print(f"[FAILURE] Some files are missing.")
    print("P3 training (Cell 1) will need to re-download the model.")
print("---" * 10)

--- P3 Cache Verification Step ---
Checking for model in: /home/abcbbong/.keras/models/bert_base_en_uncased
FAILURE: Model directory does not exist.
------------------------------
[FAILURE] Some files are missing.
P3 training (Cell 1) will need to re-download the model.
------------------------------


In [10]:
import pandas as pd
from tqdm import tqdm  # Use standard console tqdm
import warnings

# Suppress some Keras Hub/TensorFlow warnings (Optional)
warnings.filterwarnings("ignore", category=UserWarning, module="tensorflow")

# --- 1. Problem 3 Fixed Settings (from P1 & P2) ---
BEST_MAX_LEN = 512
BEST_LR = 7.5e-6
BEST_BATCH = 8
BEST_FRAC = 1.00 # <--- Using 100% data based on P2's best val_acc

# --- 2. This list will store ALL P3 results ---
# We create it here. The next cell will append to it.
p3_results = [] 

print(f"--- Starting Problem 3 / Run 1 (DistilBERT) ---")
print(f"Fixed Settings: MAX_LEN={BEST_MAX_LEN}, LR={BEST_LR}, BATCH={BEST_BATCH}, FRAC={BEST_FRAC}")

# --- 3. Load and Split Data (using BEST_FRAC=1.00) ---
print(f"\nVerifying data is loaded for SUBSET_FRAC = {BEST_FRAC}...")
try:
    # Check if the 100% data (35,000 train samples) is loaded.
    if 'X_tr' not in locals() or len(X_tr) != 35000: # 35000 is 100% train split
        
        print(f"Data not found or incorrect size. Re-splitting data for {BEST_FRAC*100}%...")
        if 'all_ds' not in locals():
            print("ERROR: 'all_ds' not found. Please re-run ALL baseline data loading cells from the top.")
            raise NameError("'all_ds' not found")
            
        # Handle the 1.00 case explicitly to avoid ValueError
        if BEST_FRAC == 1.00:
            ds_pool = all_ds
        else:
            sub = all_ds.train_test_split(train_size=BEST_FRAC, seed=SEED, stratify_by_column="label")
            ds_pool = sub["train"]
        
        splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
        train_val_pool, test_ds = splits["train"], splits["test"]
        splits2 = train_val_pool.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
        train_ds, val_ds = splits2["train"], splits2["test"]

        X_tr = np.array(train_ds["text"], dtype=object); y_tr = np.array(train_ds["label"], dtype="int32")
        X_va = np.array(val_ds["text"],   dtype=object); y_va = np.array(val_ds["label"],   dtype="int32")
        X_te = np.array(test_ds["text"],  dtype=object); y_te = np.array(test_ds["label"],  dtype="int32")
    
    print(f"Data ready: Train={len(X_tr)}, Val={len(X_va)}, Test={len(X_te)}")
except NameError as e:
    print(f"ERROR: A required variable was not found. Did you restart the kernel?")
    print("Please re-run the baseline 'Data Load' cell at the top of the notebook first.")
    raise e

# --- 4. Execute Run 1: DistilBERT ---
preset_name = "distil_bert_base_en_uncased"
print(f"\n--- [Running] Model: {preset_name} ---")

keras.backend.clear_session()
keras.utils.set_random_seed(SEED) 

Preprocessor = kh.models.DistilBertTextClassifierPreprocessor
Model = kh.models.DistilBertTextClassifier

preproc = Preprocessor.from_preset(preset_name, sequence_length=BEST_MAX_LEN)
model = Model.from_preset(preset_name, num_classes=2, preprocessor=preproc)

model.compile(
    optimizer=keras.optimizers.Adam(BEST_LR), 
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
)

start = time.time() 
cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]

history = model.fit(
    X_tr, y_tr,                 
    validation_data=(X_va, y_va), 
    epochs=EPOCHS,              
    batch_size=BEST_BATCH,      
    callbacks=cb,
    verbose=0, 
)

end = time.time() - start

logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0) 
y_pred = logits.argmax(axis=-1)

# (y_pod fix is included)
test_acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
test_f1  = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

best_epoch_idx = np.argmin(history.history['val_loss']) 
best_val_loss = history.history['val_loss'][best_epoch_idx]
best_val_acc = history.history['val_acc'][best_epoch_idx]

result = {
    "model_name": preset_name,
    "best_val_acc": best_val_acc,
    "test_acc": test_acc,
    "test_f1": test_f1,
    "time_sec": end,
    "time_str": time.strftime("%M:%S", time.gmtime(end)) 
}
p3_results.append(result) # Add to the list

print(f"--- [Finished] Model: {preset_name}. Best Val Acc: {best_val_acc:.4f}, Test Acc: {test_acc:.4f}, Time: {result['time_str']} ---")
print("\nRun 1 (DistilBERT) is complete. Proceed to the next cell to run BERT-base.")

--- Starting Problem 3 / Run 1 (DistilBERT) ---
Fixed Settings: MAX_LEN=512, LR=7.5e-06, BATCH=8, FRAC=1.0

Verifying data is loaded for SUBSET_FRAC = 1.0...
Data not found or incorrect size. Re-splitting data for 100.0%...
Data ready: Train=35000, Val=5000, Test=10000

--- [Running] Model: distil_bert_base_en_uncased ---
















































































--- [Finished] Model: distil_bert_base_en_uncased. Best Val Acc: 0.9288, Test Acc: 0.9232, Time: 10:42 ---

Run 1 (DistilBERT) is complete. Proceed to the next cell to run BERT-base.


In [11]:
# --- This cell assumes Cell 3.1 has completed ---
# It uses 'p3_results', 'X_tr', 'y_tr' etc. from the previous cell's memory.

# --- 1. Model to test in THIS CELL ---
preset_name = "bert_base_en_uncased"
print(f"--- Starting Problem 3 / Run 2 (BERT-base) ---")

# --- 2. WARNING: This step will re-attempt download (cache was empty) ---
# --- This is the high-risk step that may hang ---
print(f"\n--- [Running] Model: {preset_name} ---")

keras.backend.clear_session()
keras.utils.set_random_seed(SEED) 

Preprocessor = kh.models.BertTextClassifierPreprocessor
Model = kh.models.BertTextClassifier

preproc = Preprocessor.from_preset(preset_name, sequence_length=BEST_MAX_LEN)
model = Model.from_preset(preset_name, num_classes=2, preprocessor=preproc)

model.compile(
    optimizer=keras.optimizers.Adam(BEST_LR), 
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
)

start = time.time() 
cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]

history = model.fit(
    X_tr, y_tr,                 
    validation_data=(X_va, y_va), 
    epochs=EPOCHS,              
    batch_size=BEST_BATCH,      
    callbacks=cb,
    verbose=0, 
)

end = time.time() - start

logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0) 
y_pred = logits.argmax(axis=-1)

# (y_pod fix is included)
test_acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
test_f1  = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

best_epoch_idx = np.argmin(history.history['val_loss']) 
best_val_loss = history.history['val_loss'][best_epoch_idx]
best_val_acc = history.history['val_acc'][best_epoch_idx]

result = {
    "model_name": preset_name,
    "best_val_acc": best_val_acc,
    "test_acc": test_acc,
    "test_f1": test_f1,
    "time_sec": end,
    "time_str": time.strftime("%M:%S", time.gmtime(end)) 
}
p3_results.append(result) # Append the *second* result to the list

print(f"--- [Finished] Model: {preset_name}. Best Val Acc: {best_val_acc:.4f}, Test Acc: {test_acc:.4f}, Time: {result['time_str']} ---")
print("\nRun 2 (BERT-base) is complete. Proceed to the next cell to view the final report.")

--- Starting Problem 3 / Run 2 (BERT-base) ---

--- [Running] Model: bert_base_en_uncased ---












--- [Finished] Model: bert_base_en_uncased. Best Val Acc: 0.9446, Test Acc: 0.9400, Time: 20:30 ---

Run 2 (BERT-base) is complete. Proceed to the next cell to view the final report.


In [15]:
# --- This cell assumes 'p3_results' list exists in memory ---
# (It also needs 'pandas as pd' to be imported, which Cell 1 did)

print("\n\n" + "="*50)
print("           Problem 3: Final Results Summary")
print("==================================================")

# Set pandas display options
pd.set_option('display.float_format', '{:.6f}'.format)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)

df_p3 = pd.DataFrame(p3_results)

# Formatting for easier analysis
df_p3_display = df_p3[[
    'model_name', 
    'best_val_acc', 'test_acc', 'test_f1', 'time_str'
]].round(4)

# Use standard print instead of .to_markdown() to avoid 'tabulate' error
print(df_p3_display)

# --- 5. (Bonus) Auto-set a3a Answer ---
# Find the row with the highest 'best_val_acc'
best_run_p3 = df_p3.loc[df_p3['best_val_acc'].idxmax()]
a3a_value = best_run_p3['best_val_acc'] 

print(f"\n--- Graded Answer (a3a) ---")
print(f"Best model detected: {best_run_p3['model_name']}")
print(f"Best Validation Accuracy (at min val_loss): {a3a_value:.4f}")



           Problem 3: Final Results Summary
                    model_name  best_val_acc  test_acc  test_f1 time_str
0  distil_bert_base_en_uncased      0.928800  0.923200 0.925700    10:42
1         bert_base_en_uncased      0.944600  0.940000 0.940500    20:30

--- Graded Answer (a3a) ---
Best model detected: bert_base_en_uncased
Best Validation Accuracy (at min val_loss): 0.9446


### Graded Questions

In [13]:
# Set a1a to the validation accuracy at min validation loss for your best model found in this problem

a3a = best_run_p3['best_val_acc']             # Replace 0.0 with your answer

In [14]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a3a = {a3a:.4f}')

a3a = 0.9446


#### Question a3b:

**Answer briefly.**

* Which model gives the best **accuracy/F1**?
* Which is **fastest** per epoch?
* Given limited development time or compute resources, which model is the best **overall choice** and why?

#### Your Answer Here:

* **Which model gives the best accuracy/F1?**
    * **`bert_base_en_uncased`** was the clear winner. It achieved a test accuracy of `0.9400`, which was a significant **+1.68%** improvement over `distil_bert_base_en_uncased` (`0.9232`).

* **Which is fastest per epoch?**
    * **`distil_bert_base_en_uncased`** was the fastest, finishing the entire run in `10:42`. The `bert_base` model was much slower, taking `20:30` (almost **1.9x** as long) to train on the same data.

* **Given limited development time or compute resources, which model is the best overall choice and why?**
    * **`distil_bert_base_en_uncased` is the best overall choice** for this scenario.
    * **Reasoning (The Trade-Off):** While `bert_base` offered the *highest* accuracy, `distil_bert` delivered **98.2%** of that peak performance (`0.9232` / `0.9400`) in **nearly half the time**.
    * In a real-world project with "limited development time," iterating (running experiments) twice as fast is far more valuable than the final 1.7% accuracy boost. `DistilBERT` provides an excellent balance of high performance and fast iteration, making it the most practical and cost-effective choice. `bert_base` should only be used if achieving the absolute maximum performance (e.g., for a competition) is the *only* goal, regardless of cost.