# Task 1 — Fine-tuning BERT for Text Classification (AG News) — **Full Run**

This notebook runs a **full training run** (no subset) for Task 1 using:
- Dataset: `sh0416/ag_news` (columns: `title`, `description`, `label`)
- Model: `bert-base-uncased`
- Metrics: Accuracy + Macro-F1

It is designed for **Google Colab** and saves outputs under your Google Drive project folder.

**Repo name:** `finetuning-bert-text-classification`  
**Notebook path:** `notebooks/02_finetune_bert_ag_news_fullrun_drive.ipynb`


## 0) Mount Google Drive (required)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1) Project directory on Drive (required)

In [2]:
# Required path (as requested)
PROJECT_DIR = "/content/drive/MyDrive/finetuning-bert-text-classification"

In [3]:
from pathlib import Path

# Use a Path object for convenience (keep PROJECT_DIR string above unchanged)
PROJECT_PATH = Path(PROJECT_DIR)

REPORTS_DIR = PROJECT_PATH / "reports"
NOTEBOOKS_DIR = PROJECT_PATH / "notebooks"
MODELS_DIR = PROJECT_PATH / "models"        # optional
OUTPUTS_DIR = PROJECT_PATH / "outputs"      # trainer outputs / checkpoints

for d in [REPORTS_DIR, NOTEBOOKS_DIR, MODELS_DIR, OUTPUTS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("PROJECT_PATH:", PROJECT_PATH)
print("REPORTS_DIR:", REPORTS_DIR)

PROJECT_PATH: /content/drive/MyDrive/finetuning-bert-text-classification
REPORTS_DIR: /content/drive/MyDrive/finetuning-bert-text-classification/reports


## 2) Install dependencies

In [4]:
!pip -q install -U transformers datasets evaluate accelerate scikit-learn

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/512.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m106.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m54.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## 3) Imports & reproducibility

In [5]:
import os
import random
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    set_seed,
)
import evaluate
from sklearn.metrics import classification_report, confusion_matrix

SEED = 42
set_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)
print("torch:", torch.__version__)

Device: cuda
torch: 2.9.0+cu126


## 4) Configuration (FULL RUN)

In [6]:
MODEL_CHECKPOINT = "bert-base-uncased"
DATASET_NAME = "sh0416/ag_news"

MAX_LENGTH = 128
NUM_EPOCHS = 3
LEARNING_RATE = 2e-5
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
WEIGHT_DECAY = 0.01

# FULL RUN (no subset selection)
QUICK_RUN = False

OUTPUT_DIR = str(OUTPUTS_DIR / "bert_agnews")

print("MODEL_CHECKPOINT:", MODEL_CHECKPOINT)
print("DATASET_NAME:", DATASET_NAME)
print("QUICK_RUN:", QUICK_RUN)

MODEL_CHECKPOINT: bert-base-uncased
DATASET_NAME: sh0416/ag_news
QUICK_RUN: False


## 5) Load dataset

In [7]:
raw = load_dataset(DATASET_NAME)
raw

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.jsonl:   0%|          | 0.00/33.7M [00:00<?, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 7600
    })
})

### 5.1 Inspect columns

In [8]:
print(raw)
print("Train columns:", raw["train"].column_names)
print("Test columns :", raw["test"].column_names)

pd.DataFrame(raw["train"][:3])

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 7600
    })
})
Train columns: ['label', 'title', 'description']
Test columns : ['label', 'title', 'description']


Unnamed: 0,label,title,description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...


## 6) Create train/validation split

In [9]:
splits = DatasetDict()
splits["test"] = raw["test"]

train_valid = raw["train"].train_test_split(test_size=0.1, seed=SEED)
splits["train"] = train_valid["train"]
splits["validation"] = train_valid["test"]

print({k: len(v) for k, v in splits.items()})

{'test': 7600, 'train': 108000, 'validation': 12000}


## 7) Normalize labels to 0-based (CRITICAL)

In [10]:
# Some variants of AG News label as 1..4. PyTorch expects 0..3 for num_labels=4.

def normalize_labels_to_zero_based(ds, label_col="label"):
    labels = ds[label_col]
    mn, mx = int(np.min(labels)), int(np.max(labels))
    print(f"{label_col} min/max before:", mn, mx)

    if mn == 0:
        print("Labels already 0-based. No change.")
        return ds

    if mn == 1:
        def _shift(batch):
            return {label_col: [int(x) - 1 for x in batch[label_col]]}
        ds2 = ds.map(_shift, batched=True)
        labels2 = ds2[label_col]
        mn2, mx2 = int(np.min(labels2)), int(np.max(labels2))
        print(f"{label_col} min/max after :", mn2, mx2)
        assert mn2 == 0 and mx2 == (mx - 1), "Label normalization failed."
        return ds2

    raise ValueError(f"Unexpected label range: min={mn}, max={mx}")

splits["train"] = normalize_labels_to_zero_based(splits["train"], "label")
splits["validation"] = normalize_labels_to_zero_based(splits["validation"], "label")
splits["test"] = normalize_labels_to_zero_based(splits["test"], "label")

label min/max before: 1 4


Map:   0%|          | 0/108000 [00:00<?, ? examples/s]

label min/max after : 0 3
label min/max before: 1 4


Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

label min/max after : 0 3
label min/max before: 1 4


Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

label min/max after : 0 3


## 8) Build `text` column from title + description

In [11]:
def add_text_column(batch):
    text = [(t or "") + " " + (d or "") for t, d in zip(batch["title"], batch["description"])]
    return {"text": text}

splits = splits.map(add_text_column, batched=True)
print("Columns now:", splits["train"].column_names)
pd.DataFrame(splits["train"][:3])

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

Map:   0%|          | 0/108000 [00:00<?, ? examples/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Columns now: ['label', 'title', 'description', 'text']


Unnamed: 0,label,title,description,text
0,0,Despair and Anger in Small Russian Town After ...,"BESLAN, Russia (Reuters) - The killing of mor...",Despair and Anger in Small Russian Town After ...
1,3,"Bob Evans, mainframe pioneer, dies at 77",Evans led a team that developed a new class of...,"Bob Evans, mainframe pioneer, dies at 77 Evans..."
2,1,Agassi Brushes Bjorkman Aside in Stockholm,STOCKHOLM (Reuters) - Andre Agassi brushed pa...,Agassi Brushes Bjorkman Aside in Stockholm ST...


## 9) Tokenization

In [12]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT, use_fast=True)

def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=MAX_LENGTH,
    )

# Remove raw text columns to save memory
tokenized = splits.map(tokenize_batch, batched=True, remove_columns=["text", "title", "description"])
tokenized

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

Map:   0%|          | 0/108000 [00:00<?, ? examples/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7600
    })
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 108000
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 12000
    })
})

### 9.1 Rename label column to `labels` (Trainer expects this)

In [13]:
def rename_label_column(ds):
    if "label" in ds.column_names and "labels" not in ds.column_names:
        ds = ds.rename_column("label", "labels")
    return ds

tokenized = DatasetDict({k: rename_label_column(v) for k, v in tokenized.items()})
print("Tokenized columns:", tokenized["train"].column_names)

print("Train/Val/Test sizes:")
print("Train:", len(tokenized["train"]))
print("Validation:", len(tokenized["validation"]))
print("Test:", len(tokenized["test"]))

print("Label min/max:", min(tokenized["train"]["labels"]), max(tokenized["train"]["labels"]))

Tokenized columns: ['labels', 'input_ids', 'token_type_ids', 'attention_mask']
Train/Val/Test sizes:
Train: 108000
Validation: 12000
Test: 7600
Label min/max: 0 3


## 10) Data collator

In [14]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 11) Metrics (Accuracy + Macro-F1)

In [15]:
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    acc = accuracy_metric.compute(predictions=preds, references=labels)["accuracy"]
    f1  = f1_metric.compute(predictions=preds, references=labels, average="macro")["f1"]
    return {"accuracy": acc, "f1_macro": f1}

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

## 12) Load model

In [16]:
# Determine num_labels robustly
try:
    num_labels = raw["train"].features["label"].num_classes
except Exception:
    num_labels = 4

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_CHECKPOINT,
    num_labels=num_labels,
).to(device)

print("num_labels:", num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


num_labels: 4


## 13) Sanity checks (prevents CUDA device-side assert)

In [17]:
from torch.utils.data import DataLoader

print("Model num_labels:", model.config.num_labels)
for split in ["train", "validation", "test"]:
    labels = tokenized[split]["labels"]
    print(split, "label min/max:", int(min(labels)), int(max(labels)), "unique:", len(set(labels)))
    assert int(min(labels)) >= 0
    assert int(max(labels)) < model.config.num_labels, f"{split}: label out of range"

dl = DataLoader(tokenized["train"], batch_size=8, shuffle=True, collate_fn=data_collator)
batch = next(iter(dl))
max_token_id = int(batch["input_ids"].max())
print("Max token id:", max_token_id, "vocab size:", tokenizer.vocab_size)
assert max_token_id < tokenizer.vocab_size, "Tokenizer/model mismatch"

seq_len = batch["input_ids"].shape[1]
print("Sequence length:", seq_len)
if hasattr(model.config, "max_position_embeddings"):
    print("max_position_embeddings:", model.config.max_position_embeddings)
    assert seq_len <= model.config.max_position_embeddings, "Reduce MAX_LENGTH"

print("Sanity checks passed ✅")

Model num_labels: 4
train label min/max: 0 3 unique: 4
validation label min/max: 0 3 unique: 4
test label min/max: 0 3 unique: 4
Max token id: 27942 vocab size: 30522
Sequence length: 84
max_position_embeddings: 512
Sanity checks passed ✅


## 14) TrainingArguments + Trainer (version-compatible)

In [18]:
import inspect
from transformers import TrainingArguments, Trainer

use_fp16 = torch.cuda.is_available()

ta_kwargs = dict(
    output_dir=OUTPUT_DIR,
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=50,
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=WEIGHT_DECAY,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,
    fp16=use_fp16,
    report_to="none",
    seed=SEED,
)

ta_params = inspect.signature(TrainingArguments.__init__).parameters
if "eval_strategy" in ta_params:
    ta_kwargs["eval_strategy"] = "epoch"
else:
    ta_kwargs["evaluation_strategy"] = "epoch"

training_args = TrainingArguments(**ta_kwargs)

trainer_kwargs = dict(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer_params = inspect.signature(Trainer.__init__).parameters
if "processing_class" in trainer_params:
    trainer_kwargs["processing_class"] = tokenizer
else:
    trainer_kwargs["tokenizer"] = tokenizer

trainer = Trainer(**trainer_kwargs)

training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=False,

## 15) Train

⚠️ If you ever saw `CUDA error: device-side assert triggered` earlier, **restart the runtime** and rerun from the top before training.


In [21]:
train_result = trainer.train()
train_result

Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro
1,0.1869,0.17196,0.94575,0.945488
2,0.145,0.188322,0.9485,0.948275
3,0.073,0.236517,0.94725,0.946987


TrainOutput(global_step=20250, training_loss=0.134192324720783, metrics={'train_runtime': 2146.7462, 'train_samples_per_second': 150.926, 'train_steps_per_second': 9.433, 'total_flos': 1.5316783573832832e+16, 'train_loss': 0.134192324720783, 'epoch': 3.0})

## 16) Evaluate (Validation + Test)

In [22]:
val_metrics = trainer.evaluate(eval_dataset=tokenized["validation"])
test_metrics = trainer.evaluate(eval_dataset=tokenized["test"])

print("Validation metrics:", val_metrics)
print("Test metrics:", test_metrics)

Validation metrics: {'eval_loss': 0.18832184374332428, 'eval_accuracy': 0.9485, 'eval_f1_macro': 0.9482751692113294, 'eval_runtime': 19.2647, 'eval_samples_per_second': 622.901, 'eval_steps_per_second': 19.466, 'epoch': 3.0}
Test metrics: {'eval_loss': 0.19649437069892883, 'eval_accuracy': 0.9455263157894737, 'eval_f1_macro': 0.9455005728413841, 'eval_runtime': 11.874, 'eval_samples_per_second': 640.056, 'eval_steps_per_second': 20.044, 'epoch': 3.0}


## 17) Detailed analysis (classification report + confusion matrix)

In [23]:
pred_output = trainer.predict(tokenized["test"])
test_logits = pred_output.predictions
test_labels = pred_output.label_ids
test_preds = np.argmax(test_logits, axis=-1)

print(classification_report(test_labels, test_preds, digits=4))

              precision    recall  f1-score   support

           0     0.9604    0.9568    0.9586      1900
           1     0.9832    0.9884    0.9858      1900
           2     0.9119    0.9258    0.9188      1900
           3     0.9267    0.9111    0.9188      1900

    accuracy                         0.9455      7600
   macro avg     0.9455    0.9455    0.9455      7600
weighted avg     0.9455    0.9455    0.9455      7600



In [24]:
cm = confusion_matrix(test_labels, test_preds)
cm_df = pd.DataFrame(cm)
cm_df

Unnamed: 0,0,1,2,3
0,1818,12,36,34
1,12,1878,8,2
2,31,9,1759,101
3,32,11,126,1731


## 18) Save model + tokenizer

In [25]:
save_path = MODELS_DIR / "bert_agnews_best"
save_path.mkdir(parents=True, exist_ok=True)

trainer.save_model(str(save_path))
tokenizer.save_pretrained(str(save_path))

print("Saved to:", save_path)

Saved to: /content/drive/MyDrive/finetuning-bert-text-classification/models/bert_agnews_best


## 19) Write report to `reports/summary.md` (tidy)

In [26]:
from datetime import datetime

report_path = REPORTS_DIR / "summary.md"
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

val_loss = val_metrics.get("eval_loss", None)
val_acc  = val_metrics.get("eval_accuracy", None)
val_f1   = val_metrics.get("eval_f1_macro", None)

test_loss = test_metrics.get("eval_loss", None)
test_acc  = test_metrics.get("eval_accuracy", None)
test_f1   = test_metrics.get("eval_f1_macro", None)

report_text = f"""# Task 1 — Fine-tuning BERT for Text Classification (AG News)

Generated: {now}

## Setup
- Dataset: {DATASET_NAME}
- Model: {MODEL_CHECKPOINT}
- Task: 4-class news topic classification (AG News)
- Max length: {MAX_LENGTH}
- Epochs: {NUM_EPOCHS}
- Learning rate: {LEARNING_RATE}
- Train batch size: {TRAIN_BATCH_SIZE}
- Eval batch size: {EVAL_BATCH_SIZE}
- Weight decay: {WEIGHT_DECAY}
- Run mode: QUICK_RUN = {QUICK_RUN} (full run if False)

## Data sizes used
- Train: {len(tokenized["train"])}
- Validation: {len(tokenized["validation"])}
- Test: {len(tokenized["test"])}

## Results
| Split | Loss | Accuracy | Macro-F1 |
|------|------|----------|----------|
| Validation | {val_loss:.4f} | {val_acc:.4f} | {val_f1:.4f} |
| Test | {test_loss:.4f} | {test_acc:.4f} | {test_f1:.4f} |

## Notes
- Labels were normalized to 0–3 (if needed) to match `num_labels=4`.
- Add: confusion matrix screenshot/table and a short error analysis section (example misclassifications).
"""

report_path.write_text(report_text, encoding="utf-8")
print("Wrote:", report_path)
print(report_text[:800])

Wrote: /content/drive/MyDrive/finetuning-bert-text-classification/reports/summary.md
# Task 1 — Fine-tuning BERT for Text Classification (AG News)

Generated: 2026-01-09 05:09:15

## Setup
- Dataset: sh0416/ag_news
- Model: bert-base-uncased
- Task: 4-class news topic classification (AG News)
- Max length: 128
- Epochs: 3
- Learning rate: 2e-05
- Train batch size: 16
- Eval batch size: 32
- Weight decay: 0.01
- Run mode: QUICK_RUN = False (full run if False)

## Data sizes used
- Train: 108000
- Validation: 12000
- Test: 7600

## Results
| Split | Loss | Accuracy | Macro-F1 |
|------|------|----------|----------|
| Validation | 0.1883 | 0.9485 | 0.9483 |
| Test | 0.1965 | 0.9455 | 0.9455 |

## Notes
- Labels were normalized to 0–3 (if needed) to match `num_labels=4`.
- Add: confusion matrix screenshot/table and a short error analysis section (example misclassifications).



## 20) Quick inference demo

In [27]:
def predict_texts(texts):
    inputs = tokenizer(texts, return_tensors="pt", truncation=True, max_length=MAX_LENGTH, padding=True).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    preds = torch.argmax(logits, dim=-1).cpu().numpy()
    return preds.tolist()

examples = [
    "Stock markets rally as tech shares lead gains.",
    "The team won the championship after a thrilling final match.",
    "New study reveals advances in quantum computing hardware.",
    "Leaders meet to discuss international peace negotiations.",
]

print(list(zip(examples, predict_texts(examples))))

[('Stock markets rally as tech shares lead gains.', 3), ('The team won the championship after a thrilling final match.', 1), ('New study reveals advances in quantum computing hardware.', 3), ('Leaders meet to discuss international peace negotiations.', 0)]
