# Task 1B — Fine-tuning BERT for **MNLI** (Natural Language Inference) — Full Run

This notebook fine-tunes a BERT-family encoder model for **3-class NLI** on:
- Dataset: `nyu-mll/glue` (config: `mnli`)
- Model: `bert-base-uncased`

It is designed for **Google Colab** and saves outputs under your Google Drive folder.

**Recommended repo:** `finetuning-bert-nli`  
**Notebook path:** `notebooks/02_finetune_bert_mnli_fullrun_drive.ipynb`


## 0) Mount Google Drive (required)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1) Project directory on Drive (recommended)

In [None]:
# Recommended path for MNLI repo (change if you store it elsewhere)
PROJECT_DIR = "/content/drive/MyDrive/finetuning-bert-nli"

In [None]:
from pathlib import Path

PROJECT_PATH = Path(PROJECT_DIR)
REPORTS_DIR = PROJECT_PATH / "reports"
NOTEBOOKS_DIR = PROJECT_PATH / "notebooks"
MODELS_DIR = PROJECT_PATH / "models"
OUTPUTS_DIR = PROJECT_PATH / "outputs"

for d in [REPORTS_DIR, NOTEBOOKS_DIR, MODELS_DIR, OUTPUTS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("PROJECT_PATH:", PROJECT_PATH)

PROJECT_PATH: /content/drive/MyDrive/finetuning-bert-nli


## 2) Install dependencies

In [None]:
!pip -q install -U transformers datasets evaluate accelerate scikit-learn

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/512.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m276.5/512.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m113.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## 3) Imports & reproducibility

In [None]:
import os, random
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    set_seed,
)
import evaluate

SEED = 42
set_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

Device: cuda


## 4) Configuration (FULL RUN)

In [None]:
MODEL_CHECKPOINT = "bert-base-uncased"
DATASET_NAME = "nyu-mll/glue"
DATASET_CONFIG = "mnli"

MAX_LENGTH = 256
NUM_EPOCHS = 3
LEARNING_RATE = 2e-5
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
WEIGHT_DECAY = 0.01

OUTPUT_DIR = str(OUTPUTS_DIR / "bert_mnli")

print("MODEL_CHECKPOINT:", MODEL_CHECKPOINT)
print("DATASET:", DATASET_NAME, DATASET_CONFIG)

MODEL_CHECKPOINT: bert-base-uncased
DATASET: nyu-mll/glue mnli


## 5) Load dataset

In [None]:
ds = load_dataset(DATASET_NAME, DATASET_CONFIG)
ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

mnli/train-00000-of-00001.parquet:   0%|          | 0.00/52.2M [00:00<?, ?B/s]

mnli/validation_matched-00000-of-00001.p(…):   0%|          | 0.00/1.21M [00:00<?, ?B/s]

mnli/validation_mismatched-00000-of-0000(…):   0%|          | 0.00/1.25M [00:00<?, ?B/s]

mnli/test_matched-00000-of-00001.parquet:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

mnli/test_mismatched-00000-of-00001.parq(…):   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9796 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9847 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})

### 5.1 Inspect columns and splits

In [None]:
print(ds)
print("Train columns:", ds["train"].column_names)
pd.DataFrame(ds["train"][:3])

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})
Train columns: ['premise', 'hypothesis', 'label', 'idx']


Unnamed: 0,premise,hypothesis,label,idx
0,Conceptually cream skimming has two basic dime...,Product and geography are what make cream skim...,1,0
1,you know during the season and i guess at at y...,You lose the things to the following level if ...,0,1
2,One of our number will carry out your instruct...,A member of my team will execute your orders w...,0,2


## 6) Pick validation splits (matched + mismatched)

In [None]:
splits = DatasetDict(
    train=ds["train"],
    validation_matched=ds["validation_matched"],
    validation_mismatched=ds["validation_mismatched"],
)

print({k: len(v) for k, v in splits.items()})

{'train': 392702, 'validation_matched': 9815, 'validation_mismatched': 9832}


## 7) Tokenization (premise + hypothesis)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT, use_fast=True)

def get_fields(batch):
    if "premise" in batch and "hypothesis" in batch:
        return batch["premise"], batch["hypothesis"]
    if "sentence1" in batch and "sentence2" in batch:
        return batch["sentence1"], batch["sentence2"]
    raise KeyError("Could not find premise/hypothesis fields in MNLI batch.")

def tokenize_batch(batch):
    s1, s2 = get_fields(batch)
    return tokenizer(s1, s2, truncation=True, max_length=MAX_LENGTH)

remove_candidates = ["premise", "hypothesis", "sentence1", "sentence2", "idx"]
remove_cols = [c for c in remove_candidates if c in splits["train"].column_names]

tokenized = splits.map(tokenize_batch, batched=True, remove_columns=remove_cols)

def rename_label_column(ds_):
    if "label" in ds_.column_names and "labels" not in ds_.column_names:
        ds_ = ds_.rename_column("label", "labels")
    return ds_

tokenized = DatasetDict({k: rename_label_column(v) for k, v in tokenized.items()})

print("Tokenized columns:", tokenized["train"].column_names)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/392702 [00:00<?, ? examples/s]

Map:   0%|          | 0/9815 [00:00<?, ? examples/s]

Map:   0%|          | 0/9832 [00:00<?, ? examples/s]

Tokenized columns: ['labels', 'input_ids', 'token_type_ids', 'attention_mask']


## 8) Data collator

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 9) Metrics (Accuracy + Macro-F1)

In [None]:
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_metric.compute(predictions=preds, references=labels)["accuracy"]
    f1 = f1_metric.compute(predictions=preds, references=labels, average="macro")["f1"]
    return {"accuracy": acc, "f1_macro": f1}

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

## 10) Load model

In [None]:
num_labels = 3  # MNLI: entailment / neutral / contradiction

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_CHECKPOINT,
    num_labels=num_labels,
).to(device)

print("num_labels:", model.config.num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


num_labels: 3


## 11) TrainingArguments + Trainer (version-compatible)

In [None]:
import inspect
from transformers import TrainingArguments, Trainer

use_fp16 = torch.cuda.is_available()

ta_kwargs = dict(
    output_dir=OUTPUT_DIR,
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=200,
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=WEIGHT_DECAY,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,
    fp16=use_fp16,
    report_to="none",
    seed=SEED,
)

ta_params = inspect.signature(TrainingArguments.__init__).parameters
if "eval_strategy" in ta_params:
    ta_kwargs["eval_strategy"] = "epoch"
else:
    ta_kwargs["evaluation_strategy"] = "epoch"

training_args = TrainingArguments(**ta_kwargs)

trainer_kwargs = dict(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation_matched"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer_params = inspect.signature(Trainer.__init__).parameters
if "processing_class" in trainer_params:
    trainer_kwargs["processing_class"] = tokenizer
else:
    trainer_kwargs["tokenizer"] = tokenizer

trainer = Trainer(**trainer_kwargs)

training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=False,

## 12) Train

In [None]:
train_result = trainer.train()
train_result

Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro
1,0.4631,0.429509,0.837188,0.836571
2,0.3086,0.447262,0.844829,0.844119


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro
1,0.4631,0.429509,0.837188,0.836571
2,0.3086,0.447262,0.844829,0.844119
3,0.2338,0.573717,0.844626,0.844257


TrainOutput(global_step=73632, training_loss=0.3571284709547665, metrics={'train_runtime': 6590.0404, 'train_samples_per_second': 178.771, 'train_steps_per_second': 11.173, 'total_flos': 4.906642717478653e+16, 'train_loss': 0.3571284709547665, 'epoch': 3.0})

## 13) Evaluate (matched + mismatched)

In [None]:
val_matched = trainer.evaluate(eval_dataset=tokenized["validation_matched"])
val_mismatched = trainer.evaluate(eval_dataset=tokenized["validation_mismatched"])

print("Validation (matched):", val_matched)
print("Validation (mismatched):", val_mismatched)

Validation (matched): {'eval_loss': 0.5737167000770569, 'eval_accuracy': 0.8446255731023943, 'eval_f1_macro': 0.8442566054151638, 'eval_runtime': 12.7741, 'eval_samples_per_second': 768.354, 'eval_steps_per_second': 24.033, 'epoch': 3.0}
Validation (mismatched): {'eval_loss': 0.5500074028968811, 'eval_accuracy': 0.8486574450772986, 'eval_f1_macro': 0.8481380405426657, 'eval_runtime': 13.1827, 'eval_samples_per_second': 745.826, 'eval_steps_per_second': 23.364, 'epoch': 3.0}


## 14) Save model + tokenizer

In [None]:
save_path = MODELS_DIR / "bert_mnli_best"
save_path.mkdir(parents=True, exist_ok=True)

trainer.save_model(str(save_path))
tokenizer.save_pretrained(str(save_path))

print("Saved to:", save_path)

## 15) Write report to `reports/summary_mnli.md`

In [None]:
from datetime import datetime

report_path = REPORTS_DIR / "summary_mnli.md"
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

report_text = f"""# Task 1B — Fine-tuning BERT for MNLI (NLI)

Generated: {now}

## Setup
- Dataset: {DATASET_NAME} ({DATASET_CONFIG})
- Model: {MODEL_CHECKPOINT}
- Max length: {MAX_LENGTH}
- Epochs: {NUM_EPOCHS}
- Learning rate: {LEARNING_RATE}
- Train batch size: {TRAIN_BATCH_SIZE}
- Eval batch size: {EVAL_BATCH_SIZE}
- Weight decay: {WEIGHT_DECAY}

## Data sizes used
- Train: {len(tokenized["train"])}
- Validation (matched): {len(tokenized["validation_matched"])}
- Validation (mismatched): {len(tokenized["validation_mismatched"])}

## Results
| Split | Loss | Accuracy | Macro-F1 |
|------|------|----------|----------|
| Validation (matched) | {val_matched.get("eval_loss", 0):.4f} | {val_matched.get("eval_accuracy", 0):.4f} | {val_matched.get("eval_f1_macro", 0):.4f} |
| Validation (mismatched) | {val_mismatched.get("eval_loss", 0):.4f} | {val_mismatched.get("eval_accuracy", 0):.4f} | {val_mismatched.get("eval_f1_macro", 0):.4f} |
"""

report_path.write_text(report_text, encoding="utf-8")
print("Wrote:", report_path)

Wrote: /content/drive/MyDrive/finetuning-bert-nli/reports/summary_mnli.md
