# Hugging Face: Supervised Fine-Tuning 

This notebook was created by Natasha Patnaik for MIT 15.S60 - Computing in Optimization and Statistics. 

**Last Updated**: January 2026.

Although this code runs on a CPU, it will run much faster on an NVIDIA GPU using CUDA. In practice, you would typically access a GPU through a High-Performance Computing (HPC) cluster, such as MIT Engaging: https://orcd.mit.edu/resources/mit-campus-wide-resources. Alternatively, you can upload this notebook into Google Colab, adding appropriate `!pip install` lines to get the required libraries.

In [1]:
from transformers import BertTokenizer, BertForMaskedLM
from transformers.modeling_outputs import ModelOutput
from transformers.tokenization_utils_base import BatchEncoding

from __future__ import annotations

import numpy as np
import torch
from datasets import load_dataset
from sklearn.metrics import f1_score
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    BertTokenizer,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    set_seed,
)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Some helper functions for pedagogical purposes

def describe_tensors(obj, name):
    print(f"\n{name}:")
    print("-" * len(name))

    # Case 1: tokenizer output (BatchEncoding or dict)
    if isinstance(obj, (dict, BatchEncoding)):
        for key, value in obj.items():
            if isinstance(value, torch.Tensor):
                print(f"{key:<15} shape = {tuple(value.shape)}")
            else:
                print(f"{key:<15} type  = {type(value)}")

    # Case 2: Hugging Face model outputs (ModelOutput dataclass)
    elif isinstance(obj, ModelOutput):

        for key, value in obj.__dict__.items():
            if value is None:
                continue

            # Shape of individual tensor 
            if isinstance(value, torch.Tensor):
                print(f"{key:<15} shape = {tuple(value.shape)}")

            else:
                print(f"{key:<15} type  = {type(value)}")

    # Case 3: plain tensor
    elif isinstance(obj, torch.Tensor):
        print(f"Tensor shape = {tuple(obj.shape)}")

    else:
        print(f"Unrecognized type: {type(obj)}")


## Introduction

In [3]:
# Load the pre-trained BERT tokenizer
# The first time from_pretrained() is called, the model is downloaded and cached locally. 
# On subsequent calls, the model is loaded from the local cache unless an update is available.
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# BERT uses WordPiece tokenization.
text = "Tokenization is unbelievably important for language models!"
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens:")
print(tokens)

print("\nToken IDs:")
print(ids)


Tokens:
['token', '##ization', 'is', 'un', '##bel', '##ie', '##va', '##bly', 'important', 'for', 'language', 'models', '!']

Token IDs:
[19204, 3989, 2003, 4895, 8671, 2666, 3567, 6321, 2590, 2005, 2653, 4275, 999]


In [4]:
# The first time from_pretrained() is called, the model is downloaded and cached locally. 
# On subsequent calls, the model is loaded from the local cache unless an update is available.
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

# Put the model in evaluation mode
model.eval()

# Tokenize a simple masked sentence as input
text = "Boston is the capital of [MASK]."
inputs = tokenizer(text, return_tensors="pt")

# inputs["input_ids"]       = token indices into the WordPiece vocabulary
# inputs["attention_mask"]  = mask indicating which tokens should be attended to
# inputs["token_type_ids"]  = sentence/segment identifiers (used for sentence pairs)
describe_tensors(inputs, "Model Inputs")
print("-" * 30)

# Get the token IDs for the input sentence
input_ids = inputs["input_ids"][0] # first (and only) input sequence in the batch
print("Input token IDs:", input_ids)

# Convert token IDs back to tokens (subwords)
tokens = tokenizer.convert_ids_to_tokens(input_ids)
print("Input tokens:", tokens)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Model Inputs:
------------
input_ids       shape = (1, 9)
token_type_ids  shape = (1, 9)
attention_mask  shape = (1, 9)
------------------------------
Input token IDs: tensor([ 101, 3731, 2003, 1996, 3007, 1997,  103, 1012,  102])
Input tokens: ['[CLS]', 'boston', 'is', 'the', 'capital', 'of', '[MASK]', '.', '[SEP]']


In [5]:
# Run the model
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True, output_attentions=True)

# outputs.logits shape: (batch_size, sequence_length, vocab_size)
#   batch_size      = number of input sequences processed at once
#   sequence_length = number of tokens in each input (incl. [CLS], [SEP], [MASK])
#   vocab_size      = size of BERT’s WordPiece vocabulary (approx. 30k tokens)

# outputs.hidden_states: tuple of tensors, one per layer (+ embeddings)
#   each tensor shape: (batch_size, sequence_length, hidden_size)
#   for bert-base, hidden_size=768 means each token is represented by a 768-dimensional vector.

# outputs.attentions: tuple of tensors, one per layer
#   each tensor shape: (batch_size, num_heads, sequence_length, sequence_length)
#   attention weights showing how each token attends to every other token

describe_tensors(outputs, "Model Outputs")
describe_tensors(outputs.hidden_states[0], "Example Hidden State Tensor")
describe_tensors(outputs.attentions[0], "Example Attention Tensor")




Model Outputs:
-------------
logits          shape = (1, 9, 30522)
hidden_states   type  = <class 'tuple'>
attentions      type  = <class 'tuple'>

Example Hidden State Tensor:
---------------------------
Tensor shape = (1, 9, 768)

Example Attention Tensor:
------------------------
Tensor shape = (1, 12, 9, 9)


In [6]:
# Get the token ID that represents [MASK]
mask_token_id = tokenizer.mask_token_id
print("\n[MASK] token ID:", mask_token_id)

# Find the position of the [MASK] token in the input sentence
mask_positions = (input_ids == mask_token_id).nonzero(as_tuple=False)

# Extract the token index 
mask_index = mask_positions.item()
print("[MASK] token index in our input sentence:", mask_index)

# Get the model's prediction scores (logits) for the masked position
mask_logits = outputs.logits[0, mask_index]

# Choose the token ID with the highest score
predicted_token_id = mask_logits.argmax(dim=-1)

# Decode and print result
predicted_word = tokenizer.decode(predicted_token_id)
print("----------------------------------")
print("Predicted word:", predicted_word)


[MASK] token ID: 103
[MASK] token index in our input sentence: 6
----------------------------------
Predicted word: massachusetts


In [7]:
# Get the top 5 tokens
topk = torch.topk(mask_logits, k=5) # returns the top k values and their indices in the vocab
topk_scores = topk.values.tolist()
topk_ids = topk.indices.tolist()

# Convert token IDs to strings
topk_tokens = tokenizer.convert_ids_to_tokens(topk_ids)

print("Top 5 predictions for [MASK]:")
for token, score in zip(topk_tokens, topk_scores):
    print(f"{token:>10} -> {score:.4f}")

Top 5 predictions for [MASK]:
massachusetts -> 15.9126
     maine -> 10.9696
   america -> 10.8234
    canada -> 10.4788
   england -> 10.1370


## Fine-tuning a Classification Model

Let's try fine-tuning BERT on a *multi-label classification* task using google-research-datasets/go_emotions on Hugging Face. Here's the dataset card: https://huggingface.co/datasets/google-research-datasets/go_emotions

In [8]:
# Set random seed for reproducibility
set_seed(42)

# Global configuration
MODEL_NAME = "bert-base-uncased"
DATASET_NAME = "google-research-datasets/go_emotions"
TEXT_COL = "text"

# For classification, we only need to tokenize the input text.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

MAX_LENGTH =  tokenizer.model_max_length
print("Model context window:", MAX_LENGTH)

# Classification threshold for converting probabilities to label predictions
THRESHOLD = 0.5  

# Sigmoid function to convert logits to probabilities
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Check whether we are using CUDA (GPU) or CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

Model context window: 512
Device: cuda


### Data Pre-processing

In [9]:
# Load the GoEmotions dataset
raw_data = load_dataset(DATASET_NAME)
print(raw_data)
print("-" * 50)

# Look at an example row from the training set
example = raw_data["train"][0]
for key, value in example.items():
    print(f"{key}: {value}")
print("-" * 50)
print()

# Hugging Face datasets carry a typed schema in .features.
# This tells us what each column means and how labels are encoded.
print("=== Dataset features schema ===")
dataset_features_schema = raw_data["train"].features
for key, value in dataset_features_schema.items():
    print(f"{key}: {value}")
print("-" * 50)
print()

# The "labels" column is stored as a Sequence of ClassLabel objects.
# Each example contains a list of label IDs, and each ID maps to a human-readable label name.
labels_column_schema = dataset_features_schema["labels"]
label_id_schema = labels_column_schema.feature  # This is the ClassLabel mapping

print("Labels column schema (list of IDs):", labels_column_schema)
print("Label ID schema (ID to name mapping):", label_id_schema)
print("-" * 50)
print()

# Get number of labels and label names from the schema
label_names = label_id_schema.names
num_labels = label_id_schema.num_classes

print(f"Number of labels: {num_labels}")
print("Label names:")
for i, name in enumerate(label_names):
    print(f"{i:2d} → {name}")

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'id'],
        num_rows: 43410
    })
    validation: Dataset({
        features: ['text', 'labels', 'id'],
        num_rows: 5426
    })
    test: Dataset({
        features: ['text', 'labels', 'id'],
        num_rows: 5427
    })
})
--------------------------------------------------
text: My favourite food is anything I didn't have to cook myself.
labels: [27]
id: eebbqej
--------------------------------------------------

=== Dataset features schema ===
text: Value('string')
labels: List(ClassLabel(names=['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']))
id: Value('string')
--------------------------------------------------

Labels 

In [10]:
def tokenize_and_build_multihot(batch):
    """
    Converts a batch of examples into:
      - tokenized inputs: input_ids, attention_mask
      - multi-label targets: labels as multi-hot encoding vectors [B, num_labels]

    IMPORTANT: For BCEWithLogitsLoss, labels must be floats (0.0/1.0).
    """

    # Tokenize the input text
    # If a sequence is too long, truncate to max length (context window size)
    # If a sequence is shorter than max length, padding is done later by DataCollatorWithPadding
    tokenized = tokenizer(
        batch[TEXT_COL],
        truncation=True,
        max_length=MAX_LENGTH,
    )

    # Create an empty matrix of shape: (batch_size, num_labels)
    # We will fill this with 0s and 1s to form a multi-hot encoding.
    multi_hot = np.zeros((len(batch["labels"]), num_labels), dtype=np.float32)
    for i, label_ids in enumerate(batch["labels"]):
        for id in label_ids:
            multi_hot[i, id] = 1.0

    tokenized["labels"] = multi_hot
    return tokenized

# Apply to datasets (batched mapping)
tokenized = raw_data.map(tokenize_and_build_multihot, batched=True)

# We can drop raw columns we no longer need. Keep only model-ready fields.
cols_to_remove = [c for c in raw_data["train"].column_names if c != "labels"]

# The tokenized dataset now contains token fields + labels; remove original "text" etc.
tokenized = tokenized.remove_columns(cols_to_remove)

print("Tokenized columns:", tokenized["train"].column_names)
print("Tokenized label shape:", np.array(tokenized["train"][0]["labels"]).shape)


Map: 100%|██████████| 5427/5427 [00:00<00:00, 57458.83 examples/s]

Tokenized columns: ['labels', 'input_ids', 'token_type_ids', 'attention_mask']
Tokenized label shape: (28,)





In [11]:
# A data collator is the function that turns a list of dataset rows into a single training batch.
print("=== Public methods in DataCollatorWithPadding ===")
for name in dir(DataCollatorWithPadding):
    if not name.startswith("_"): 
        print(name)
print()

# For pedagogical purposes, we instantiate the default data collator here and see what it does
base_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Grab two examples with different sequence lengths
sample_examples = [
    tokenized["train"][0],
    tokenized["train"][1],
]

print("=== Raw examples ===")
print("Example 0 length:", len(sample_examples[0]["input_ids"]))
print("Example 1 length:", len(sample_examples[1]["input_ids"]))
print()

# Collate them into a batch
batch = base_collator(sample_examples)

print("=== After DataCollatorWithPadding ===")
for k, v in batch.items():
    print(k, v.shape, v.dtype)

# Show padding effect
print("\ninput_ids:")
print(batch["input_ids"])

=== Public methods in DataCollatorWithPadding ===
max_length
pad_to_multiple_of
padding
return_tensors

=== Raw examples ===
Example 0 length: 16
Example 1 length: 24

=== After DataCollatorWithPadding ===
labels torch.Size([2, 28]) torch.int64
input_ids torch.Size([2, 24]) torch.int64
token_type_ids torch.Size([2, 24]) torch.int64
attention_mask torch.Size([2, 24]) torch.int64

input_ids:
tensor([[  101,  2026,  8837,  2833,  2003,  2505,  1045,  2134,  1005,  1056,
          2031,  2000,  5660,  2870,  1012,   102,     0,     0,     0,     0,
             0,     0,     0,     0],
        [  101,  2085,  2065,  2002,  2515,  2125,  2370,  1010,  3071,  2097,
          2228,  2002,  2015,  2383,  1037,  4756, 29082,  2007,  2111,  2612,
          1997,  2941,  2757,   102]])


In [12]:
# DataCollatorWithPadding pads each batch to the longest example in that batch.
# For multi-label classification, we also need to ensure 
# labels are float tensors in order to be compatible with BCEWithLogitsLoss.
# So we create a custom data collator by subclassing DataCollatorWithPadding.
class CustomDataCollatorWithPadding(DataCollatorWithPadding):

    def __call__(self, features):

        # Run Hugging Face's default padding + tensorization
        batch = super().__call__(features)

        # BCEWithLogitsLoss expects float targets
        batch["labels"] = batch["labels"].float()

        return batch

data_collator = CustomDataCollatorWithPadding(tokenizer=tokenizer)

### Model Definition

In [13]:
# AutoModelForSequenceClassification adds a linear, task-specific classification head on top of BERT.
# The head is a single fully connected linear layer that maps the [CLS] to num_labels logits.
# Its weights are randomly initialized.
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,
    problem_type="multi_label_classification", # problem_type="multi_label_classification" tells HF to use BCEWithLogitsLoss.
).to(device)

# Look at the model architecture
print(model)

# View the classification head specifically
print("\n=== Classification head ===")
print(model.classifier)

# We can also access the classification head parameters
W = model.classifier.weight
b = model.classifier.bias

print("\nClassifier weight matrix shape:", W.shape)
print("Classifier bias shape:", b.shape)

# How many parameters does the model have?
total_params = sum(p.numel() for p in model.parameters())
head_params = sum(p.numel() for p in model.classifier.parameters())

print(f"\nTotal parameters: {total_params:,}")
print(f"Classifier head parameters: {head_params:,}")
print(f"Fraction in classifier head: {head_params / total_params:.4%}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

### Training and Evaluation

In [14]:
def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for a multi-label classification model.

    This function converts model logits into binary label predictions using a
    sigmoid + threshold, then computes three complementary metrics:

    - f1_micro: Treats every (example, label) decision as an independent binary
                prediction and aggregates them globally. 

    - f1_macro: Computes F1 separately for each label and then averages them. 

    - exact_match: The fraction of examples for which the model predicted 
                   *all* labels correctly.
    """
    logits, labels = eval_pred  # logits: (N, num_labels), labels: (N, num_labels)

    probs = 1 / (1 + np.exp(-logits))                 # sigmoid
    preds = (probs >= THRESHOLD).astype(np.int32)     # multi-label predictions

    labels = labels.astype(np.int32)                  # ground truth multi-hot

    micro = f1_score(labels, preds, average="micro", zero_division=0)
    macro = f1_score(labels, preds, average="macro", zero_division=0)
    exact_match = float(np.mean(np.all(preds == labels, axis=1)))

    return {"f1_micro": micro, "f1_macro": macro, "exact_match": exact_match}


In [15]:
# Training arguments - super easy with Transformers library!
args = TrainingArguments(
    output_dir="bert-goemotions-multilabel", # Where checkpoints, logs, and trainer outputs are saved
    learning_rate=2e-5,
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
    eval_strategy="epoch",  # When to run evaluation
    save_strategy="epoch",  # When to save model checkpoints
    load_best_model_at_end=True, # After training, reload the checkpoint with the best evaluation metric
    metric_for_best_model="f1_micro",  # Metric used to determine which checkpoint is "best"
    greater_is_better=True,  # Indicates if larger metric values are better (True for F1)
    logging_strategy="steps",
    logging_steps=50,
    report_to="none", # Disable integration with external experiment trackers (e.g., Weights & Biases)
    fp16=torch.cuda.is_available(), # Mixed-precision (FP16) to reduce memory
)

# Trainer: combines model, args, datasets, tokenizer, data collator, metrics
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Optional but useful sanity check
batch = next(iter(trainer.get_train_dataloader()))
print("Sanity check batch shapes:")
print(" input_ids:", batch["input_ids"].shape)
print(" attention_mask:", batch["attention_mask"].shape)
print(" labels:", batch["labels"].shape, batch["labels"].dtype)
# Expect labels dtype float32 for BCEWithLogitsLoss

Sanity check batch shapes:
 input_ids: torch.Size([16, 36])
 attention_mask: torch.Size([16, 36])
 labels: torch.Size([16, 28]) torch.float32


  trainer = Trainer(


In [16]:
# Train, evaluate, and save model weights + architecture
trainer.train()
metrics = trainer.evaluate()
print("Validation metrics:", metrics)

save_dir = "bert-goemotions-multilabel-finetuned"
trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)
print("Saved to:", save_dir)

Epoch,Training Loss,Validation Loss,F1 Micro,F1 Macro,Exact Match
1,0.0909,0.090875,0.535611,0.288413,0.407114
2,0.0826,0.084159,0.573554,0.411853,0.45282
3,0.0725,0.084958,0.577879,0.436225,0.461482
4,0.0536,0.090412,0.579297,0.461265,0.466826
5,0.0471,0.097697,0.572647,0.460999,0.462403
6,0.038,0.104581,0.571813,0.47076,0.463878
7,0.0321,0.111031,0.557675,0.466242,0.449687
8,0.0262,0.116304,0.568578,0.476703,0.455031
9,0.0231,0.119289,0.564945,0.479297,0.449134
10,0.0195,0.120857,0.564286,0.474608,0.449134


Validation metrics: {'eval_loss': 0.09041234850883484, 'eval_f1_micro': 0.579297079909075, 'eval_f1_macro': 0.46126483391374357, 'eval_exact_match': 0.4668263914485809, 'eval_runtime': 7.1503, 'eval_samples_per_second': 758.852, 'eval_steps_per_second': 23.775, 'epoch': 10.0}
Saved to: bert-goemotions-multilabel-finetuned


### Inference

In [17]:
# Inference: predict emotions for new text
# Set to evaluation mode: disables training-only behavior like dropout and batch norm updates.
model.eval()

def predict_emotions(texts, top_k=5, threshold=THRESHOLD):
    """
    In classification inference:
      text -> tokenize -> logits -> sigmoid -> label probabilities -> choose labels

    Args:
      texts (List[str]): input texts to classify
      top_k (int): number of highest-probability labels to return
      threshold (float): probability cutoff for selecting labels

    Returns:
      List[dict]: one result per input text, containing:
        - original text
        - top-k predicted labels
        - labels whose probability exceeds the threshold
    """
    enc = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=MAX_LENGTH,
        return_tensors="pt",
    ).to(device)

    # Forward pass through the model: input is tokenized texts
    with torch.no_grad():
        logits = model(**enc).logits  # [B, num_labels]
        probs = torch.sigmoid(logits).cpu().numpy()

    # Post-processing results into readable format
    results = []
    for text, p in zip(texts, probs):
        top_idx = np.argsort(-p)[:top_k] # Sort label indices by prob (descending) and keep top_k
        top = [(label_names[i], float(p[i])) for i in top_idx] # Convert indices into (label_name, probability) pairs

        # Choose labels where prob exceeds threshold
        chosen_idx = np.where(p >= threshold)[0] 
        chosen = [(label_names[i], float(p[i])) for i in chosen_idx]
        chosen = sorted(chosen, key=lambda x: -x[1]) # Sort selected labels by probab (highest first)

        results.append({"text": text, "top_k": top, "above_threshold": chosen})

    return results


demo_texts = [
    "I can't believe you did that. I'm furious.",
    "That was so kind of you, thank you so much!",
    "I'm a bit nervous about tomorrow, but also sort of excited.",
]

for r in predict_emotions(demo_texts, top_k=3, threshold=0.5):
    print("\nTEXT:", r["text"])
    print("Top-K:", r["top_k"])
    print("Above threshold:", r["above_threshold"])



TEXT: I can't believe you did that. I'm furious.
Top-K: [('anger', 0.8766343593597412), ('annoyance', 0.193451926112175), ('neutral', 0.022328782826662064)]
Above threshold: [('anger', 0.8766343593597412)]

TEXT: That was so kind of you, thank you so much!
Top-K: [('gratitude', 0.9874721169471741), ('admiration', 0.32874488830566406), ('approval', 0.012241275049746037)]
Above threshold: [('gratitude', 0.9874721169471741)]

TEXT: I'm a bit nervous about tomorrow, but also sort of excited.
Top-K: [('excitement', 0.442516028881073), ('nervousness', 0.2485341727733612), ('fear', 0.06669065356254578)]
Above threshold: []


## Compare: Evaluation Metrics for Pre-trained vs Fine-Tuned Model

A pre-trained BERT model with a randomly initialized classification head will usually score near chance on F1, because the head is untrained. Let's check and compare it against our fine-tuned model.

In [18]:
from transformers import AutoModelForSequenceClassification

# Reload a new, Base Model to compare our Fine-Tuned Model against
base_model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,
    problem_type="multi_label_classification", 
).to(device)

# Corresponding Trainer for the Base Model
base_trainer = Trainer(
    model=base_model, # Pre-trained weights
    args=args,  # Same TrainingArguments as before
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Create a new Trainer object for our Fine-Tuned Model (safer than re-using the one used during Trainer)
ft_trainer = Trainer(
    model=model,  # Recall: this is our model fine-tuned on GoEmotions!
    args=args, # Same TrainingArguments as before
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Use the new Trainers to evaluate each respective model
base_metrics = base_trainer.evaluate()
ft_metrics = ft_trainer.evaluate()

print("\n=== Baseline (pretrained) ===")
for k, v in base_metrics.items():
    if k.startswith("eval_"):
        print(f"{k}: {v}")

print("\n=== Fine-tuned ===")
for k, v in ft_metrics.items():
    if k.startswith("eval_"):
        print(f"{k}: {v}")

print("\n=== Difference (fine-tuned - baseline) ===")
for k in ft_metrics:
    if k.startswith("eval_") and isinstance(ft_metrics[k], (int, float)):
        b = base_metrics.get(k, None)
        if isinstance(b, (int, float)):
            print(f"{k}: {ft_metrics[k] - b:+.4f}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  base_trainer = Trainer(
  ft_trainer = Trainer(



=== Baseline (pretrained) ===
eval_loss: 0.7067374587059021
eval_model_preparation_time: 0.0073
eval_f1_micro: 0.0923913686077446
eval_f1_macro: 0.06380048823264672
eval_exact_match: 0.0
eval_runtime: 7.2064
eval_samples_per_second: 752.937
eval_steps_per_second: 23.59

=== Fine-tuned ===
eval_loss: 0.09041234850883484
eval_model_preparation_time: 0.0083
eval_f1_micro: 0.579297079909075
eval_f1_macro: 0.46126483391374357
eval_exact_match: 0.4668263914485809
eval_runtime: 7.1416
eval_samples_per_second: 759.774
eval_steps_per_second: 23.804

=== Difference (fine-tuned - baseline) ===
eval_loss: -0.6163
eval_model_preparation_time: +0.0010
eval_f1_micro: +0.4869
eval_f1_macro: +0.3975
eval_exact_match: +0.4668
eval_runtime: -0.0648
eval_samples_per_second: +6.8370
eval_steps_per_second: +0.2140
