Import the required libraries: PyTorch, Hugging Face `transformers` (for pretrained BERT and tokenizer), Hugging Face `datasets` (for loading datasets), and pandas for data handling.


In [46]:
import torch
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset
import pandas as pd

Set the model checkpoint to `bert-base-uncased` and select the device (GPU if available, otherwise CPU).


In [47]:
model_ckpt = "bert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Load the pretrained BERT model and tokenizer from Hugging Face. It also defines a `tokenize` function that tokenizes a batch of text samples with padding and truncation for later preprocessing.


In [48]:
model = AutoModel.from_pretrained(model_ckpt).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

def tokenize(batch):
    texts = batch["text"].tolist() if hasattr(batch["text"], "tolist") else batch["text"]
    return tokenizer(texts, padding=True, truncation=True)

Define the `extract_hidden_states` function. It moves model inputs to the GPU (if available), runs them through BERT without gradient tracking, extracts the last hidden states, and returns the hidden vector of the `[CLS]` token (commonly used as a sentence embedding).


In [49]:
def extract_hidden_states(batch):
  # Place model inputs on the GPU
  inputs = {k:v.to(device) for k,v in batch.items()
    if k in tokenizer.model_input_names}
  # Extract last hidden states
  with torch.no_grad():
    last_hidden_state = model(**inputs).last_hidden_state
  # Return vector for [CLS] token
  return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

Load the Hugging Face `emotion` dataset, which contains text samples labeled with different emotions such as joy, sadness, anger, etc.


In [50]:
emotions = load_dataset("emotion")

Convert the training split of the `emotion` dataset into a pandas DataFrame and display the first few entries.

In [51]:
emotions.set_format(type="pandas")
df = emotions["train"][:]
df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


Add a human-readable label column to the dataset. It uses the dataset’s label mapping (`int2str`) to convert integer class IDs into string labels (e.g., 0 → sadness).


In [52]:
def label_int2str(row):
  return emotions["train"].features["label"].int2str(row)

df["label_name"] = df["label"].apply(label_int2str)
df.head()

Unnamed: 0,text,label,label_name
0,i didnt feel humiliated,0,sadness
1,i can go from feeling so hopeless to so damned...,0,sadness
2,im grabbing a minute to post i feel greedy wrong,3,anger
3,i am ever feeling nostalgic about the fireplac...,2,love
4,i am feeling grouchy,3,anger


In [53]:
print(tokenize(emotions["train"][:2]))

{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


Tokenize the entire `emotion` dataset using the `map` function with batching, producing tokenized inputs (`input_ids`, `attention_mask`) for all samples.


In [54]:
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Display the column names of the tokenized dataset to verify that tokenization was successful.


In [55]:
print(emotions_encoded["train"].column_names)

['input_ids', 'token_type_ids', 'attention_mask']


Reformat the tokenized dataset into PyTorch tensors, keeping only the columns `input_ids` and `attention_mask` for model input.


In [56]:
emotions_encoded.set_format("torch",columns=["input_ids", "attention_mask"])

Apply the `extract_hidden_states` function to the tokenized dataset. Compute BERT embeddings (from `[CLS]` token) for all text samples and stores them in a new column `hidden_state`.


In [57]:
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Display the updated column names of the dataset to confirm that the new `hidden_state` column has been added.


In [58]:
emotions_hidden["train"].column_names

['input_ids', 'token_type_ids', 'attention_mask', 'hidden_state']

Define a helper function `get_embedding` that takes a single word or sentence, tokenizes it, runs it through BERT, and returns the `[CLS]` embedding as a PyTorch tensor (moved back to CPU).


In [59]:
def get_embedding(word):
    inputs = tokenizer(word, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling over the token dimension
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return outputs.last_hidden_state[:, 0, :].squeeze().cpu()  # bring back to CPU for similarity


Define a `cosine_similarity` function to compute similarity between two embedding vectors using PyTorch’s cosine similarity.


In [60]:
def cosine_similarity(a, b):
    return torch.nn.functional.cosine_similarity(a, b, dim=0).item()

Compute embeddings for three example texts (`right`, `wrong`, `earth`) and compare their semantic similarity.

In [61]:
right = get_embedding(
    "After carefully checking all the calculations, the professor confirmed that the student's solution to the physics problem was correct and demonstrated a deep understanding of the subject."
)

wrong = get_embedding(
    "Despite putting in a lot of effort, the student submitted a solution to the physics assignment that contained multiple errors, and the professor explained why the reasoning was incorrect."
)

earth = get_embedding(
    "The planet Earth, the third celestial body from the Sun, sustains life through its diverse ecosystems, atmosphere rich in oxygen, and vast resources such as water and fertile soil."
)

print("right vs wrong:", cosine_similarity(right, wrong))
print("right vs earth:", cosine_similarity(right, earth))
print("wrong vs earth:", cosine_similarity(wrong, earth))

right vs wrong: 0.9405587911605835
right vs earth: 0.7458899617195129
wrong vs earth: 0.7377017140388489
