# Introduction to BERT

## Overview

In this tutorial, we'll explore the basics of BERT (Bidirectional Encoder Representations from Transformers), learn how to tokenize text, use BERT for masked language modeling, fine-tune it on a classification task (both using Trainer and a manual loop), and run simple inference. At the end, there is a quiz and a discussion section

---

## 1. Objectives
- Understand BERT's architecture and use cases
- Load and run a pretrained BERT model with Hugging Face's Transformers
- Tokenize input text for BERT
- Perform masked language modeling
- Fine-tune BERT on a downstream classification task using Trainer
- Fine-tune BERT manually with a custom training loop
- Run inference with the fine-tuned model
- Understand how contextual embeddings differ from previous approaches

## 2. Prerequisites
- Python 3.7+
- `pip install transformers torch datasets`
- Basic familiarity with Python and Jupyter notebooks

## 3. Setup

In [1]:
# !pip install transformers torch datasets
# !pip install 'accelerate>=0.26.0'

In [2]:
# Set fixed random seed for reproducibility
import random
import numpy as np
import torch

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(RANDOM_SEED)

# Device configuration
if torch.backends.mps.is_available():
    # Check if MPS is available and enabled
    if torch.backends.mps.is_built():
        device = torch.device("mps")
        print("Using MPS device.")
    else:
        device = torch.device("cpu")
        print("MPS is available but not built. Falling back to CPU.")
else:
    device = torch.device("cpu")
    print("MPS is not available. Falling back to CPU.")

if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory allocated: {torch.cuda.memory_allocated(0)/1024**3:.1f} GB")
    print(f"Memory cached: {torch.cuda.memory_reserved(0)/1024**3:.1f} GB")


Using MPS device.



## 4. Loading BERT and Tokenizer

In [3]:
from transformers import BertTokenizer, BertForMaskedLM, BertForSequenceClassification

# Base masked LM model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
mlm_model = BertForMaskedLM.from_pretrained('bert-base-uncased')
mlm_model.to(device)  # Move to GPU
mlm_model.eval()  # set to evaluation mode

# Sequence classification model (will fine-tune)
num_labels = 2  # e.g., binary sentiment
clf_model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=num_labels
)
clf_model.to(device)  # Move to GPU


  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it f

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## 5. Tokenization Example

In [4]:
text = "Hello, BERT!"
# Convert to token IDs
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens:", tokens)
print("Token IDs:", ids)


Tokens: ['hello', ',', 'bert', '!']
Token IDs: [7592, 1010, 14324, 999]



## 6. Masked Language Modeling Example

In [5]:
import torch
# Prepare text with a mask token
text = "The capital of France is [MASK]."
input_ids = tokenizer.encode(text, return_tensors='pt').to(device)  # Move to GPU
# Run model
with torch.no_grad():
    outputs = mlm_model(input_ids)
    logits = outputs.logits
# Locate mask position
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]
# Predict token
mask_logits = logits[0, mask_token_index, :]
predicted_token_id = torch.argmax(mask_logits, dim=-1)
predicted_token = tokenizer.convert_ids_to_tokens(predicted_token_id.cpu())  # Move back to CPU for tokenizer
print(f"Predicted token for [MASK]: {predicted_token}")


Predicted token for [MASK]: ['paris']


  return forward_call(*args, **kwargs)


## 7. Fine-tuning BERT for Text Classification (Trainer API, not Recommended)

In [6]:
from datasets import load_dataset
from transformers import TrainingArguments, Trainer

# Load a sample dataset (IMDb)
dataset = load_dataset('imdb', split={'train':'train[:2000]', 'test':'test[:500]'})

# Tokenization function
def tokenize_fn(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

dataset = dataset.map(tokenize_fn, batched=True)
dataset.set_format(type='torch', columns=['input_ids','attention_mask','label'])

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=8,
    logging_steps=100,
      report_to="none"
)

trainer = Trainer(
    model=clf_model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test']
)

trainer.train()
metrics = trainer.evaluate()   # run a final evaluation pass
print(metrics)


  return forward_call(*args, **kwargs)


Step,Training Loss
100,0.026
200,0.0001


  return forward_call(*args, **kwargs)


{'eval_loss': 7.190158794401214e-05, 'eval_runtime': 42.0804, 'eval_samples_per_second': 11.882, 'eval_steps_per_second': 1.497, 'epoch': 1.0}


## 8. Fine-tuning BERT Manually (Custom Training Loop, Recommended)

In [7]:
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm import tqdm

# Load a sample dataset (IMDb)
dataset = load_dataset('imdb', split={'train':'train[:2000]', 'test':'test[:500]'})

# Tokenization function
def tokenize_fn(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

dataset = dataset.map(tokenize_fn, batched=True)
dataset.set_format(type='torch', columns=['input_ids','attention_mask','label'])
# DataLoaders
dataset.set_format(type='torch', columns=['input_ids','attention_mask','label'])
train_loader = DataLoader(dataset['train'], batch_size=8, shuffle=True)
eval_loader = DataLoader(dataset['test'], batch_size=8)

# Optimizer
optimizer = AdamW(clf_model.parameters(), lr=5e-5)
num_labels = 2  # e.g., binary sentiment
clf_model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=num_labels
)
clf_model.to(device)  # Move to GPU
# Training loop
clf_model.train()
for epoch in range(1):
    for batch in tqdm(train_loader, desc="Training"):
        inputs = {
            'input_ids': batch['input_ids'].to(clf_model.device),
            'attention_mask': batch['attention_mask'].to(clf_model.device)
        }
        labels = batch['label'].to(clf_model.device)
        outputs = clf_model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    print(f"Epoch {epoch+1} completed. Loss: {loss.item():.4f}")

# Evaluation with Loss and Accuracy
clf_model.eval()
correct = 0
total = 0
eval_losses = []
with torch.no_grad():
    for batch in tqdm(eval_loader, desc="Evaluating"):
        inputs = {
            'input_ids': batch['input_ids'].to(clf_model.device),
            'attention_mask': batch['attention_mask'].to(clf_model.device)
        }
        labels = batch['label'].to(clf_model.device)
        outputs = clf_model(**inputs, labels=labels)
        loss = outputs.loss
        logits = outputs.logits
        eval_losses.append(loss.item())
        preds = torch.argmax(logits, dim=-1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

avg_loss = sum(eval_losses) / len(eval_losses)
accuracy = correct / total
print(f"Evaluation Loss: {avg_loss:.4f}")
print(f"Accuracy: {accuracy:.4f}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 250/250 [11:41<00:00,  2.80s/it]


Epoch 1 completed. Loss: 0.7621


Evaluating: 100%|██████████| 63/63 [00:43<00:00,  1.46it/s]

Evaluation Loss: 0.7310
Accuracy: 0.1800






## 9. Inference with Fine-tuned Model

In [8]:
# Sample texts
texts = [
    "I absolutely loved this movie!",
    "That was the worst film I've ever seen."
]
# Tokenize and move to GPU
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
inputs = {k: v.to(device) for k, v in inputs.items()}  # Move all tensors to GPU
# Predict
with torch.no_grad():
    logits = clf_model(**inputs).logits
predictions = torch.argmax(logits, dim=-1).cpu()  # Move back to CPU for processing
labels = ['negative', 'positive']
for text, pred in zip(texts, predictions):
    print(f"Text: {text}\nPrediction: {labels[pred]}\n")


Text: I absolutely loved this movie!
Prediction: negative

Text: That was the worst film I've ever seen.
Prediction: negative




# 🔍 Quiz: Contextualized Word Embeddings in Action

## 🧠 Objective
You will explore how BERT generates different vector representations for the same word depending on its surrounding context. This tests your understanding of contextualization, a major advantage of BERT over earlier static embeddings like Word2Vec.

## 🧪 Task
Use the `bert-base-uncased` model to extract the last hidden state vectors of the word "bank" in the following two sentences:

1. "He sat down by the bank to enjoy the view of the river."
2. "She went to the bank to deposit some cash."

Then:
- Extract the token embedding for the word "bank" from both sentences.
- Compute the cosine similarity between them.
- Discuss why the similarity is (likely) low, even though the surface word is the same.

## 📌 Your Code Here


In [None]:
from transformers import BertTokenizer, BertModel
import torch
import torch.nn.functional as F

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.to(device)  # Move to GPU
model.eval()

# Sentences with same word used in different contexts
sent1 = "He sat down by the bank to enjoy the view of the river."
sent2 = "She went to the bank to deposit some cash."

def get_word_embedding(sentence, target_word="bank"):
    # TODO: Implement this function
    # Hints:
    # 1. Tokenize the sentence and move to GPU
    # 2. Get model outputs
    # 3. Find the index of target_word in tokens
    # 4. Extract embedding from last_hidden_state
    # Remember to handle device placement!
    pass

# Get embeddings
emb1, tok1 = get_word_embedding(sent1)
emb2, tok2 = get_word_embedding(sent2)

# Compute similarity
similarity = F.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()

print("Sentence 1 Tokens:", tok1)
print("Sentence 2 Tokens:", tok2)
print("Cosine similarity between 'bank' embeddings:", similarity)


## Discussion
### ✅ Insight
Question: What value of cosine similarity between these two 'bank' do you observe. If they are equal, the cosine similarity should be 1, otherwise please explain your idea behind this observation.

### 💬 Bonus Question
Question: Why is it problematic if a model (like Word2Vec) gives similar embeddings for these two "bank" instances? When might that lead to incorrect model predictions?


## 📚 Complete Implementation (For Reference, Do not see before completing)

Below is the complete implementation of the `get_word_embedding` function:

In [None]:
# def get_word_embedding(sentence, target_word="bank"):
#     """
#     Extract word embedding for a target word from a sentence using BERT.

#     Args:
#         sentence (str): Input sentence containing the target word
#         target_word (str): The word to extract embedding for

#     Returns:
#         tuple: (embedding_tensor, token_list)
#     """
#     # Tokenize and move to GPU
#     tokens = tokenizer(sentence, return_tensors="pt")
#     tokens = {k: v.to(device) for k, v in tokens.items()}

#     with torch.no_grad():
#         outputs = model(**tokens)

#     input_ids = tokens["input_ids"][0].cpu()  # Move back to CPU for tokenizer
#     token_strs = tokenizer.convert_ids_to_tokens(input_ids)

#     # Find index of the word "bank"
#     # BERT may tokenize it as 'bank' or '##bank', so we do an exact match
#     index = token_strs.index(target_word)
#     embedding = outputs.last_hidden_state[0, index].cpu()  # Move back to CPU for similarity computation
#     return embedding, token_strs
