# BERT: Contextual Word Embeddings

BERT (Bidirectional Encoder Representations from Transformers) revolutionized NLP by introducing **contextual embeddings** — the same word gets different vector representations depending on its surrounding context.

**Key differences from Word2Vec:**
- **Word2Vec**: Static embeddings — "bank" always has the same vector
- **BERT**: Contextual embeddings — "river bank" vs "bank account" produce different vectors for "bank"

**Key concepts:**
- **Transformer architecture**: Self-attention mechanism that considers all words simultaneously
- **Bidirectional**: Considers both left and right context (unlike GPT which is left-to-right only)
- **WordPiece tokenization**: Subword units handle unknown words gracefully
- **[CLS] token**: Special token whose embedding represents the entire sequence

---

## 1. Setup & Installation

In [None]:
!pip install transformers torch sentence-transformers scipy -q

In [None]:
import torch
from transformers import BertTokenizer, BertModel
from scipy.spatial.distance import cosine
import numpy as np

In [None]:
# Load pre-trained BERT model and tokenizer
# bert-base-uncased: 12 layers, 768 hidden size, 110M parameters
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()  # Set to evaluation mode (disables dropout)

print(f"Model config: {model.config.num_hidden_layers} layers, {model.config.hidden_size} hidden size")

## 2. BERT Tokenization (WordPiece)

BERT uses **WordPiece** tokenization:
- Common words stay intact: "the", "cat", "running"
- Rare/unknown words split into subwords: "embeddings" → "em", "##bed", "##ding", "##s"
- `##` prefix indicates a continuation of the previous token

Special tokens:
- `[CLS]`: Added at the start, its embedding represents the whole sequence
- `[SEP]`: Separates sentences (used in sentence-pair tasks)
- `[PAD]`: Padding for batch processing
- `[MASK]`: Used during pre-training (masked language modeling)

In [None]:
# Basic tokenization example
text = "The cat sat on the mat."

# Tokenize into subwords
tokens = tokenizer.tokenize(text)
print(f"Original: {text}")
print(f"Tokens: {tokens}")

# Convert to token IDs (what the model actually sees)
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")

# Decode back to text
decoded = tokenizer.decode(token_ids)
print(f"Decoded: {decoded}")

In [None]:
# Subword splitting for complex/rare words
examples = [
    "embeddings",           # Technical term
    "unbelievable",         # Long word with prefixes
    "transformers",         # Domain-specific
    "antidisestablishment", # Very long word
    "ChatGPT",              # Modern term (not in BERT's vocab)
]

for word in examples:
    tokens = tokenizer.tokenize(word)
    print(f"{word:25} → {tokens}")

In [None]:
# Full tokenization with special tokens
text = "Hello, how are you?"

# Using the tokenizer to get model inputs
inputs = tokenizer(text, return_tensors='pt')

print("Input IDs:", inputs['input_ids'])
print("Attention mask:", inputs['attention_mask'])
print("Tokens:", tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))

# Note: [CLS] at start (101), [SEP] at end (102)

## 3. Extracting Word Embeddings

BERT outputs a 768-dimensional vector for each token. The key insight: **the same word gets different embeddings based on context**.

The output shape is `[batch_size, sequence_length, hidden_size]` where:
- `batch_size`: Number of sentences (1 for single sentence)
- `sequence_length`: Number of tokens including [CLS] and [SEP]
- `hidden_size`: 768 for bert-base

In [None]:
def get_embeddings(text):
    """Get BERT embeddings for all tokens in text."""
    inputs = tokenizer(text, return_tensors='pt')
    
    with torch.no_grad():  # No gradient computation needed for inference
        outputs = model(**inputs)
    
    # last_hidden_state: [batch, seq_len, 768]
    embeddings = outputs.last_hidden_state
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    return tokens, embeddings[0]  # Return tokens and embeddings for first (only) batch item


def get_word_embedding(text, target_word):
    """Get embedding for a specific word in the text."""
    tokens, embeddings = get_embeddings(text)
    
    # Find the target word in tokens (handle subwords by looking for exact match)
    target_lower = target_word.lower()
    for i, token in enumerate(tokens):
        if token == target_lower:
            return embeddings[i]
    
    # If not found as exact token, might be a subword - find starting token
    for i, token in enumerate(tokens):
        if token.startswith(target_lower[:3]):  # Partial match for subwords
            return embeddings[i]
    
    return None

In [None]:
# Get embeddings for a sentence
text = "The quick brown fox jumps over the lazy dog."
tokens, embeddings = get_embeddings(text)

print(f"Tokens: {tokens}")
print(f"Embedding shape: {embeddings.shape}")  # [seq_len, 768]
print(f"\nFirst token '[CLS]' embedding (first 10 dims):")
print(embeddings[0][:10])

### Contextual Embeddings Demo

The magic of BERT: **same word, different contexts → different embeddings**

Let's compare "bank" in:
1. "I deposited money at the bank" (financial institution)
2. "I sat by the river bank" (edge of water)

In [None]:
# Two sentences with "bank" in different contexts
sent1 = "I deposited money at the bank"
sent2 = "I sat by the river bank"

# Get embeddings for "bank" in each context
bank_financial = get_word_embedding(sent1, "bank")
bank_river = get_word_embedding(sent2, "bank")

# Compute cosine similarity between the two "bank" embeddings
# cosine() returns distance (0 = identical), so similarity = 1 - distance
similarity = 1 - cosine(bank_financial.numpy(), bank_river.numpy())

print(f"Sentence 1: '{sent1}'")
print(f"Sentence 2: '{sent2}'")
print(f"\nCosine similarity between 'bank' embeddings: {similarity:.4f}")
print("\n→ Different contexts produce different embeddings for the same word!")

In [None]:
# More examples of contextual embeddings
word_contexts = [
    ("apple", "I ate a delicious apple", "Apple released a new iPhone"),
    ("bat", "The bat flew out of the cave", "He swung the bat and hit a home run"),
    ("cell", "The prisoner was locked in a cell", "The cell divides during mitosis"),
    ("python", "A python is a large snake", "I wrote the code in Python"),
]

print("Cosine similarity for same word in different contexts:\n")
for word, sent1, sent2 in word_contexts:
    emb1 = get_word_embedding(sent1, word)
    emb2 = get_word_embedding(sent2, word)
    
    if emb1 is not None and emb2 is not None:
        sim = 1 - cosine(emb1.numpy(), emb2.numpy())
        print(f"'{word}': {sim:.4f}")
        print(f"  • {sent1}")
        print(f"  • {sent2}\n")

## 4. Sentence Embeddings

For many tasks, we need a single vector representing the entire sentence. Two common approaches:

1. **[CLS] token embedding**: The first token is designed to aggregate sequence information
2. **Mean pooling**: Average all token embeddings (often works better)

For production use, **Sentence-Transformers** provides models fine-tuned specifically for sentence similarity.

In [None]:
def get_sentence_embedding_cls(text):
    """Get sentence embedding using [CLS] token."""
    inputs = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    # [CLS] is the first token (index 0)
    return outputs.last_hidden_state[0, 0, :]


def get_sentence_embedding_mean(text):
    """Get sentence embedding using mean pooling."""
    inputs = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    # Average all token embeddings (excluding [CLS] and [SEP] for cleaner results)
    embeddings = outputs.last_hidden_state[0, 1:-1, :]  # Skip first and last tokens
    return embeddings.mean(dim=0)

In [None]:
# Compare the two methods
sentences = [
    "The weather is beautiful today.",
    "It's a lovely sunny day.",
    "I need to buy groceries.",
]

print("Sentence similarity using [CLS] vs Mean pooling:\n")

# CLS method
cls_embs = [get_sentence_embedding_cls(s) for s in sentences]
print("[CLS] token method:")
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        sim = 1 - cosine(cls_embs[i].numpy(), cls_embs[j].numpy())
        print(f"  '{sentences[i][:30]}...' <-> '{sentences[j][:30]}...': {sim:.4f}")

# Mean pooling method
print("\nMean pooling method:")
mean_embs = [get_sentence_embedding_mean(s) for s in sentences]
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        sim = 1 - cosine(mean_embs[i].numpy(), mean_embs[j].numpy())
        print(f"  '{sentences[i][:30]}...' <-> '{sentences[j][:30]}...': {sim:.4f}")

### Sentence-Transformers (Recommended)

For sentence similarity tasks, use **Sentence-Transformers** — BERT models fine-tuned specifically for semantic similarity. Much better results than raw BERT!

In [None]:
from sentence_transformers import SentenceTransformer

# Load a sentence-transformer model (all-MiniLM-L6-v2 is fast and good)
st_model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode sentences - super simple API!
sentences = [
    "The weather is beautiful today.",
    "It's a lovely sunny day.",
    "I need to buy groceries.",
]

embeddings = st_model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")  # [num_sentences, 384]

In [None]:
# Compute similarities using sentence-transformers
from sentence_transformers import util

# Compute cosine similarity matrix
cos_sim = util.cos_sim(embeddings, embeddings)

print("Sentence similarity matrix (Sentence-Transformers):\n")
for i, sent in enumerate(sentences):
    print(f"{i}: {sent[:40]}")

print(f"\n{cos_sim}")

## 5. Semantic Search

A practical application: find the most similar documents to a query. Unlike keyword search (TF-IDF), semantic search understands meaning — "automobile" matches "car".

In [None]:
# Document corpus
documents = [
    "The cat sits on the windowsill watching birds.",
    "Machine learning models require large datasets.",
    "Python is a popular programming language.",
    "The dog runs in the park chasing squirrels.",
    "Deep learning uses neural networks with many layers.",
    "JavaScript is used for web development.",
    "The kitten plays with a ball of yarn.",
    "Natural language processing analyzes text data.",
]

# Encode all documents
doc_embeddings = st_model.encode(documents)

def semantic_search(query, top_k=3):
    """Find most similar documents to query."""
    query_embedding = st_model.encode([query])
    similarities = util.cos_sim(query_embedding, doc_embeddings)[0]
    
    # Get top-k indices
    top_indices = similarities.argsort(descending=True)[:top_k]
    
    print(f"Query: '{query}'\n")
    print("Top matches:")
    for i, idx in enumerate(top_indices):
        print(f"  {i+1}. [{similarities[idx]:.4f}] {documents[idx]}")

In [None]:
# Try different queries - notice semantic understanding!
semantic_search("feline animals")  # Should match cat/kitten docs

In [None]:
semantic_search("artificial intelligence")  # Should match ML/DL docs

In [None]:
semantic_search("coding languages")  # Should match Python/JavaScript docs

---

## Summary: Word2Vec vs BERT

| Aspect | Word2Vec | BERT |
|--------|----------|------|
| **Embedding type** | Static (one vector per word) | Contextual (different vectors based on context) |
| **Architecture** | Shallow neural network | Deep Transformer (12+ layers) |
| **Training objective** | Predict context/target words | Masked language modeling + next sentence |
| **Handles polysemy?** | No — "bank" is always same vector | Yes — "river bank" ≠ "bank account" |
| **Unknown words** | Out-of-vocabulary (OOV) problem | WordPiece handles any word |
| **Dimensions** | Typically 100-300 | 768 (base) or 1024 (large) |
| **Speed** | Very fast | Slower (more computation) |
| **Pre-training data** | Often domain-specific | Massive web corpus (Wikipedia, books) |
| **Best for** | Simple similarity, limited compute | Complex NLU, semantic understanding |

**The embedding evolution:**
1. **TF-IDF/BoW**: Sparse, word-independent
2. **Word2Vec**: Dense, but static
3. **BERT**: Dense and contextual — the current standard

**Next steps:**
- Try different BERT variants (RoBERTa, DistilBERT, ALBERT)
- Fine-tune BERT for specific tasks (classification, NER, QA)
- Explore newer models (GPT, T5, LLaMA)

## Bonus: Visualizing Embeddings

Reduce 768 dimensions to 2D to see how sentences cluster by meaning.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Sentences grouped by topic
sentences = [
    # Animals
    "The cat sleeps on the couch.",
    "A dog plays in the yard.",
    "The kitten chases a mouse.",
    # Programming
    "Python is great for data science.",
    "JavaScript powers the web.",
    "Rust is a systems language.",
    # Weather
    "It's raining outside today.",
    "The sun is shining brightly.",
    "Snow covers the ground.",
]

labels = ["animal"] * 3 + ["programming"] * 3 + ["weather"] * 3
colors = {"animal": "red", "programming": "blue", "weather": "green"}

# Get embeddings
embeddings = st_model.encode(sentences)

# Reduce to 2D with PCA
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)

# Plot
plt.figure(figsize=(10, 6))
for i, (x, y) in enumerate(reduced):
    plt.scatter(x, y, c=colors[labels[i]], s=100)
    plt.annotate(sentences[i][:25] + "...", (x, y), fontsize=8, alpha=0.7)

plt.title("BERT Sentence Embeddings (PCA)")
plt.xlabel("PC1")
plt.ylabel("PC2")

# Legend
for label, color in colors.items():
    plt.scatter([], [], c=color, label=label)
plt.legend()
plt.tight_layout()
plt.show()