# Part 5.3: Embeddings

Embeddings are the **bridge between human concepts and machine learning**. Every word you type into ChatGPT, every product recommendation you receive, every song Spotify suggests -- all of these rely on embeddings to represent meaning as numbers. The core idea is deceptively simple: represent things as dense vectors in a space where **similar things are close together**. This single idea has transformed NLP, recommendation systems, search engines, and much of modern AI.

In this notebook, you'll build embeddings from scratch, see how vector arithmetic can capture analogies like "king - man + woman = queen," implement semantic search, and understand why embeddings are the foundation of virtually every modern AI system.

---

## Learning Objectives

By the end of this notebook, you should be able to:

- [ ] Explain why one-hot encoding fails and why dense embeddings are needed
- [ ] Describe the distributional hypothesis and how it motivates embedding methods
- [ ] Implement Word2Vec (skip-gram) from scratch in PyTorch
- [ ] Perform and visualize vector arithmetic on word embeddings
- [ ] Compare Word2Vec and GloVe approaches
- [ ] Explain the difference between static and contextual embeddings
- [ ] Compute cosine similarity, Euclidean distance, and dot product similarity
- [ ] Build a simple semantic search engine from scratch
- [ ] Explain the concept of RAG (Retrieval-Augmented Generation)
- [ ] Describe practical applications of embeddings across domains

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from collections import Counter, defaultdict

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
torch.manual_seed(42)
np.random.seed(42)

---

## 1. Why Embeddings?

### The Problem with One-Hot Encoding

Imagine you have a vocabulary of 50,000 words. The simplest way to represent each word as a number is **one-hot encoding**: give each word a vector of length 50,000 where exactly one position is 1 and the rest are 0.

This seems fine at first, but it has three devastating problems:

1. **Sparse and high-dimensional:** Each vector has 49,999 zeros and a single 1. This wastes enormous memory and computation.
2. **No notion of similarity:** The one-hot vectors for "cat" and "kitten" are just as different as "cat" and "refrigerator." Every word is equally distant from every other word.
3. **No generalization:** If a model learns something about "cat," that knowledge tells it absolutely nothing about "kitten."

**The key insight:** What if we could represent each word as a short, dense vector (say, 300 numbers) where words with similar meanings are close together in space? That is exactly what embeddings do.

### Visualization: One-Hot vs Dense Embeddings

In [None]:
# Compare one-hot encoding vs dense embeddings
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# === Left: One-hot encoding ===
ax = axes[0]
words = ['cat', 'kitten', 'dog', 'puppy', 'car', 'truck']
vocab_size = len(words)

# Create one-hot matrix
one_hot = np.eye(vocab_size)
im = ax.imshow(one_hot, cmap='Blues', aspect='auto', vmin=0, vmax=1)
ax.set_xticks(range(vocab_size))
ax.set_xticklabels([f'dim {i}' for i in range(vocab_size)], fontsize=9)
ax.set_yticks(range(vocab_size))
ax.set_yticklabels(words, fontsize=11, fontweight='bold')
ax.set_title('One-Hot Encoding\n(Sparse, No Similarity)', fontsize=13, fontweight='bold')
ax.set_xlabel('Dimensions')

# Annotate values
for i in range(vocab_size):
    for j in range(vocab_size):
        color = 'white' if one_hot[i, j] > 0.5 else 'black'
        ax.text(j, i, f'{one_hot[i,j]:.0f}', ha='center', va='center', 
                color=color, fontsize=10)

# === Right: Dense embeddings ===
ax = axes[1]
# Simulate meaningful dense embeddings where similar words are close
embeddings = np.array([
    [0.8, 0.2, -0.5],   # cat
    [0.7, 0.3, -0.4],   # kitten (close to cat)
    [0.6, -0.3, -0.6],  # dog
    [0.5, -0.2, -0.5],  # puppy (close to dog)
    [-0.7, 0.1, 0.8],   # car
    [-0.6, -0.1, 0.9],  # truck (close to car)
])

im2 = ax.imshow(embeddings, cmap='RdBu_r', aspect='auto', vmin=-1, vmax=1)
ax.set_xticks(range(3))
ax.set_xticklabels(['dim 0', 'dim 1', 'dim 2'], fontsize=9)
ax.set_yticks(range(vocab_size))
ax.set_yticklabels(words, fontsize=11, fontweight='bold')
ax.set_title('Dense Embeddings\n(Compact, Similar = Close)', fontsize=13, fontweight='bold')
ax.set_xlabel('Dimensions')
plt.colorbar(im2, ax=ax, shrink=0.8)

# Annotate values
for i in range(vocab_size):
    for j in range(3):
        ax.text(j, i, f'{embeddings[i,j]:.1f}', ha='center', va='center', 
                fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

# Show similarity comparison
print("=== Cosine Similarities ===")
print("\nOne-hot encoding:")
for i, j in [(0,1), (0,2), (0,4)]:
    sim = np.dot(one_hot[i], one_hot[j]) / (np.linalg.norm(one_hot[i]) * np.linalg.norm(one_hot[j]))
    print(f"  {words[i]:8s} <-> {words[j]:8s}: {sim:.4f}")

print("\nDense embeddings:")
for i, j in [(0,1), (0,2), (0,4)]:
    sim = np.dot(embeddings[i], embeddings[j]) / (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]))
    print(f"  {words[i]:8s} <-> {words[j]:8s}: {sim:.4f}")

### The Distributional Hypothesis

> *"You shall know a word by the company it keeps."* -- J.R. Firth (1957)

This is the foundational insight behind all embedding methods. Words that appear in similar contexts tend to have similar meanings:

- "The **cat** sat on the mat" / "The **kitten** sat on the mat"
- "I drove my **car** to work" / "I drove my **truck** to work"

If two words frequently appear in the same contexts (surrounded by the same neighboring words), they probably mean similar things. This is what embedding algorithms learn to capture: they convert co-occurrence patterns into geometric relationships.

### Deep Dive: Why Embeddings Matter

| Problem with One-Hot | How Embeddings Fix It |
|---|---|
| Vectors are huge (vocab_size dimensions) | Vectors are small (50-1000 dimensions) |
| All words are equidistant | Similar words are nearby |
| No generalization between words | Knowledge transfers between similar words |
| Memory scales as O(V^2) for pairwise ops | Memory scales as O(V * d) where d << V |
| Cannot capture relationships | Vector arithmetic captures analogies |

#### Key Insight

Embeddings transform **discrete symbols** (words, products, users) into **continuous vectors** in a learned space. This is what makes gradient-based optimization possible -- you cannot take the gradient of a one-hot vector, but you can take the gradient of a dense embedding and update it during training.

#### Common Misconceptions

| Misconception | Reality |
|---|---|
| Embeddings are hand-designed features | They are **learned** from data |
| Each dimension has a clear meaning | Dimensions are usually not individually interpretable |
| Embeddings are only for words | They work for anything: users, products, genes, molecules |
| Bigger embeddings are always better | There is an optimal size; too large overfits, too small underfits |

---

## 2. Word2Vec

### Intuitive Explanation

Word2Vec (Mikolov et al., 2013) was a breakthrough: a simple neural network that learns word embeddings by predicting context. The core idea has two flavors:

**Skip-gram:** Given a center word, predict the surrounding context words.
- Input: "cat" -> Predict: "the", "sat", "on", "mat"

**CBOW (Continuous Bag of Words):** Given surrounding context words, predict the center word.
- Input: "the", "sat", "on", "mat" -> Predict: "cat"

**Why does this work?** If "cat" and "kitten" both predict similar context words ("sat," "purring," "fur"), then the model is forced to give them similar embeddings. The training objective implicitly encodes semantic similarity.

**The surprising result:** The learned embeddings capture not just similarity, but **relational structure**. The vector difference between "king" and "man" is approximately the same as between "queen" and "woman." This means:

$$\text{king} - \text{man} + \text{woman} \approx \text{queen}$$

This was one of the most shocking results in NLP history.

### Visualization: Skip-gram vs CBOW

In [None]:
# Visualize Skip-gram vs CBOW architectures
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

sentence = ["the", "cat", "sat", "on", "the"]
center_idx = 2  # "sat"
window = 2

# === Left: Skip-gram ===
ax = axes[0]
ax.set_xlim(-1, 5)
ax.set_ylim(-0.5, 3.5)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Skip-gram\n"Predict context from center"', fontsize=14, fontweight='bold')

# Draw words
for i, word in enumerate(sentence):
    color = 'steelblue' if i == center_idx else 'lightcoral'
    edge = 'navy' if i == center_idx else 'darkred'
    alpha = 1.0 if abs(i - center_idx) <= window else 0.3
    
    rect = plt.Rectangle((i - 0.4, 2.8), 0.8, 0.5, facecolor=color, 
                          edgecolor=edge, linewidth=2, alpha=alpha)
    ax.add_patch(rect)
    ax.text(i, 3.05, word, ha='center', va='center', fontsize=12, 
            fontweight='bold', alpha=alpha)

# Draw arrows from center to context
for i in range(len(sentence)):
    if i != center_idx and abs(i - center_idx) <= window:
        ax.annotate('', xy=(i, 2.8), xytext=(center_idx, 2.8),
                    arrowprops=dict(arrowstyle='->', color='darkred', lw=2))

# Labels
ax.text(center_idx, 2.2, 'INPUT\n(center word)', ha='center', va='center',
        fontsize=10, color='navy', fontweight='bold')
ax.text(2, 0.8, 'Neural\nNetwork', ha='center', va='center', fontsize=12,
        bbox=dict(boxstyle='round,pad=0.5', facecolor='lightyellow', edgecolor='orange', linewidth=2))
ax.text(2, 0.0, '"sat" -> predict "the", "cat", "on", "the"', 
        ha='center', va='center', fontsize=10, style='italic')

# === Right: CBOW ===
ax = axes[1]
ax.set_xlim(-1, 5)
ax.set_ylim(-0.5, 3.5)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('CBOW\n"Predict center from context"', fontsize=14, fontweight='bold')

# Draw words
for i, word in enumerate(sentence):
    color = 'lightcoral' if i != center_idx else 'steelblue'
    edge = 'darkred' if i != center_idx else 'navy'
    alpha = 1.0 if abs(i - center_idx) <= window else 0.3
    
    rect = plt.Rectangle((i - 0.4, 2.8), 0.8, 0.5, facecolor=color, 
                          edgecolor=edge, linewidth=2, alpha=alpha)
    ax.add_patch(rect)
    ax.text(i, 3.05, word, ha='center', va='center', fontsize=12, 
            fontweight='bold', alpha=alpha)

# Draw arrows from context to center
for i in range(len(sentence)):
    if i != center_idx and abs(i - center_idx) <= window:
        ax.annotate('', xy=(center_idx, 2.8), xytext=(i, 2.8),
                    arrowprops=dict(arrowstyle='->', color='navy', lw=2))

# Labels
ax.text(center_idx, 2.2, 'OUTPUT\n(center word)', ha='center', va='center',
        fontsize=10, color='navy', fontweight='bold')
ax.text(2, 0.8, 'Neural\nNetwork', ha='center', va='center', fontsize=12,
        bbox=dict(boxstyle='round,pad=0.5', facecolor='lightyellow', edgecolor='orange', linewidth=2))
ax.text(2, 0.0, '"the", "cat", "on", "the" -> predict "sat"', 
        ha='center', va='center', fontsize=10, style='italic')

plt.tight_layout()
plt.show()

### Implementing Word2Vec (Skip-gram) from Scratch

Let's build a skip-gram model step by step. The architecture is surprisingly simple:

1. **Input:** One-hot encoded center word
2. **Hidden layer:** Embedding lookup (this IS the embedding we want to learn)
3. **Output:** Probability distribution over vocabulary (which words are likely context words)

With negative sampling, we simplify this further: instead of predicting over the entire vocabulary, we just need to distinguish true context words from random "negative" words.

In [None]:
# Step 1: Prepare a toy corpus with enough structure to learn from
corpus = [
    "the king rules the kingdom with wisdom",
    "the queen rules the kingdom with grace",
    "the prince will become king one day",
    "the princess will become queen one day",
    "the man works in the village",
    "the woman works in the village",
    "the boy plays in the village",
    "the girl plays in the village",
    "a king and queen rule together",
    "a man and woman live together",
    "a boy and girl play together",
    "the king sits on the royal throne",
    "the queen sits on the royal throne",
    "the prince is the son of the king",
    "the princess is the daughter of the queen",
    "the man is the father of the boy",
    "the woman is the mother of the girl",
    "a brave king protects the kingdom",
    "a wise queen protects the kingdom",
    "the young prince trains with the knight",
    "the young princess studies with the scholar",
    "the strong man builds the house",
    "the kind woman tends the garden",
    "the king wears the golden crown",
    "the queen wears the silver crown",
]

# Tokenize
sentences = [s.lower().split() for s in corpus]
all_words = [word for sentence in sentences for word in sentence]

# Build vocabulary
word_counts = Counter(all_words)
vocab = sorted(word_counts.keys())
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for w, i in word2idx.items()}
vocab_size = len(vocab)

print(f"Vocabulary size: {vocab_size}")
print(f"Total tokens: {len(all_words)}")
print(f"Sample words: {vocab[:10]}")
print(f"Word counts (top 10): {word_counts.most_common(10)}")

In [None]:
# Step 2: Generate skip-gram training pairs
def generate_skipgram_pairs(sentences, word2idx, window_size=2):
    """
    Generate (center_word, context_word) pairs for skip-gram training.
    
    Args:
        sentences: List of tokenized sentences
        word2idx: Word to index mapping
        window_size: Number of words on each side to consider as context
    
    Returns:
        List of (center_idx, context_idx) tuples
    """
    pairs = []
    for sentence in sentences:
        indices = [word2idx[w] for w in sentence]
        for i, center in enumerate(indices):
            # Look at words within the window
            for j in range(max(0, i - window_size), min(len(indices), i + window_size + 1)):
                if i != j:
                    pairs.append((center, indices[j]))
    return pairs

pairs = generate_skipgram_pairs(sentences, word2idx, window_size=2)
print(f"Generated {len(pairs)} training pairs")
print(f"\nSample pairs (center -> context):")
for center, context in pairs[:8]:
    print(f"  {idx2word[center]:12s} -> {idx2word[context]}")

In [None]:
# Step 3: Define the Skip-gram model with Negative Sampling
class SkipGramNegSampling(nn.Module):
    """
    Skip-gram Word2Vec with negative sampling.
    
    Instead of computing softmax over the entire vocabulary (expensive!),
    we train a binary classifier: is this a real (center, context) pair
    or a fake one?
    """
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        # Two embedding matrices:
        # - center_embeddings: for center words (this is what we keep as our word vectors)
        # - context_embeddings: for context words (used during training only)
        self.center_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # Initialize with small random values
        nn.init.uniform_(self.center_embeddings.weight, -0.5/embedding_dim, 0.5/embedding_dim)
        nn.init.uniform_(self.context_embeddings.weight, -0.5/embedding_dim, 0.5/embedding_dim)
    
    def forward(self, center_words, context_words, negative_words):
        """
        Args:
            center_words: (batch_size,) center word indices
            context_words: (batch_size,) true context word indices
            negative_words: (batch_size, num_neg) negative sample indices
        
        Returns:
            loss: negative sampling loss
        """
        # Get embeddings
        center_emb = self.center_embeddings(center_words)      # (batch, emb_dim)
        context_emb = self.context_embeddings(context_words)    # (batch, emb_dim)
        neg_emb = self.context_embeddings(negative_words)       # (batch, num_neg, emb_dim)
        
        # Positive score: dot product of center and true context
        pos_score = torch.sum(center_emb * context_emb, dim=1)  # (batch,)
        pos_loss = F.logsigmoid(pos_score)                       # log(sigmoid(score))
        
        # Negative scores: dot product of center with each negative sample
        # center_emb: (batch, emb_dim) -> (batch, emb_dim, 1)
        neg_score = torch.bmm(neg_emb, center_emb.unsqueeze(2)).squeeze(2)  # (batch, num_neg)
        neg_loss = F.logsigmoid(-neg_score).sum(dim=1)  # log(sigmoid(-score))
        
        # Total loss: maximize positive score, minimize negative scores
        loss = -(pos_loss + neg_loss).mean()
        return loss

# Create model
EMBEDDING_DIM = 20  # Small for our toy corpus
model = SkipGramNegSampling(vocab_size, EMBEDDING_DIM)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Center embeddings shape: {model.center_embeddings.weight.shape}")
print(f"Context embeddings shape: {model.context_embeddings.weight.shape}")

In [None]:
# Step 4: Training loop
def get_negative_samples(batch_size, num_neg, vocab_size, word_counts, idx2word):
    """
    Sample negative words proportional to frequency^(3/4).
    The 3/4 power smooths the distribution, giving rare words more chance.
    """
    # Build sampling distribution: freq^(3/4)
    freqs = np.array([word_counts.get(idx2word[i], 1) for i in range(vocab_size)], dtype=np.float64)
    freqs = freqs ** 0.75
    freqs /= freqs.sum()
    
    neg_samples = np.random.choice(vocab_size, size=(batch_size, num_neg), p=freqs)
    return torch.LongTensor(neg_samples)

# Training
optimizer = optim.Adam(model.parameters(), lr=0.01)
NUM_NEG = 5
EPOCHS = 200
BATCH_SIZE = 64

# Convert pairs to tensors
center_indices = torch.LongTensor([p[0] for p in pairs])
context_indices = torch.LongTensor([p[1] for p in pairs])

losses = []
for epoch in range(EPOCHS):
    # Shuffle
    perm = torch.randperm(len(pairs))
    center_shuffled = center_indices[perm]
    context_shuffled = context_indices[perm]
    
    epoch_loss = 0
    num_batches = 0
    
    for i in range(0, len(pairs), BATCH_SIZE):
        batch_center = center_shuffled[i:i+BATCH_SIZE]
        batch_context = context_shuffled[i:i+BATCH_SIZE]
        batch_neg = get_negative_samples(len(batch_center), NUM_NEG, vocab_size, word_counts, idx2word)
        
        optimizer.zero_grad()
        loss = model(batch_center, batch_context, batch_neg)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        num_batches += 1
    
    avg_loss = epoch_loss / num_batches
    losses.append(avg_loss)
    
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1:3d}/{EPOCHS}: Loss = {avg_loss:.4f}")

# Plot training loss
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(losses, color='steelblue', linewidth=2)
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Skip-gram Training Loss', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Visualization: Learned Word Embeddings in 2D

Let's project our learned embeddings down to 2D using PCA and see if similar words cluster together.

In [None]:
# Extract learned embeddings
embeddings_matrix = model.center_embeddings.weight.detach().numpy()

# PCA for 2D projection
def pca_2d(X):
    """Project data to 2D using PCA."""
    X_centered = X - X.mean(axis=0)
    cov = np.cov(X_centered.T)
    eigenvalues, eigenvectors = np.linalg.eigh(cov)
    # Take top 2 eigenvectors (largest eigenvalues)
    idx = np.argsort(eigenvalues)[::-1][:2]
    top_vectors = eigenvectors[:, idx]
    return X_centered @ top_vectors

# Project to 2D
embeddings_2d = pca_2d(embeddings_matrix)

# Define word groups for coloring
groups = {
    'royalty_male': ['king', 'prince'],
    'royalty_female': ['queen', 'princess'],
    'people_male': ['man', 'boy', 'father', 'son'],
    'people_female': ['woman', 'girl', 'mother', 'daughter'],
}

group_colors = {
    'royalty_male': 'blue',
    'royalty_female': 'red',
    'people_male': 'steelblue',
    'people_female': 'lightcoral',
}

group_labels = {
    'royalty_male': 'Male Royalty',
    'royalty_female': 'Female Royalty',
    'people_male': 'Male Common',
    'people_female': 'Female Common',
}

fig, ax = plt.subplots(figsize=(12, 10))

# Plot all words in gray
all_group_words = set(w for words in groups.values() for w in words)
for i, word in enumerate(vocab):
    if word not in all_group_words:
        ax.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1], c='lightgray', s=30, alpha=0.5)
        ax.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]), fontsize=7, alpha=0.4)

# Plot grouped words with colors
for group_name, words in groups.items():
    for word in words:
        if word in word2idx:
            idx = word2idx[word]
            ax.scatter(embeddings_2d[idx, 0], embeddings_2d[idx, 1], 
                      c=group_colors[group_name], s=150, zorder=5, edgecolors='black', linewidth=1.5,
                      label=group_labels[group_name] if word == words[0] else "")
            ax.annotate(word, (embeddings_2d[idx, 0], embeddings_2d[idx, 1]),
                       fontsize=12, fontweight='bold', 
                       xytext=(8, 8), textcoords='offset points')

ax.set_title('Learned Word2Vec Embeddings (PCA Projection)', fontsize=14, fontweight='bold')
ax.set_xlabel('PC 1', fontsize=12)
ax.set_ylabel('PC 2', fontsize=12)
ax.legend(fontsize=10, loc='best')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### The Magic: Vector Arithmetic

The most surprising property of word embeddings is that **vector arithmetic captures analogies**. The direction from "man" to "woman" encodes the concept of gender. Adding that direction to "king" should land near "queen."

$$\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$$

In [None]:
# Vector arithmetic on our learned embeddings
def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def find_nearest(query_vec, embeddings_matrix, idx2word, top_k=5, exclude=None):
    """
    Find the top_k nearest words to a query vector.
    
    Args:
        query_vec: Query embedding vector
        embeddings_matrix: All word embeddings (vocab_size, emb_dim)
        idx2word: Index to word mapping
        top_k: Number of results
        exclude: Set of words to exclude from results
    
    Returns:
        List of (word, similarity) tuples
    """
    if exclude is None:
        exclude = set()
    
    similarities = []
    for i in range(len(idx2word)):
        word = idx2word[i]
        if word not in exclude:
            sim = cosine_similarity(query_vec, embeddings_matrix[i])
            similarities.append((word, sim))
    
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_k]

def analogy(a, b, c, embeddings_matrix, word2idx, idx2word):
    """
    Solve: a is to b as c is to ?
    Computes: b - a + c and finds nearest word.
    """
    vec_a = embeddings_matrix[word2idx[a]]
    vec_b = embeddings_matrix[word2idx[b]]
    vec_c = embeddings_matrix[word2idx[c]]
    
    query = vec_b - vec_a + vec_c
    results = find_nearest(query, embeddings_matrix, idx2word, top_k=5, exclude={a, b, c})
    return results

# Test analogies
print("=== Vector Arithmetic Analogies ===\n")

analogy_tests = [
    ("man", "king", "woman", "queen"),
    ("man", "boy", "woman", "girl"),
    ("king", "prince", "queen", "princess"),
    ("king", "kingdom", "queen", "kingdom"),
]

for a, b, c, expected in analogy_tests:
    results = analogy(a, b, c, embeddings_matrix, word2idx, idx2word)
    top_word = results[0][0]
    marker = "  <<< correct!" if top_word == expected else f"  (expected: {expected})"
    print(f"  {a:10s} -> {b:10s} :: {c:10s} -> {top_word:10s} (sim={results[0][1]:.4f}){marker}")
    for word, sim in results[1:3]:
        print(f"{'':42s}   {word:10s} (sim={sim:.4f})")

### Visualization: Vector Arithmetic in Embedding Space

In [None]:
# Visualize the king - man + woman = queen analogy in 2D
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# === Left: The analogy in embedding space ===
ax = axes[0]

analogy_words = ['king', 'queen', 'man', 'woman']
analogy_colors = ['blue', 'red', 'steelblue', 'lightcoral']
analogy_indices = [word2idx[w] for w in analogy_words]

# Plot the four words
for i, (word, color) in enumerate(zip(analogy_words, analogy_colors)):
    idx = word2idx[word]
    ax.scatter(embeddings_2d[idx, 0], embeddings_2d[idx, 1], 
              c=color, s=200, zorder=5, edgecolors='black', linewidth=2)
    ax.annotate(word, (embeddings_2d[idx, 0], embeddings_2d[idx, 1]),
               fontsize=14, fontweight='bold', 
               xytext=(10, 10), textcoords='offset points')

# Draw "gender" arrows: man->woman, king->queen
for start_word, end_word, label in [('man', 'woman', 'gender\ndirection'), 
                                      ('king', 'queen', '')]:
    si = word2idx[start_word]
    ei = word2idx[end_word]
    ax.annotate('', xy=(embeddings_2d[ei, 0], embeddings_2d[ei, 1]),
               xytext=(embeddings_2d[si, 0], embeddings_2d[si, 1]),
               arrowprops=dict(arrowstyle='->', color='green', lw=2.5, ls='--'))
    if label:
        mid_x = (embeddings_2d[si, 0] + embeddings_2d[ei, 0]) / 2
        mid_y = (embeddings_2d[si, 1] + embeddings_2d[ei, 1]) / 2
        ax.annotate(label, (mid_x, mid_y), fontsize=10, color='green', 
                   fontweight='bold', ha='center',
                   xytext=(-30, 0), textcoords='offset points')

# Draw "royalty" arrows: man->king, woman->queen
for start_word, end_word, label in [('man', 'king', 'royalty\ndirection'), 
                                      ('woman', 'queen', '')]:
    si = word2idx[start_word]
    ei = word2idx[end_word]
    ax.annotate('', xy=(embeddings_2d[ei, 0], embeddings_2d[ei, 1]),
               xytext=(embeddings_2d[si, 0], embeddings_2d[si, 1]),
               arrowprops=dict(arrowstyle='->', color='purple', lw=2.5, ls='--'))
    if label:
        mid_x = (embeddings_2d[si, 0] + embeddings_2d[ei, 0]) / 2
        mid_y = (embeddings_2d[si, 1] + embeddings_2d[ei, 1]) / 2
        ax.annotate(label, (mid_x, mid_y), fontsize=10, color='purple', 
                   fontweight='bold', ha='center',
                   xytext=(30, 0), textcoords='offset points')

ax.set_title('king - man + woman = queen\n(Parallel Relationship Structure)', 
             fontsize=13, fontweight='bold')
ax.set_xlabel('PC 1', fontsize=11)
ax.set_ylabel('PC 2', fontsize=11)
ax.grid(True, alpha=0.3)

# === Right: Similarity heatmap ===
ax = axes[1]
focus_words = ['king', 'queen', 'prince', 'princess', 'man', 'woman', 'boy', 'girl']
focus_indices = [word2idx[w] for w in focus_words if w in word2idx]
focus_words_filtered = [w for w in focus_words if w in word2idx]

n = len(focus_words_filtered)
sim_matrix = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        sim_matrix[i, j] = cosine_similarity(
            embeddings_matrix[word2idx[focus_words_filtered[i]]],
            embeddings_matrix[word2idx[focus_words_filtered[j]]]
        )

im = ax.imshow(sim_matrix, cmap='RdBu_r', vmin=-1, vmax=1)
ax.set_xticks(range(n))
ax.set_xticklabels(focus_words_filtered, rotation=45, ha='right', fontsize=11)
ax.set_yticks(range(n))
ax.set_yticklabels(focus_words_filtered, fontsize=11)
ax.set_title('Cosine Similarity Between Word Embeddings', fontsize=13, fontweight='bold')
plt.colorbar(im, ax=ax, shrink=0.8)

# Annotate
for i in range(n):
    for j in range(n):
        ax.text(j, i, f'{sim_matrix[i,j]:.2f}', ha='center', va='center', fontsize=8)

plt.tight_layout()
plt.show()

### Deep Dive: Why Does Word2Vec Capture Semantics?

The math behind this is elegant. Consider the skip-gram objective: we want to maximize $P(\text{context} | \text{center})$. With negative sampling, we learn embeddings where:

$$\vec{w}_{\text{center}} \cdot \vec{w}_{\text{context}} \approx \log P(\text{context} | \text{center})$$

Mikolov et al. showed that this implicitly factorizes a **pointwise mutual information (PMI) matrix** -- a matrix that captures how much more often two words co-occur than you would expect by chance. Words that co-occur in similar contexts end up with similar vectors because they have similar PMI profiles.

#### Key Insight

Word2Vec does not "understand" language. It discovers statistical regularities in word co-occurrence and encodes them as geometric relationships. The fact that this produces semantically meaningful vectors is a profound statement about how meaning relates to usage patterns.

#### The Negative Sampling Trick

| Approach | What it does | Cost per step |
|---|---|---|
| Full softmax | Normalize over all V words | O(V) -- very expensive |
| Negative sampling | Compare 1 positive + k negatives | O(k) -- fast! |
| Hierarchical softmax | Binary tree over vocabulary | O(log V) |

Negative sampling typically uses k=5 for large datasets and k=15 for small ones. The negative words are sampled proportional to $f(w)^{3/4}$, where $f(w)$ is the word frequency. The 3/4 exponent was found empirically to work best -- it smooths the distribution so rare words get sampled more than their raw frequency would suggest.

---

## 3. GloVe (Global Vectors for Word Representation)

### Intuitive Explanation

While Word2Vec learns from **local** context windows (one word at a time), GloVe (Pennington et al., 2014) takes a different approach: it first builds a global **co-occurrence matrix** and then factorizes it.

Think of it this way:
- **Word2Vec** reads through the text word by word, like a human reading a book
- **GloVe** first counts all word pair co-occurrences across the entire corpus, then finds embeddings that best explain those counts

The GloVe objective is:

$$J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \vec{w}_i^T \vec{w}_j + b_i + b_j - \log X_{ij} \right)^2$$

**What this means:** We want the dot product of two word vectors (plus bias terms) to approximate the log of how often they co-occur. The weighting function $f(X_{ij})$ prevents very common pairs (like "the, the") from dominating.

### GloVe: Building the Co-occurrence Matrix

In [None]:
# Build co-occurrence matrix from our corpus
def build_cooccurrence_matrix(sentences, word2idx, window_size=2):
    """
    Build a word-word co-occurrence matrix.
    
    Args:
        sentences: List of tokenized sentences
        word2idx: Word to index mapping
        window_size: Context window size
    
    Returns:
        Co-occurrence matrix of shape (vocab_size, vocab_size)
    """
    V = len(word2idx)
    cooccur = np.zeros((V, V))
    
    for sentence in sentences:
        indices = [word2idx[w] for w in sentence]
        for i, center in enumerate(indices):
            for j in range(max(0, i - window_size), min(len(indices), i + window_size + 1)):
                if i != j:
                    # Weight by distance: closer words get higher weight
                    distance = abs(i - j)
                    cooccur[center, indices[j]] += 1.0 / distance
    
    return cooccur

cooccur_matrix = build_cooccurrence_matrix(sentences, word2idx, window_size=2)

# Visualize a portion of the co-occurrence matrix
fig, ax = plt.subplots(figsize=(10, 8))

focus_words = ['king', 'queen', 'prince', 'princess', 'man', 'woman', 'boy', 'girl',
               'kingdom', 'village', 'throne', 'crown']
focus_words = [w for w in focus_words if w in word2idx]
focus_indices = [word2idx[w] for w in focus_words]

sub_matrix = cooccur_matrix[np.ix_(focus_indices, focus_indices)]
im = ax.imshow(sub_matrix, cmap='YlOrRd', aspect='auto')
ax.set_xticks(range(len(focus_words)))
ax.set_xticklabels(focus_words, rotation=45, ha='right', fontsize=10)
ax.set_yticks(range(len(focus_words)))
ax.set_yticklabels(focus_words, fontsize=10)
ax.set_title('Co-occurrence Matrix (GloVe Input)', fontsize=14, fontweight='bold')
plt.colorbar(im, ax=ax, shrink=0.8, label='Co-occurrence count')

# Annotate non-zero values
for i in range(len(focus_words)):
    for j in range(len(focus_words)):
        val = sub_matrix[i, j]
        if val > 0:
            ax.text(j, i, f'{val:.1f}', ha='center', va='center', fontsize=7,
                   color='white' if val > sub_matrix.max() * 0.6 else 'black')

plt.tight_layout()
plt.show()

print("Notice: words that appear in similar contexts (king/queen, man/woman)")
print("have similar co-occurrence patterns across the columns.")

### Comparison: Word2Vec vs GloVe

| Feature | Word2Vec | GloVe |
|---|---|---|
| **Approach** | Predictive (neural network) | Count-based (matrix factorization) |
| **Training data** | Local context windows | Global co-occurrence matrix |
| **Objective** | Predict context words | Reconstruct log co-occurrence |
| **Strengths** | Good with small data, captures syntax | Better with large data, captures semantics |
| **Weaknesses** | Only sees local context | Requires building full matrix |
| **Training** | Online (stochastic) | Batch (needs full corpus first) |
| **Result quality** | Very similar | Very similar |
| **Key paper** | Mikolov et al., 2013 | Pennington et al., 2014 |

**What this means:** In practice, Word2Vec and GloVe produce similarly good embeddings. The choice often comes down to implementation convenience. Both have been largely superseded by contextual embeddings from models like BERT and GPT.

---

## 4. Modern Embeddings: From Static to Contextual

### Intuitive Explanation

Word2Vec and GloVe give each word **one fixed embedding** regardless of context. But consider the word "bank":

- "I sat on the river **bank**" (riverbank)
- "I deposited money at the **bank**" (financial institution)

These are completely different meanings, but Word2Vec assigns the same vector to both! This is a fundamental limitation of **static embeddings**.

**Contextual embeddings** (from models like BERT, GPT, and their descendants) solve this by computing a **different embedding for each occurrence** of a word, based on the surrounding context. The same word "bank" gets a different vector depending on whether it appears near "river" or "money."

### How Contextual Embeddings Work

1. **Input:** A full sentence is fed into a Transformer model
2. **Processing:** Each word attends to all other words via self-attention
3. **Output:** Each word gets a context-dependent embedding

The key difference: static embeddings are a **lookup table** (one vector per word), while contextual embeddings are a **function** of the entire input sequence.

### Visualization: Static vs Contextual Embeddings

In [None]:
# Simulate static vs contextual embeddings for the word "bank"
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

np.random.seed(42)

# === Left: Static Embeddings ===
ax = axes[0]

# All instances of "bank" map to the same point
static_words = {
    'river': np.array([2.0, 3.5]),
    'water': np.array([2.5, 3.0]),
    'shore': np.array([1.5, 3.2]),
    'money': np.array([-2.0, -2.5]),
    'account': np.array([-2.5, -2.0]),
    'deposit': np.array([-1.5, -2.8]),
    'bank': np.array([0.0, 0.5]),  # Single point -- ambiguous!
}

for word, pos in static_words.items():
    color = 'green' if word == 'bank' else ('steelblue' if pos[1] > 1 else 'lightcoral')
    size = 200 if word == 'bank' else 100
    marker = '*' if word == 'bank' else 'o'
    ax.scatter(pos[0], pos[1], c=color, s=size, zorder=5, edgecolors='black', 
              linewidth=1.5, marker=marker)
    ax.annotate(word, pos, fontsize=11, fontweight='bold',
               xytext=(8, 8), textcoords='offset points')

ax.set_title('Static Embeddings (Word2Vec/GloVe)\n"bank" has ONE vector', 
             fontsize=13, fontweight='bold')
ax.set_xlabel('Dimension 1', fontsize=11)
ax.set_ylabel('Dimension 2', fontsize=11)
ax.grid(True, alpha=0.3)

# Add region labels
ax.text(2.0, 4.2, 'Nature cluster', fontsize=10, ha='center', color='steelblue', 
        fontweight='bold', style='italic')
ax.text(-2.0, -3.5, 'Finance cluster', fontsize=10, ha='center', color='lightcoral', 
        fontweight='bold', style='italic')
ax.annotate('Stuck in\nthe middle!', xy=(0, 0.5), xytext=(1.5, -1.5),
           fontsize=10, color='green', fontweight='bold',
           arrowprops=dict(arrowstyle='->', color='green', lw=2))

# === Right: Contextual Embeddings ===
ax = axes[1]

contextual_words = {
    'river': np.array([2.0, 3.5]),
    'water': np.array([2.5, 3.0]),
    'shore': np.array([1.5, 3.2]),
    'money': np.array([-2.0, -2.5]),
    'account': np.array([-2.5, -2.0]),
    'deposit': np.array([-1.5, -2.8]),
    'bank\n(river context)': np.array([1.8, 2.8]),     # Near nature words!
    'bank\n(finance context)': np.array([-1.8, -2.2]),  # Near finance words!
}

for word, pos in contextual_words.items():
    if 'bank' in word:
        color = 'green'
        size = 200
        marker = '*'
    elif pos[1] > 1:
        color = 'steelblue'
        size = 100
        marker = 'o'
    else:
        color = 'lightcoral'
        size = 100
        marker = 'o'
    
    ax.scatter(pos[0], pos[1], c=color, s=size, zorder=5, edgecolors='black', 
              linewidth=1.5, marker=marker)
    fontsize = 10 if 'bank' in word else 11
    ax.annotate(word, pos, fontsize=fontsize, fontweight='bold',
               xytext=(8, 8), textcoords='offset points')

# Draw arrow showing same word, different positions
bank1 = contextual_words['bank\n(river context)']
bank2 = contextual_words['bank\n(finance context)']
ax.annotate('', xy=bank2, xytext=bank1,
           arrowprops=dict(arrowstyle='<->', color='green', lw=2, ls='--'))
ax.text((bank1[0]+bank2[0])/2 + 1.2, (bank1[1]+bank2[1])/2, 
        'Same word,\ndifferent vectors!', fontsize=10, color='green', 
        fontweight='bold', ha='center')

ax.set_title('Contextual Embeddings (BERT/GPT)\n"bank" has DIFFERENT vectors per context', 
             fontsize=13, fontweight='bold')
ax.set_xlabel('Dimension 1', fontsize=11)
ax.set_ylabel('Dimension 2', fontsize=11)
ax.grid(True, alpha=0.3)

ax.text(2.0, 4.2, 'Nature cluster', fontsize=10, ha='center', color='steelblue', 
        fontweight='bold', style='italic')
ax.text(-2.0, -3.5, 'Finance cluster', fontsize=10, ha='center', color='lightcoral', 
        fontweight='bold', style='italic')

plt.tight_layout()
plt.show()

### Sentence Embeddings: From Words to Sentences

How do you get a single embedding for an entire sentence? There are several approaches:

| Method | How It Works | Pros | Cons |
|---|---|---|---|
| **Mean pooling** | Average all word/token embeddings | Simple, works well | Treats all tokens equally |
| **[CLS] token** | Use BERT's special classification token | Built into BERT | Not optimized for similarity |
| **Max pooling** | Take element-wise max across tokens | Captures strongest signals | Loses ordering information |
| **Specialized models** | Models trained specifically for sentence similarity (e.g., Sentence-BERT) | Best quality | Requires fine-tuned model |

In practice, **specialized sentence embedding models** (like Sentence-BERT or OpenAI's embedding models) give the best results because they are explicitly trained so that similar sentences have similar embeddings.

In [None]:
# Demonstrate different pooling strategies using our Word2Vec embeddings
def get_sentence_embedding(sentence, embeddings_matrix, word2idx, method='mean'):
    """
    Create a sentence embedding from word embeddings.
    
    Args:
        sentence: String sentence
        embeddings_matrix: Word embedding matrix
        word2idx: Word to index mapping
        method: 'mean', 'max', or 'sum'
    
    Returns:
        Sentence embedding vector
    """
    words = sentence.lower().split()
    word_vecs = []
    for word in words:
        if word in word2idx:
            word_vecs.append(embeddings_matrix[word2idx[word]])
    
    if not word_vecs:
        return np.zeros(embeddings_matrix.shape[1])
    
    word_vecs = np.array(word_vecs)
    
    if method == 'mean':
        return word_vecs.mean(axis=0)
    elif method == 'max':
        return word_vecs.max(axis=0)
    elif method == 'sum':
        return word_vecs.sum(axis=0)

# Test with example sentences
test_sentences = [
    "the king rules the kingdom",
    "the queen rules the kingdom",
    "the boy plays in the village",
    "the girl plays in the village",
    "the prince will become king",
]

print("=== Sentence Similarity (Mean Pooling) ===\n")
sent_embeddings = [get_sentence_embedding(s, embeddings_matrix, word2idx, 'mean') 
                   for s in test_sentences]

for i in range(len(test_sentences)):
    for j in range(i+1, len(test_sentences)):
        sim = cosine_similarity(sent_embeddings[i], sent_embeddings[j])
        print(f"  {sim:.4f}  |  '{test_sentences[i]}'")
        print(f"         |  '{test_sentences[j]}'")
        print()

### Deep Dive: The Evolution of Embeddings

| Era | Method | Type | Key Innovation |
|---|---|---|---|
| 2003 | Neural LM (Bengio) | Static | First neural word embeddings |
| 2013 | Word2Vec | Static | Efficient training at scale |
| 2014 | GloVe | Static | Global co-occurrence + local context |
| 2017 | ELMo | Contextual | Bi-directional LSTM embeddings |
| 2018 | BERT | Contextual | Transformer-based, bidirectional |
| 2018-now | GPT family | Contextual | Autoregressive Transformer |
| 2019+ | Sentence-BERT | Sentence-level | Contrastive learning for sentences |
| 2022+ | text-embedding-ada-002 | Sentence-level | Production-grade API embeddings |

#### Key Insight

The trend is clear: embeddings have evolved from one-vector-per-word to context-dependent representations computed by increasingly powerful models. Modern embedding models are trained specifically to produce vectors where semantic similarity corresponds to geometric proximity.

---

## 5. Similarity and Distance

### Intuitive Explanation

Now that we have embeddings, how do we measure "closeness" between vectors? This connects directly back to linear algebra (Part 1.1). There are three main distance/similarity measures, each with different properties.

**Cosine similarity:** How much do two vectors point in the same direction? Ignores magnitude, only cares about angle. This is the most common choice for embeddings.

$$\text{cosine}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}$$

**Euclidean distance:** Straight-line distance between two points. Sensitive to magnitude.

$$d(\vec{a}, \vec{b}) = \|\vec{a} - \vec{b}\| = \sqrt{\sum_i (a_i - b_i)^2}$$

**Dot product similarity:** Raw alignment. Scales with magnitude, so longer vectors have higher scores.

$$\text{dot}(\vec{a}, \vec{b}) = \vec{a} \cdot \vec{b} = \sum_i a_i b_i$$

### Visualization: Comparing Similarity Measures

In [None]:
# Demonstrate the difference between similarity measures
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Create example vectors
np.random.seed(42)
a = np.array([3.0, 1.0])
b = np.array([1.0, 3.0])  # Different direction, same magnitude
c = np.array([6.0, 2.0])  # Same direction as a, different magnitude
d = np.array([-2.0, -1.0])  # Opposite direction to a

vectors = {'a': a, 'b': b, 'c (2*a)': c, 'd (-a)': d}
colors = {'a': 'blue', 'b': 'red', 'c (2*a)': 'green', 'd (-a)': 'orange'}

# === Left: Cosine Similarity ===
ax = axes[0]
for name, vec in vectors.items():
    ax.annotate('', xy=vec, xytext=(0, 0),
               arrowprops=dict(arrowstyle='->', color=colors[name], lw=2.5))
    ax.text(vec[0]*1.1, vec[1]*1.1, name, fontsize=11, fontweight='bold', color=colors[name])

ax.set_xlim(-7, 7)
ax.set_ylim(-3, 7)
ax.set_aspect('equal')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title('Cosine Similarity\n(angle only, ignores length)', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)

# Show cosine similarities
sims = []
for name, vec in vectors.items():
    if name != 'a':
        sim = np.dot(a, vec) / (np.linalg.norm(a) * np.linalg.norm(vec))
        sims.append(f"cos(a,{name})={sim:.2f}")
ax.text(0.02, 0.98, '\n'.join(sims), transform=ax.transAxes, fontsize=9,
        verticalalignment='top', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

# === Middle: Euclidean Distance ===
ax = axes[1]
for name, vec in vectors.items():
    ax.scatter(vec[0], vec[1], c=colors[name], s=100, zorder=5, edgecolors='black', linewidth=1.5)
    ax.text(vec[0]+0.3, vec[1]+0.3, name, fontsize=11, fontweight='bold', color=colors[name])

# Draw distance lines from a to others
for name, vec in vectors.items():
    if name != 'a':
        ax.plot([a[0], vec[0]], [a[1], vec[1]], '--', color=colors[name], alpha=0.5, linewidth=1.5)

ax.set_xlim(-3, 7)
ax.set_ylim(-2, 4)
ax.set_aspect('equal')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title('Euclidean Distance\n(straight-line, sensitive to length)', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)

dists = []
for name, vec in vectors.items():
    if name != 'a':
        dist = np.linalg.norm(a - vec)
        dists.append(f"d(a,{name})={dist:.2f}")
ax.text(0.02, 0.98, '\n'.join(dists), transform=ax.transAxes, fontsize=9,
        verticalalignment='top', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

# === Right: Dot Product ===
ax = axes[2]
for name, vec in vectors.items():
    ax.annotate('', xy=vec, xytext=(0, 0),
               arrowprops=dict(arrowstyle='->', color=colors[name], lw=2.5))
    ax.text(vec[0]*1.1, vec[1]*1.1, name, fontsize=11, fontweight='bold', color=colors[name])

ax.set_xlim(-7, 7)
ax.set_ylim(-3, 7)
ax.set_aspect('equal')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title('Dot Product\n(direction + magnitude)', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)

dots = []
for name, vec in vectors.items():
    if name != 'a':
        dot = np.dot(a, vec)
        dots.append(f"a . {name}={dot:.2f}")
ax.text(0.02, 0.98, '\n'.join(dots), transform=ax.transAxes, fontsize=9,
        verticalalignment='top', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

plt.tight_layout()
plt.show()

print("\nKey observations:")
print("  - Cosine: a and c(2*a) have sim=1.00 (same direction, ignores length)")
print("  - Euclidean: a and c(2*a) are far apart (different magnitudes)")
print("  - Dot product: a . c(2*a) is large (rewards both alignment AND magnitude)")

### When to Use Which?

| Metric | Best For | Range | Properties |
|---|---|---|---|
| **Cosine similarity** | Comparing embeddings regardless of magnitude | [-1, 1] | Invariant to vector length; 1 = identical direction |
| **Euclidean distance** | Clustering, when magnitude matters | [0, inf) | Sensitive to scale; 0 = identical |
| **Dot product** | Attention mechanisms, when magnitude encodes importance | (-inf, inf) | Fast to compute; used in Transformers |

**Rule of thumb:** Use **cosine similarity** for semantic similarity tasks (search, recommendation). Use **dot product** when magnitude matters (attention scores, learned relevance). Use **Euclidean distance** for clustering and when you want a proper distance metric.

### Interactive Exploration: Similarity in Embedding Space

In [None]:
# Interactive: nearest neighbors search with different metrics
def nearest_neighbors(query_word, embeddings_matrix, word2idx, idx2word, k=5):
    """
    Find k nearest neighbors using all three metrics.
    """
    if query_word not in word2idx:
        print(f"'{query_word}' not in vocabulary")
        return
    
    query_vec = embeddings_matrix[word2idx[query_word]]
    
    results = {'cosine': [], 'euclidean': [], 'dot_product': []}
    
    for i in range(len(idx2word)):
        word = idx2word[i]
        if word == query_word:
            continue
        vec = embeddings_matrix[i]
        
        cos_sim = np.dot(query_vec, vec) / (np.linalg.norm(query_vec) * np.linalg.norm(vec))
        euc_dist = np.linalg.norm(query_vec - vec)
        dot_prod = np.dot(query_vec, vec)
        
        results['cosine'].append((word, cos_sim))
        results['euclidean'].append((word, euc_dist))
        results['dot_product'].append((word, dot_prod))
    
    results['cosine'].sort(key=lambda x: x[1], reverse=True)
    results['euclidean'].sort(key=lambda x: x[1])  # Lower = closer
    results['dot_product'].sort(key=lambda x: x[1], reverse=True)
    
    return {k: v[:5] for k, v in results.items()}

# Try different query words
for query in ['king', 'woman', 'village']:
    print(f"\n{'='*60}")
    print(f"Nearest neighbors of '{query}'")
    print(f"{'='*60}")
    
    results = nearest_neighbors(query, embeddings_matrix, word2idx, idx2word)
    
    print(f"\n  {'Cosine Similarity':<25s} {'Euclidean Distance':<25s} {'Dot Product':<25s}")
    print(f"  {'-'*23:<25s} {'-'*23:<25s} {'-'*23:<25s}")
    
    for i in range(5):
        cos_word, cos_val = results['cosine'][i]
        euc_word, euc_val = results['euclidean'][i]
        dot_word, dot_val = results['dot_product'][i]
        print(f"  {cos_word:12s} {cos_val:+.4f}    {euc_word:12s} {euc_val:.4f}    {dot_word:12s} {dot_val:+.4f}")

---

## 6. Vector Databases and Retrieval

### Intuitive Explanation

Once you have embeddings, a natural question is: given a query, how do I quickly find the most similar items? With a small collection, you can compare the query to every item (brute force). But with millions or billions of items, this becomes impossibly slow.

**Vector databases** solve this problem. They store embeddings and provide fast **approximate nearest neighbor (ANN)** search. The key insight is that you do not need the exact nearest neighbor -- an approximate answer that is 95% as good but 1000x faster is far more practical.

### Why Not Just Use a Regular Database?

| Traditional Database | Vector Database |
|---|---|
| Exact match: "WHERE name = 'cat'" | Similarity search: "find things like 'cat'" |
| Structured queries (SQL) | Semantic queries (natural language) |
| Indexes on exact values | Indexes on vector similarity |
| Returns exact matches | Returns ranked by similarity |
| Fast for equality/range queries | Fast for nearest-neighbor queries |

### Approximate Nearest Neighbors (ANN)

The main ANN algorithms trade a small amount of accuracy for huge speed gains:

| Algorithm | How It Works | Used By |
|---|---|---|
| **IVF (Inverted File Index)** | Cluster vectors, search only nearby clusters | FAISS |
| **HNSW (Hierarchical NSW)** | Build a graph of neighbors at multiple scales | Most modern systems |
| **LSH (Locality-Sensitive Hashing)** | Hash similar vectors to same bucket | Early systems |
| **Product Quantization** | Compress vectors by splitting into subspaces | FAISS (with IVF) |

### Building a Simple Semantic Search Engine from Scratch

Let's build a complete semantic search pipeline. We will:
1. Create a document collection
2. Embed each document using our Word2Vec model
3. Accept a query, embed it, and find the most similar documents

In [None]:
class SimpleVectorStore:
    """
    A minimal vector database for semantic search.
    Uses brute-force cosine similarity (no ANN indexing).
    """
    def __init__(self, embedding_dim):
        self.embedding_dim = embedding_dim
        self.vectors = []       # List of embedding vectors
        self.documents = []     # List of original documents
        self.metadata = []      # Optional metadata
    
    def add(self, document, vector, metadata=None):
        """Add a document and its embedding to the store."""
        self.vectors.append(vector / np.linalg.norm(vector))  # Normalize for cosine sim
        self.documents.append(document)
        self.metadata.append(metadata or {})
    
    def search(self, query_vector, top_k=5):
        """
        Find the top_k most similar documents to the query.
        
        Args:
            query_vector: Query embedding
            top_k: Number of results to return
        
        Returns:
            List of (document, similarity, metadata) tuples
        """
        query_norm = query_vector / np.linalg.norm(query_vector)
        
        # Compute cosine similarity with all documents
        similarities = [np.dot(query_norm, vec) for vec in self.vectors]
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            results.append((
                self.documents[idx],
                similarities[idx],
                self.metadata[idx]
            ))
        
        return results
    
    def __len__(self):
        return len(self.documents)

# Create document collection
documents = [
    "the king rules the kingdom with wisdom and power",
    "the queen leads the kingdom with grace and intelligence",
    "the prince trains to become a future king",
    "the princess studies diplomacy and leadership",
    "the man works hard in the village every day",
    "the woman tends the garden with great care",
    "the boy plays with friends in the village square",
    "the girl reads books under the old oak tree",
    "the knight protects the kingdom from invaders",
    "the scholar teaches young students in the academy",
    "a brave warrior defends the castle walls",
    "the royal throne sits in the great hall",
    "golden crown jewels are kept in the vault",
    "village life is simple and peaceful",
    "children play games in the meadow",
]

# Build the vector store
store = SimpleVectorStore(EMBEDDING_DIM)
for doc in documents:
    vec = get_sentence_embedding(doc, embeddings_matrix, word2idx, method='mean')
    store.add(doc, vec, metadata={'length': len(doc.split())})

print(f"Vector store contains {len(store)} documents")
print(f"Embedding dimension: {store.embedding_dim}")

# Search!
queries = [
    "king and queen rule together",
    "boy plays in village",
    "royal crown and throne",
]

for query in queries:
    print(f"\n{'='*60}")
    print(f"Query: '{query}'")
    print(f"{'='*60}")
    
    query_vec = get_sentence_embedding(query, embeddings_matrix, word2idx, method='mean')
    results = store.search(query_vec, top_k=3)
    
    for rank, (doc, sim, meta) in enumerate(results, 1):
        print(f"  #{rank} (sim={sim:.4f}): {doc}")

### Visualization: Embedding Space with Query and Retrieved Results

In [None]:
# Visualize retrieval in embedding space
fig, ax = plt.subplots(figsize=(12, 8))

# Get 2D projections of all document embeddings
doc_embeddings = np.array([get_sentence_embedding(doc, embeddings_matrix, word2idx, 'mean') 
                           for doc in documents])
doc_2d = pca_2d(doc_embeddings)

# Plot all documents
ax.scatter(doc_2d[:, 0], doc_2d[:, 1], c='lightgray', s=60, alpha=0.6, edgecolors='gray')

# Add short labels for each document
for i, doc in enumerate(documents):
    short = ' '.join(doc.split()[:4]) + '...'
    ax.annotate(short, (doc_2d[i, 0], doc_2d[i, 1]), fontsize=7, alpha=0.5,
               xytext=(5, 5), textcoords='offset points')

# Process a query and highlight results
query = "king and queen rule together"
query_vec = get_sentence_embedding(query, embeddings_matrix, word2idx, method='mean')
results = store.search(query_vec, top_k=3)

# Project query into same 2D space
all_vecs = np.vstack([doc_embeddings, query_vec.reshape(1, -1)])
all_2d = pca_2d(all_vecs)
query_2d = all_2d[-1]
doc_2d_new = all_2d[:-1]

# Re-plot with query
ax.clear()
ax.scatter(doc_2d_new[:, 0], doc_2d_new[:, 1], c='lightgray', s=60, alpha=0.6, edgecolors='gray')

for i, doc in enumerate(documents):
    short = ' '.join(doc.split()[:4]) + '...'
    ax.annotate(short, (doc_2d_new[i, 0], doc_2d_new[i, 1]), fontsize=7, alpha=0.5,
               xytext=(5, 5), textcoords='offset points')

# Highlight retrieved documents
retrieved_docs = [r[0] for r in results]
retrieved_sims = [r[1] for r in results]
colors_retrieved = ['green', 'limegreen', 'yellowgreen']

for rank, (doc, sim, _) in enumerate(results):
    doc_idx = documents.index(doc)
    ax.scatter(doc_2d_new[doc_idx, 0], doc_2d_new[doc_idx, 1], 
              c=colors_retrieved[rank], s=200, zorder=5, edgecolors='black', linewidth=2,
              label=f'#{rank+1}: sim={sim:.3f}')
    
    # Draw line from query to result
    ax.plot([query_2d[0], doc_2d_new[doc_idx, 0]], 
            [query_2d[1], doc_2d_new[doc_idx, 1]], 
            '--', color=colors_retrieved[rank], alpha=0.5, linewidth=1.5)

# Plot query
ax.scatter(query_2d[0], query_2d[1], c='red', s=300, zorder=6, 
          edgecolors='black', linewidth=2, marker='*', label=f'Query: "{query}"')

ax.set_title('Semantic Search: Query and Retrieved Documents in Embedding Space', 
             fontsize=13, fontweight='bold')
ax.set_xlabel('PC 1', fontsize=11)
ax.set_ylabel('PC 2', fontsize=11)
ax.legend(fontsize=9, loc='best')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### The Vector Database Ecosystem

| Tool | Type | Key Feature | Best For |
|---|---|---|---|
| **FAISS** (Meta) | Library | Blazing fast, GPU support | Research, large-scale |
| **Pinecone** | Managed service | Fully hosted, easy API | Production, no ops |
| **Chroma** | Open source | Lightweight, embedded | Prototyping, small projects |
| **Weaviate** | Open source | Hybrid search (vector + keyword) | Complex search needs |
| **Qdrant** | Open source | Filtering + vector search | Production, self-hosted |
| **Milvus** | Open source | Distributed, scalable | Very large scale |

### RAG: Retrieval-Augmented Generation

RAG is one of the most important patterns in modern AI. The idea is simple:

1. **Retrieve:** Use embedding similarity to find relevant documents from a knowledge base
2. **Augment:** Add those documents to the prompt as context
3. **Generate:** Have an LLM generate an answer using the retrieved context

**Why RAG matters:**
- LLMs have a knowledge cutoff -- RAG gives them access to current information
- LLMs can hallucinate -- RAG grounds answers in real documents
- You can update the knowledge base without retraining the model
- The user can verify answers by checking the source documents

In [None]:
# Simulate a RAG pipeline
def rag_pipeline(question, vector_store, embeddings_matrix, word2idx, top_k=3):
    """
    Simulate a RAG pipeline:
    1. Embed the question
    2. Retrieve relevant documents
    3. Show what would be sent to an LLM
    
    Args:
        question: User's question
        vector_store: Our SimpleVectorStore
        embeddings_matrix: Word embeddings
        word2idx: Word to index mapping
        top_k: Number of documents to retrieve
    
    Returns:
        The constructed prompt (in a real system, this goes to an LLM)
    """
    # Step 1: Embed the question
    query_vec = get_sentence_embedding(question, embeddings_matrix, word2idx, method='mean')
    
    # Step 2: Retrieve relevant documents
    results = vector_store.search(query_vec, top_k=top_k)
    
    # Step 3: Construct prompt
    context = "\n".join([f"- {doc}" for doc, sim, _ in results])
    
    prompt = f"""Answer the question based on the following context.

Context:
{context}

Question: {question}
Answer:"""
    
    return prompt, results

# Demo the RAG pipeline
question = "who rules the kingdom"
prompt, results = rag_pipeline(question, store, embeddings_matrix, word2idx)

print("=== RAG Pipeline Demo ===\n")
print("Step 1: User asks a question")
print(f"  Question: '{question}'\n")
print("Step 2: Retrieve relevant documents")
for rank, (doc, sim, _) in enumerate(results, 1):
    print(f"  #{rank} (sim={sim:.4f}): {doc}")
print(f"\nStep 3: Construct prompt for LLM")
print("-" * 50)
print(prompt)
print("-" * 50)
print("\nStep 4: Send to LLM (not implemented here -- would call GPT/Claude API)")
print("The LLM would answer based on the retrieved context!")

---

## 7. Practical Applications

### Intuitive Explanation

Embeddings are not just an academic curiosity -- they power some of the most impactful applications in technology. The fundamental pattern is always the same: represent items as vectors, then use similarity to find related items.

### Why This Matters in Machine Learning

| Application | How Embeddings Are Used | Example |
|---|---|---|
| **Semantic search** | Query and documents are embedded; find documents closest to query | Google Search, Bing |
| **Recommendation systems** | Users and items are embedded; recommend items close to user | Netflix, Spotify, Amazon |
| **Clustering / topic modeling** | Embed documents, cluster similar ones | News categorization |
| **Anomaly detection** | Items far from all clusters are anomalies | Fraud detection |
| **Duplicate detection** | Similar embeddings = potential duplicates | Customer deduplication |
| **Classification** | Use embeddings as features for classifiers | Sentiment analysis |
| **Cross-lingual tasks** | Align embeddings across languages | Translation, multilingual search |
| **RAG** | Retrieve relevant documents to augment LLM prompts | ChatGPT with plugins |

In [None]:
# Demo: Simple recommendation system using embeddings
# Simulate a "user liked these items, recommend similar ones"

# Items with hand-crafted embeddings simulating a movie recommendation scenario
# Dimensions roughly represent: [action, romance, comedy, scifi, drama]
movie_embeddings = {
    'The Matrix':       np.array([0.8, 0.1, 0.0, 0.9, 0.3]),
    'Inception':        np.array([0.7, 0.2, 0.1, 0.8, 0.5]),
    'Interstellar':     np.array([0.3, 0.3, 0.0, 0.9, 0.7]),
    'Titanic':          np.array([0.2, 0.9, 0.0, 0.0, 0.8]),
    'The Notebook':     np.array([0.0, 0.9, 0.1, 0.0, 0.7]),
    'Pride & Prejudice':np.array([0.0, 0.8, 0.2, 0.0, 0.6]),
    'Superbad':         np.array([0.1, 0.2, 0.9, 0.0, 0.3]),
    'The Hangover':     np.array([0.2, 0.1, 0.9, 0.0, 0.2]),
    'Step Brothers':    np.array([0.1, 0.0, 0.9, 0.0, 0.1]),
    'John Wick':        np.array([0.9, 0.0, 0.1, 0.2, 0.3]),
    'Mad Max':          np.array([0.9, 0.1, 0.1, 0.5, 0.3]),
    'Blade Runner':     np.array([0.5, 0.2, 0.0, 0.9, 0.6]),
}

def recommend(liked_movies, movie_embeddings, top_k=3):
    """
    Recommend movies based on user's liked movies.
    
    Args:
        liked_movies: List of movie titles the user liked
        movie_embeddings: Dict of movie -> embedding
        top_k: Number of recommendations
    
    Returns:
        List of (movie, similarity) tuples
    """
    # Create user profile: average of liked movie embeddings
    user_vec = np.mean([movie_embeddings[m] for m in liked_movies], axis=0)
    
    # Find most similar movies (excluding already liked)
    scores = []
    for movie, emb in movie_embeddings.items():
        if movie not in liked_movies:
            sim = np.dot(user_vec, emb) / (np.linalg.norm(user_vec) * np.linalg.norm(emb))
            scores.append((movie, sim))
    
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

# Test recommendations
print("=== Movie Recommendation Demo ===\n")

user_profiles = {
    'Sci-fi fan': ['The Matrix', 'Inception'],
    'Romance fan': ['Titanic', 'The Notebook'],
    'Comedy fan': ['Superbad', 'The Hangover'],
}

for profile_name, liked in user_profiles.items():
    recs = recommend(liked, movie_embeddings, top_k=3)
    print(f"{profile_name} (liked: {', '.join(liked)})")
    for movie, sim in recs:
        print(f"  -> {movie:20s} (similarity: {sim:.4f})")
    print()

### Visualization: Clustering in Embedding Space

In [None]:
# Visualize movie embeddings in 2D with genre clusters
movie_names = list(movie_embeddings.keys())
movie_vecs = np.array(list(movie_embeddings.values()))

# PCA to 2D
movie_2d = pca_2d(movie_vecs)

# Assign genres for coloring
genres = {
    'Sci-fi/Action': ['The Matrix', 'Inception', 'Interstellar', 'Blade Runner', 'Mad Max', 'John Wick'],
    'Romance/Drama': ['Titanic', 'The Notebook', 'Pride & Prejudice'],
    'Comedy': ['Superbad', 'The Hangover', 'Step Brothers'],
}

genre_colors = {'Sci-fi/Action': 'steelblue', 'Romance/Drama': 'lightcoral', 'Comedy': 'green'}
movie_genre = {}
for genre, movies in genres.items():
    for movie in movies:
        movie_genre[movie] = genre

fig, ax = plt.subplots(figsize=(10, 8))

for i, movie in enumerate(movie_names):
    genre = movie_genre[movie]
    ax.scatter(movie_2d[i, 0], movie_2d[i, 1], c=genre_colors[genre], s=150, 
              edgecolors='black', linewidth=1.5, zorder=5,
              label=genre if movie == genres[genre][0] else "")
    ax.annotate(movie, (movie_2d[i, 0], movie_2d[i, 1]), fontsize=9, fontweight='bold',
               xytext=(8, 8), textcoords='offset points')

ax.set_title('Movie Embeddings: Genre Clusters Emerge Naturally', fontsize=14, fontweight='bold')
ax.set_xlabel('PC 1', fontsize=11)
ax.set_ylabel('PC 2', fontsize=11)
ax.legend(fontsize=10, loc='best')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Embeddings naturally cluster by genre -- no explicit genre labels were used!")
print("This is the power of learning representations from data.")

---

## Exercises

### Exercise 1: Implement Cosine Similarity from Scratch

Implement cosine similarity using only NumPy, without using any library similarity function.

In [None]:
# EXERCISE 1: Implement cosine similarity
def cosine_similarity_manual(a, b):
    """
    Compute cosine similarity between two vectors using only NumPy.
    
    Args:
        a: First vector (1D numpy array)
        b: Second vector (1D numpy array)
    
    Returns:
        Cosine similarity (scalar between -1 and 1)
    """
    # TODO: Implement this!
    # Step 1: Compute dot product of a and b
    # Step 2: Compute L2 norm of a
    # Step 3: Compute L2 norm of b
    # Step 4: Return dot_product / (norm_a * norm_b)
    # Hint: Use np.dot(), np.sqrt(), np.sum()
    
    pass  # Replace with your implementation

# Test
a = np.array([1.0, 2.0, 3.0])
b = np.array([1.0, 2.0, 3.0])  # Identical
c = np.array([-1.0, -2.0, -3.0])  # Opposite
d = np.array([0.0, 0.0, 1.0])  # Partially aligned

print("Test results:")
result1 = cosine_similarity_manual(a, b)
print(f"  Identical vectors: {result1}")
print(f"  Expected: 1.0000")
print(f"  Correct: {np.isclose(result1, 1.0) if result1 is not None else 'Not implemented'}")

result2 = cosine_similarity_manual(a, c)
print(f"\n  Opposite vectors: {result2}")
print(f"  Expected: -1.0000")
print(f"  Correct: {np.isclose(result2, -1.0) if result2 is not None else 'Not implemented'}")

result3 = cosine_similarity_manual(a, d)
expected3 = 3.0 / (np.sqrt(14) * 1.0)
print(f"\n  Partial alignment: {result3}")
print(f"  Expected: {expected3:.4f}")
print(f"  Correct: {np.isclose(result3, expected3) if result3 is not None else 'Not implemented'}")

### Exercise 2: Implement CBOW Word2Vec

Modify the skip-gram model to implement CBOW (Continuous Bag of Words). Instead of predicting context from center, predict the center word from the average of context word embeddings.

In [None]:
# EXERCISE 2: Implement CBOW
class CBOWNegSampling(nn.Module):
    """
    CBOW Word2Vec with negative sampling.
    
    Given context words, predict the center word.
    """
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        nn.init.uniform_(self.embeddings.weight, -0.5/embedding_dim, 0.5/embedding_dim)
        nn.init.uniform_(self.output_embeddings.weight, -0.5/embedding_dim, 0.5/embedding_dim)
    
    def forward(self, context_words, center_word, negative_words):
        """
        Args:
            context_words: (batch_size, 2*window) context word indices
            center_word: (batch_size,) center word index
            negative_words: (batch_size, num_neg) negative sample indices
        
        Returns:
            loss: negative sampling loss
        """
        # TODO: Implement CBOW forward pass!
        # Step 1: Look up context word embeddings and average them
        #   context_emb = self.embeddings(context_words)  # (batch, 2*window, emb_dim)
        #   context_avg = context_emb.mean(dim=1)          # (batch, emb_dim)
        #
        # Step 2: Look up center word output embedding
        #   center_emb = self.output_embeddings(center_word)  # (batch, emb_dim)
        #
        # Step 3: Compute positive score (dot product of context_avg and center_emb)
        #   pos_score = (context_avg * center_emb).sum(dim=1)
        #   pos_loss = F.logsigmoid(pos_score)
        #
        # Step 4: Compute negative scores
        #   neg_emb = self.output_embeddings(negative_words)
        #   neg_score = torch.bmm(neg_emb, context_avg.unsqueeze(2)).squeeze(2)
        #   neg_loss = F.logsigmoid(-neg_score).sum(dim=1)
        #
        # Step 5: Return -(pos_loss + neg_loss).mean()
        
        pass  # Replace with your implementation

# Test structure (you would need to generate CBOW training data to fully test)
# cbow_model = CBOWNegSampling(vocab_size, EMBEDDING_DIM)
# print(f"CBOW parameters: {sum(p.numel() for p in cbow_model.parameters()):,}")

### Exercise 3: Build an Anomaly Detector Using Embeddings

Given a collection of "normal" items, detect anomalies by finding items that are far from all cluster centers.

In [None]:
# EXERCISE 3: Anomaly detection with embeddings
def detect_anomalies(items, item_embeddings, threshold=0.5):
    """
    Detect anomalies by finding items whose average similarity 
    to all other items is below the threshold.
    
    Args:
        items: List of item names
        item_embeddings: Dict of item_name -> embedding vector
        threshold: Similarity threshold (items below this are anomalies)
    
    Returns:
        List of (item_name, avg_similarity) for items below threshold,
        sorted by similarity (most anomalous first)
    """
    # TODO: Implement this!
    # Step 1: For each item, compute its average cosine similarity to all other items
    # Step 2: Flag items with average similarity below threshold
    # Step 3: Return sorted list of anomalies
    # Hint: Use the cosine_similarity function defined earlier
    
    pass  # Replace with your implementation

# Test data: mostly animals, with some outliers
test_embeddings = {
    'cat':        np.array([0.8, 0.2, 0.1, -0.3, 0.5]),
    'dog':        np.array([0.7, 0.3, 0.2, -0.2, 0.4]),
    'hamster':    np.array([0.6, 0.1, 0.0, -0.4, 0.6]),
    'parrot':     np.array([0.5, 0.4, 0.3, -0.1, 0.3]),
    'goldfish':   np.array([0.4, 0.2, 0.1, -0.3, 0.5]),
    'airplane':   np.array([-0.7, -0.5, 0.8, 0.6, -0.2]),   # ANOMALY!
    'submarine':  np.array([-0.8, -0.6, 0.7, 0.5, -0.3]),   # ANOMALY!
    'rabbit':     np.array([0.7, 0.1, 0.0, -0.4, 0.5]),
}

# anomalies = detect_anomalies(list(test_embeddings.keys()), test_embeddings, threshold=0.5)
# if anomalies:
#     print("Detected anomalies:")
#     for item, sim in anomalies:
#         print(f"  {item:12s}: avg similarity = {sim:.4f}")
# else:
#     print("No anomalies detected (check your implementation)")
print("Uncomment the test code above after implementing detect_anomalies()")

---

## Summary

### Key Concepts

**Why Embeddings:**
- One-hot encoding is sparse, high-dimensional, and encodes no similarity
- Dense embeddings represent items as short vectors where similar items are close
- The distributional hypothesis: meaning comes from usage patterns

**Word2Vec:**
- Skip-gram: predict context words from center word
- CBOW: predict center word from context words
- Negative sampling makes training efficient
- Implicitly factorizes a word co-occurrence (PMI) matrix
- Captures analogies via vector arithmetic: king - man + woman = queen

**GloVe:**
- Explicitly builds and factorizes a co-occurrence matrix
- Global context vs Word2Vec's local context windows
- Produces similarly good embeddings in practice

**Modern Embeddings:**
- Static (Word2Vec, GloVe): one vector per word, regardless of context
- Contextual (BERT, GPT): different vector for each word occurrence based on context
- Sentence embeddings: mean pooling, [CLS] token, or specialized models

**Similarity Measures:**
- Cosine similarity: direction only, range [-1, 1]
- Euclidean distance: straight-line distance, sensitive to magnitude
- Dot product: direction and magnitude, used in attention

**Vector Databases and Retrieval:**
- Store embeddings for fast nearest-neighbor search
- ANN algorithms (IVF, HNSW) trade small accuracy loss for huge speed gains
- RAG: Retrieve relevant documents to augment LLM responses

### Connection to Deep Learning

| Concept | Where It's Used |
|---|---|
| Word embeddings | First layer of every NLP model |
| nn.Embedding | PyTorch lookup table, trained end-to-end |
| Cosine similarity | Contrastive learning, similarity search |
| Dot product similarity | Attention mechanism (Q * K) |
| Contextual embeddings | BERT, GPT, all modern language models |
| Sentence embeddings | Semantic search, RAG, classification |
| Vector databases | Production search and retrieval systems |
| Negative sampling | Contrastive learning, SimCLR, CLIP |

### Checklist

- [ ] I can explain why one-hot encoding fails and why dense embeddings are needed
- [ ] I understand the distributional hypothesis and how it motivates embeddings
- [ ] I can implement skip-gram Word2Vec with negative sampling
- [ ] I can perform vector arithmetic and solve analogies with embeddings
- [ ] I can compare Word2Vec and GloVe approaches
- [ ] I understand the difference between static and contextual embeddings
- [ ] I can compute and compare cosine similarity, Euclidean distance, and dot product
- [ ] I can build a simple semantic search engine
- [ ] I can explain RAG and why it matters for modern AI
- [ ] I can describe practical applications of embeddings across domains

---

## Next Steps

You now understand **embeddings** -- the fundamental representation that powers modern AI. Every time you use a search engine, get a recommendation, or chat with an LLM, embeddings are working behind the scenes to represent meaning as geometry.

The key ideas to carry forward:

1. **Representation matters more than algorithms.** A good embedding makes downstream tasks dramatically easier. This is why so much research focuses on learning better representations.

2. **Similarity = proximity in embedding space.** This one idea connects search, recommendation, clustering, anomaly detection, and more. If you can embed it, you can compare it.

3. **From static to contextual.** The evolution from Word2Vec to BERT/GPT shows how richer context produces better representations. Modern models compute embeddings as a function of the entire input.

4. **RAG bridges embeddings and generation.** Retrieval-Augmented Generation is one of the most practical patterns in modern AI, combining the precision of search with the fluency of language models.

**Practical next steps:**
- Try using pre-trained embeddings (e.g., `gensim` for Word2Vec/GloVe, `sentence-transformers` for sentence embeddings)
- Build a RAG pipeline with a real embedding API and vector database
- Experiment with embedding dimensions: how does quality change with 50 vs 300 vs 768 dimensions?
- Explore multimodal embeddings (CLIP) that embed both images and text in the same space

**In the next notebook,** we will explore fine-tuning and parameter-efficient methods (LoRA, adapters), where you learn to adapt pre-trained models -- including their embeddings -- to specific tasks with minimal data and computation.