# Notebook 16: Embedding Models & Cosine Similarity

---

## Inference Engineering Course

Welcome to Notebook 16! In this notebook, we explore **embedding models** -- models that convert text into dense vector representations that capture semantic meaning.

### What You Will Learn

| Topic | Description |
|-------|-------------|
| **Text Embeddings** | Generate dense vectors from text using sentence-transformers |
| **Cosine Similarity** | Implement and understand similarity from scratch |
| **Semantic Search** | Build a mini search engine with embeddings |
| **Visualization** | Project embeddings to 2D using t-SNE/UMAP |
| **Model Comparison** | Compare different embedding models |
| **Retrieval Precision** | Measure retrieval quality at different thresholds |

### Why Embeddings Matter for Inference

Embedding models are the backbone of:
- **RAG (Retrieval-Augmented Generation)**: Finding relevant documents to augment LLM context
- **Semantic search**: Understanding user intent, not just keywords
- **Clustering & classification**: Organizing text at scale
- **Deduplication**: Finding near-duplicate content

---

## Part 1: Setup & Installations

In [None]:
%%capture
!pip install sentence-transformers matplotlib numpy scikit-learn

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import time
import warnings
warnings.filterwarnings('ignore')

from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine
from typing import List, Dict, Tuple

plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

print("All imports successful!")

## Part 2: Loading an Embedding Model

We will use the `all-MiniLM-L6-v2` model from sentence-transformers. This is a compact but effective model:

| Property | Value |
|----------|-------|
| Parameters | ~22M |
| Embedding dimension | 384 |
| Max sequence length | 256 tokens |
| Speed | Very fast (CPU-friendly) |
| Quality | Good for general-purpose tasks |

In [None]:
# Load the embedding model
print("Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Model loaded!")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
print(f"Max sequence length: {model.max_seq_length}")

# Test with a simple sentence
test_embedding = model.encode("Hello, world!")
print(f"\nTest embedding shape: {test_embedding.shape}")
print(f"First 10 values: {test_embedding[:10].round(4)}")
print(f"L2 norm: {np.linalg.norm(test_embedding):.4f}")

## Part 3: Cosine Similarity from Scratch

**Cosine similarity** measures the angle between two vectors, ignoring their magnitude. It is defined as:

$$\text{cosine\_sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$$

Properties:
- Range: [-1, 1] (for normalized embeddings: [0, 1] typically)
- 1.0 = identical direction (same meaning)
- 0.0 = orthogonal (unrelated)
- -1.0 = opposite direction (opposite meaning, rare in practice)

In [None]:
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors from scratch.
    
    Args:
        a: First vector (1D numpy array)
        b: Second vector (1D numpy array)
    
    Returns:
        Cosine similarity score between -1 and 1
    """
    # Step 1: Compute dot product
    dot_product = np.dot(a, b)
    
    # Step 2: Compute magnitudes (L2 norms)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    
    # Step 3: Divide dot product by product of magnitudes
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    return dot_product / (norm_a * norm_b)


def cosine_similarity_batch(query: np.ndarray, corpus: np.ndarray) -> np.ndarray:
    """
    Compute cosine similarity between a query and all corpus vectors.
    Efficient vectorized implementation.
    """
    # Normalize vectors
    query_norm = query / np.linalg.norm(query)
    corpus_norms = corpus / np.linalg.norm(corpus, axis=1, keepdims=True)
    
    # Dot product of normalized vectors = cosine similarity
    return corpus_norms @ query_norm


# Test our implementation
sentences = [
    "The cat sat on the mat",
    "A kitten was resting on the rug",
    "The stock market crashed today",
    "Financial markets experienced a downturn",
]

embeddings = model.encode(sentences)

print("Pairwise Cosine Similarities (from scratch):")
print("=" * 60)
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        sim = cosine_similarity(embeddings[i], embeddings[j])
        print(f"  [{sim:.4f}] '{sentences[i]}'")
        print(f"           vs '{sentences[j]}'")
        print()

In [None]:
# Visualize the similarity matrix
fig, ax = plt.subplots(figsize=(8, 7))

sim_matrix = np.zeros((len(sentences), len(sentences)))
for i in range(len(sentences)):
    for j in range(len(sentences)):
        sim_matrix[i, j] = cosine_similarity(embeddings[i], embeddings[j])

im = ax.imshow(sim_matrix, cmap='RdYlGn', vmin=0, vmax=1, aspect='auto')

# Labels
short_labels = [s[:30] + '...' if len(s) > 30 else s for s in sentences]
ax.set_xticks(range(len(sentences)))
ax.set_yticks(range(len(sentences)))
ax.set_xticklabels(short_labels, rotation=45, ha='right', fontsize=9)
ax.set_yticklabels(short_labels, fontsize=9)

# Annotate values
for i in range(len(sentences)):
    for j in range(len(sentences)):
        color = 'white' if sim_matrix[i, j] > 0.6 or sim_matrix[i, j] < 0.2 else 'black'
        ax.text(j, i, f'{sim_matrix[i, j]:.3f}', ha='center', va='center',
                fontsize=11, fontweight='bold', color=color)

plt.colorbar(im, ax=ax, shrink=0.8, label='Cosine Similarity')
ax.set_title('Pairwise Cosine Similarity Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 4: Building a Mini Semantic Search System

Let's build a simple but functional semantic search engine. The pipeline:

1. **Index**: Encode all documents into embeddings
2. **Query**: Encode the search query
3. **Rank**: Compute cosine similarity between query and all documents
4. **Return**: Top-K most similar documents

In [None]:
class SemanticSearchEngine:
    """
    A mini semantic search engine using embedding models.
    """
    
    def __init__(self, model: SentenceTransformer):
        self.model = model
        self.documents = []
        self.embeddings = None
        self.index_time = 0
    
    def index(self, documents: List[str]):
        """Encode and store document embeddings."""
        self.documents = documents
        start = time.time()
        self.embeddings = self.model.encode(documents, show_progress_bar=False)
        self.index_time = time.time() - start
        print(f"Indexed {len(documents)} documents in {self.index_time:.3f}s")
        print(f"  Embedding shape: {self.embeddings.shape}")
        print(f"  Memory: {self.embeddings.nbytes / 1024:.1f} KB")
    
    def search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Search for the most similar documents to the query."""
        start = time.time()
        query_embedding = self.model.encode(query)
        
        # Compute similarities
        similarities = cosine_similarity_batch(query_embedding, self.embeddings)
        
        # Get top-K indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        search_time = time.time() - start
        
        results = []
        for idx in top_indices:
            results.append({
                'rank': len(results) + 1,
                'document': self.documents[idx],
                'score': float(similarities[idx]),
                'index': int(idx),
            })
        
        return results, search_time

print("SemanticSearchEngine class defined!")

In [None]:
# Create a document corpus
documents = [
    # Technology
    "Python is a popular programming language for data science and machine learning.",
    "JavaScript is the most widely used language for web development.",
    "Docker containers help deploy applications consistently across environments.",
    "Kubernetes orchestrates containerized applications at scale.",
    "React is a JavaScript library for building user interfaces.",
    "TensorFlow and PyTorch are the leading deep learning frameworks.",
    "Git is a distributed version control system for tracking code changes.",
    
    # Science
    "Photosynthesis converts sunlight into chemical energy in plants.",
    "DNA carries the genetic instructions for all living organisms.",
    "The theory of relativity describes gravity as curvature of spacetime.",
    "Quantum mechanics governs the behavior of particles at atomic scales.",
    "Climate change is driven by increasing greenhouse gas concentrations.",
    
    # Food & Cooking
    "Sourdough bread requires a fermented starter culture and long proofing time.",
    "Sushi is a Japanese dish featuring vinegared rice with seafood.",
    "Espresso is made by forcing hot water through finely ground coffee beans.",
    "Thai curry combines coconut milk, curry paste, and fresh herbs.",
    
    # Sports
    "The FIFA World Cup is the most watched sporting event globally.",
    "Basketball was invented by James Naismith in 1891.",
    "Marathon runners cover 26.2 miles or 42.195 kilometers.",
    "Tennis Grand Slams include Wimbledon, US Open, French Open, and Australian Open.",
]

# Build the search engine
engine = SemanticSearchEngine(model)
engine.index(documents)

In [None]:
# Run search queries
queries = [
    "How do I build a website?",
    "What is artificial intelligence?",
    "Tell me about making coffee",
    "Which sport has the biggest global audience?",
    "How does nature produce energy from the sun?",
]

for query in queries:
    results, search_time = engine.search(query, top_k=3)
    print(f"\nQuery: '{query}'  ({search_time*1000:.1f}ms)")
    print("-" * 70)
    for r in results:
        print(f"  [{r['score']:.4f}] {r['document']}")

## Part 5: Visualizing Embeddings in 2D

High-dimensional embeddings (384D) are hard to visualize. We use **t-SNE** (t-distributed Stochastic Neighbor Embedding) to project them to 2D while preserving local structure.

t-SNE properties:
- Preserves **local** neighborhood structure
- Points that are close in high-D remain close in 2D
- Does NOT preserve global distances
- Non-deterministic (different runs may look different)

In [None]:
# Assign categories to documents for coloring
categories = (
    ['Technology'] * 7 + 
    ['Science'] * 5 + 
    ['Food'] * 4 + 
    ['Sports'] * 4
)

category_colors = {
    'Technology': '#2196F3',
    'Science': '#4CAF50',
    'Food': '#FF9800',
    'Sports': '#F44336',
}

# Run t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5, n_iter=1000)
embeddings_2d = tsne.fit_transform(engine.embeddings)

# Plot
fig, ax = plt.subplots(figsize=(14, 10))

for cat in category_colors:
    mask = [c == cat for c in categories]
    indices = [i for i, m in enumerate(mask) if m]
    ax.scatter(
        embeddings_2d[indices, 0], 
        embeddings_2d[indices, 1],
        c=category_colors[cat], 
        s=200, 
        alpha=0.8,
        label=cat,
        edgecolors='black',
        linewidths=1,
        zorder=5
    )

# Add labels
for i, doc in enumerate(documents):
    short = doc[:40] + '...' if len(doc) > 40 else doc
    ax.annotate(
        short, 
        (embeddings_2d[i, 0], embeddings_2d[i, 1]),
        fontsize=7,
        xytext=(5, 5),
        textcoords='offset points',
        alpha=0.8,
    )

ax.set_title('Document Embeddings Projected to 2D (t-SNE)',
             fontsize=15, fontweight='bold')
ax.legend(fontsize=12, loc='upper left')
ax.set_xlabel('t-SNE Dimension 1', fontsize=12)
ax.set_ylabel('t-SNE Dimension 2', fontsize=12)

plt.tight_layout()
plt.show()
print("Documents from the same category cluster together!")

## Part 6: Comparing Different Embedding Models

Not all embedding models are equal. Let's compare several models on the same task.

In [None]:
# Models to compare (small enough for free Colab)
model_names = [
    'all-MiniLM-L6-v2',       # 22M params, 384 dim
    'all-MiniLM-L12-v2',      # 33M params, 384 dim
    'paraphrase-MiniLM-L3-v2', # 17M params, 384 dim
]

# Test pairs: (sentence_a, sentence_b, expected_similarity)
# 'high' = should be similar, 'low' = should be dissimilar
test_pairs = [
    ("A dog is playing in the park", "A puppy runs around the garden", 'high'),
    ("The weather is sunny today", "It's a beautiful clear day", 'high'),
    ("I love programming in Python", "The snake slithered through the grass", 'low'),
    ("The bank approved my loan", "I sat by the river bank", 'low'),
    ("She drives a red car", "The automobile is crimson colored", 'high'),
    ("Neural networks learn patterns", "Deep learning models find structure in data", 'high'),
    ("The cat sleeps on the sofa", "Quantum physics is complex", 'low'),
    ("I need to buy groceries", "Shopping for food at the supermarket", 'high'),
]

model_results = {}

for model_name in model_names:
    print(f"\nLoading {model_name}...")
    m = SentenceTransformer(model_name)
    
    # Measure encoding speed
    start = time.time()
    all_texts = [p[0] for p in test_pairs] + [p[1] for p in test_pairs]
    embeddings = m.encode(all_texts)
    encode_time = time.time() - start
    
    # Calculate similarities
    similarities = []
    for i, (s1, s2, expected) in enumerate(test_pairs):
        emb1 = embeddings[i]
        emb2 = embeddings[len(test_pairs) + i]
        sim = cosine_similarity(emb1, emb2)
        similarities.append((sim, expected))
    
    # Calculate quality metric: average separation between high/low pairs
    high_sims = [s for s, e in similarities if e == 'high']
    low_sims = [s for s, e in similarities if e == 'low']
    separation = np.mean(high_sims) - np.mean(low_sims)
    
    model_results[model_name] = {
        'similarities': similarities,
        'encode_time': encode_time,
        'dim': m.get_sentence_embedding_dimension(),
        'high_avg': np.mean(high_sims),
        'low_avg': np.mean(low_sims),
        'separation': separation,
    }
    
    print(f"  Dim: {m.get_sentence_embedding_dimension()}, Time: {encode_time:.3f}s")
    print(f"  High pairs avg: {np.mean(high_sims):.4f}, Low pairs avg: {np.mean(low_sims):.4f}")
    print(f"  Separation score: {separation:.4f}")

print("\nAll models loaded and compared!")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Similarity distributions by model
ax = axes[0]
x = np.arange(len(test_pairs))
width = 0.25
colors = ['#2196F3', '#4CAF50', '#FF9800']

for i, model_name in enumerate(model_names):
    sims = [s for s, _ in model_results[model_name]['similarities']]
    short_name = model_name.split('-')[0] + '-' + model_name.split('-')[-1]
    bars = ax.bar(x + i * width, sims, width, label=short_name, 
                  color=colors[i], alpha=0.8)

# Color background by expected similarity
for i, (_, _, expected) in enumerate(test_pairs):
    if expected == 'high':
        ax.axvspan(i - 0.4, i + 0.9, alpha=0.05, color='green')
    else:
        ax.axvspan(i - 0.4, i + 0.9, alpha=0.05, color='red')

ax.set_xlabel('Test Pair Index', fontsize=11)
ax.set_ylabel('Cosine Similarity', fontsize=11)
ax.set_title('Similarity Scores by Model', fontsize=13, fontweight='bold')
ax.legend(fontsize=9)
ax.set_xticks(x + width)
ax.set_xticklabels([f'P{i+1}({e[0].upper()})' for i, (_, _, e) in enumerate(test_pairs)], fontsize=8)

# Plot 2: Separation scores
ax = axes[1]
names = [n.replace('all-', '').replace('paraphrase-', 'para-') for n in model_names]
separations = [model_results[m]['separation'] for m in model_names]
bars = ax.bar(names, separations, color=colors, alpha=0.8, edgecolor='black')
for bar, val in zip(bars, separations):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.01,
           f'{val:.3f}', ha='center', fontsize=11, fontweight='bold')
ax.set_ylabel('Separation Score', fontsize=11)
ax.set_title('Similarity Separation\n(Higher = Better)', fontsize=13, fontweight='bold')

# Plot 3: Speed comparison
ax = axes[2]
times = [model_results[m]['encode_time'] * 1000 for m in model_names]  # Convert to ms
bars = ax.bar(names, times, color=colors, alpha=0.8, edgecolor='black')
for bar, val in zip(bars, times):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 1,
           f'{val:.0f}ms', ha='center', fontsize=11, fontweight='bold')
ax.set_ylabel('Encoding Time (ms)', fontsize=11)
ax.set_title('Encoding Speed\n(Lower = Faster)', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

## Part 7: Retrieval Precision at Different Thresholds

In a real search system, you need to decide: **what similarity threshold counts as a "match"?**

Too high -> miss relevant results (low recall)
Too low -> include irrelevant results (low precision)

Let's measure precision and recall at different thresholds.

In [None]:
# Create a labeled dataset for precision/recall analysis
# Each query has ground truth relevant document indices
eval_queries = [
    {
        'query': 'machine learning frameworks',
        'relevant': [0, 5],  # Python for ML, TensorFlow/PyTorch
    },
    {
        'query': 'container orchestration deployment',
        'relevant': [2, 3],  # Docker, Kubernetes
    },
    {
        'query': 'physics and the nature of the universe',
        'relevant': [9, 10],  # Relativity, Quantum mechanics
    },
    {
        'query': 'preparing food and beverages',
        'relevant': [12, 13, 14, 15],  # All food items
    },
    {
        'query': 'competitive athletics and tournaments',
        'relevant': [16, 17, 18, 19],  # All sports items
    },
]

# Calculate precision@k and recall@k at different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)

precisions_at_threshold = []
recalls_at_threshold = []

for threshold in thresholds:
    all_precisions = []
    all_recalls = []
    
    for eq in eval_queries:
        query_emb = model.encode(eq['query'])
        sims = cosine_similarity_batch(query_emb, engine.embeddings)
        
        # Documents above threshold
        retrieved = set(np.where(sims >= threshold)[0])
        relevant = set(eq['relevant'])
        
        if len(retrieved) > 0:
            precision = len(retrieved & relevant) / len(retrieved)
        else:
            precision = 1.0  # No results = no false positives
        
        recall = len(retrieved & relevant) / len(relevant) if len(relevant) > 0 else 0
        
        all_precisions.append(precision)
        all_recalls.append(recall)
    
    precisions_at_threshold.append(np.mean(all_precisions))
    recalls_at_threshold.append(np.mean(all_recalls))

# Plot precision-recall curve
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Precision and Recall vs Threshold
ax = axes[0]
ax.plot(thresholds, precisions_at_threshold, 'o-', color='#2196F3', 
        linewidth=2.5, markersize=5, label='Precision')
ax.plot(thresholds, recalls_at_threshold, 's-', color='#F44336',
        linewidth=2.5, markersize=5, label='Recall')

# F1 score
f1_scores = [2*p*r/(p+r) if (p+r) > 0 else 0 
             for p, r in zip(precisions_at_threshold, recalls_at_threshold)]
ax.plot(thresholds, f1_scores, 'D-', color='#4CAF50',
        linewidth=2.5, markersize=5, label='F1 Score')

# Best F1 threshold
best_f1_idx = np.argmax(f1_scores)
ax.axvline(x=thresholds[best_f1_idx], color='gray', linestyle='--', alpha=0.5,
           label=f'Best threshold: {thresholds[best_f1_idx]:.2f}')

ax.set_xlabel('Similarity Threshold', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Precision, Recall, F1 vs Threshold', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)
ax.set_ylim(-0.05, 1.05)

# Right: Precision-Recall curve
ax = axes[1]
ax.plot(recalls_at_threshold, precisions_at_threshold, 'o-', color='#9C27B0',
        linewidth=2.5, markersize=6)

# Annotate a few points with their threshold
for i in range(0, len(thresholds), 4):
    ax.annotate(f't={thresholds[i]:.2f}', 
               (recalls_at_threshold[i], precisions_at_threshold[i]),
               textcoords='offset points', xytext=(10, 5), fontsize=9)

ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curve', fontsize=14, fontweight='bold')
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.05, 1.05)

plt.tight_layout()
plt.show()

print(f"Best threshold (highest F1): {thresholds[best_f1_idx]:.2f}")
print(f"  Precision: {precisions_at_threshold[best_f1_idx]:.3f}")
print(f"  Recall: {recalls_at_threshold[best_f1_idx]:.3f}")
print(f"  F1: {f1_scores[best_f1_idx]:.3f}")

## Part 8: Understanding Embedding Space Structure

Let's analyze what the embedding space looks like -- how are vectors distributed? What does "distance" mean in this space?

In [None]:
# Analyze embedding space properties
all_embeddings = engine.embeddings

# 1. Distribution of pairwise similarities
n = len(all_embeddings)
all_sims = []
for i in range(n):
    for j in range(i + 1, n):
        all_sims.append(cosine_similarity(all_embeddings[i], all_embeddings[j]))

# 2. Within-category vs between-category similarities
within_sims = []
between_sims = []
for i in range(n):
    for j in range(i + 1, n):
        sim = cosine_similarity(all_embeddings[i], all_embeddings[j])
        if categories[i] == categories[j]:
            within_sims.append(sim)
        else:
            between_sims.append(sim)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Similarity distribution
ax = axes[0]
ax.hist(all_sims, bins=30, alpha=0.5, color='gray', label='All pairs', edgecolor='black')
ax.hist(within_sims, bins=20, alpha=0.6, color='#4CAF50', label='Within category', edgecolor='black')
ax.hist(between_sims, bins=20, alpha=0.6, color='#F44336', label='Between categories', edgecolor='black')

ax.axvline(np.mean(within_sims), color='#4CAF50', linestyle='--', linewidth=2)
ax.axvline(np.mean(between_sims), color='#F44336', linestyle='--', linewidth=2)

ax.set_xlabel('Cosine Similarity', fontsize=12)
ax.set_ylabel('Count', fontsize=12)
ax.set_title('Distribution of Pairwise Similarities', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)

# Right: Average similarity by category pair
ax = axes[1]
unique_cats = list(category_colors.keys())
cat_sim_matrix = np.zeros((len(unique_cats), len(unique_cats)))

for ci, cat_i in enumerate(unique_cats):
    for cj, cat_j in enumerate(unique_cats):
        sims_ij = []
        for i in range(n):
            for j in range(n):
                if i != j and categories[i] == cat_i and categories[j] == cat_j:
                    sims_ij.append(cosine_similarity(all_embeddings[i], all_embeddings[j]))
        cat_sim_matrix[ci, cj] = np.mean(sims_ij) if sims_ij else 0

im = ax.imshow(cat_sim_matrix, cmap='YlOrRd', vmin=0, vmax=0.7)
ax.set_xticks(range(len(unique_cats)))
ax.set_yticks(range(len(unique_cats)))
ax.set_xticklabels(unique_cats, rotation=45, ha='right')
ax.set_yticklabels(unique_cats)
ax.set_title('Average Similarity Between Categories', fontsize=14, fontweight='bold')

for i in range(len(unique_cats)):
    for j in range(len(unique_cats)):
        ax.text(j, i, f'{cat_sim_matrix[i,j]:.3f}', ha='center', va='center',
                fontsize=11, fontweight='bold')

plt.colorbar(im, ax=ax, shrink=0.8)
plt.tight_layout()
plt.show()

print(f"Within-category avg similarity: {np.mean(within_sims):.4f}")
print(f"Between-category avg similarity: {np.mean(between_sims):.4f}")
print(f"Separation: {np.mean(within_sims) - np.mean(between_sims):.4f}")

## Part 9: Embedding Inference Performance

For production systems, embedding speed matters. Let's benchmark throughput.

In [None]:
# Benchmark embedding throughput
batch_sizes = [1, 4, 8, 16, 32, 64]
test_sentences = [f"This is test sentence number {i} for benchmarking embedding model throughput." 
                  for i in range(64)]

throughputs = []
latencies = []

for bs in batch_sizes:
    batch = test_sentences[:bs]
    
    # Warmup
    _ = model.encode(batch)
    
    # Measure
    times = []
    for _ in range(5):
        start = time.time()
        _ = model.encode(batch)
        times.append(time.time() - start)
    
    avg_time = np.mean(times)
    throughput = bs / avg_time
    latency = avg_time * 1000  # ms
    
    throughputs.append(throughput)
    latencies.append(latency)
    print(f"Batch size {bs:>3d}: {throughput:>7.1f} sentences/s | {latency:>7.1f} ms total")

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax = axes[0]
ax.plot(batch_sizes, throughputs, 'o-', color='#2196F3', linewidth=2.5, markersize=10)
ax.set_xlabel('Batch Size', fontsize=12)
ax.set_ylabel('Throughput (sentences/s)', fontsize=12)
ax.set_title('Embedding Throughput vs Batch Size', fontsize=14, fontweight='bold')

ax = axes[1]
ax.plot(batch_sizes, latencies, 's-', color='#F44336', linewidth=2.5, markersize=10)
ax.set_xlabel('Batch Size', fontsize=12)
ax.set_ylabel('Total Latency (ms)', fontsize=12)
ax.set_title('Total Latency vs Batch Size', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## Part 10: Key Takeaways

### Summary

1. **Embedding models** convert text into dense vectors where semantic similarity corresponds to vector proximity.

2. **Cosine similarity** is the standard metric for comparing embeddings -- it measures the angle between vectors, ignoring magnitude.

3. **Model choice matters**: Different embedding models have different quality-speed tradeoffs. Larger models generally have better separation but slower inference.

4. **Threshold tuning** is critical for production systems -- use precision-recall analysis to find the optimal threshold for your use case.

5. **Batching improves throughput** significantly -- always batch your encoding when possible.

### For Inference Engineering

- Embedding inference is typically **much faster** than generative LLM inference
- Can run efficiently on CPU for smaller models
- Key bottleneck in RAG pipelines: optimize embedding model choice and batching
- Consider **quantized** embedding models for production at scale

---

## Exercises

### Exercise 1: Build a FAQ Bot
Create a FAQ system where you index question-answer pairs and find the most relevant answer for a new question.

In [None]:
# Exercise 1: FAQ Bot
faq_pairs = [
    {"q": "How do I reset my password?", "a": "Go to Settings > Security > Reset Password."},
    {"q": "What are your business hours?", "a": "We are open Monday-Friday, 9am-5pm EST."},
    {"q": "How can I track my order?", "a": "Check the Order Status page with your order number."},
    {"q": "Do you offer refunds?", "a": "Yes, within 30 days of purchase."},
    {"q": "How do I contact support?", "a": "Email support@example.com or call 1-800-123-4567."},
]

# TODO: Index the FAQ questions, then search for:
# "I forgot my login credentials"
# "When is your store open?"
# "Can I get my money back?"

print("Exercise 1: Build the FAQ search system!")

### Exercise 2: Embedding Arithmetic
Explore if embedding arithmetic works: king - man + woman = queen?

In [None]:
# Exercise 2: Embedding arithmetic
# Encode: king, man, woman, queen
# Compute: king - man + woman
# Check: is the result closest to queen?

# TODO: Test with other analogies too:
# Paris - France + Germany = Berlin?
# doctor - man + woman = nurse? (or doctor?)

print("Exercise 2: Explore embedding arithmetic!")

### Exercise 3: Deduplication
Use embedding similarity to find near-duplicate documents.

In [None]:
# Exercise 3: Deduplication
duplicate_docs = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast brown fox leaps over a sleepy dog.",  # Near duplicate
    "Python is great for data science.",
    "For data science, Python is excellent.",  # Near duplicate
    "The weather is nice today.",
    "Today the weather is pleasant.",  # Near duplicate
    "I love playing basketball.",
    "Machine learning transforms industries.",
]

# TODO: Find all pairs with similarity > 0.8 (likely duplicates)

print("Exercise 3: Find the near-duplicates!")

---

**End of Notebook 16: Embedding Models & Cosine Similarity**

Next: [Notebook 17 - Image Generation: Guidance Scale & Steps](./17_image_generation_guidance.ipynb)