# 8. Text Similarity with Embeddings

**Estimated Time**: ~2 hours

**Prerequisites**: Notebooks 1-7 (especially understanding of text representations from Translation)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Understand** how models represent text as numerical vectors (embeddings)
2. **Use** the feature-extraction pipeline to generate embeddings
3. **Calculate** similarity between texts using cosine similarity
4. **Compare** different pooling strategies (mean, CLS, max)
5. **Build** a semantic FAQ matcher for real-world applications

## Setup

Run this cell first. If you completed previous notebooks, you already have the core packages ready.

In [None]:
# Core imports
from transformers import pipeline, AutoTokenizer, AutoModel
import torch
import numpy as np

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Setup complete!")
print(f"PyTorch version: {torch.__version__}")

---

# Part 1: Conceptual Foundation

## What are Text Embeddings?

**In plain English**: Embeddings are a way to represent text as numbers that capture meaning. Similar texts have similar numbers, letting us measure how alike two pieces of text are.

**Technical definition**: Text embeddings are dense vector representations where semantically similar texts are mapped to nearby points in a high-dimensional space, enabling mathematical operations like similarity computation.

### Why Do We Need Embeddings?

```
THE PROBLEM:
┌────────────────────────────────────────────────────────────────┐
│  Computers can't understand text directly.                    │
│                                                                │
│  "The movie was great!" → ???                                 │
│  "I loved the film!"    → ???                                 │
│                                                                │
│  Are these similar? Computer has no idea!                     │
└────────────────────────────────────────────────────────────────┘

THE SOLUTION - EMBEDDINGS:
┌────────────────────────────────────────────────────────────────┐
│  Convert text to numbers that capture meaning:                │
│                                                                │
│  "The movie was great!" → [0.82, -0.15, 0.43, 0.91, ...]     │
│  "I loved the film!"    → [0.79, -0.18, 0.45, 0.88, ...]     │
│                                                                │
│  Similar meanings → Similar numbers → We can compare!         │
└────────────────────────────────────────────────────────────────┘
```

### How Embeddings Work

```
TEXT TO EMBEDDING PROCESS:

Input: "I love machine learning"
                │
                ▼
┌─────────────────────────────────────────────────────────────┐
│                      TOKENIZER                              │
│  Split into tokens: ["I", "love", "machine", "learning"]   │
│  Convert to IDs:    [101, 1045, 2293, 3698, 4083, 102]     │
│                     ([CLS]  I   love machine learning [SEP])│
└─────────────────────────────────────────────────────────────┘
                │
                ▼
┌─────────────────────────────────────────────────────────────┐
│                    TRANSFORMER MODEL                        │
│  Each token → Hidden state vector (768 dimensions)         │
│                                                             │
│  [CLS]    → [h₀₁, h₀₂, h₀₃, ..., h₀₇₆₈]                    │
│  I        → [h₁₁, h₁₂, h₁₃, ..., h₁₇₆₈]                    │
│  love     → [h₂₁, h₂₂, h₂₃, ..., h₂₇₆₈]                    │
│  machine  → [h₃₁, h₃₂, h₃₃, ..., h₃₇₆₈]                    │
│  learning → [h₄₁, h₄₂, h₄₃, ..., h₄₇₆₈]                    │
│  [SEP]    → [h₅₁, h₅₂, h₅₃, ..., h₅₇₆₈]                    │
└─────────────────────────────────────────────────────────────┘
                │
                ▼
┌─────────────────────────────────────────────────────────────┐
│                      POOLING                                │
│  Combine token vectors into one sentence vector             │
│                                                             │
│  Methods:                                                   │
│  • CLS pooling:  Use [CLS] token vector only               │
│  • Mean pooling: Average all token vectors                  │
│  • Max pooling:  Take maximum value per dimension           │
└─────────────────────────────────────────────────────────────┘
                │
                ▼
Output: [0.23, -0.45, 0.67, ..., 0.12]  (768-dim vector)
```

### Visualizing the Embedding Space

```
2D SIMPLIFICATION OF EMBEDDING SPACE:
(Real embeddings have 768+ dimensions!)

        Positive Sentiment ↑
                          │
                    ★ "Amazing movie!"
            ★ "Great film!"     ★ "I loved it!"
                          │
    ─────────────────────┼─────────────────────→ Entertainment
                          │
        ○ "Okay, I guess"│
                          │
            ◆ "Terrible!" │  ◆ "Waste of time"
                    ◆ "Hated it"
                          │
        Negative Sentiment ↓

SIMILAR TEXTS CLUSTER TOGETHER!
```

### Connection to Previous Notebooks

| Notebook | What It Does | Relation to Embeddings |
|----------|--------------|------------------------|
| 1 (Fill-Mask) | Predicts missing words | Uses internal embeddings |
| 2 (NER) | Labels entities | Token-level embeddings |
| 6 (Zero-Shot) | Classifies text | Compares text embeddings |
| 7 (Translation) | Converts languages | Encoder creates meaning vectors |
| **8 (This notebook)** | **Extracts embeddings directly** | **Full control over representations** |

In previous notebooks, embeddings were used internally. Now we'll extract and use them directly!

### Measuring Similarity: Cosine Similarity

Once we have embeddings, we need a way to compare them:

```
COSINE SIMILARITY:

Measures the angle between two vectors (ignores magnitude)

                    A · B
cos(θ) = ─────────────────────
          ||A|| × ||B||

Score interpretation:
├── 1.0:   Identical direction (very similar)
├── 0.7+:  High similarity
├── 0.5:   Moderate similarity
├── 0.3-:  Low similarity
└── 0.0:   Orthogonal (unrelated)

Example:
┌─────────────────────────────────────────────────────────────┐
│  "I love dogs"  ←─── cos = 0.92 ───→  "Dogs are great"    │
│                                                             │
│  "I love dogs"  ←─── cos = 0.45 ───→  "The weather is nice" │
│                                                             │
│  "I love dogs"  ←─── cos = 0.23 ───→  "Python programming" │
└─────────────────────────────────────────────────────────────┘
```

### Real-World Applications

Embeddings power many applications:

- **Semantic Search**: Find documents by meaning, not just keywords
- **Duplicate Detection**: Find similar or duplicate content
- **FAQ Matching**: Match user questions to known answers
- **Recommendation Systems**: Suggest similar content
- **Clustering**: Group similar documents automatically
- **Retrieval-Augmented Generation (RAG)**: Find relevant context for LLMs

### Key Terminology

| Term | Definition |
|------|------------|
| **Embedding** | Dense vector representation of text |
| **Dimension** | Number of values in the vector (e.g., 768) |
| **Pooling** | Combining token vectors into one sentence vector |
| **Cosine similarity** | Measure of angle between vectors (0 to 1) |
| **Semantic similarity** | How close in meaning two texts are |
| **Vector space** | Mathematical space where embeddings live |
| **Dense vector** | Vector with mostly non-zero values |

### Check Your Understanding

Before moving on, try to answer these questions (answers at the end):

1. What is the purpose of text embeddings?
   - A) To compress text files
   - B) To represent text as numbers that capture meaning
   - C) To encrypt text for security

2. What does cosine similarity measure?
   - A) The length of two vectors
   - B) The angle between two vectors
   - C) The sum of two vectors

3. If two texts have cosine similarity of 0.95, what does that mean?
   - A) They are very different
   - B) They are very similar
   - C) They are exactly the same

4. What is pooling used for?
   - A) Combining multiple token vectors into one sentence vector
   - B) Removing water from the model
   - C) Training the model faster

---

# Part 2: Basic Implementation

## Your First Embeddings

Let's generate embeddings using the feature-extraction pipeline:

In [None]:
# Create a feature-extraction pipeline
# This extracts the raw embeddings from a model
extractor = pipeline("feature-extraction", model="distilbert-base-uncased")

# Generate embeddings for a sentence
text = "Machine learning is fascinating."
embeddings = extractor(text)

print("Feature Extraction Result:")
print("="*50)
print(f"Input: \"{text}\"")
print(f"\nOutput type: {type(embeddings)}")
print(f"Output shape: {len(embeddings)} x {len(embeddings[0])} x {len(embeddings[0][0])}")
print(f"\nExplanation:")
print(f"  - 1 batch (1 input text)")
print(f"  - {len(embeddings[0])} tokens (including [CLS] and [SEP])")
print(f"  - {len(embeddings[0][0])} dimensions per token")

### Understanding the Output Shape

The pipeline returns:
- A nested list: `[batch][tokens][dimensions]`
- Each token has its own 768-dimensional vector
- We need to combine these into one sentence vector (pooling)

In [None]:
# Convert to numpy for easier manipulation
embeddings_array = np.array(embeddings)

print("Embeddings as NumPy Array:")
print("="*50)
print(f"Shape: {embeddings_array.shape}")
print(f"  → (batch_size, num_tokens, hidden_size)")

# Look at the first few values of the first token
print(f"\nFirst 10 values of [CLS] token:")
print(embeddings_array[0, 0, :10])

### Implementing Pooling Strategies

We need to combine token embeddings into a single sentence embedding:

In [None]:
def get_embedding(text, extractor, pooling='mean'):
    """
    Generate a sentence embedding with specified pooling.
    
    Args:
        text: Input text
        extractor: Feature extraction pipeline
        pooling: 'mean', 'cls', or 'max'
        
    Returns:
        numpy array of shape (hidden_size,)
    """
    # Get raw embeddings
    embeddings = extractor(text)
    embeddings = np.array(embeddings)[0]  # Remove batch dimension
    
    if pooling == 'cls':
        # Use only the [CLS] token (first token)
        return embeddings[0]
    elif pooling == 'mean':
        # Average all token embeddings
        return np.mean(embeddings, axis=0)
    elif pooling == 'max':
        # Take maximum value across tokens for each dimension
        return np.max(embeddings, axis=0)
    else:
        raise ValueError(f"Unknown pooling: {pooling}")


# Test different pooling methods
text = "I love learning about artificial intelligence."

cls_embedding = get_embedding(text, extractor, pooling='cls')
mean_embedding = get_embedding(text, extractor, pooling='mean')
max_embedding = get_embedding(text, extractor, pooling='max')

print("Pooling Comparison:")
print("="*50)
print(f"Text: \"{text}\"\n")

for name, emb in [('CLS', cls_embedding), ('Mean', mean_embedding), ('Max', max_embedding)]:
    print(f"{name} pooling:")
    print(f"  Shape: {emb.shape}")
    print(f"  First 5 values: {emb[:5].round(3)}")
    print(f"  Min: {emb.min():.3f}, Max: {emb.max():.3f}, Mean: {emb.mean():.3f}")
    print()

### Implementing Cosine Similarity

In [None]:
def cosine_similarity(vec1, vec2):
    """
    Calculate cosine similarity between two vectors.
    
    Returns:
        float between -1 and 1 (usually 0 to 1 for embeddings)
    """
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    
    return dot_product / (norm1 * norm2)


# Test with two similar sentences
text1 = "I love dogs."
text2 = "Dogs are wonderful pets."
text3 = "The weather is nice today."

emb1 = get_embedding(text1, extractor, pooling='mean')
emb2 = get_embedding(text2, extractor, pooling='mean')
emb3 = get_embedding(text3, extractor, pooling='mean')

print("Cosine Similarity Comparison:")
print("="*60)
print(f"Text 1: \"{text1}\"")
print(f"Text 2: \"{text2}\"")
print(f"Text 3: \"{text3}\"")
print()
print(f"Similarity (1 vs 2): {cosine_similarity(emb1, emb2):.4f}  ← Both about dogs!")
print(f"Similarity (1 vs 3): {cosine_similarity(emb1, emb3):.4f}  ← Different topics")
print(f"Similarity (2 vs 3): {cosine_similarity(emb2, emb3):.4f}  ← Different topics")

### Finding the Most Similar Text

In [None]:
def find_most_similar(query, candidates, extractor, pooling='mean'):
    """
    Find the most similar text from a list of candidates.
    
    Args:
        query: The query text
        candidates: List of candidate texts to compare against
        extractor: Feature extraction pipeline
        pooling: Pooling strategy
        
    Returns:
        List of (candidate, similarity) tuples, sorted by similarity
    """
    # Get query embedding
    query_emb = get_embedding(query, extractor, pooling)
    
    # Calculate similarity with each candidate
    results = []
    for candidate in candidates:
        cand_emb = get_embedding(candidate, extractor, pooling)
        similarity = cosine_similarity(query_emb, cand_emb)
        results.append((candidate, similarity))
    
    # Sort by similarity (highest first)
    results.sort(key=lambda x: x[1], reverse=True)
    
    return results


# Test: Find the best match for a query
query = "How do I reset my password?"

candidates = [
    "Click 'Forgot Password' on the login page.",
    "Our store is open 9 AM to 5 PM.",
    "You can change your password in account settings.",
    "Contact support for billing questions.",
    "Password recovery is available via email.",
]

results = find_most_similar(query, candidates, extractor)

print(f"Query: \"{query}\"")
print("="*60)
print("\nResults (ranked by similarity):")
for i, (candidate, score) in enumerate(results, 1):
    bar = '*' * int(score * 30)
    print(f"\n{i}. [{score:.3f}] {bar}")
    print(f"   {candidate}")

---

## Exercise 1: Similarity Calculator (Guided)

**Difficulty**: Basic | **Time**: 10-15 minutes

**Your task**: Build a function that compares multiple text pairs and visualizes their similarity.

### Step 1: Create a batch similarity calculator

In [None]:
def calculate_pairwise_similarities(texts, extractor, pooling='mean'):
    """
    Calculate similarity between all pairs of texts.
    
    Args:
        texts: List of texts
        extractor: Feature extraction pipeline
        pooling: Pooling strategy
        
    Returns:
        numpy array of shape (n_texts, n_texts) with similarities
    """
    # Get embeddings for all texts
    embeddings = [get_embedding(text, extractor, pooling) for text in texts]
    
    # Calculate pairwise similarities
    n = len(texts)
    similarity_matrix = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            similarity_matrix[i, j] = cosine_similarity(embeddings[i], embeddings[j])
    
    return similarity_matrix


# Test texts
test_texts = [
    "I love programming in Python.",
    "Python is my favorite programming language.",
    "The weather is beautiful today.",
    "It's a sunny and warm day outside.",
    "Machine learning is transforming technology.",
]

# Calculate similarities
sim_matrix = calculate_pairwise_similarities(test_texts, extractor)

print("Pairwise Similarity Matrix:")
print("="*60)

### Step 2: Create a visual display

In [None]:
def display_similarity_matrix(texts, sim_matrix):
    """
    Display similarity matrix in a readable format.
    """
    n = len(texts)
    
    # Print abbreviated texts
    print("Texts:")
    for i, text in enumerate(texts):
        abbrev = text[:40] + '...' if len(text) > 40 else text
        print(f"  [{i}] {abbrev}")
    
    print("\nSimilarity Matrix (rows vs columns):")
    print()
    
    # Header row
    header = "     " + "  ".join(f"[{i}]" for i in range(n))
    print(header)
    print("     " + "─" * (len(header) - 5))
    
    # Data rows
    for i in range(n):
        row = f"[{i}] │ "
        for j in range(n):
            score = sim_matrix[i, j]
            # Color coding using text
            if i == j:
                row += f"1.00 "
            elif score > 0.8:
                row += f"{score:.2f}★"
            elif score > 0.6:
                row += f"{score:.2f}+"
            else:
                row += f"{score:.2f} "
        print(row)
    
    print("\n★ = High similarity (>0.8)  + = Moderate (>0.6)")


display_similarity_matrix(test_texts, sim_matrix)

### Step 3: Try your own text comparisons

In [None]:
# YOUR CODE HERE
# Write your own texts to compare

my_texts = [
    "Your first text here",
    "Your second text here",
    "Your third text here",
]

# Uncomment to run:
# my_sim_matrix = calculate_pairwise_similarities(my_texts, extractor)
# display_similarity_matrix(my_texts, my_sim_matrix)

---

# Part 3: Intermediate Exploration

## Comparing Pooling Strategies

Different pooling strategies can affect similarity results:

In [None]:
# Compare pooling strategies
text_pairs = [
    ("The cat sat on the mat.", "A cat is sitting on a rug."),  # Similar
    ("I love pizza.", "Pizza is my favorite food."),  # Similar
    ("The sky is blue.", "Programming in Python."),  # Different
    ("Machine learning is exciting.", "AI technology advances."),  # Related
]

print("Pooling Strategy Comparison:")
print("="*70)

for text1, text2 in text_pairs:
    print(f"\nText 1: \"{text1}\"")
    print(f"Text 2: \"{text2}\"")
    print()
    
    for pooling in ['cls', 'mean', 'max']:
        emb1 = get_embedding(text1, extractor, pooling)
        emb2 = get_embedding(text2, extractor, pooling)
        sim = cosine_similarity(emb1, emb2)
        bar = '*' * int(sim * 20)
        print(f"  {pooling.upper():5s}: {sim:.4f} {bar}")
    print("-" * 50)

### Pooling Strategy Guide

| Strategy | How It Works | Best For |
|----------|--------------|----------|
| **CLS** | Uses [CLS] token only | Classification tasks |
| **Mean** | Averages all tokens | General similarity |
| **Max** | Maximum per dimension | Capturing key features |

**Mean pooling** is typically the best default choice for semantic similarity.

## Using Sentence Transformers (Better for Similarity)

For production-quality embeddings, sentence-transformers models are specifically trained for semantic similarity:

In [None]:
# Load a model trained specifically for similarity
# all-MiniLM-L6-v2 is fast and effective for semantic similarity
print("Loading sentence-transformers model...")
similarity_extractor = pipeline(
    "feature-extraction", 
    model="sentence-transformers/all-MiniLM-L6-v2"
)
print("Model loaded!")

In [None]:
# Compare base model vs sentence-transformers
test_pairs = [
    ("A man is eating food.", "A person is having a meal."),
    ("The dog runs fast.", "A canine is sprinting quickly."),
    ("I love programming.", "The sky is cloudy."),
]

print("Comparison: Base Model vs Sentence-Transformers")
print("="*60)

for text1, text2 in test_pairs:
    # Base model (DistilBERT)
    emb1_base = get_embedding(text1, extractor, 'mean')
    emb2_base = get_embedding(text2, extractor, 'mean')
    sim_base = cosine_similarity(emb1_base, emb2_base)
    
    # Sentence-transformers
    emb1_st = get_embedding(text1, similarity_extractor, 'mean')
    emb2_st = get_embedding(text2, similarity_extractor, 'mean')
    sim_st = cosine_similarity(emb1_st, emb2_st)
    
    print(f"\n\"{text1}\"")
    print(f"\"{text2}\"")
    print(f"  Base DistilBERT:      {sim_base:.4f}")
    print(f"  Sentence-Transformers: {sim_st:.4f}")

### Understanding the Difference

| Model Type | Training | Embedding Quality |
|------------|----------|-------------------|
| Base BERT/DistilBERT | General language modeling | Okay for similarity |
| Sentence-Transformers | Trained on similarity pairs | Great for similarity |

Sentence-transformer models are trained using contrastive learning to make similar texts closer and dissimilar texts farther apart.

---

## Exercise 2: Duplicate Detector (Semi-guided)

**Difficulty**: Intermediate | **Time**: 15-20 minutes

**Your task**: Build a function that finds potential duplicate or near-duplicate content.

**Hints**:
- Compare each text with every other text
- Flag pairs above a similarity threshold
- Avoid comparing a text with itself

In [None]:
# YOUR CODE HERE

def find_duplicates(texts, extractor, threshold=0.85, pooling='mean'):
    """
    Find potential duplicate or near-duplicate texts.
    
    Args:
        texts: List of texts to check
        extractor: Feature extraction pipeline
        threshold: Similarity threshold to flag as duplicate
        pooling: Pooling strategy
        
    Returns:
        List of (idx1, idx2, similarity) tuples for potential duplicates
    """
    # Get all embeddings
    embeddings = [get_embedding(text, extractor, pooling) for text in texts]
    
    duplicates = []
    
    # Compare each pair (avoiding self-comparison and duplicates)
    for i in range(len(texts)):
        for j in range(i + 1, len(texts)):  # Start from i+1 to avoid duplicates
            sim = cosine_similarity(embeddings[i], embeddings[j])
            if sim >= threshold:
                duplicates.append((i, j, sim))
    
    # Sort by similarity (highest first)
    duplicates.sort(key=lambda x: x[2], reverse=True)
    
    return duplicates


def format_duplicate_report(texts, duplicates):
    """
    Format duplicate findings as a readable report.
    """
    if not duplicates:
        return "No potential duplicates found above threshold."
    
    lines = []
    lines.append(f"Found {len(duplicates)} potential duplicate pair(s):")
    lines.append("="*60)
    
    for idx1, idx2, sim in duplicates:
        lines.append(f"\nSimilarity: {sim:.1%}")
        lines.append(f"  [{idx1}] {texts[idx1][:60]}..." if len(texts[idx1]) > 60 else f"  [{idx1}] {texts[idx1]}")
        lines.append(f"  [{idx2}] {texts[idx2][:60]}..." if len(texts[idx2]) > 60 else f"  [{idx2}] {texts[idx2]}")
    
    return '\n'.join(lines)


# Test with sample documents
documents = [
    "How do I reset my password for my account?",
    "What are your business hours?",
    "I forgot my password, how can I reset it?",
    "When is your store open?",
    "Can you help me recover my account password?",
    "What time do you close?",
    "I need help with a refund.",
    "How do I return a product for a refund?",
]

print("Duplicate Detection Report:")
print("Threshold: 0.75\n")

duplicates = find_duplicates(documents, similarity_extractor, threshold=0.75)
print(format_duplicate_report(documents, duplicates))

In [None]:
# Group potential duplicates together
def group_duplicates(texts, duplicates):
    """
    Group texts that are duplicates of each other.
    """
    from collections import defaultdict
    
    # Union-find like grouping
    groups = {i: i for i in range(len(texts))}
    
    def find(x):
        if groups[x] != x:
            groups[x] = find(groups[x])
        return groups[x]
    
    def union(x, y):
        px, py = find(x), find(y)
        if px != py:
            groups[px] = py
    
    for idx1, idx2, _ in duplicates:
        union(idx1, idx2)
    
    # Collect groups
    group_members = defaultdict(list)
    for i in range(len(texts)):
        group_members[find(i)].append(i)
    
    # Filter to groups with more than one member
    duplicate_groups = [members for members in group_members.values() if len(members) > 1]
    
    return duplicate_groups


# Show grouped duplicates
groups = group_duplicates(documents, duplicates)

print("\nDuplicate Groups:")
print("="*60)

for i, group in enumerate(groups, 1):
    print(f"\nGroup {i}:")
    for idx in group:
        print(f"  [{idx}] {documents[idx]}")

---

# Part 4: Advanced Topics

## Batch Processing for Efficiency

Processing many texts individually is slow. Let's batch:

In [None]:
def get_embeddings_batch(texts, extractor, pooling='mean'):
    """
    Get embeddings for multiple texts efficiently.
    
    Note: The pipeline already handles batching internally,
    but we can process multiple texts in one call.
    """
    # Get all embeddings at once
    all_embeddings = extractor(texts)
    
    # Apply pooling to each
    result = []
    for emb in all_embeddings:
        emb_array = np.array(emb)
        if pooling == 'cls':
            result.append(emb_array[0])
        elif pooling == 'mean':
            result.append(np.mean(emb_array, axis=0))
        elif pooling == 'max':
            result.append(np.max(emb_array, axis=0))
    
    return np.array(result)


# Time comparison (simplified)
import time

sample_texts = [
    "This is the first sample text.",
    "Here is another sample for testing.",
    "A third text to process.",
    "Fourth sample text here.",
    "And finally the fifth sample.",
]

# Individual processing
start = time.time()
individual_results = [get_embedding(t, similarity_extractor, 'mean') for t in sample_texts]
individual_time = time.time() - start

# Batch processing
start = time.time()
batch_results = get_embeddings_batch(sample_texts, similarity_extractor, 'mean')
batch_time = time.time() - start

print("Processing Time Comparison:")
print("="*50)
print(f"Individual processing: {individual_time:.3f}s")
print(f"Batch processing:      {batch_time:.3f}s")
print(f"Speedup:               {individual_time/batch_time:.1f}x")

## Similarity Search at Scale

For large collections, pre-compute embeddings and use efficient search:

In [None]:
class EmbeddingIndex:
    """
    A simple embedding index for similarity search.
    """
    
    def __init__(self, extractor, pooling='mean'):
        self.extractor = extractor
        self.pooling = pooling
        self.documents = []
        self.embeddings = None
    
    def add_documents(self, documents):
        """
        Add documents to the index.
        """
        self.documents.extend(documents)
        
        # Compute embeddings
        new_embeddings = get_embeddings_batch(documents, self.extractor, self.pooling)
        
        if self.embeddings is None:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])
    
    def search(self, query, top_k=5):
        """
        Search for most similar documents.
        """
        # Get query embedding
        query_emb = get_embedding(query, self.extractor, self.pooling)
        
        # Calculate similarities with all documents
        similarities = []
        for i, doc_emb in enumerate(self.embeddings):
            sim = cosine_similarity(query_emb, doc_emb)
            similarities.append((i, self.documents[i], sim))
        
        # Sort by similarity
        similarities.sort(key=lambda x: x[2], reverse=True)
        
        return similarities[:top_k]


# Create index with sample FAQs
faqs = [
    "How do I reset my password?",
    "What payment methods do you accept?",
    "How long does shipping take?",
    "What is your return policy?",
    "How do I track my order?",
    "Do you offer international shipping?",
    "How do I contact customer support?",
    "What are your business hours?",
    "How do I cancel my subscription?",
    "Is there a mobile app available?",
]

# Build index
print("Building FAQ index...")
index = EmbeddingIndex(similarity_extractor)
index.add_documents(faqs)
print(f"Index contains {len(index.documents)} documents")

In [None]:
# Test queries
queries = [
    "I forgot my password",
    "Where's my package?",
    "Can I pay with credit card?",
    "I want to return something",
]

print("FAQ Search Results:")
print("="*60)

for query in queries:
    print(f"\nQuery: \"{query}\"")
    print("-" * 40)
    
    results = index.search(query, top_k=3)
    for rank, (idx, doc, score) in enumerate(results, 1):
        print(f"  {rank}. [{score:.3f}] {doc}")

### Limitations and Considerations

| Aspect | Consideration |
|--------|---------------|
| **Scale** | For millions of documents, use vector databases (Pinecone, Weaviate, etc.) |
| **Freshness** | Embeddings must be recomputed when documents change |
| **Domain** | General models may not work well for specialized domains |
| **Length** | Very long texts may lose information - consider chunking |
| **Languages** | Use multilingual models for non-English text |

---

## Exercise 3: Clustering Similar Content (Independent)

**Difficulty**: Advanced | **Time**: 15-20 minutes

**Your task**: Build a simple clustering system that groups similar texts together based on their embeddings.

**Requirements**:
1. Calculate pairwise similarities
2. Group texts above a threshold
3. Display clusters with their members

In [None]:
# YOUR CODE HERE

class SimpleClustering:
    """
    Simple clustering based on similarity threshold.
    """
    
    def __init__(self, extractor, threshold=0.7, pooling='mean'):
        self.extractor = extractor
        self.threshold = threshold
        self.pooling = pooling
    
    def fit(self, texts):
        """
        Cluster texts based on similarity.
        
        Returns:
            list of clusters, where each cluster is a list of (index, text) tuples
        """
        # Get all embeddings
        embeddings = get_embeddings_batch(texts, self.extractor, self.pooling)
        
        n = len(texts)
        assigned = [False] * n
        clusters = []
        
        for i in range(n):
            if assigned[i]:
                continue
            
            # Start a new cluster with this text
            cluster = [(i, texts[i])]
            assigned[i] = True
            
            # Find all similar texts
            for j in range(i + 1, n):
                if assigned[j]:
                    continue
                
                sim = cosine_similarity(embeddings[i], embeddings[j])
                if sim >= self.threshold:
                    cluster.append((j, texts[j]))
                    assigned[j] = True
            
            clusters.append(cluster)
        
        return clusters
    
    def format_clusters(self, clusters):
        """
        Format clusters for display.
        """
        lines = []
        lines.append(f"Found {len(clusters)} clusters (threshold: {self.threshold})")
        lines.append("="*60)
        
        for i, cluster in enumerate(clusters, 1):
            lines.append(f"\nCluster {i} ({len(cluster)} items):")
            for idx, text in cluster:
                abbrev = text[:55] + '...' if len(text) > 55 else text
                lines.append(f"  [{idx}] {abbrev}")
        
        return '\n'.join(lines)


# Test with diverse texts
diverse_texts = [
    # Technology cluster
    "Python is a popular programming language.",
    "JavaScript is used for web development.",
    "Coding in Python is enjoyable.",
    
    # Food cluster
    "I love eating pizza.",
    "Pizza is my favorite food.",
    "Italian cuisine is delicious.",
    
    # Sports cluster
    "Basketball is an exciting sport.",
    "I enjoy watching football games.",
    
    # Weather (singleton)
    "The weather is beautiful today.",
]

# Cluster the texts
clusterer = SimpleClustering(similarity_extractor, threshold=0.6)
clusters = clusterer.fit(diverse_texts)

print(clusterer.format_clusters(clusters))

In [None]:
# Experiment with different thresholds
print("Threshold Experiment:")
print("="*60)

for threshold in [0.5, 0.6, 0.7, 0.8]:
    clusterer = SimpleClustering(similarity_extractor, threshold=threshold)
    clusters = clusterer.fit(diverse_texts)
    
    multi_member_clusters = [c for c in clusters if len(c) > 1]
    singletons = len([c for c in clusters if len(c) == 1])
    
    print(f"\nThreshold: {threshold}")
    print(f"  Total clusters: {len(clusters)}")
    print(f"  Multi-member clusters: {len(multi_member_clusters)}")
    print(f"  Singletons: {singletons}")

---

# Part 5: Mini-Project

## Project: Semantic FAQ Matcher

**Scenario**: You're building a customer support system that matches user questions to the most relevant FAQ entries, providing instant answers.

**Your goal**: Build a `SemanticFAQMatcher` class that:
1. Stores FAQ entries with their answers
2. Matches user queries to the best FAQ
3. Provides confidence scores
4. Handles the case when no good match is found

In [None]:
# MINI-PROJECT: Semantic FAQ Matcher
# ===================================

class SemanticFAQMatcher:
    """
    Matches user questions to FAQ entries using semantic similarity.
    """
    
    def __init__(self, extractor, confidence_threshold=0.6, pooling='mean'):
        """
        Initialize the FAQ matcher.
        
        Args:
            extractor: Feature extraction pipeline
            confidence_threshold: Minimum similarity to return a match
            pooling: Pooling strategy for embeddings
        """
        self.extractor = extractor
        self.confidence_threshold = confidence_threshold
        self.pooling = pooling
        
        self.faqs = []  # List of {question, answer, category}
        self.embeddings = None
    
    def add_faq(self, question, answer, category="General"):
        """
        Add a single FAQ entry.
        """
        self.faqs.append({
            'question': question,
            'answer': answer,
            'category': category
        })
        self.embeddings = None  # Mark for recalculation
    
    def add_faqs_bulk(self, faq_list):
        """
        Add multiple FAQ entries.
        
        Args:
            faq_list: List of dicts with 'question', 'answer', and optionally 'category'
        """
        for faq in faq_list:
            self.add_faq(
                faq['question'], 
                faq['answer'], 
                faq.get('category', 'General')
            )
    
    def build_index(self):
        """
        Pre-compute embeddings for all FAQ questions.
        """
        if not self.faqs:
            raise ValueError("No FAQs added. Use add_faq() first.")
        
        questions = [faq['question'] for faq in self.faqs]
        self.embeddings = get_embeddings_batch(questions, self.extractor, self.pooling)
        
        print(f"Index built with {len(self.faqs)} FAQ entries.")
    
    def match(self, user_query, top_k=3):
        """
        Find the best matching FAQ(s) for a user query.
        
        Args:
            user_query: The user's question
            top_k: Number of top matches to return
            
        Returns:
            dict with best match info and alternatives
        """
        # Build index if needed
        if self.embeddings is None:
            self.build_index()
        
        # Get query embedding
        query_emb = get_embedding(user_query, self.extractor, self.pooling)
        
        # Calculate similarities
        similarities = []
        for i, faq_emb in enumerate(self.embeddings):
            sim = cosine_similarity(query_emb, faq_emb)
            similarities.append((i, sim))
        
        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Get top matches
        top_matches = []
        for idx, sim in similarities[:top_k]:
            top_matches.append({
                'faq': self.faqs[idx],
                'similarity': sim,
                'confident': sim >= self.confidence_threshold
            })
        
        # Determine overall result
        best_match = top_matches[0] if top_matches else None
        
        return {
            'query': user_query,
            'found_match': best_match['confident'] if best_match else False,
            'best_match': best_match,
            'alternatives': top_matches[1:] if len(top_matches) > 1 else [],
            'all_matches': top_matches,
        }
    
    def get_answer(self, user_query):
        """
        Get the answer to a user's question.
        
        Returns:
            tuple of (answer_text, confidence) or (fallback_message, 0)
        """
        result = self.match(user_query)
        
        if result['found_match']:
            return (
                result['best_match']['faq']['answer'],
                result['best_match']['similarity']
            )
        else:
            return (
                "I'm not sure about that. Would you like to speak with a support agent?",
                0.0
            )
    
    def format_response(self, result):
        """
        Format match result as a user-friendly response.
        """
        lines = []
        lines.append("┌" + "─"*68 + "┐")
        lines.append(f"│ User Query: {result['query'][:52]:52s} │")
        lines.append("├" + "─"*68 + "┤")
        
        if result['found_match']:
            best = result['best_match']
            lines.append(f"│ ✓ Match Found (Confidence: {best['similarity']:.1%}){' '*30} │")
            lines.append("│" + " "*68 + "│")
            lines.append(f"│ Category: {best['faq']['category']:57s} │")
            
            # Wrap question
            q = best['faq']['question']
            lines.append(f"│ Q: {q[:63]:63s} │")
            
            # Wrap answer
            a = best['faq']['answer']
            lines.append("│" + " "*68 + "│")
            lines.append(f"│ A: {a[:63]:63s} │")
            if len(a) > 63:
                lines.append(f"│    {a[63:126]:63s} │")
        else:
            lines.append(f"│ ✗ No confident match found{' '*40} │")
            lines.append("│" + " "*68 + "│")
            if result['all_matches']:
                lines.append(f"│ Closest match ({result['all_matches'][0]['similarity']:.1%} confidence):{' '*26} │")
                q = result['all_matches'][0]['faq']['question']
                lines.append(f"│   {q[:63]:63s} │")
        
        # Show alternatives
        if result['alternatives']:
            lines.append("├" + "─"*68 + "┤")
            lines.append(f"│ Other possibilities:{' '*46} │")
            for alt in result['alternatives'][:2]:
                q = alt['faq']['question'][:50]
                lines.append(f"│   • {q:50s} ({alt['similarity']:.0%}) │")
        
        lines.append("└" + "─"*68 + "┘")
        
        return '\n'.join(lines)


# Create FAQ matcher
matcher = SemanticFAQMatcher(similarity_extractor, confidence_threshold=0.65)

print("Semantic FAQ Matcher initialized!")

In [None]:
# Add sample FAQs
sample_faqs = [
    {
        'question': 'How do I reset my password?',
        'answer': 'Go to Settings > Account > Change Password. You can also use the "Forgot Password" link on the login page.',
        'category': 'Account'
    },
    {
        'question': 'What payment methods do you accept?',
        'answer': 'We accept Visa, Mastercard, American Express, PayPal, and Apple Pay.',
        'category': 'Billing'
    },
    {
        'question': 'How long does shipping take?',
        'answer': 'Standard shipping takes 5-7 business days. Express shipping delivers in 2-3 business days.',
        'category': 'Shipping'
    },
    {
        'question': 'What is your return policy?',
        'answer': 'We offer free returns within 30 days of purchase. Items must be in original condition with tags attached.',
        'category': 'Returns'
    },
    {
        'question': 'How do I track my order?',
        'answer': 'Log into your account and go to "My Orders" to see tracking information. You\'ll also receive tracking emails.',
        'category': 'Shipping'
    },
    {
        'question': 'Do you offer international shipping?',
        'answer': 'Yes! We ship to over 100 countries. International shipping typically takes 7-14 business days.',
        'category': 'Shipping'
    },
    {
        'question': 'How do I contact customer support?',
        'answer': 'Email us at support@example.com or call 1-800-123-4567. Live chat is available 9 AM - 5 PM EST.',
        'category': 'Support'
    },
    {
        'question': 'How do I cancel my subscription?',
        'answer': 'Go to Settings > Subscription > Cancel. Your access continues until the end of your billing period.',
        'category': 'Account'
    },
    {
        'question': 'Is there a mobile app?',
        'answer': 'Yes! Download our free app from the App Store (iOS) or Google Play Store (Android).',
        'category': 'General'
    },
    {
        'question': 'How do I update my email address?',
        'answer': 'Go to Settings > Account > Edit Profile. Click on your email address to change it.',
        'category': 'Account'
    },
]

matcher.add_faqs_bulk(sample_faqs)
matcher.build_index()

In [None]:
# Test with user queries
test_queries = [
    "I forgot my password",
    "where's my package?",
    "can I pay with credit card?",
    "I want to send something back",
    "do you have an iPhone app?",
    "what's the meaning of life?",  # Should not match
]

print("FAQ Matcher Test Results:")
print("="*70)

for query in test_queries:
    result = matcher.match(query)
    print("\n" + matcher.format_response(result))

In [None]:
# Simple chatbot interface
def chatbot_response(user_input):
    """
    Generate a chatbot response using the FAQ matcher.
    """
    answer, confidence = matcher.get_answer(user_input)
    
    response = []
    response.append(f"User: {user_input}")
    response.append("")
    
    if confidence > 0:
        response.append(f"Bot: {answer}")
        response.append(f"     (Confidence: {confidence:.0%})")
    else:
        response.append(f"Bot: {answer}")
    
    return '\n'.join(response)


# Simulate a conversation
print("Customer Support Chatbot")
print("="*50)

conversation = [
    "Hi, I need help with my account",
    "How can I change my password?",
    "What if I want a refund?",
    "Thanks for the help!",
]

for msg in conversation:
    print("\n" + chatbot_response(msg))
    print("-" * 50)

In [None]:
# Try your own queries
# Uncomment and modify:

# my_query = "Your question here"
# result = matcher.match(my_query)
# print(matcher.format_response(result))

### Extension Ideas

If you want to extend this project further:

1. **Synonyms handling**: Add alternate phrasings for each FAQ
2. **Category filtering**: Let users filter by category
3. **Feedback loop**: Track which answers are helpful to improve matching
4. **Multi-turn context**: Remember conversation history
5. **Hybrid search**: Combine semantic search with keyword matching

---

# Part 6: Wrap-Up

## Key Takeaways

1. **Embeddings represent text as numbers** that capture semantic meaning - similar texts have similar vectors

2. **Pooling combines token vectors** into sentence vectors:
   - Mean pooling (average) is usually best for similarity
   - CLS pooling uses the special [CLS] token
   - Max pooling captures the strongest signals

3. **Cosine similarity** measures how alike two vectors are (0-1 scale)

4. **Sentence-transformer models** are specifically trained for similarity tasks and work better than base models

5. **Applications are everywhere**: search, duplicate detection, FAQ matching, recommendations, clustering

## Common Mistakes to Avoid

| Mistake | Why It's a Problem |
|---------|-------------------|
| Using base models for similarity | They're not optimized for this task |
| Ignoring the threshold | Low-confidence matches may be wrong |
| Not pre-computing embeddings | Slow for repeated searches |
| Using CLS for similarity | Mean pooling usually works better |
| Very long texts without chunking | Information gets lost |

## What's Next?

In **Notebook 9: Sentiment with Different Models**, you'll learn:
- How different models approach sentiment analysis
- Comparing binary vs. multi-class sentiment
- Understanding model training data influence

The embedding concepts from this notebook will help you understand why different sentiment models produce different results!

---

## Solutions

### Check Your Understanding (Quiz Answers)

1. **B) To represent text as numbers that capture meaning** - Embeddings convert text into numerical vectors
2. **B) The angle between two vectors** - Cosine similarity ignores magnitude, focusing on direction
3. **B) They are very similar** - 0.95 indicates high similarity (close to 1.0)
4. **A) Combining multiple token vectors into one sentence vector** - Pooling aggregates token-level representations

### Exercise 2: Key Insights

In [None]:
# Key insights from duplicate detection:

# 1. Threshold selection matters:
#    - Too high (>0.9): Misses near-duplicates
#    - Too low (<0.6): False positives
#    - Sweet spot: 0.75-0.85 for most use cases

# 2. Semantic vs. lexical duplicates:
#    - "How to reset password" ↔ "Forgot my password" (semantic match)
#    - "Reset password" ↔ "Reset password" (lexical match)
#    - Embeddings catch semantic matches that keywords miss!

# 3. Performance at scale:
#    - Pre-compute embeddings once
#    - Use efficient similarity search (FAISS, Annoy)
#    - Consider approximate nearest neighbors for millions of docs

threshold_guide = {
    'Strict (0.90+)': 'Nearly identical content only',
    'Standard (0.80)': 'Clear duplicates, different wording',
    'Relaxed (0.70)': 'Same topic, related content',
    'Loose (0.60)': 'Broadly similar themes',
}

print("Duplicate Detection Threshold Guide:")
print("="*50)
for level, desc in threshold_guide.items():
    print(f"  {level:20s} → {desc}")

---

## Additional Resources

- [Sentence-Transformers Documentation](https://www.sbert.net/)
- [Hugging Face Embeddings Guide](https://huggingface.co/blog/getting-started-with-embeddings)
- [Understanding Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
- [FAISS Library](https://github.com/facebookresearch/faiss) - Efficient similarity search
- [Vector Databases Comparison](https://www.pinecone.io/learn/vector-database/)
- [Semantic Search Tutorial](https://www.sbert.net/examples/applications/semantic-search/README.html)