# Comparing Embedding Models

In this notebook, we'll compare different embedding models to understand their trade-offs in terms of performance, speed, and resource requirements.

## Learning Objectives

By the end of this notebook, you will be able to:
- Understand key differences between embedding models
- Compare models based on accuracy, speed, and resource requirements
- Select appropriate models for different use cases
- Measure model performance using contrast metrics
- Make informed trade-offs between accuracy and efficiency

## What Makes Embedding Models Different?

Embedding models vary across several dimensions:

1. **Size and computational requirements** - Larger models are more accurate but slower
2. **Language support** - Monolingual (English) vs. multilingual (50+ languages)
3. **Context length** - Maximum input text length
4. **Embedding dimensionality** - Higher dimensions capture more nuance (384 vs. 768)
5. **Task specialization** - General purpose vs. optimized for specific tasks

## Models We'll Compare

| Model | Dimensions | Specialization |
|-------|------------|----------------|
| `all-MiniLM-L6-v2` | 384 | Compact, efficient, general purpose |
| `all-MiniLM-L12-v2` | 384 | Medium size, better accuracy |
| `all-mpnet-base-v2` | 768 | Large, highest quality embeddings |
| `paraphrase-multilingual-MiniLM-L12-v2` | 384 | Multilingual (50+ languages) |

In [None]:
import os
os.environ['UV_LINK_MODE'] = 'copy'

# Install the required packages
!uv pip install accelerate==1.6.0 sentence-transformers==4.0.2

print("✓ Required libraries installed successfully!")

In [None]:
# Import libraries
import random
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer, util
import time
import pandas as pd
import numpy as np
import seaborn as sns

print("✓ Libraries imported successfully!")

In [None]:
# Set visualization style
sns.set_theme(style="whitegrid")
plt.rcParams.update({'font.size': 11})

print("✓ Visualization settings configured!")

In [None]:
## Setup: Define Models to Compare

We'll benchmark four different embedding models with varying characteristics:

# Models to evaluate
models = [
    'all-MiniLM-L6-v2',                      # Small, fast (384d)
    'all-MiniLM-L12-v2',                     # Medium size (384d)
    'all-mpnet-base-v2',                     # Large, powerful (768d)
    'paraphrase-multilingual-MiniLM-L12-v2'  # Multilingual (384d)
]

print("✓ Models configured for benchmarking!")
print(f"\nTotal models to evaluate: {len(models)}")
for i, model in enumerate(models, 1):
    print(f"  {i}. {model}")

In [None]:
## Create Test Dataset

We'll create sentence pairs to test model performance:
- **Similar pairs** - Sentences on the same topic (should have high similarity)
- **Dissimilar pairs** - Sentences on different topics (should have low similarity)

# Semantically similar sentence pairs organized by topic
sentence_pairs = [
    # Technology
    ["Machine learning models require significant computational resources.",
     "AI systems need a lot of computing power to train."],
    
    # Programming
    ["What's the best algorithm for text classification?",
     "How can I optimize my neural network training time?"],
    
    # Weather
    ["The weather forecast predicts rain tomorrow.",
     "It's going to be wet outside tomorrow according to meteorologists."],
    
    # Office
    ["I need to get a new computer for my office.",
     "My workplace needs updated computing equipment."],
    
    # Health
    ["Regular exercise improves cardiovascular health.",
     "Working out frequently is good for your heart."],
    
    # Food
    ["The restaurant serves authentic Italian pasta dishes.",
     "You can get genuine Italian noodle recipes at that dining place."]
]

print("✓ Similar sentence pairs created!")
print(f"  Total similar pairs: {len(sentence_pairs)}")
print(f"  Topics: Technology, Programming, Weather, Office, Health, Food")

In [None]:
# Generate dissimilar pairs by mixing sentences from different topics
dissimilar_pairs = []
for i in range(len(sentence_pairs)):
    for j in range(len(sentence_pairs)):
        if i != j:  # Different topics
            dissimilar_pairs.append([sentence_pairs[i][0], sentence_pairs[j][0]])

# Sample 6 dissimilar pairs to match similar pairs count
random.seed(42)  # For reproducibility
dissimilar_pairs = random.sample(dissimilar_pairs, 6)

# Create list of all unique sentences for encoding
all_sentences = []
for pair in sentence_pairs + dissimilar_pairs:
    all_sentences.extend(pair)
all_sentences = list(set(all_sentences))  # Remove duplicates

print("✓ Dissimilar sentence pairs generated!")
print(f"  Total dissimilar pairs: {len(dissimilar_pairs)}")
print(f"  Total unique sentences: {len(all_sentences)}")

In [None]:
## Define Evaluation Function

This function measures model performance across multiple dimensions:

def evaluate_model(model_name, all_sentences, sentence_pairs, dissimilar_pairs):
    """
    Evaluate an embedding model on similar and dissimilar sentence pairs.
    
    Returns metrics:
    - Model size and dimensions
    - Loading and encoding time
    - Similarity scores for similar/dissimilar pairs
    - Contrast score (difference between similar and dissimilar)
    """
    start_time = time.time()
    
    # Load model
    model_load_time = time.time()
    model = SentenceTransformer(model_name)
    model_load_time = time.time() - model_load_time
    
    # Encode all sentences at once (efficient batch processing)
    encoding_time = time.time()
    embeddings_dict = {sentence: model.encode(sentence) for sentence in all_sentences}
    encoding_time = time.time() - encoding_time
    
    # Get model metadata
    dim = next(iter(embeddings_dict.values())).shape[0]
    model_size_mb = sum(p.numel() for p in model.parameters()) * 4 / 1024 / 1024
    
    # Calculate similarities for similar pairs
    similar_scores = []
    for s1, s2 in sentence_pairs:
        score = util.cos_sim(
            embeddings_dict[s1].reshape(1, -1),
            embeddings_dict[s2].reshape(1, -1)
        ).item()
        similar_scores.append(score)
    
    # Calculate similarities for dissimilar pairs
    dissimilar_scores = []
    for s1, s2 in dissimilar_pairs:
        score = util.cos_sim(
            embeddings_dict[s1].reshape(1, -1),
            embeddings_dict[s2].reshape(1, -1)
        ).item()
        dissimilar_scores.append(score)
    
    # Calculate metrics
    avg_similar = sum(similar_scores) / len(similar_scores)
    avg_dissimilar = sum(dissimilar_scores) / len(dissimilar_scores)
    contrast = avg_similar - avg_dissimilar  # Key metric!
    
    total_time = time.time() - start_time
    
    return {
        'Model': model_name,
        'Dimensions': dim,
        'Size (MB)': model_size_mb,
        'Load Time (s)': model_load_time,
        'Encoding Time (s)': encoding_time,
        'Total Time (s)': total_time,
        'Avg Similar Score': avg_similar,
        'Avg Different Score': avg_dissimilar,
        'Contrast': contrast,
        'Similar Scores': similar_scores,
        'Dissimilar Scores': dissimilar_scores
    }

print("✓ Evaluation function defined successfully!")

In [None]:
## Run Model Evaluation

Now let's benchmark all four models:

# Evaluate all models
results = []
print("=" * 80)
for i, model_name in enumerate(models, 1):
    print(f"\n[{i}/{len(models)}] Evaluating {model_name}...")
    model_results = evaluate_model(model_name, all_sentences, sentence_pairs, dissimilar_pairs)
    results.append(model_results)
    print(f"✓ Completed {model_name}")
    print(f"  Contrast score: {model_results['Contrast']:.4f}")

print("\n" + "=" * 80)
print("✓ All model evaluations complete!")

In [None]:
## Model Comparison Summary

Let's view the key metrics for all models:

# Convert to DataFrame
df = pd.DataFrame(results)

# Create display-friendly version
display_df = df[['Model', 'Dimensions', 'Size (MB)', 'Encoding Time (s)',
                'Avg Similar Score', 'Avg Different Score', 'Contrast']]

print("=" * 80)
print("MODEL COMPARISON SUMMARY")
print("=" * 80)
print(display_df.to_string(index=False, float_format=lambda x: f"{x:.4f}"))
print("=" * 80)

print("\n✓ Summary table generated!")
print("\nKey metrics:")
print("  • Contrast = Avg Similar - Avg Different (higher is better)")
print("  • Higher contrast = better at distinguishing relevant from irrelevant")

In [None]:
## Detailed Pair Analysis

Let's examine how each model scored individual sentence pairs:

print("=" * 80)
print("DETAILED PAIR ANALYSIS")
print("=" * 80)

for idx, model_data in enumerate(results):
    print(f"\n{'─' * 80}")
    print(f"Model: {model_data['Model']}")
    print(f"{'─' * 80}")
    
    print("\n✓ SIMILAR PAIRS (should have HIGH similarity):")
    for i, score in enumerate(model_data['Similar Scores']):
        s1, s2 = sentence_pairs[i]
        print(f"  {i+1}. Score: {score:.4f}")
        print(f"     \"{s1[:50]}...\"")
        print(f"     \"{s2[:50]}...\"")
    
    print("\n✗ DISSIMILAR PAIRS (should have LOW similarity):")
    for i, score in enumerate(model_data['Dissimilar Scores']):
        s1, s2 = dissimilar_pairs[i]
        print(f"  {i+1}. Score: {score:.4f}")
        print(f"     \"{s1[:50]}...\"")
        print(f"     \"{s2[:50]}...\"")

print("\n" + "=" * 80)

In [None]:
## Visualization 1: Similarity Score Distributions

Compare how each model distributes similarity scores:

plt.figure(figsize=(14, 7))

# Prepare data for boxplot
boxplot_data = []
model_labels = []

for model_data in results:
    boxplot_data.append(model_data['Similar Scores'])
    boxplot_data.append(model_data['Dissimilar Scores'])
    # Shorten model names for display
    short_name = model_data['Model'].replace('paraphrase-', '').replace('-MiniLM-', '-ML')
    model_labels.append(f"{short_name}\nSimilar")
    model_labels.append(f"{short_name}\nDifferent")

# Color pattern: green for similar, red for dissimilar
colors = ['lightgreen', 'lightcoral'] * len(models)

# Create boxplot
bp = plt.boxplot(boxplot_data, patch_artist=True, vert=False)
plt.yticks(range(1, len(model_labels) + 1), model_labels, fontsize=9)
plt.xlabel('Cosine Similarity Score', fontsize=12)
plt.title('Distribution of Similarity Scores by Model', fontsize=14, weight='bold')

# Color the boxes
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

plt.grid(alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("✓ Boxplot visualization complete!")
print("\nInterpretation:")
print("  • Green boxes (similar pairs) should be right-shifted (higher scores)")
print("  • Red boxes (dissimilar pairs) should be left-shifted (lower scores)")
print("  • Greater separation = better model discrimination")

In [None]:
## Visualization 2: Model Efficiency

Compare model size vs. encoding speed (bubble size = dimensionality):

plt.figure(figsize=(10, 6))
scatter = plt.scatter(df['Size (MB)'], df['Encoding Time (s)'],
                     s=df['Dimensions'], alpha=0.6, c=range(len(df)), 
                     cmap='viridis', edgecolors='black', linewidth=2)

# Add model labels
for i, model in enumerate(df['Model']):
    short_name = model.split('-')[1] if '-' in model else model[:10]
    plt.annotate(short_name, 
                (df['Size (MB)'].iloc[i], df['Encoding Time (s)'].iloc[i]),
                xytext=(8, 8), textcoords='offset points',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='white', alpha=0.7),
                fontsize=9, weight='bold')

plt.xlabel('Model Size (MB)', fontsize=12)
plt.ylabel('Encoding Time (seconds)', fontsize=12)
plt.title('Model Efficiency Trade-offs\n(bubble size = embedding dimensions)', 
         fontsize=14, weight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("✓ Efficiency visualization complete!")
print("\nInterpretation:")
print("  • Bottom-left = Fast and small (efficient)")
print("  • Top-right = Slow and large (resource-intensive)")
print("  • Larger bubbles = higher dimensionality (more detailed embeddings)")

In [None]:
## Visualization 3: Performance vs. Resources

Compare accuracy (contrast) vs. speed (colored by model size):

plt.figure(figsize=(10, 6))
scatter = plt.scatter(df['Total Time (s)'], df['Contrast'], 
                     c=df['Size (MB)'], s=250, cmap='plasma', 
                     alpha=0.7, edgecolors='black', linewidth=2)
plt.colorbar(scatter, label='Model Size (MB)')

# Add model labels
for i, model in enumerate(df['Model']):
    short_name = model.split('-')[1] if '-' in model else model[:10]
    plt.annotate(short_name,
                (df['Total Time (s)'].iloc[i], df['Contrast'].iloc[i]),
                xytext=(8, 8), textcoords='offset points',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='white', alpha=0.7),
                fontsize=9, weight='bold')

plt.xlabel('Total Processing Time (seconds)', fontsize=12)
plt.ylabel('Contrast Score (Similar - Dissimilar)', fontsize=12)
plt.title('Performance vs. Resource Requirements', fontsize=14, weight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("✓ Performance comparison complete!")
print("\nInterpretation:")
print("  • Top-left corner = Best (high accuracy, fast processing)")
print("  • Bottom-right = Worst (low accuracy, slow processing)")
print("  • Color indicates model size (darker = larger model)")

## Summary

We've compared four embedding models across multiple dimensions to understand their trade-offs.

### Key Takeaways

1. **Contrast is king** - The contrast score (difference between similar and dissimilar pairs) is the most important metric for distinguishing relevant from irrelevant information

2. **Size vs. Speed trade-off** - Larger models have better semantic understanding but require more resources and run slower

3. **Dimensionality matters** - Higher-dimensional embeddings (768d) capture more nuance than lower-dimensional ones (384d)

4. **Specialization helps** - Models fine-tuned for specific tasks (e.g., multilingual, QA) perform better on those tasks

5. **Test with your data** - These benchmarks use general examples; always test with domain-specific data for best results

### Model Selection Guidelines

**Choose `all-MiniLM-L6-v2` when:**
- Speed and efficiency are critical
- Deploying on edge devices or mobile
- Resource constraints are tight
- Good enough accuracy for general tasks

**Choose `all-mpnet-base-v2` when:**
- Maximum accuracy is required
- Server-side deployment with ample resources
- Handling complex semantic relationships
- Performance matters more than speed

**Choose `paraphrase-multilingual` when:**
- Working with multiple languages (50+)
- Building international applications
- Language detection not available
- Multilingual support is essential

**Choose `all-MiniLM-L12-v2` when:**
- Balanced performance needed
- Middle ground between speed and accuracy
- Moderate resource availability
- Good general-purpose choice

### Decision Framework

When selecting an embedding model, consider:

1. **Accuracy requirements** - How critical is perfect semantic matching?
2. **Inference speed** - What are your latency constraints?
3. **Resource constraints** - What's your memory and compute budget?
4. **Multilingual needs** - Do you need multiple language support?
5. **Deployment target** - Cloud server, edge device, or mobile?
6. **Domain specificity** - Is there a specialized model for your use case?

The best model is the one that meets your requirements with the least resources!