# Understanding Text Embeddings: From Words to Vectors

This notebook explores text embeddings, a fundamental concept in natural language processing. We'll investigate how computers understand semantic meaning by converting words and sentences into numerical vectors.

## Learning Objectives

By the end of this notebook, you will be able to:
- Understand what text embeddings are and why they're important
- Generate embeddings using pre-trained models
- Measure semantic similarity using cosine similarity
- Visualize high-dimensional embeddings in 2D space
- Apply embeddings to find semantically similar text

## What Are Embeddings?

**Embeddings** are dense numerical representations of data in a continuous vector space where:
- Similar meanings are positioned close together 
- Relative positions capture semantic relationships
- Each dimension captures different aspects of meaning

## Setup: Install Required Libraries

In [None]:
import os
os.environ['UV_LINK_MODE'] = 'copy'

# Install the required packages
!uv pip install accelerate==1.6.0 sentence-transformers==4.0.2

print("✓ Required libraries installed successfully!")

In [None]:
# Import libraries
from sentence_transformers import SentenceTransformer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

# Set up matplotlib
try:
    plt.style.use('seaborn-v0_8-whitegrid')
except:
    try:
        plt.style.use('seaborn-whitegrid')  # Fallback for older versions
    except:
        pass  # Default style if neither is available
        
plt.rcParams['figure.figsize'] = (10, 7)
np.random.seed(42)  # For reproducibility

print("✓ Libraries imported and configured successfully!")

## Load Embedding Model

We'll use the `all-MiniLM-L6-v2` model:
- Creates 384-dimensional embeddings
- Optimized for semantic similarity tasks
- Fast and efficient for most applications

In [None]:
# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

print(f"✓ Model loaded successfully!")
print(f"  Model name: all-MiniLM-L6-v2")
print(f"  Embedding dimensions: {model.get_sentence_embedding_dimension()}")

## Create and Examine Embeddings

Let's create embeddings for example sentences grouped by topic to see how the model captures semantic similarity.

In [None]:
# Example sentences grouped by topic
sentences = [
    # AI/ML related sentences
    "I love machine learning and artificial intelligence.",
    "AI and ML are fascinating fields of study.",
    
    # Weather related sentences
    "The weather is beautiful today.",
    "It's a sunny day with clear skies.",
    
    # Python related sentences
    "Python is my favorite programming language.",
    "I enjoy coding in Python for data analysis."
]

# Topic labels for visualization
topics = ['AI/ML', 'AI/ML', 'Weather', 'Weather', 'Python', 'Python']

# Display our sentences with their topics
print("Example sentences grouped by topic:\n")
print("=" * 80)
for i, (sentence, topic) in enumerate(zip(sentences, topics)):
    print(f"  {i+1}. [{topic:7}] {sentence}")

In [None]:
# Create embeddings for our sentences
embeddings = model.encode(sentences)

print("✓ Embeddings created successfully!\n")
print(f"Embedding information:")
print(f"  Shape of each embedding: {embeddings[0].shape}")
print(f"  Number of embeddings: {len(embeddings)}")
print(f"  Data type: {embeddings[0].dtype}")

# Show a snippet of the first embedding
print(f"\nFirst 10 dimensions of embedding #1:")
print(f"  {embeddings[0][:10]}")
print(f"\nEmbedding statistics:")
print(f"  Min value: {embeddings[0].min():>8.4f}")
print(f"  Max value: {embeddings[0].max():>8.4f}")
print(f"  Mean value: {embeddings[0].mean():>8.4f}")

## Measure Similarity with Cosine Similarity

**Cosine Similarity** measures the cosine of the angle between two vectors:
- **Range:** -1 (opposite direction) to 1 (identical direction)
- **Interpretation:** Higher values indicate greater semantic similarity
- **Formula:** similarity = (A · B) / (||A|| × ||B||)

In [None]:
# Calculate cosine similarity between all pairs of embeddings
similarity_matrix = cosine_similarity(embeddings)

print("✓ Cosine similarity matrix calculated!\n")
print("Similarity matrix (6×6):")
print("=" * 80)

# Display with proper formatting
np.set_printoptions(precision=4, suppress=True)
print(similarity_matrix)

print("\n" + "=" * 80)
print("\nKey observations:")
print("  • Diagonal values = 1.0 (each sentence is identical to itself)")
print("  • High similarity (>0.6) between sentences on the same topic")
print("  • Low similarity (<0.1) between sentences on different topics")

In [None]:
# Create labels for our heatmap
labels = [f"S{i+1}: {topic}" for i, topic in enumerate(topics)]

# Create a heatmap of the similarity matrix
plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix, 
            annot=True, 
            fmt='.3f',
            cmap='viridis', 
            xticklabels=labels, 
            yticklabels=labels,
            cbar_kws={'label': 'Cosine Similarity'})
plt.title('Cosine Similarity Heatmap', fontsize=14, weight='bold')
plt.tight_layout()
plt.show()

print("\n✓ Heatmap visualization complete!")
print("\nHeatmap interpretation:")
print("  • Diagonal (1.0 values) → Each sentence compared with itself")
print("  • Bright blocks → High similarity between sentences on the same topic")
print("  • Dark areas → Low similarity between sentences on different topics")

## Visualize Embeddings in 2D Space

We'll use **PCA (Principal Component Analysis)** to reduce our 384-dimensional embeddings to 2D for visualization while preserving as much variance as possible.

In [None]:
# Reduce embeddings to 2 dimensions using PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Set up colors for topics
topic_colors = {'AI/ML': 'red', 'Weather': 'blue', 'Python': 'green'}
colors = [topic_colors[topic] for topic in topics]

# Plot the 2D embeddings
plt.figure(figsize=(12, 8))
for i, (x, y) in enumerate(embeddings_2d):
    plt.scatter(x, y, c=colors[i], s=150, alpha=0.7, edgecolors='black', linewidth=2)
    plt.annotate(f"S{i+1}", 
                xy=(x, y), 
                xytext=(5, 5), 
                textcoords='offset points',
                fontsize=12,
                weight='bold')

# Add a legend
for topic, color in topic_colors.items():
    plt.scatter([], [], c=color, label=topic, s=150, alpha=0.7, edgecolors='black', linewidth=2)
plt.legend(loc='upper right', fontsize=11)

# Add title and labels
plt.title('2D PCA Projection of Sentence Embeddings', fontsize=15, weight='bold')
plt.xlabel(f'Principal Component 1 (Variance: {pca.explained_variance_ratio_[0]:.2%})', fontsize=12)
plt.ylabel(f'Principal Component 2 (Variance: {pca.explained_variance_ratio_[1]:.2%})', fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("✓ 2D visualization complete!\n")
print(f"PCA results:")
print(f"  Total variance captured: {sum(pca.explained_variance_ratio_):.2%}")
print(f"  PC1 variance: {pca.explained_variance_ratio_[0]:.2%}")
print(f"  PC2 variance: {pca.explained_variance_ratio_[1]:.2%}")
print(f"\nNotice how sentences on the same topic cluster together in 2D space!")

## Test with New Sentences

Let's test how the model handles new sentences and finds their semantic matches among our original sentences.

In [None]:
# Define new sentences
new_sentences = [
    "Deep learning has revolutionized computer vision.",  # AI/ML related
    "The forecast predicts rain for tomorrow.",           # Weather related
    "NumPy and Pandas are essential Python libraries."    # Python related
]

# Create embeddings for the new sentences
new_embeddings = model.encode(new_sentences)

print("✓ New sentence embeddings created!\n")
print("=" * 80)

# Calculate similarity between new and original sentences
similarity_to_original = cosine_similarity(new_embeddings, embeddings)

# Find the most similar original sentence for each new sentence
for i, new_sent in enumerate(new_sentences):
    most_similar_idx = np.argmax(similarity_to_original[i])
    similarity_score = similarity_to_original[i][most_similar_idx]
    
    print(f"\nNew sentence #{i+1}:")
    print(f"  \"{new_sent}\"")
    print(f"\nMost similar original sentence:")
    print(f"  \"{sentences[most_similar_idx]}\"")
    print(f"\nSimilarity score: {similarity_score:.4f}")
    print(f"Topic match: {topics[most_similar_idx]}")
    print("-" * 80)

## Visualize Original and New Sentences Together

Let's see how the new sentences position themselves relative to the original ones in 2D space.

In [None]:
# Combine original and new embeddings
all_embeddings = np.vstack([embeddings, new_embeddings])
all_topics = topics + ['AI/ML', 'Weather', 'Python']

# Project to 2D using PCA
pca = PCA(n_components=2)
all_embeddings_2d = pca.fit_transform(all_embeddings)

# Create visualization
plt.figure(figsize=(12, 8))

# Plot original sentences (circles)
for i in range(len(sentences)):
    x, y = all_embeddings_2d[i]
    plt.scatter(x, y, c=topic_colors[all_topics[i]], s=150, alpha=0.7, 
               edgecolors='black', linewidth=2)
    plt.annotate(f"S{i+1}", xy=(x, y), xytext=(5, 5), 
                textcoords='offset points', fontsize=10)

# Plot new sentences (stars)
for i in range(len(sentences), len(sentences) + len(new_sentences)):
    x, y = all_embeddings_2d[i]
    plt.scatter(x, y, c=topic_colors[all_topics[i]], s=200, alpha=0.9, 
               marker='*', edgecolors='black', linewidth=2)
    plt.annotate(f"N{i-len(sentences)+1}", xy=(x, y), xytext=(5, 5), 
                textcoords='offset points', fontsize=10, weight='bold')

# Add a legend
for topic, color in topic_colors.items():
    plt.scatter([], [], c=color, label=topic, s=100, alpha=0.7)
plt.scatter([], [], c='gray', marker='o', s=100, label='Original', alpha=0.7, 
           edgecolors='black', linewidth=2)
plt.scatter([], [], c='gray', marker='*', s=150, label='New', alpha=0.9, 
           edgecolors='black', linewidth=2)
plt.legend(loc='lower right', fontsize=11)

plt.title('PCA Projection: Original and New Sentences', fontsize=15, weight='bold')
plt.xlabel(f'Principal Component 1 (Variance: {pca.explained_variance_ratio_[0]:.2%})', fontsize=12)
plt.ylabel(f'Principal Component 2 (Variance: {pca.explained_variance_ratio_[1]:.2%})', fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("✓ Combined visualization complete!")
print("\nObservation:")
print("  New sentences (stars) appear close to their semantically related")
print("  original sentences, confirming the model captures meaning accurately.")

## Summary

We've explored the fundamentals of text embeddings and their practical applications.

### Key Takeaways

1. **Embeddings represent meaning** - Text is converted to numerical vectors that capture semantic relationships
2. **Similar meanings cluster together** - Semantically related sentences have high cosine similarity (>0.6)
3. **Dimensionality matters** - Our model uses 384 dimensions to capture nuanced meaning
4. **Visualization aids understanding** - PCA helps us see high-dimensional relationships in 2D
5. **Transferability** - The model generalizes well to new sentences, finding appropriate matches

### Real-World Applications

Embeddings power many modern AI applications:

1. **Semantic Search** - Finding documents based on meaning rather than just keywords
2. **Document Clustering** - Automatically grouping similar documents together
3. **Recommendation Systems** - Suggesting similar items based on semantic content
4. **Question Answering** - Finding relevant information to answer queries
5. **Retrieval Augmented Generation (RAG)** - Combining LLMs with knowledge bases using embeddings

### Next Steps

The techniques learned here form the foundation for:
- Building semantic search systems
- Creating document retrieval pipelines
- Implementing RAG systems for enhanced LLM responses
- Developing intelligent recommendation engines

Text embeddings bridge the gap between human language and machine understanding, enabling AI systems to work with meaning rather than just syntax.