# Getting Started With Text Embeddings

## A Complete Guide to Understanding and Using Embeddings in AI

In this tutorial, you'll learn:
- What text embeddings are and why they matter
- How to generate embeddings using SentenceTransformer
- How to measure similarity between texts
- How to build a practical semantic search engine

---

## Setup and Installation

First, let's install the required libraries.

In [None]:
# Install required packages
!pip install sentence-transformers scikit-learn numpy pandas matplotlib seaborn

## Import Libraries

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

# For nice visualizations
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

# Load the embedding model (downloads on first run)
model = SentenceTransformer('all-mpnet-base-v2')

print("‚úÖ All libraries imported successfully!")

---

## üìö What Are Text Embeddings?

**Text embeddings** are numerical representations of text that capture semantic meaning. 

**Key insight**: Similar meanings = Similar number patterns!

---

## Helper Function to Get Embeddings

In [None]:
def get_embedding(text, model_instance=model):
    """
    Get embedding vector for a given text using SentenceTransformer.
    
    Args:
        text: String to embed
        model_instance: SentenceTransformer model instance
    
    Returns:
        List of floats (the embedding vector)
    """
    text = text.replace("\n", " ")  # Clean text
    embedding = model_instance.encode(text, convert_to_tensor=False)
    return embedding.tolist()

---

##  Example : Your First Embedding - A Single Word

Let's convert the word "rocket" into an embedding!

In [None]:
word = "rocket"
embedding = get_embedding(word)

print(f"Word: '{word}'")
print(f"Embedding dimension: {len(embedding)}")
print(f"\nFirst 15 values of the embedding:")
print(embedding[:15])
print("\nüìù Note: all-mpnet-base-v2 uses 768 dimensions for better semantic quality")

### Visualize the Embedding

Let's look at what these numbers look like!

In [None]:
# Visualize part of the embedding
plt.figure(figsize=(14, 4))
plt.plot(embedding[:100], marker='o', markersize=2, linewidth=0.5)
plt.title(f"First 100 Dimensions of '{word}' Embedding", fontsize=14, fontweight='bold')
plt.xlabel("Dimension")
plt.ylabel("Value")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## üìù Example : Embedding Complete Sentences

Embeddings work for entire sentences too! The model captures the FULL meaning.

In [None]:
sentence = "Artificial intelligence is revolutionizing healthcare"
embedding = get_embedding(sentence)

print(f"Sentence: '{sentence}'")
print(f"Embedding dimension: {len(embedding)}")
print(f"\nFirst 15 values:")
print(embedding[:15])

---

## üéØ Understanding Cosine Similarity

**Cosine similarity** measures how "close" two embeddings are:


### Visual Analogy:
```
Vector A ‚Üí 
          \  Small angle = High similarity (close to 1.0)
           ‚Üí Vector B

Vector A ‚Üë
        |
        |  Large angle = Low similarity (close to 0.0)
        |________‚Üí Vector C
```

### The Math:
```
similarity = cos(angle) = (A ¬∑ B) / (||A|| √ó ||B||)
```

---

## Helper Function: Calculate Similarity

In [None]:
def calculate_similarity(text1, text2):
    """Calculate cosine similarity between two texts"""
    emb1 = np.array(get_embedding(text1)).reshape(1, -1)
    emb2 = np.array(get_embedding(text2)).reshape(1, -1)
    return cosine_similarity(emb1, emb2)[0][0]

---

## üîç Example : Comparing Similar vs Different Topics

In [None]:
sentence_a = "I adopted a golden retriever puppy yesterday"
sentence_b = "My new dog is very playful and energetic"
sentence_c = "The stock market crashed last Monday"

sim_ab = calculate_similarity(sentence_a, sentence_b)
sim_ac = calculate_similarity(sentence_a, sentence_c)
sim_bc = calculate_similarity(sentence_b, sentence_c)

print("Sentences:")
print(f"A: '{sentence_a}'")
print(f"B: '{sentence_b}'")
print(f"C: '{sentence_c}'")
print("\n" + "="*70)
print(f"Similarity A ‚Üî B: {sim_ab:.4f}  (both about dogs)")
print(f"Similarity A ‚Üî C: {sim_ac:.4f}  (dogs vs stocks - unrelated)")
print(f"Similarity B ‚Üî C: {sim_bc:.4f}  (dogs vs stocks - unrelated)")
print("="*70)

---

## Example : Similarity Heatmap

Let's visualize similarities between multiple sentences at once!

In [None]:
sentences = [
    "Machine learning models need training data",
    "Neural networks require lots of examples",
    "I'm planning a vacation to Hawaii",
    "My trip to the beach is next month",
    "Climate change affects ocean temperatures"
]

# Calculate all pairwise similarities
n = len(sentences)
similarity_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(n):
        if i == j:
            similarity_matrix[i][j] = 1.0
        elif i < j:
            sim = calculate_similarity(sentences[i], sentences[j])
            similarity_matrix[i][j] = sim
            similarity_matrix[j][i] = sim  # Symmetric

# Create heatmap
plt.figure(figsize=(11, 9))
sns.heatmap(
    similarity_matrix,
    annot=True,
    fmt='.3f',
    cmap='YlOrRd',
    xticklabels=[f"S{i+1}" for i in range(n)],
    yticklabels=[f"S{i+1}" for i in range(n)],
    vmin=0,
    vmax=1,
    cbar_kws={'label': 'Similarity Score'},
    square=True
)
plt.title("Sentence Similarity Heatmap\n", fontsize=16, fontweight='bold')
plt.tight_layout()

print("\nüìã Sentence Reference:")
for i, sent in enumerate(sentences):
    print(f"S{i+1}: {sent}")

plt.show()

print("\nüí° Notice:")
print("- S1 & S2 are similar (both about ML)")
print("- S3 & S4 are similar (both about travel)")
print("- S5 is different from all others")

---

## üîé Real-World Application: Semantic Search Engine

Let's build a simple search engine that finds documents by **MEANING**, not just keywords!

In [None]:
# Our document database
documents = [
    "How to train a neural network for image classification",
    "Best beaches to visit in summer for families",
    "Understanding backpropagation in deep learning",
    "Top hiking trails in the Rocky Mountains",
    "Introduction to convolutional neural networks",
    "Planning a road trip across Europe",
    "Gradient descent optimization techniques",
    "Cooking Italian pasta dishes at home",
    "Transfer learning with pretrained models",
    "Growing tomatoes in your backyard garden"
]

print("üìö Document Database:")
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc}")

In [None]:
def semantic_search(query, docs, top_k=3):
    """
    Search documents by semantic similarity
    
    Args:
        query: Search string
        docs: List of documents
        top_k: Number of results to return
        
    Returns:
        List of (doc, score) tuples
    """
    query_emb = np.array(get_embedding(query)).reshape(1, -1)
    
    scores = []
    for doc in docs:
        doc_emb = np.array(get_embedding(doc)).reshape(1, -1)
        score = cosine_similarity(query_emb, doc_emb)[0][0]
        scores.append((doc, score))
    
    # Sort by score (highest first)
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

### Test the Semantic Search Engine

In [None]:
queries = [
    "deep learning and AI",
    "outdoor activities and nature",
    "food and recipes"
]

for query in queries:
    print(f"\n{'='*75}")
    print(f"üîç SEARCH QUERY: '{query}'")
    print('='*75)
    
    results = semantic_search(query, documents, top_k=3)
    
    for rank, (doc, score) in enumerate(results, 1):
        print(f"\n{rank}. [Score: {score:.4f}]")
        print(f"   {doc}")

---

##  Example : Question Similarity (FAQ Matching)

Useful for customer support chatbots!

In [None]:
faq_questions = [
    "How do I reset my password?",
    "What are your shipping costs?",
    "Can I return an item?",
    "How long does delivery take?",
    "What payment methods do you accept?"
]

user_questions = [
    "I forgot my login credentials",
    "How much does shipping cost?",
    "I want to send back a product"
]

print("FAQ Database:\n")
for i, q in enumerate(faq_questions, 1):
    print(f"{i}. {q}")

print("\n" + "="*75)

for user_q in user_questions:
    print(f"\nüë§ User asks: '{user_q}'")
    
    # Find best matching FAQ
    best_match = None
    best_score = 0
    
    for faq_q in faq_questions:
        score = calculate_similarity(user_q, faq_q)
        if score > best_score:
            best_score = score
            best_match = faq_q
    
    print(f"ü§ñ Best match: '{best_match}'")
    print(f"   Confidence: {best_score:.4f}")