# Introduction to Embeddings

In this notebook, we will explore the concept of **text embeddings**. Embeddings are numerical representations (vectors) of text that capture semantic meaning. They are the foundation of modern AI applications like Semantic Search, RAG (Retrieval Augmented Generation), and Clustering.

In [None]:
%pip install sentence-transformers numpy

## 1. Generating Embeddings

We will use the `sentence-transformers` library, which provides easy access to state-of-the-art models for generating sentence embeddings. We'll use a small but efficient model called `all-MiniLM-L6-v2`.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our corpus of sentences
sentences = [
    "The cat sits outside",
    "A man is playing guitar",
    "The new movie is awesome",
    "The dog is in the garden",
    "I love pasta and pizza"
]

# Generate embeddings
embeddings = model.encode(sentences)

print(f"Shape of embeddings: {embeddings.shape}")
# Output should be (5, 384) because 'all-MiniLM-L6-v2' creates 384-dimensional vectors.

## 2. Understanding Vectors

Let's look at a single embedding. It is just a list of floating-point numbers.

In [None]:
# Inspect the first embedding
print(f"First sentence: '{sentences[0]}'")
print(f"First 10 dimensions of the vector: {embeddings[0][:10]}")

## 3. Measuring Similarity (Cosine Similarity)

To determine how similar two sentences are, we compare their vectors. The most common metric is **Cosine Similarity**, which measures the cosine of the angle between two vectors.

Formula:
$$ \text{similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} $$

where $A \cdot B$ is the dot product and $\|A\|$ is the magnitude (norm).

In [None]:
def cosine_similarity(v1, v2):
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

# Compare "The cat sits outside" with "The dog is in the garden"
sim_cat_dog = cosine_similarity(embeddings[0], embeddings[3])
print(f"Similarity (Cat vs Dog): {sim_cat_dog:.4f}")

# Compare "The cat sits outside" with "A man is playing guitar"
sim_cat_guitar = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity (Cat vs Guitar): {sim_cat_guitar:.4f}")

You should observe that the similarity between the two animal-related sentences is higher than the similarity between the cat sentence and the guitar sentence.

## 4. Simple Semantic Search

Now let's implement a basic search. We will take a query, embed it, and find the closest sentence in our list.

In [None]:
query = "food"
query_embedding = model.encode([query])[0]

# Calculate similarity with all sentences
scores = []
for i, sent_emb in enumerate(embeddings):
    score = cosine_similarity(query_embedding, sent_emb)
    scores.append((score, sentences[i]))

# Sort by score descending
scores.sort(key=lambda x: x[0], reverse=True)

print(f"Query: '{query}'")
print("Top matches:")
for score, sentence in scores:
    print(f"{score:.4f} | {sentence}")