# 03 - Embeddings & Vector Search

**Understand semantic search and vector databases.**

## Learning Objectives

By the end of this notebook, you will:
- Understand what embeddings are
- Create embeddings with different models
- Implement similarity search
- Use vector databases

## Table of Contents

1. [What are Embeddings?](#what)
2. [Creating Embeddings](#creating)
3. [Similarity Search](#similarity)
4. [Vector Databases](#vectordb)
5. [Practical Applications](#applications)
6. [Exercises](#exercises)
7. [Checkpoint](#checkpoint)

In [None]:
# GUIDED: Setup
import os
import sys
from pathlib import Path

sys.path.append(str(Path.cwd().parent))

from dotenv import load_dotenv
load_dotenv(Path.cwd().parent / ".env")

print("Setup complete!")

---
## 1. What are Embeddings? <a id='what'></a>

**Embeddings** convert text into numerical vectors that capture semantic meaning.

```
Text                    → Embedding (vector)
───────────────────────────────────────────────
"I love dogs"           → [0.2, -0.5, 0.8, ...]
"I adore puppies"       → [0.21, -0.48, 0.79, ...]  ← Similar!
"The weather is nice"   → [-0.3, 0.7, 0.1, ...]     ← Different
```

### Key Properties:
- **Fixed size**: All texts become same-length vectors (e.g., 1536 dimensions)
- **Semantic meaning**: Similar concepts = similar vectors
- **Measurable**: Can compute distance/similarity between vectors

---
## 2. Creating Embeddings <a id='creating'></a>

In [None]:
# GUIDED: Create embeddings with OpenAI
from openai import OpenAI

client = OpenAI()

# Single text
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello, how are you?"
)

embedding = response.data[0].embedding
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")

In [None]:
# GUIDED: Batch embeddings
texts = [
    "The cat sat on the mat",
    "A feline rested on the rug",
    "Python is a programming language",
    "Machine learning is a subset of AI"
]

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts
)

embeddings = [d.embedding for d in response.data]
print(f"Created {len(embeddings)} embeddings")
print(f"Each has {len(embeddings[0])} dimensions")

In [None]:
# GUIDED: Use our embedding utility
from src.embedding_utils import EmbeddingModel

embedder = EmbeddingModel(provider="openai", model="text-embedding-3-small")

# Single embedding
vec = embedder.embed("Hello world")
print(f"Single embedding: {len(vec)} dimensions")

# Batch embedding
vecs = embedder.embed_batch(["First text", "Second text", "Third text"])
print(f"Batch: {len(vecs)} embeddings")

---
## 3. Similarity Search <a id='similarity'></a>

In [None]:
# GUIDED: Calculate cosine similarity
from src.embedding_utils import cosine_similarity, EmbeddingModel

embedder = EmbeddingModel(provider="openai", model="text-embedding-3-small")

# Create embeddings for comparison
texts = [
    "I love programming in Python",
    "Python coding is my passion",
    "The weather today is sunny",
    "JavaScript is also a great language"
]

embeddings = embedder.embed_batch(texts)

# Compare first text to all others
query = embeddings[0]
print(f"Query: '{texts[0]}'\n")

for i, (text, emb) in enumerate(zip(texts[1:], embeddings[1:]), 1):
    similarity = cosine_similarity(query, emb)
    print(f"{similarity:.3f} - {text}")

In [None]:
# GUIDED: Build a simple search engine
from src.embedding_utils import similarity_search, EmbeddingModel

embedder = EmbeddingModel(provider="openai", model="text-embedding-3-small")

# Our "documents"
documents = [
    "Python is a versatile programming language used for web development, data science, and AI.",
    "Machine learning algorithms learn patterns from data to make predictions.",
    "Neural networks are inspired by the human brain's structure.",
    "Data visualization helps communicate insights from complex datasets.",
    "APIs allow different software applications to communicate with each other.",
    "Cloud computing provides on-demand access to computing resources.",
    "Cybersecurity protects systems and networks from digital attacks.",
    "Docker containers package applications with their dependencies."
]

# Create embeddings for all documents
doc_embeddings = embedder.embed_batch(documents)

# Search function
def search(query: str, top_k: int = 3):
    query_embedding = embedder.embed(query)
    results = similarity_search(
        query_embedding, 
        doc_embeddings, 
        documents, 
        top_k=top_k
    )
    return results

# Test searches
print("Search: 'How do machines learn?'\n")
for text, score in search("How do machines learn?"):
    print(f"{score:.3f}: {text[:60]}...")

---
## 4. Vector Databases <a id='vectordb'></a>

In [None]:
# GUIDED: Use ChromaDB
import chromadb

# Create client (in-memory)
client = chromadb.Client()

# Create a collection
collection = client.create_collection(
    name="demo_collection",
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

# Add documents (ChromaDB handles embedding automatically)
collection.add(
    documents=[
        "Python is great for data science",
        "JavaScript runs in the browser",
        "Machine learning predicts outcomes",
        "APIs connect different services"
    ],
    ids=["doc1", "doc2", "doc3", "doc4"],
    metadatas=[
        {"category": "programming"},
        {"category": "programming"},
        {"category": "ai"},
        {"category": "architecture"}
    ]
)

print(f"Collection has {collection.count()} documents")

In [None]:
# GUIDED: Query ChromaDB
# Basic search
results = collection.query(
    query_texts=["What language is good for AI?"],
    n_results=3
)

print("Top results:")
for doc, distance in zip(results["documents"][0], results["distances"][0]):
    print(f"  {1-distance:.3f}: {doc}")

In [None]:
# GUIDED: Search with metadata filter
results = collection.query(
    query_texts=["programming concepts"],
    n_results=3,
    where={"category": "programming"}  # Filter by metadata
)

print("Programming results only:")
for doc in results["documents"][0]:
    print(f"  - {doc}")

In [None]:
# GUIDED: Use our SimpleVectorStore
from src.embedding_utils import EmbeddingModel, SimpleVectorStore

embedder = EmbeddingModel(provider="openai", model="text-embedding-3-small")
store = SimpleVectorStore(embedding_model=embedder)

# Add documents
store.add(
    texts=[
        "The quick brown fox jumps over the lazy dog",
        "A fast auburn fox leaps above a sleepy canine",
        "Python is used for machine learning",
        "Data science requires statistical knowledge"
    ],
    metadata=[
        {"type": "sentence"},
        {"type": "sentence"},
        {"type": "tech"},
        {"type": "tech"}
    ]
)

# Search
results = store.search("fox jumping", k=2)
print("Search results:")
for r in results:
    print(f"  {r['score']:.3f}: {r['text']}")

---
## 5. Practical Applications <a id='applications'></a>

In [None]:
# GUIDED: Semantic FAQ matching
from src.embedding_utils import EmbeddingModel, SimpleVectorStore

embedder = EmbeddingModel(provider="openai", model="text-embedding-3-small")
faq_store = SimpleVectorStore(embedding_model=embedder)

# FAQ database
faqs = [
    {
        "question": "How do I reset my password?",
        "answer": "Go to Settings > Security > Reset Password."
    },
    {
        "question": "What payment methods do you accept?",
        "answer": "We accept Visa, Mastercard, and PayPal."
    },
    {
        "question": "How can I contact support?",
        "answer": "Email support@example.com or call 1-800-SUPPORT."
    },
    {
        "question": "What is your refund policy?",
        "answer": "Full refund within 30 days, no questions asked."
    }
]

# Index FAQs
faq_store.add(
    texts=[f["question"] for f in faqs],
    metadata=[{"answer": f["answer"]} for f in faqs]
)

# Find matching FAQ
def find_faq(user_question: str):
    results = faq_store.search(user_question, k=1)
    if results and results[0]["score"] > 0.7:
        return results[0]["metadata"]["answer"]
    return "Sorry, I couldn't find a matching FAQ."

# Test
queries = [
    "I forgot my password",
    "Can I pay with credit card?",
    "I want my money back"
]

for q in queries:
    print(f"Q: {q}")
    print(f"A: {find_faq(q)}\n")

---
## 6. Exercises <a id='exercises'></a>

### Exercise 1: Compare Embedding Models

Compare results from different embedding models.

In [None]:
# TODO: Compare text-embedding-3-small vs text-embedding-3-large

# Your code here:


### Exercise 2: Build a Document Search

Create a search system for a set of documents.

In [None]:
# TODO: Index multiple documents and implement search with ranking

# Your code here:


---
## 7. Checkpoint <a id='checkpoint'></a>

Before moving on, verify:

- [ ] You understand what embeddings are
- [ ] You can create embeddings with OpenAI
- [ ] You implemented similarity search
- [ ] You used a vector database

### Next Steps

In the next notebook, we'll build a complete **RAG System** - combining retrieval with LLM generation!