# Vectors and Embeddings - Semantic Search Foundation

## Overview

This notebook demonstrates **vector embeddings** and **semantic search**, the core technology behind RAG systems. Key topics:

1. **Embeddings Fundamentals** - Converting text to numerical vectors
2. **Semantic Similarity** - Measuring meaning, not just keywords
3. **Vector Databases** - Efficient storage and retrieval
4. **Similarity Search** - Finding relevant documents by meaning

## Why This Matters

### The Problem with Keyword Search
- **"dogs" vs "canines"**: Same meaning, different words → keyword search fails
- **Synonyms**: Traditional search can't understand semantic equivalence
- **Context**: Word meaning depends on surrounding text

### The Solution: Embeddings
- **Semantic vectors**: Represent meaning as numerical coordinates
- **Similar meanings** → similar vectors → high similarity score
- **Scalable search**: Find relevant content in milliseconds from millions of documents

## Real-World Applications

| Use Case | How Embeddings Help |
|----------|---------------------|
| **RAG Systems** | Find relevant context for LLM |
| **Semantic Search** | "Find documents about X" (not just containing "X") |
| **Recommendation** | "More like this" based on content |
| **Clustering** | Group similar documents automatically |
| **Anomaly Detection** | Find unusual/outlier content |

## Key Concepts

### Embedding
A **dense vector** (typically 1536 dimensions for OpenAI) representing semantic meaning:
```python
"I like dogs" → [0.023, -0.145, 0.892, ..., 0.034]  # 1536 numbers
```

### Cosine Similarity
Measures how similar two vectors are (0 = unrelated, 1 = identical):
```python
similarity("dogs", "canines") = 0.92  # High - same meaning
similarity("dogs", "weather") = 0.71  # Low - unrelated
```

### Vector Database
Specialized database for:
- Storing embeddings efficiently
- Fast similarity search (ANN - Approximate Nearest Neighbors)
- Metadata filtering

## Technologies Used

- **OpenAI Embeddings**: `text-embedding-ada-002` model (1536 dimensions)
- **ChromaDB**: Open-source vector database
- **NumPy**: Similarity calculations
- **LangChain**: Integration layer

## Environment Setup

Standard imports plus tiktoken for token counting (embeddings are billed per token).

In [1]:
import os
import openai
import tiktoken  # OpenAI's tokenizer for accurate token counting
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file (contains OPENAI_API_KEY)
_ = load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

---

## Step 1: Load Source Documents

Loading the same PDF we used in previous notebooks (Indian Data Protection Act).

**Why This Document?**
- Real-world legal text (complex, multi-page)
- Demonstrates retrieval from long documents
- Shows how embeddings find relevant sections in legislation

In [2]:
from langchain_community.document_loaders import PyPDFLoader

# Load PDF - Digital Personal Data Protection Act (India, 2023)
loaders = [
    PyPDFLoader("./99-DPDPA.pdf")  # Real-world legal document
]

# Aggregate all pages into single list
docs = []
for loader in loaders:
    docs.extend(loader.load())  # Each page becomes a Document object

---

## Step 2: Split Documents into Chunks

Before creating embeddings, we need to split the document into appropriately-sized chunks.

**Why Split?**
- **Context window limits**: LLMs have token limits (4K-128K)
- **Relevance**: Smaller chunks = more precise retrieval
- **Cost**: Embedding entire documents is expensive and less effective

**Chunk Size: 1500 characters**
- ~300-400 tokens
- 1-2 paragraphs of text
- Balance: specific enough to be relevant, large enough for context

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Configure text splitter for optimal chunk sizes
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,      # ~300-400 tokens, 1-2 paragraphs
    chunk_overlap=150     # 10% overlap preserves context across boundaries
)

## Execute Splitting

Splits the PDF into 55 chunks. Each chunk will get its own embedding vector.

In [6]:
# Split documents into chunks
splits = text_splitter.split_documents(docs)

# Total chunks (each will become a vector in the database)
len(splits)  # Output: 55 chunks

55

---

## Step 3: Initialize Embedding Model

OpenAI's embedding model converts text to 1536-dimensional vectors.

### Model: `text-embedding-ada-002`

**Specifications**:
- **Dimensions**: 1536
- **Max Input**: 8,191 tokens per request
- **Cost**: $0.0001 per 1K tokens (~$0.0055 for 55 chunks)
- **Performance**: Optimized for semantic search

**What Happens When You Call `embed_query()`?**
1. Text → Tokens (tokenization)
2. Tokens → API (HTTP request to OpenAI)
3. API → 1536 floating-point numbers
4. Numbers represent semantic meaning in high-dimensional space

**Note**: Deprecation warning is expected - newer package exists but functionality is identical.

In [7]:
from langchain_community.embeddings.openai import OpenAIEmbeddings

# Initialize OpenAI embedding model (text-embedding-ada-002)
# This will convert text strings into 1536-dimensional vectors
embedding = OpenAIEmbeddings()

# Note: Deprecation warning is expected - using langchain_community version
# Functionality is identical to newer langchain_openai package

  embedding = OpenAIEmbeddings()


---

## Embeddings Experiment: Semantic Similarity

Let's demonstrate how embeddings capture **meaning, not just keywords**.

### Test Sentences

1. **"i like dogs"** - Base sentence
2. **"i like canines"** - Same meaning, different word (synonym)
3. **"the weather is ugly outside"** - Completely unrelated

### Expected Results

**High similarity**: sentences 1 & 2 (dogs = canines)  
**Low similarity**: sentence 1 & 3 (unrelated topics)  
**Low similarity**: sentence 2 & 3 (unrelated topics)

This proves embeddings understand **semantics**, not just lexical overlap.

In [8]:
# Test sentences for semantic similarity comparison
sentence1 = "i like dogs"                    # Base sentence
sentence2 = "i like canines"                 # Synonym - same meaning
sentence3 = "the weather is ugly outside"    # Unrelated - different topic

## Generate Embeddings

Each sentence becomes a 1536-dimensional vector. This makes 3 API calls to OpenAI.

**Cost**: ~3 API calls × ~10 tokens each = ~$0.000003 (negligible)

In [9]:
# Generate embeddings (each is 1536-dimensional vector)
embedding1 = embedding.embed_query(sentence1)  # [0.023, -0.145, ..., 0.034]
embedding2 = embedding.embed_query(sentence2)  # [0.019, -0.140, ..., 0.029]
embedding3 = embedding.embed_query(sentence3)  # [-0.052, 0.201, ..., -0.078]

# Each embedding is a list of 1536 floating-point numbers
# Similar meanings → similar vectors

## Calculate Cosine Similarity

Using NumPy's dot product to measure similarity between vectors.

### Similarity Score Interpretation

- **0.96**: "dogs" vs "canines" → **Very high** (synonyms!)
- **0.77**: "dogs" vs "weather" → **Low** (unrelated)
- **0.76**: "canines" vs "weather" → **Low** (unrelated)

### Key Insight

The embedding model **correctly identifies** that "dogs" and "canines" are semantically similar (0.96), despite being different words. This is the power of semantic embeddings!

**Note**: Scores range from -1 (opposite) to 1 (identical). Values >0.9 indicate strong semantic similarity.

In [10]:
import numpy as np

# Calculate cosine similarity using dot product
# (For normalized vectors, dot product = cosine similarity)

print(np.dot(embedding2, embedding3))  # "canines" vs "weather" = 0.76 (LOW - unrelated)
print(np.dot(embedding1, embedding3))  # "dogs" vs "weather" = 0.77 (LOW - unrelated)
print(np.dot(embedding1, embedding2))  # "dogs" vs "canines" = 0.96 (HIGH - synonyms!)

# Key insight: "dogs" and "canines" are semantically similar (0.96)
# despite being different words - this is semantic search!   

0.7590539714454777
0.7702031204123153
0.9631510802407718


---

## Step 4: Vector Database Setup

Now we'll store all document chunks in a **vector database** for efficient similarity search.

### Why ChromaDB?

| Feature | Benefit |
|---------|---------|
| **Open Source** | Free, self-hosted |
| **Embedded** | Runs in-process (no separate server) |
| **Persistent** | Save to disk, reload later |
| **Fast** | Approximate Nearest Neighbors (ANN) |
| **Metadata** | Filter by document properties |

### Alternative Vector DBs

- **Pinecone**: Managed, cloud-hosted (paid)
- **Weaviate**: Open-source, GraphQL API
- **Qdrant**: Rust-based, high performance
- **FAISS**: Facebook's library (no persistence by default)

For this demo, ChromaDB is perfect: simple, local, and persistent.

In [11]:
from langchain_community.vectorstores import Chroma

# Import ChromaDB - embedded vector database
# This will store embeddings and enable fast similarity search

## Clean Up Old Database

Remove any existing database to start fresh. In production, you'd only do this when rebuilding the index.

In [12]:
# Define where to persist the vector database
persist_directory = 'docs/chroma/'

# Clean up: remove old database files (if any)
!rm -rf ./docs/chroma

# In production: only rebuild when documents change or you need to update embeddings

---

## Create Vector Database

This is where the magic happens! This single command:

1. **Embeds all 55 chunks** (55 API calls to OpenAI)
2. **Stores vectors** in ChromaDB
3. **Builds search index** (ANN data structures)
4. **Saves to disk** (persistent storage)

### What's Happening Behind the Scenes

```python
for chunk in splits:  # 55 chunks
    # 1. Generate embedding
    vector = openai.embeddings.create(
        model="text-embedding-ada-002",
        input=chunk.page_content
    )
    
    # 2. Store in database
    chroma.add(
        vector=vector,
        metadata=chunk.metadata,  # page number, source, etc.
        content=chunk.page_content
    )
```

### Cost Breakdown

- **55 chunks** × ~300 tokens each = ~16,500 tokens
- **Price**: $0.0001 per 1K tokens
- **Total**: ~$0.0017 (less than a quarter of a cent!)

### Time

- Approximately 10-30 seconds for 55 embeddings
- Most time spent waiting for API responses

In [14]:
# Create vector database from document chunks
# This will:
# 1. Generate embeddings for all 55 chunks (55 API calls)
# 2. Store vectors + metadata in ChromaDB
# 3. Save to disk for future use
vectordb = Chroma.from_documents(
    documents=splits,               # Our 55 document chunks
    embedding=embedding,            # OpenAI embedding function
    persist_directory=persist_directory  # Save to ./docs/chroma/
)

# Cost: ~16,500 tokens = ~$0.0017 (less than 1/4 of a cent!)
# Time: ~10-30 seconds

## Verify Database

Confirm all 55 chunks were successfully embedded and stored.

In [15]:
# Verify: count total vectors in database
print(vectordb._collection.count())  # Should output: 55

# This confirms all chunks were successfully embedded and stored

55


---

## Step 5: Semantic Search in Action!

Now we can search the document **by meaning**, not just keywords.

### Query

**"is there any rules for a data principal?"**

This question uses natural language:
- "rules" (the document uses "rights" and "obligations")
- "data principal" (specific legal term in the document)

### How Similarity Search Works

1. **Embed the query** (1 API call)
   ```python
   query_vector = embedding.embed_query(question)
   ```

2. **Find similar vectors** (fast, local computation)
   ```python
   # Compare query_vector to all 55 stored vectors
   # Return top-k most similar (k=3)
   ```

3. **Return original text** of most similar chunks

### Parameters

- **k=3**: Return top 3 most relevant chunks
- Could adjust based on needs (k=1 for single best match, k=10 for more context)

In [None]:
# Natural language question about the document
question = "is there any rules for a data principal?"

# Note: Question uses "rules" but document uses "rights", "obligations"
# Semantic search will still find relevant sections!

## Execute Similarity Search

The search returns 3 most relevant chunks. Let's examine the top result.

In [18]:
# Perform similarity search
# 1. Embeds the question (1 API call)
# 2. Compares to all 55 stored vectors (fast, local)
# 3. Returns top 3 most similar chunks
docs = vectordb.similarity_search(question, k=3)

# Verify we got 3 results
len(docs)  # Output: 3

# Display the most relevant chunk (highest similarity score)
docs[0].page_content

# This text is the most semantically similar to our question!
# Notice: it contains "Data Principal", "rights", "notice" - all relevant to "rules"

'section 6 and section 13; and\n (iii) the manner in which the Data Principal may make a complaint to the Board,\nin such manner and as may be prescribed.\nIllustration.\nX, an individual, opens a bank account using the mobile app or website of Y , a bank.\nTo complete the Know-Your-Customer requirements under law for opening of bank account,\nX opts for processing of her personal data by Y in a live, video-based customer identification\nprocess. Y shall accompany or precede the request for the personal data with notice to X,\ndescribing the personal data and the purpose of its processing.\n(2) Where a Data Principal has given her consent for the processing of her personal\ndata before the date of commencement of this Act,—\n(a) the Data Fiduciary shall, as soon as it is reasonably practicable, give to the\nData Principal a notice informing her,––\n(i) the personal data and the purpose for which the same has been\nprocessed;\n  (ii) the manner in which she may exercise her rights under

### Analysis of Result

The top result is **highly relevant**:
- Contains "Data Principal" (exact term from query)
- Describes "rights" under sections 6 and 13 (the "rules" we asked about)
- Includes "notice" requirements (procedural rules)
- Shows concrete example (bank account KYC)

**This proves semantic search works!** The query used "rules" but the document text uses "rights", "consent", "notice" - yet the system found the correct section.

---

## Persist Database

Save the database to disk so we can reload it later without re-embedding.

**Note**: Newer ChromaDB versions auto-persist, so this command is optional but harmless.

In [19]:
# Persist database to disk (save for future use)
vectordb.persist()

# Note: Newer ChromaDB versions auto-persist, so this is optional
# But calling it explicitly doesn't hurt and ensures compatibility

# Next time: load existing database instead of re-embedding
# vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

  vectordb.persist()


---

## Summary: What We Accomplished

### 1. Embeddings Fundamentals ✅
- Learned how text becomes numerical vectors (1536 dimensions)
- Demonstrated semantic similarity: "dogs" vs "canines" = 0.96
- Proved embeddings understand meaning, not just keywords

### 2. Vector Database ✅
- Created ChromaDB with 55 document chunks
- Each chunk embedded and indexed for fast search
- Persistent storage for future use

### 3. Semantic Search ✅
- Asked natural language question
- Retrieved relevant sections despite vocabulary mismatch
- "rules" → found text containing "rights", "notice", "consent"

---

## Production Patterns

### 1. Loading Existing Database

```python
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings.openai import OpenAIEmbeddings

# Initialize embedding function
embedding = OpenAIEmbeddings()

# Load existing database (no re-embedding needed!)
vectordb = Chroma(
    persist_directory='docs/chroma/',
    embedding_function=embedding
)

# Ready to search immediately
results = vectordb.similarity_search("your question", k=3)
```

### 2. Incremental Updates

```python
# Add new documents without rebuilding entire index
new_docs = load_new_documents()
new_splits = text_splitter.split_documents(new_docs)

# Add to existing database
vectordb.add_documents(new_splits)
```

### 3. Metadata Filtering

```python
# Search with filters (e.g., only specific document sections)
results = vectordb.similarity_search(
    "your question",
    k=3,
    filter={"source": "section_6"}  # Only search in section 6
)
```

### 4. Similarity Scores

```python
# Get similarity scores along with documents
results = vectordb.similarity_search_with_score("your question", k=3)

for doc, score in results:
    print(f"Score: {score:.3f}")
    print(f"Content: {doc.page_content[:100]}...")
```

---

## Cost Analysis

### Embedding Costs

| Operation | Tokens | Cost | Frequency |
|-----------|--------|------|-----------|
| **Initial indexing** | 16,500 | $0.0017 | Once (or on rebuild) |
| **Query** | ~10 | $0.000001 | Per search |
| **Add 1 document** | ~300 | $0.00003 | As needed |

### Cost Optimization Strategies

1. **Batch Processing**: Embed multiple documents in single API call
2. **Caching**: Reuse existing embeddings, don't re-embed unchanged content
3. **Incremental Updates**: Only embed new/changed documents
4. **Local Embeddings**: Use open-source models (e.g., sentence-transformers) for sensitive data

---

## Advanced Techniques

### 1. Hybrid Search (Keyword + Semantic)

```python
# Combine keyword search (BM25) with semantic search
# Best of both worlds: exact matches + semantic understanding

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Keyword retriever
bm25_retriever = BM25Retriever.from_documents(splits)

# Semantic retriever
vector_retriever = vectordb.as_retriever(search_kwargs={"k": 3})

# Combine both
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7]  # 30% keyword, 70% semantic
)

results = ensemble_retriever.get_relevant_documents("your question")
```

### 2. Maximal Marginal Relevance (MMR)

```python
# Avoid returning duplicate/similar results
# MMR balances relevance with diversity

results = vectordb.max_marginal_relevance_search(
    "your question",
    k=3,
    fetch_k=10,  # Fetch 10 candidates, return diverse 3
    lambda_mult=0.5  # Balance relevance (1.0) vs diversity (0.0)
)
```

### 3. Self-Query Retriever

```python
# Let LLM extract filters from natural language
# "Find documents from 2023 about privacy" 
# → filter: {year: 2023, topic: "privacy"}

from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

metadata_fields = [
    AttributeInfo(name="source", description="Document filename", type="string"),
    AttributeInfo(name="page", description="Page number", type="integer")
]

retriever = SelfQueryRetriever.from_llm(
    llm=your_llm,
    vectorstore=vectordb,
    document_contents="Legal documents about data privacy",
    metadata_field_info=metadata_fields
)

# Natural language query with implicit filter
results = retriever.get_relevant_documents(
    "What does page 5 say about consent?"  
    # Auto-extracts: filter={page: 5}
)
```

---

## Common Pitfalls

### 1. Chunk Size Too Large
**Problem**: Large chunks reduce retrieval precision  
**Solution**: Keep chunks 300-800 tokens (1-3 paragraphs)

### 2. No Overlap
**Problem**: Context lost at chunk boundaries  
**Solution**: Use 10-20% overlap

### 3. Forgetting to Persist
**Problem**: Database lost on restart  
**Solution**: Always specify `persist_directory`

### 4. Wrong Embedding Model
**Problem**: Using different embeddings for indexing vs querying  
**Solution**: Always use same embedding function

### 5. Ignoring Metadata
**Problem**: Can't filter search by document properties  
**Solution**: Preserve metadata during splitting

---

## Next Steps

### In This Module
- **Notebook 4**: Question Answering over Documents (RAG)
- **Notebook 5**: Chat with your data (conversational RAG)

### Production Deployment
1. **Scale**: Use managed vector DB (Pinecone, Weaviate)
2. **Monitor**: Track search quality, latency, cost
3. **Optimize**: Experiment with chunk sizes, overlap
4. **Enhance**: Add hybrid search, metadata filtering
5. **Secure**: Implement access controls, audit logging

---

## Key Takeaways

✅ **Embeddings** represent meaning as vectors  
✅ **Semantic similarity** captures synonyms and context  
✅ **Vector databases** enable fast similarity search  
✅ **ChromaDB** provides persistent, embedded storage  
✅ **RAG foundation** is now complete - ready for Q&A!

**You now understand the core technology behind modern RAG systems!**