# Embeddings & Semantic Search Demo

This notebook demonstrates how embeddings work for semantic search - the foundation of RAG systems.

**Workflow:**
1. **Run cells 1-3 once** to create embeddings (costs tokens)
2. **Run cell 4 multiple times** to test different questions (free!)

This saves you money while experimenting! üí∞

## Cell 1: Setup & Imports
Run this once at the start

In [1]:
from openai import OpenAI
import numpy as np
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

print("‚úì Setup complete!")

‚úì Setup complete!


## Cell 2: Helper Functions
Functions for embeddings and similarity calculation

In [2]:
def get_embedding(text, model="text-embedding-3-small"):
    """Convert text to embedding vector"""
    response = client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(vec1, vec2):
    """Calculate similarity between two vectors (0-1 scale, 1=most similar)"""
    dot_product = np.dot(vec1, vec2)
    magnitude = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    return dot_product / magnitude

def find_most_relevant(question_embedding, doc_embeddings, documents, top_k=3):
    """Find top_k most relevant documents for a question"""
    similarities = []
    for i, doc_emb in enumerate(doc_embeddings):
        score = cosine_similarity(question_embedding, doc_emb)
        similarities.append((score, i, documents[i]))
    
    # Sort by similarity (highest first)
    similarities.sort(reverse=True)
    return similarities[:top_k]

print("‚úì Helper functions defined!")

‚úì Helper functions defined!


## Cell 3: Create Document Embeddings

**RUN THIS ONCE** - Creates embeddings for all documents.

üí° **Tip:** You can modify the `documents` list to add your own test documents!

In [3]:
# Your knowledge base - modify these to test with your own documents!
documents = [
    "Employees receive 15 days of PTO annually. This includes vacation and sick leave.",
    "Our remote work policy allows employees to work from home 3 days per week.",
    "The database uses PostgreSQL 14 with automatic backups every 6 hours.",
    "Time off requests must be submitted 2 weeks in advance through the HR portal.",
    "All employees must complete cybersecurity training within their first month.",
    "The company matches 401k contributions up to 6% of salary.",
    "Office hours are 9 AM to 6 PM, with flexible start times between 8-10 AM.",
    "Health insurance coverage begins on the first day of employment."
]

print(f"Creating embeddings for {len(documents)} documents...")
print("(This costs tokens, but we only do it once!)\n")

# Create embeddings for all documents
doc_embeddings = []
for i, doc in enumerate(documents):
    embedding = get_embedding(doc)
    doc_embeddings.append(embedding)
    print(f"  [{i+1}/{len(documents)}] Embedded: {doc[:50]}...")

print(f"\n‚úì Created {len(doc_embeddings)} embeddings!")
print(f"  Each embedding has {len(doc_embeddings[0])} dimensions")
print(f"\nüí∞ Cost: ~${len(documents) * 0.00002:.6f} (very cheap!)")
print("\nüéØ Now you can ask unlimited questions in Cell 4 without additional embedding costs!")

Creating embeddings for 8 documents...
(This costs tokens, but we only do it once!)

  [1/8] Embedded: Employees receive 15 days of PTO annually. This in...
  [2/8] Embedded: Our remote work policy allows employees to work fr...
  [3/8] Embedded: The database uses PostgreSQL 14 with automatic bac...
  [4/8] Embedded: Time off requests must be submitted 2 weeks in adv...
  [5/8] Embedded: All employees must complete cybersecurity training...
  [6/8] Embedded: The company matches 401k contributions up to 6% of...
  [7/8] Embedded: Office hours are 9 AM to 6 PM, with flexible start...
  [8/8] Embedded: Health insurance coverage begins on the first day ...

‚úì Created 8 embeddings!
  Each embedding has 1536 dimensions

üí∞ Cost: ~$0.000160 (very cheap!)

üéØ Now you can ask unlimited questions in Cell 4 without additional embedding costs!


## Cell 4: Ask Questions (Run Multiple Times!)

**Change the `question` and re-run this cell** as many times as you want.

No additional embedding costs - we already have the document embeddings! üéâ

In [15]:
# üîß CHANGE THIS QUESTION AND RE-RUN!
# question = "How many vacation days do I get?"

# Alternative questions to try:
# question = "Can I work from home for 5 days in a month?"
question = "When does my health insurance end?"
# question = "Tell me about retirement benefits"
# question = "What database do we use and why?"

print(f"‚ùì Question: {question}")
print("=" * 80)

# Convert question to embedding
question_embedding = get_embedding(question)

# Find most relevant documents
top_results = find_most_relevant(question_embedding, doc_embeddings, documents, top_k=3)

# Display results
print("\nüìä RELEVANCE SCORES:\n")
for rank, (score, idx, doc) in enumerate(top_results, 1):
    print(f"#{rank} - Score: {score:.3f}")
    print(f"    Doc [{idx}]: {doc}")
    print()

# Use the most relevant document for RAG
most_relevant_doc = top_results[0][2]
print("=" * 80)
print("üìÑ SENDING TO LLM (most relevant doc only):\n")
print(f"  {most_relevant_doc}\n")

# Generate answer using RAG
print("=" * 80)
print("ü§ñ LLM RESPONSE:\n")
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Answer the question based only on the provided context. If the context doesn't contain the answer, say so."},
        {"role": "user", "content": f"Context: {most_relevant_doc}\n\nQuestion: {question}"}
    ],
    temperature=0
)

answer = response.choices[0].message.content
print(f"  {answer}\n")

# Show token usage
tokens_used = response.usage.total_tokens
cost = tokens_used * 0.00000015  # gpt-4o-mini pricing
print("=" * 80)
print(f"üí∞ COST: {tokens_used} tokens = ${cost:.6f}")
print("\n‚ú® Try changing the question above and re-running this cell!")

‚ùì Question: When does my health insurance end?

üìä RELEVANCE SCORES:

#1 - Score: 0.482
    Doc [7]: Health insurance coverage begins on the first day of employment.

#2 - Score: 0.194
    Doc [3]: Time off requests must be submitted 2 weeks in advance through the HR portal.

#3 - Score: 0.185
    Doc [0]: Employees receive 15 days of PTO annually. This includes vacation and sick leave.

üìÑ SENDING TO LLM (most relevant doc only):

  Health insurance coverage begins on the first day of employment.

ü§ñ LLM RESPONSE:

  The provided context does not contain information about when your health insurance ends.

üí∞ COST: 68 tokens = $0.000010

‚ú® Try changing the question above and re-running this cell!


## Cell 5: Compare Multiple Questions (Optional)

Run this to see how different questions match to documents

In [10]:
# Test multiple questions at once
test_questions = [
    "How many vacation days do I get?",
    "Can I work remotely?",
    "What database technology do we use?",
    "When does insurance coverage begin?"
]

print("üîç TESTING MULTIPLE QUESTIONS\n")
print("=" * 80)

for q in test_questions:
    q_emb = get_embedding(q)
    top = find_most_relevant(q_emb, doc_embeddings, documents, top_k=1)[0]
    score, idx, doc = top
    
    print(f"\n‚ùì Q: {q}")
    print(f"‚úì Best Match (score: {score:.3f}): {doc[:70]}...")
    print("-" * 80)

üîç TESTING MULTIPLE QUESTIONS


‚ùì Q: How many vacation days do I get?
‚úì Best Match (score: 0.551): Employees receive 15 days of PTO annually. This includes vacation and ...
--------------------------------------------------------------------------------

‚ùì Q: Can I work remotely?
‚úì Best Match (score: 0.610): Our remote work policy allows employees to work from home 3 days per w...
--------------------------------------------------------------------------------

‚ùì Q: What database technology do we use?
‚úì Best Match (score: 0.441): The database uses PostgreSQL 14 with automatic backups every 6 hours....
--------------------------------------------------------------------------------

‚ùì Q: When does insurance coverage begin?
‚úì Best Match (score: 0.765): Health insurance coverage begins on the first day of employment....
--------------------------------------------------------------------------------


## üéì Key Takeaways

1. **Embeddings capture meaning** - "vacation" matches "PTO" even with different words
2. **One-time cost** - Embed documents once, query many times
3. **Semantic search** - Finds relevant content based on meaning, not keywords
4. **Token efficiency** - Only send relevant docs to LLM, not everything

## üöÄ This is RAG!

You just experienced the core of RAG:
- **R**etrieval: Find relevant docs using embeddings
- **A**ugmented: Add those docs to the prompt
- **G**eneration: LLM generates answer based on retrieved context

## üìù Experiment Ideas

1. Add your own documents to the `documents` list in Cell 3
2. Try questions that don't match any document - see how scores drop
3. Modify `top_k` to send multiple documents to the LLM
4. Compare costs: RAG vs sending all documents every time