# Retrieval & Context Systems

## Learning Objectives
- Understand the limitations of LLMs without external context
- Learn how to inject domain-specific knowledge using context
- Explore basic RAG patterns with embeddings and vector search
- Practice chunking strategies and retrieval techniques

## Setup
Make sure you have your OpenAI API key set as an environment variable or in a `.env` file.

## What's up with Retrieval and "RAG"?

**Retrieval** is the fundamental concept of finding and fetching relevant information from external knowledge sources. It's the "R" in **RAG (Retrieval-Augmented Generation)** - a pattern that combines:

- **Retrieval**: Finding relevant context from external sources (documents, databases, APIs)
- **Augmented**: Enhancing the LLM's capabilities with this external knowledge
- **Generation**: Using the LLM to generate responses informed by the retrieved context

RAG solves a critical limitation: LLMs are trained on static datasets and can't access real-time information or private company data. By retrieving relevant context and injecting it into prompts, we can make LLMs knowledgeable about domains they were never trained on.

Today we'll explore various retrieval techniques, from simple context injection to vector-based search systems.


In [None]:
import os
from openai import OpenAI
from typing import List, Dict

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

print("Setup complete!")

## Challenge 1: Testing LLM Knowledge Limits

Let's start by asking the model something it won't know - questions about Framna, our company.

In [None]:
def ask_question(question: str) -> str:
    response = client.responses.create(
        model="gpt-4.1",
        input=[
            {"role": "user", "content": question}
        ],
    )
    return response.output[0].content[0].text

# Try asking about Framna
question = "What services does Framna offer and what technology stack do they use?"
answer = ask_question(question)
print(f"Question: {question}")
print(f"Answer: {answer}")

**Expected Result:** The model won't know about Framna since it's not in its training data.

## Challenge 2: Adding Context - The Naive Approach

Now let's load our company information and inject it as context.

In [None]:
# Load company information
with open('framna_company_info.txt', 'r', encoding='utf-8') as f:
    company_info = f.read()

print(f"Loaded {len(company_info)} characters of company information")
print(f"First 200 characters: {company_info[:200]}...")

In [None]:
def ask_with_context(question: str, context: str) -> str:
    """Ask a question with provided context using string formatting"""
    
    # Define our prompt template with placeholders
    prompt_template = """Based on the following company information, please answer the question.

Company Information:
{context}

Question: {question}

Answer based only on the provided information:"""
    
    formatted_prompt = prompt_template.format(
        context=context,
        question=question
    )
    
    response = client.responses.create(
        model="gpt-4.1",
        messages=[
            {"role": "user", "content": formatted_prompt}
        ],
        max_tokens=300
    )
    return response.output[0].content[0].text

# Now ask the same question with context
answer_with_context = ask_with_context(question, company_info)
print(f"Question: {question}")
print(f"Answer with context: {answer_with_context}")

**Expected Result:** Now the model can answer accurately about Framna!

Let's try a few more questions to see how well this works:

In [None]:
# Test more questions
questions = [
    "Where is Framna located and what is their work policy?",
    "What are Framna's company values?",
    "What notable projects has Framna worked on?"
]

for q in questions:
    answer = ask_with_context(q, company_info)
    print(f"\nQ: {q}")
    print(f"A: {answer}")
    print("-" * 50)

## Challenge 3: The Problem with Large Documents

What happens when we have too much information? Let's simulate a larger document and see the limitations.

In [None]:
# Simulate a much larger document by repeating our content
large_document = company_info * 10  # 10x the original size

print(f"Original document: {len(company_info)} characters")
print(f"Large document: {len(large_document)} characters")

# Calculate approximate token count (rough estimate: 1 token ≈ 4 characters)
estimated_tokens = len(large_document) // 0.7
print(f"Estimated tokens: {estimated_tokens}")

# This would be expensive and inefficient!
print("\n⚠️ Problems with large documents:")
print("- Higher API costs (more tokens)")
print("- Slower response times")
print("- Context window limits (models have max token limits)")
print("- Difficulty finding relevant information (needle in haystack)")

## Challenge 4: Introduction to RAG - Q&A Database Approach

Now let's see a more practical example. We have a large Q&A database about Framna. It would be inefficient to send all Q&A pairs for every question. Instead, we'll use RAG to find only the relevant Q&A pairs.

In [None]:
# Install ChromaDB if not already installed
# !pip install chromadb

import chromadb
from chromadb.config import Settings

# Initialize ChromaDB client
chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))

# Create a collection
collection = chroma_client.get_or_create_collection(
    name="framna_knowledge",
    metadata={"description": "Framna company information"}
)

print("ChromaDB collection created successfully!")

In [None]:
# Load our Q&A database instead
with open('framna_qa_database.txt', 'r', encoding='utf-8') as f:
    qa_database = f.read()

print(f"Loaded Q&A database with {len(qa_database)} characters")

# Let's see what we're working with
qa_pairs = []
lines = qa_database.strip().split('\n')
current_q = None
current_a = None

for line in lines:
    line = line.strip()
    if line.startswith('Q: '):
        current_q = line[3:]  # Remove 'Q: '
    elif line.startswith('A: '):
        current_a = line[3:]  # Remove 'A: '
        if current_q and current_a:
            qa_pairs.append(f"Q: {current_q}\nA: {current_a}")
            current_q = None
            current_a = None

print(f"Parsed {len(qa_pairs)} Q&A pairs")
print(f"\nFirst few Q&A pairs:")
for i, qa in enumerate(qa_pairs[:3]):
    print(f"\n--- Q&A {i+1} ---")
    print(qa)

In [None]:
def get_embeddings(texts: List[str]) -> List[List[float]]:
    """Get embeddings for a list of texts using OpenAI"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [embedding.embedding for embedding in response.data]

# Get embeddings for our Q&A pairs (not chunks this time)
embeddings = get_embeddings(qa_pairs)
print(f"Generated embeddings for {len(embeddings)} Q&A pairs")
print(f"Each embedding has {len(embeddings[0])} dimensions")

In [None]:
# Add Q&A pairs to ChromaDB with embeddings
collection.add(
    documents=qa_pairs,
    embeddings=embeddings,
    ids=[f"qa_{i}" for i in range(len(qa_pairs))],
    metadatas=[{"qa_index": i} for i in range(len(qa_pairs))]
)

print(f"Added {len(qa_pairs)} Q&A pairs to ChromaDB collection")

# Demonstrate the concept: we don't need all 20 Q&A pairs to answer one question
print(f"\n💡 Key insight: We have {len(qa_pairs)} Q&A pairs, but we only need 2-3 relevant ones to answer most questions!")
print("This saves tokens, reduces costs, and improves response quality.")

## Challenge 5: Semantic Search Experiments

Now let's experiment with semantic search to find relevant chunks.

In [None]:
def search_knowledge(query: str, n_results: int = 3) -> Dict:
    """Search for relevant Q&A pairs using semantic similarity"""
    # Get embedding for the query
    query_embedding = get_embeddings([query])[0]
    
    # Search in ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    
    return results

# Test semantic search with our Q&A database
test_queries = [
    "What programming languages does Framna use?",
    "Company work environment and culture",  
    "Geographic location and office setup"
]

for query in test_queries:
    print(f"\n🔍 Query: {query}")
    results = search_knowledge(query, n_results=2)
    
    for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
        print(f"\n📋 Most relevant Q&A {i+1} (distance: {distance:.3f}):")
        print(f"{doc}")
    print("-" * 80)

## Challenge 6: RAG System - Putting It All Together

Now let's build a complete RAG system that retrieves relevant chunks and uses them as context.

In [None]:
def rag_query(question: str, n_results: int = 2) -> str:
    """Complete RAG pipeline: retrieve relevant Q&A pairs and generate answer"""
    
    # Step 1: Retrieve relevant Q&A pairs
    search_results = search_knowledge(question, n_results=n_results)
    relevant_qas = search_results['documents'][0]
    
    # Step 2: Combine Q&A pairs into context
    context = "\n\n".join(relevant_qas)
    
    # Step 3: Generate answer using context
    prompt_template = """Based on the following Q&A pairs about Framna, please answer the question. Use the information from the Q&A pairs to provide a comprehensive answer.

Relevant Q&A pairs:
{context}

Question: {question}

Answer:"""
    
    formatted_prompt = prompt_template.format(
        context=context,
        question=question
    )
    
    response = client.responses.create(
        model="gpt-4.1",
        input=[
            {"role": "user", "content": formatted_prompt}
        ],
        max_tokens=300
    )
    
    return response.output[0].content[0].text, relevant_qas

# Test the RAG system with Q&A database
test_questions = [
    "What technology stack does Framna use for mobile development?",
    "How big is the Framna team?",
    "What makes Framna environmentally conscious?",
    "Where can Framna employees work from?"
]

for question in test_questions:
    print(f"\n❓ Question: {question}")
    answer, qas_used = rag_query(question)
    
    print(f"\n💡 Answer: {answer}")
    
    print(f"\n📚 Q&A pairs used:")
    for i, qa in enumerate(qas_used):
        print(f"\n  {i+1}. {qa}")
    
    print("="*80)

## Key Takeaways

1. **Context Injection**: We can provide external knowledge to LLMs through context
2. **Scaling Problems**: Large documents/databases are expensive and inefficient to process entirely
3. **RAG Solution**: Retrieve only relevant portions using semantic search
4. **Q&A Database Pattern**: Perfect example of why RAG is needed - we had 20 Q&A pairs but only needed 2-3 for any given question
5. **Embeddings**: Vector representations enable semantic similarity search beyond keyword matching
6. **Efficiency**: RAG dramatically reduces token costs while maintaining answer quality

## Next Steps
- Experiment with different retrieval strategies (top-k, similarity thresholds)
- Try different embedding models (text-embedding-3-large vs small)
- Add metadata filtering (categories, dates, etc.)
- Implement re-ranking of retrieved results
- Add evaluation metrics for RAG quality (relevance, completeness)