# Chapter 4.1: RAG: Contextual Grounding via Semantic Search

In Chapters 1-3, we explored the fundamentals of AI-assisted programming: task taxonomy, mental models, prompt engineering, and environment setup. We learned that language models have powerful capabilities, but they also have critical limitations—most notably, they're constrained by their training data and context windows.

Consider this scenario: You're working with a large codebase containing thousands of files. You need help understanding a specific module's functionality. The entire codebase won't fit in the model's context window, and even if it did, the model wasn't trained on your specific code. How do you provide the AI with the right context to assist you effectively?

This is where **Retrieval-Augmented Generation (RAG)** shines—a technique that fundamentally transforms how we work with language models by dynamically providing them with relevant context retrieved from external sources.

In [2]:
import ollama

# Configure your server URL here
SERVER_HOST = 'http://ollama.cs.wallawalla.edu:11434'
client = ollama.Client(host=SERVER_HOST)

def call_ollama(prompt, model="cs450", **options):
    """
    Send a prompt to the Ollama API.
    
    Args:
        prompt (str): The prompt to send
        model (str): Model name to use
        **options: Additional model parameters (temperature, top_k, etc.)
    
    Returns:
        str: The model's response
    """
    try:
        response = client.generate(
            model=model,
            prompt=prompt,
            options=options
        )
        return response['response']
    
    except Exception as e:
        return f"Error: {e}"

def call_ollama_full(prompt, model="cs450", **options):
    try:
        response = client.generate(
            model=model,
            prompt=prompt,
            options=options
        )
        return response
    
    except Exception as e:
        return f"Error: {e}"



### The Limitations LLMs Face

**1. Knowledge Cutoff**: Models are frozen in time, trained on data up to a specific date. They don't know about:
- Your proprietary codebase
- Recent library updates or API changes  
- Company-specific conventions and patterns
- Project-specific documentation

**2. Context Window Constraints**: Even with modern large context windows (32K, 128K tokens), you often can't fit:
- Entire codebases (millions of lines)
- Complete documentation sets
- Historical project discussions and decisions
- All relevant code dependencies

**3. Hallucination Risk**: When models don't have relevant information, they may:
- Generate plausible-sounding but incorrect information
- Invent APIs or functions that don't exist
- Misremember details from their training data

### The RAG Solution

RAG addresses these limitations through a three-step process:

1. **Index**: Pre-process and store your documents in a searchable format
2. **Retrieve**: Find the most relevant documents for a given query
3. **Generate**: Provide retrieved context to the model along with your prompt


## Semantic Search: The Foundation of RAG

RAG's power comes from **semantic search**—the ability to find documents based on meaning rather than just keyword matching. This is fundamentally different from traditional search.

### Traditional vs. Semantic Search

**Traditional (Keyword) Search:**
- Query: "python function sorting"
- Matches documents containing these exact words
- Misses: "def bubble_sort(arr):" (no word "function")
- Misses: "organizing data in ascending order" (different terminology)

**Semantic Search:**
- Query: "python function sorting"
- Understands the *meaning* and *intent*
- Finds: Implementation of sorting functions
- Finds: Explanations using different terminology
- Finds: Related concepts (algorithms, data structures)

### How Semantic Search Works

Semantic search relies on **embeddings**—vector representations that capture meaning:

In [3]:
def explore_embeddings():
    """Understand what embeddings are and how they work."""
    
    # Generate embeddings for similar and different texts
    texts = [
        "The cat sat on the mat",
        "A feline rested on a rug",
        "Python is a programming language",
        "The dog ran in the park"
    ]
    
    print("Generating Embeddings")
    print("=" * 60)
    
    embeddings = []
    for text in texts:
        response = client.embeddings(
            model='nomic-embed-text',
            prompt=text
        )
        embeddings.append(response['embedding'])
        print(f"\nText: {text}")
        print(f"Embedding dimensions: {len(response['embedding'])}")
        print(f"First 5 values: {response['embedding'][:5]}")
    
    return texts, embeddings

if __name__ == "__main__":
    explore_embeddings()

Generating Embeddings

Text: The cat sat on the mat
Embedding dimensions: 768
First 5 values: [1.004204511642456, 1.3216490745544434, -2.537903070449829, -0.5848085880279541, 1.1537489891052246]

Text: A feline rested on a rug
Embedding dimensions: 768
First 5 values: [0.989753246307373, 1.0614591836929321, -3.260462760925293, -1.8742859363555908, 0.5663155317306519]

Text: Python is a programming language
Embedding dimensions: 768
First 5 values: [0.2254875749349594, 1.77924644947052, -2.091967821121216, -1.2990589141845703, 1.6509016752243042]

Text: The dog ran in the park
Embedding dimensions: 768
First 5 values: [0.19359147548675537, 0.8133468627929688, -3.2348642349243164, -0.14081451296806335, 0.2091991901397705]


**Key Insights:**

1. **High-Dimensional Vectors**: Embeddings are typically 384-1536 dimensions
2. **Semantic Encoding**: Similar meanings → similar vectors
3. **Distance Metrics**: Closer vectors → more similar meanings