---
## Section 4: Context Engineering (~25 min)
---

### 4.1 Context Engineering Fundamentals

**Context engineering** is about optimizing *what information reaches the LLM*. Even with a perfect prompt, garbage context = garbage answers.

**Three levers to optimize:**

1. **Chunk size**: How big each piece of context is
2. **Retrieval count (k)**: How many chunks to retrieve
3. **Re-ranking**: Reorder results by relevance after retrieval

**The "lost in the middle" problem:**
LLMs tend to focus on the beginning and end of their context, losing information in the middle. This means:
- Fewer, more relevant chunks > many loosely relevant chunks
- Re-ranking to put the best chunk first matters

**Context window budget:**
```
Total context window (e.g., 8K tokens)
  - System prompt:    ~200 tokens
  - Retrieved context: ~2000-4000 tokens (our budget)
  - Generation space:  ~2000 tokens
  - Safety margin:     ~1000 tokens
```

### 4.2 Chunk Size Optimization

In [None]:
import numpy as np

chunk_configs = {
    "small": {"chunk_size": 200, "chunk_overlap": 40},
    "medium": {"chunk_size": 500, "chunk_overlap": 100},
    "large": {"chunk_size": 1000, "chunk_overlap": 200},
    "xlarge": {"chunk_size": 2000, "chunk_overlap": 400},
}

chunk_stores = {}
for name, config in chunk_configs.items():
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=config["chunk_size"],
        chunk_overlap=config["chunk_overlap"],
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    config_chunks = splitter.split_documents(documents)
    store = FAISS.from_documents(config_chunks, embeddings)
    chunk_stores[name] = {"store": store, "chunks": config_chunks}
    
    sizes = [len(c.page_content) for c in config_chunks]
    print(f"{name:>8}: {len(config_chunks):>5} chunks | avg {np.mean(sizes):>6.0f} chars | "
          f"min {min(sizes):>4} | max {max(sizes):>4}")

   small:  1866 chunks | avg    146 chars | min    1 | max  200
  medium:   702 chunks | avg    404 chars | min   40 | max  500
   large:   352 chunks | avg    815 chars | min   58 | max  999
  xlarge:   180 chunks | avg   1594 chars | min   62 | max 1999


In [None]:
test_query = "What PPE is required for handling DESMOPHEN XP 2680?"
print(f"Query: '{test_query}'\n")

for name, data in chunk_stores.items():
    results = data["store"].similarity_search_with_score(test_query, k=3)
    avg_score = np.mean([score for _, score in results])
    total_chars = sum(len(doc.page_content) for doc, _ in results)
    
    print(f"{name:>8}: avg_distance={avg_score:.4f} | total_chars={total_chars:>5} | "
          f"chunks_returned={len(results)}")
    for i, (doc, score) in enumerate(results):
        print(f"          [{i+1}] distance={score:.4f} | {len(doc.page_content)} chars | "
              f"{doc.metadata.get('product_name', '?')}/S{doc.metadata.get('section_number', '?')}")

### 4.3 Retrieval Parameter Tuning (k-value analysis)

In [None]:
# Analyze the impact of k on retrieval

def analyze_k_impact(store, query: str, k_values: list[int]):
    """Test different k values and compare results."""
    print(f"Query: '{query}'\n")
    print(f"{'k':>3} | {'Docs':>4} | {'~Tokens':>7} | {'Unique Sources':>14} | {'Avg Distance':>12}")
    print("-" * 55)
    
    for k in k_values:
        results = store.similarity_search_with_score(query, k=k)
        total_chars = sum(len(doc.page_content) for doc, _ in results)
        approx_tokens = total_chars // 4  # rough estimate
        unique_sources = len(set(
            f"{doc.metadata.get('product_name', '?')}/S{doc.metadata.get('section_number', '?')}"
            for doc, _ in results
        ))
        avg_dist = np.mean([score for _, score in results])
        
        print(f"{k:>3} | {len(results):>4} | {approx_tokens:>7} | {unique_sources:>14} | {avg_dist:>12.4f}")

# Use the large chunk store (our default)
analyze_k_impact(
    chunk_stores["large"]["store"],
    "What PPE is required for handling DESMOPHEN XP 2680?",
    k_values=[1, 3, 5, 10]
)

**Tradeoff analysis:**
- **k=1**: Minimal context, highest precision, might miss information
- **k=3**: Good balance (our default)
- **k=5**: More context, higher recall, more noise
- **k=10**: Maximum recall, but risk of "lost in the middle" and high cost

### 4.4 Metadata Filtering & Re-ranking

In [None]:
# --- Metadata Filtering ---
# Filter retrieval to a specific product

def retrieve_with_product_filter(store, query: str, product_name: str, k: int = 3):
    """Retrieve docs filtered by product name metadata."""
    # FAISS doesn't support native filtering, so we over-retrieve and filter
    all_results = store.similarity_search_with_score(query, k=k * 5)
    filtered = [(doc, score) for doc, score in all_results if product_name.lower() in doc.metadata.get("product_name", "").lower()]
    return filtered[:k]

query = "What are the hazardous decomposition products?"

print(f"Query: '{query}'\n")
print("--- Unfiltered (all products) ---")
for doc, score in vector_store.similarity_search_with_score(query, k=3):
    print(f"  {doc.metadata.get('product_name', '?')}/S{doc.metadata.get('section_number', '?')} | distance={score:.4f}")

print("\n--- Filtered (BAYBLEND M750 only) ---")
for doc, score in retrieve_with_product_filter(vector_store, query, product_name="BAYBLEND M750", k=3):
    print(f"  {doc.metadata.get('product_name', '?')}/S{doc.metadata.get('section_number', '?')} | distance={score:.4f}")

In [None]:
# LLM-based Re-ranking
# Use the LLM to score and re-order retrieved results

def rerank_results(query: str, docs: list, llm, top_n: int = 3) -> list:
    """Re-rank documents by LLM-judged relevance (1-10)."""
    scored = []
    for doc in docs:
        prompt = (
            f"Rate the relevance of this document to the query on a scale of 1-10.\n\n"
            f"Query: {query}\n\n"
            f"Document: {doc.page_content[:500]}\n\n"
            f"Respond with ONLY a single number from 1 to 10."
        )
        response = llm.invoke(prompt)
        try:
            score = int(response.content.strip())
        except ValueError:
            score = 5  # default if parsing fails
        scored.append((doc, score))
    
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:top_n]

# Compare basic vs re-ranked retrieval
query = "What are the storage requirements for DESMOPHEN XP 2680?"
basic_results = vector_store.similarity_search(query, k=5)

print(f"Query: '{query}'\n")
print("--- Basic retrieval (top 5 by vector distance) ---")
for i, doc in enumerate(basic_results, 1):
    print(f"  [{i}] {doc.metadata.get('product_name', '?')}/S{doc.metadata.get('section_number', '?')} | {doc.page_content[:80]}...")

reranked = rerank_results(query, basic_results, llm, top_n=3)
print("\n--- Re-ranked (top 3 by LLM relevance) ---")
for i, (doc, score) in enumerate(reranked, 1):
    print(f"  [{i}] score={score}/10 | {doc.metadata.get('product_name', '?')}/S{doc.metadata.get('section_number', '?')} | {doc.page_content[:80]}...")

### 4.5 Hybrid Search (BM25 + Vector)

Vector search excels at **semantic similarity** â€” finding chunks that *mean* the same thing even with different wording. But it can miss exact keyword matches that a user expects.

**BM25** is a classic keyword-based ranking algorithm (used by Elasticsearch, Lucene, etc.). It excels at:
- Exact term matching ("BERT", "BPE", "tokenizer")
- Queries with rare or specific technical terms
- Cases where the user's wording closely matches the document

**Hybrid search** combines both:

| Strategy | Strengths | Weaknesses |
|----------|-----------|------------|
| **Vector** | Semantic understanding, paraphrases | Misses exact keywords |
| **BM25** | Exact term matching, rare terms | No semantic understanding |
| **Hybrid** | Best of both worlds | Slightly more compute |

We use LangChain's `EnsembleRetriever` which merges results using **Reciprocal Rank Fusion (RRF)** with configurable weights.

In [None]:
from rag.vectorstore import create_retriever

# Create all 3 retriever strategies
vector_ret = create_retriever("vector", chunks, vector_store, k=3)
bm25_ret = create_retriever("bm25", chunks, vector_store, k=3)
hybrid_ret = create_retriever("hybrid", chunks, vector_store, k=3)

# Compare results on a query with specific technical terms
query = "What is the GHS classification and CAS numbers for DESMOPHEN XP 2680?"
print(f"Query: '{query}'")

for name, ret in [("Vector", vector_ret), ("BM25", bm25_ret), ("Hybrid", hybrid_ret)]:
    docs = ret.invoke(query)
    print(f"--- {name} ({len(docs)} docs) ---")
    for i, doc in enumerate(docs, 1):
        preview = doc.page_content[:120].replace(chr(10), ' ')
        print(f"  [{i}] {doc.metadata.get('product_name', '?')}/S{doc.metadata.get('section_number', '?')} | {preview}...")
    print()

In [None]:
# Show how adjusting weights changes hybrid results

weight_configs = [
    (0.0, 1.0, "Pure Vector"),
    (0.3, 0.7, "30% BM25 / 70% Vector (default)"),
    (0.5, 0.5, "50% BM25 / 50% Vector"),
    (0.7, 0.3, "70% BM25 / 30% Vector"),
    (1.0, 0.0, "Pure BM25"),
]

query = "What is BPE tokenization and how does BERT use it?"
print(f"Query: '{query}'")

for bm25_w, vec_w, label in weight_configs:
    ret = create_retriever("hybrid", chunks, vector_store, k=3,
                           bm25_weight=bm25_w, vector_weight=vec_w)
    docs = ret.invoke(query)
    sources = [f"ch{d.metadata.get('chapter')}/p{d.metadata.get('page')}" for d in docs]
    print(f"{label:>40}: {sources}")

Query: 'What is BPE tokenization and how does BERT use it?'
                             Pure Vector: ['ch2/p4', 'ch2/p7', 'ch3/p2', 'ch3/p3', 'ch3/p2', 'ch1/p5']
         30% BM25 / 70% Vector (default): ['ch2/p4', 'ch2/p7', 'ch3/p2', 'ch3/p3', 'ch3/p2', 'ch1/p5']
                   50% BM25 / 50% Vector: ['ch3/p3', 'ch2/p4', 'ch3/p2', 'ch2/p7', 'ch1/p5', 'ch3/p2']
                   70% BM25 / 30% Vector: ['ch3/p3', 'ch3/p2', 'ch1/p5', 'ch2/p4', 'ch2/p7', 'ch3/p2']
                               Pure BM25: ['ch3/p3', 'ch3/p2', 'ch1/p5', 'ch2/p4', 'ch2/p7', 'ch3/p2']


### 4.6 Building the Optimized Pipeline

In [None]:
# Combine: large chunks + k=5 + re-ranking into an optimized graph

optimized_store = chunk_stores["large"]["store"]

def optimized_retrieve(state: State) -> dict:
    """Retrieve with re-ranking."""

    docs = optimized_store.similarity_search(state["question"], k=5)

    reranked = rerank_results(state["question"], docs, llm, top_n=3)
    context = "\n\n".join(doc.page_content for doc, _ in reranked)
    return {"context": context}

def optimized_generate(state: State) -> dict:
    messages = selected_prompt.format_messages(
        context=state["context"],
        question=state["question"],
    )
    response = llm.invoke(messages)
    return {"answer": response.content}

opt_builder = StateGraph(State)
opt_builder.add_node("retrieve", optimized_retrieve)
opt_builder.add_node("generate", optimized_generate)
opt_builder.add_edge(START, "retrieve")
opt_builder.add_edge("retrieve", "generate")
opt_builder.add_edge("generate", END)

optimized_graph = opt_builder.compile()
print("Optimized RAG graph compiled (large chunks + k=5 + re-ranking)!")

Optimized RAG graph compiled (large chunks + k=5 + re-ranking)!


In [None]:
# Test the optimized pipeline
result = optimized_graph.invoke({"question": "What PPE is required for DESMOPHEN XP 2680?"})
print("Question:", result["question"])
print("\nAnswer:", result["answer"])