---
## Section 5: Automated Testing with RAGAS (~25 min)
---

### 5.1 Why Automated Evaluation

Manual testing doesn't scale. **RAGAS** (Retrieval-Augmented Generation Assessment) automates evaluation using **LLM-as-judge**.

| Approach | Pros | Cons |
|----------|------|------|
| Manual evaluation | High quality, nuanced | Slow, expensive, inconsistent |
| Automated (RAGAS) | Fast, reproducible, scalable | Depends on judge LLM quality |

**Three metrics we'll use:**

| Metric | What it measures | Score meaning |
|--------|-----------------|---------------|
| **LLMContextRecall** | Does retrieved context contain needed info? | 1.0 = all needed info retrieved |
| **Faithfulness** | Is answer faithful to context (no hallucination)? | 1.0 = fully grounded |
| **FactualCorrectness** | Does answer match ground truth? | 1.0 = perfectly correct |

### RAGAS Test Set Generation

Instead of manually creating test questions, RAGAS can **automatically generate a test set** from your documents using a KnowledgeGraph. This ensures comprehensive coverage of your SDS content. We also include a hand-crafted set of 12 SDS Q&A pairs in `qa_dataset.xlsx` for targeted evaluation.

In [None]:
### 5.2 Building a Knowledge Graph & Generating the Test Set

R#AGAS builds a **KnowledgeGraph** from your documents, applies transformations to understand relationships, then synthesizes diverse test questions automatically.

In [None]:
# Run RAGAS evaluation with Ollama as judge

metrics = [
    LLMContextRecall(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    FactualCorrectness(llm=evaluator_llm),
]

ragas_result = evaluate(
    dataset=eval_dataset,
    metrics=metrics,
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
)

print("RAGAS Evaluation Results:")
print(f"  LLM Context Recall: {ragas_result['context_recall']:.4f}")
print(f"  Faithfulness:       {ragas_result['faithfulness']:.4f}")
print(f"  Factual Correctness:{ragas_result['factual_correctness']:.4f}")

In [None]:
# View per-question results
results_df = ragas_result.to_pandas()
results_df

**Interpreting the scores:**

- **LLMContextRecall > 0.7**: Our retrieval is finding relevant information
- **Faithfulness > 0.8**: The LLM is staying grounded (thanks to our restrictive prompt)
- **FactualCorrectness**: Depends heavily on question difficulty and context coverage

Low scores indicate areas for improvement - either in retrieval (chunk size, k) or generation (prompt engineering).

### 5.4 Comparative Evaluation Pipeline

Now let's compare multiple RAG configurations to find the best one.

In [None]:
# Build graphs with different retriever strategies using the factory
from rag.vectorstore import create_retriever
from rag.pipeline import build_basic_graph

# Create retrievers
vector_retriever = create_retriever("vector", chunks, vector_store, k=3)
bm25_retriever = create_retriever("bm25", chunks, vector_store, k=3)
hybrid_retriever = create_retriever("hybrid", chunks, vector_store, k=3)

# Build a graph for each retriever strategy
vector_graph = build_basic_graph(llm, vector_store, prompt_template=selected_prompt, k=3,
                                  retriever=vector_retriever)
bm25_graph = build_basic_graph(llm, vector_store, prompt_template=selected_prompt, k=3,
                                retriever=bm25_retriever)
hybrid_graph = build_basic_graph(llm, vector_store, prompt_template=selected_prompt, k=3,
                                  retriever=hybrid_retriever)

# Evaluate each retriever strategy
print("Evaluating retriever strategies...")
all_results.append(evaluate_config(vector_graph, "Vector Retriever", testset_df))
all_results.append(evaluate_config(bm25_graph, "BM25 Retriever", testset_df))
all_results.append(evaluate_config(hybrid_graph, "Hybrid Retriever", testset_df))

print("Done! All 6 configurations evaluated.")

### 5.5 Workshop Conclusion

---

## What We Built

In this workshop, we built a **production-ready RAG system** with:

1. **Data Pipeline**: PDF extraction (PyMuPDF) → section splitting → embedding → FAISS indexing
2. **Guardrails**: Input validation, prompt injection detection, topic relevance, output grounding
3. **Prompt Engineering**: 4 strategies compared quantitatively (restrictive, permissive, few-shot, structured)
4. **Context Engineering**: Chunk size optimization, k-value analysis, LLM-based re-ranking
5. **Hybrid Search**: BM25 + vector retrieval with configurable weights via EnsembleRetriever
6. **Evaluation**: Automated RAGAS pipeline with KnowledgeGraph-based test set generation, comparing prompt strategies and retriever strategies side by side

## Next Steps for Production

- **CI/CD with RAGAS**: Run evaluation on every code change to catch regressions
- **Monitoring**: Track latency, error rates, and user satisfaction in production
- **Conversation Memory**: Add multi-turn context for follow-up questions
- **Advanced Re-ranking**: Use cross-encoder models for better re-ranking
- **Semantic Chunking**: Split documents by meaning instead of fixed character count

## Resources

- [Original Article: Building and Evaluating your First RAG](https://medium.com/henkel-data-and-analytics/building-and-evaluating-your-first-rag) by Abdelrhman ElMoghazy
- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
- [RAGAS Documentation](https://docs.ragas.io/)
- [Ollama](https://ollama.com)
- [Covestro Product Safety](https://www.productsafetyfirst.covestro.com)