# Week 15 ‚Äî RAG & Customer Service Evaluation
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand how Retrieval-Augmented Generation (RAG) works for customer service
2. Create a tiny FAQ corpus and index it with a simple vector store
3. Implement retrieval and generation components for RAG
4. Benchmark tinyGPT as a generator given retrieved snippets
5. Evaluate answer groundedness using LLM-as-Judge and string matching

---

## üß† What is RAG?

### The Challenge

LLMs have knowledge cutoffs and may not know about your company's specific policies:

| Challenge | Without RAG | With RAG |
|-----------|-------------|----------|
| Knowledge | Limited to training data | Access to external knowledge base |
| Accuracy | May hallucinate | Grounded in retrieved documents |
| Updates | Requires retraining | Just update knowledge base |
| Verification | Hard to trace claims | Can cite source documents |

### The RAG Pipeline

```
Question ‚Üí Embed ‚Üí Search KB ‚Üí Retrieve Docs ‚Üí Generate Answer
```

---

## üõ†Ô∏è Step 1: Setup & Dependencies

In [None]:
# Standard library imports
import numpy as np
import sys
import json
from typing import Dict, List, Any, Tuple, Optional, Callable

# Add src to path if running in Colab
sys.path.insert(0, '.')

# For data display
try:
    from IPython.display import display, HTML
except ImportError:
    display = print

print("‚úÖ Setup complete!")
print(f"   NumPy version: {np.__version__}")

---

## üìö Step 2: Define the FAQ Corpus

In [None]:
# Customer service FAQ corpus
FAQ_CORPUS = [
    {
        "id": "faq_001",
        "question": "What is your return policy?",
        "answer": "You can return any item within 30 days of purchase for a full refund. Items must be unused and in original packaging. Return shipping is free for defective items.",
        "category": "returns",
    },
    {
        "id": "faq_002",
        "question": "How long does shipping take?",
        "answer": "Standard shipping takes 5-7 business days. Express shipping takes 2-3 business days. Free shipping is available on orders over $50.",
        "category": "shipping",
    },
    {
        "id": "faq_003",
        "question": "How do I track my order?",
        "answer": "You can track your order by logging into your account and visiting the 'Order History' section. You will also receive tracking updates via email once your order ships.",
        "category": "orders",
    },
    {
        "id": "faq_004",
        "question": "What payment methods do you accept?",
        "answer": "We accept Visa, Mastercard, American Express, PayPal, and Apple Pay. All transactions are secured with SSL encryption.",
        "category": "payment",
    },
    {
        "id": "faq_005",
        "question": "How do I cancel my order?",
        "answer": "You can cancel your order within 1 hour of placing it by contacting customer support. After 1 hour, orders enter processing and cannot be canceled, but you can return the item once received.",
        "category": "orders",
    },
    {
        "id": "faq_006",
        "question": "Do you offer international shipping?",
        "answer": "Yes, we ship to over 50 countries worldwide. International shipping rates vary by destination and typically take 10-14 business days.",
        "category": "shipping",
    },
    {
        "id": "faq_007",
        "question": "What if my item arrives damaged?",
        "answer": "If your item arrives damaged, please contact us within 48 hours with photos of the damage. We will send a replacement at no additional cost and arrange free return shipping for the damaged item.",
        "category": "returns",
    },
    {
        "id": "faq_008",
        "question": "How do I change my shipping address?",
        "answer": "You can update your shipping address in your account settings before placing an order. For orders already placed, contact customer support within 1 hour to request an address change.",
        "category": "shipping",
    },
]

print(f"üìö FAQ Corpus: {len(FAQ_CORPUS)} entries")
print("")
print("Categories:")
categories = {}
for faq in FAQ_CORPUS:
    cat = faq["category"]
    categories[cat] = categories.get(cat, 0) + 1

for cat, count in sorted(categories.items()):
    print(f"  ‚Ä¢ {cat}: {count} FAQs")

---

## üîç Step 3: Implement the Simple Vector Store

In [None]:
class SimpleVectorStore:
    """
    A simple vector store using numpy for demonstration.
    
    In production, use FAISS, Pinecone, Weaviate, or similar.
    This implementation uses cosine similarity for retrieval.
    """
    
    def __init__(self):
        """Initialize the vector store."""
        self.documents: List[Dict[str, Any]] = []
        self.embeddings: Optional[np.ndarray] = None
        self.embed_fn = None
    
    def set_embedding_function(self, embed_fn):
        """
        Set the embedding function.
        
        Args:
            embed_fn: Function that takes text and returns embedding vector
        """
        self.embed_fn = embed_fn
    
    def add_documents(self, documents: List[Dict[str, Any]]) -> None:
        """
        Add documents to the vector store.
        
        Args:
            documents: List of documents with 'answer' or 'text' field
        """
        self.documents = documents
        
        if self.embed_fn is None:
            raise ValueError("Embedding function not set. Call set_embedding_function first.")
        
        # Generate embeddings for all documents
        texts = [doc.get("answer", doc.get("text", "")) for doc in documents]
        embeddings_list = [self.embed_fn(text) for text in texts]
        self.embeddings = np.array(embeddings_list)
    
    def search(
        self,
        query: str,
        top_k: int = 3,
    ) -> List[Tuple[Dict[str, Any], float]]:
        """
        Search for similar documents.
        
        Args:
            query: Search query
            top_k: Number of results to return
            
        Returns:
            List of (document, similarity_score) tuples
        """
        if self.embeddings is None or len(self.documents) == 0:
            return []
        
        # Embed the query
        query_embedding = np.array(self.embed_fn(query))
        
        # Compute cosine similarity
        similarities = self._cosine_similarity(query_embedding, self.embeddings)
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        # Return documents with scores
        results = []
        for idx in top_indices:
            results.append((self.documents[idx], float(similarities[idx])))
        
        return results
    
    def _cosine_similarity(
        self,
        query: np.ndarray,
        documents: np.ndarray,
    ) -> np.ndarray:
        """
        Compute cosine similarity between query and documents.
        
        Args:
            query: Query embedding (1D array)
            documents: Document embeddings (2D array)
            
        Returns:
            Array of similarity scores
        """
        # Normalize vectors
        query_norm = query / (np.linalg.norm(query) + 1e-8)
        doc_norms = documents / (np.linalg.norm(documents, axis=1, keepdims=True) + 1e-8)
        
        # Compute dot product
        similarities = np.dot(doc_norms, query_norm)
        
        return similarities


print("‚úÖ SimpleVectorStore class defined!")

---

## üßÆ Step 4: Implement Simple Embedding Function

In [None]:
# Vocabulary for bag-of-words embedding
VOCAB = [
    "return", "refund", "shipping", "ship", "order", "payment",
    "track", "cancel", "day", "days", "hour", "free",
    "international", "damaged", "broken", "address", "account",
    "paypal", "visa", "credit", "policy", "replace", "replacement",
    "express", "standard", "worldwide", "countries", "email",
    "photos", "48", "30", "1", "package", "item", "items"
]


def simple_bow_embedding(text: str) -> np.ndarray:
    """
    Create a simple bag-of-words embedding.
    
    For production, use sentence-transformers or similar.
    
    Args:
        text: Text to embed
        
    Returns:
        Embedding vector as numpy array
    """
    words = text.lower().split()
    embedding = np.array([words.count(w) for w in VOCAB], dtype=np.float32)
    return embedding


# Test the embedding function
test_text = "How do I return an item for a refund?"
test_embedding = simple_bow_embedding(test_text)

print("‚úÖ Embedding function defined!")
print(f"")
print(f"Test text: '{test_text}'")
print(f"Embedding dimension: {len(test_embedding)}")
print(f"Non-zero features: {np.sum(test_embedding > 0)}")

---

## üèóÔ∏è Step 5: Initialize Vector Store with FAQ Corpus

In [None]:
# Create and populate vector store
vector_store = SimpleVectorStore()
vector_store.set_embedding_function(simple_bow_embedding)
vector_store.add_documents(FAQ_CORPUS)

print("‚úÖ Vector store initialized!")
print(f"   Documents indexed: {len(FAQ_CORPUS)}")
print(f"   Embedding shape: {vector_store.embeddings.shape}")

---

## üîç Step 6: Test Retrieval

In [None]:
# Test retrieval with sample questions
test_questions = [
    "Can I get a refund?",
    "How fast is delivery?",
    "Can I pay with PayPal?",
    "My package is broken",
]

print("üîç Testing Retrieval...")
print("=" * 70)

for question in test_questions:
    print(f"\nQuestion: {question}")
    print("-" * 40)
    
    results = vector_store.search(question, top_k=3)
    
    for i, (doc, score) in enumerate(results, 1):
        print(f"  {i}. [{doc['id']}] Score: {score:.3f}")
        print(f"     Q: {doc['question'][:50]}...")

---

## ü§ñ Step 7: Implement Mock Generator (tinyGPT)

In [None]:
class MockTinyGPT:
    """
    Mock generator that simulates tinyGPT behavior.
    
    For demonstration, uses template-based generation
    based on retrieved context.
    """
    
    def __init__(self):
        """Initialize the mock generator."""
        self.templates = [
            "Based on our FAQ: {context}",
            "According to our policies: {context}",
            "Here's what you need to know: {context}",
        ]
    
    def generate(
        self,
        question: str,
        context_list: List[str],
    ) -> str:
        """
        Generate an answer given question and context.
        
        Args:
            question: Customer question
            context_list: List of retrieved context snippets
            
        Returns:
            Generated answer
        """
        if not context_list:
            return "I don't have information about that topic. Please contact customer support."
        
        # Use the first (most relevant) context
        main_context = context_list[0]
        
        # Select template based on question hash
        template_idx = len(question) % len(self.templates)
        template = self.templates[template_idx]
        
        return template.format(context=main_context)


# Create generator
mock_generator = MockTinyGPT()


def generator_fn(question: str, context_list: List[str]) -> str:
    """Wrapper function for the generator."""
    return mock_generator.generate(question, context_list)


print("‚úÖ Mock tinyGPT generator created!")

---

## üß™ Step 8: Implement RAG Evaluator

In [None]:
class RAGEvaluator:
    """
    Evaluator for Retrieval-Augmented Generation systems.
    
    Evaluates:
    1. Retrieval quality (precision, recall)
    2. Answer groundedness (string match and LLM judge)
    3. Overall answer quality
    """
    
    def __init__(
        self,
        vector_store: SimpleVectorStore,
        generator_fn: Callable[[str, List[str]], str],
    ):
        """
        Initialize the RAGEvaluator.
        
        Args:
            vector_store: Vector store for retrieval
            generator_fn: Function that takes (question, context_list) and returns answer
        """
        self.vector_store = vector_store
        self.generator_fn = generator_fn
    
    def retrieve(
        self,
        question: str,
        top_k: int = 3,
    ) -> List[Dict[str, Any]]:
        """
        Retrieve relevant documents for a question.
        
        Args:
            question: Customer question
            top_k: Number of documents to retrieve
            
        Returns:
            List of retrieved documents with scores
        """
        results = self.vector_store.search(question, top_k=top_k)
        return [{"document": doc, "score": score} for doc, score in results]
    
    def generate_answer(
        self,
        question: str,
        retrieved_docs: List[Dict[str, Any]],
    ) -> str:
        """
        Generate an answer using retrieved context.
        
        Args:
            question: Customer question
            retrieved_docs: Retrieved documents
            
        Returns:
            Generated answer
        """
        context_list = [
            doc["document"].get("answer", doc["document"].get("text", ""))
            for doc in retrieved_docs
        ]
        return self.generator_fn(question, context_list)
    
    def evaluate_retrieval(
        self,
        retrieved_ids: List[str],
        expected_ids: List[str],
    ) -> Dict[str, float]:
        """
        Evaluate retrieval quality.
        
        Args:
            retrieved_ids: IDs of retrieved documents
            expected_ids: IDs of expected relevant documents
            
        Returns:
            Dictionary with precision, recall, and hit_rate
        """
        retrieved_set = set(retrieved_ids)
        expected_set = set(expected_ids)
        
        if len(retrieved_ids) == 0:
            return {"precision": 0.0, "recall": 0.0, "hit_rate": 0.0}
        
        hits = len(retrieved_set & expected_set)
        
        precision = hits / len(retrieved_ids) if retrieved_ids else 0.0
        recall = hits / len(expected_ids) if expected_ids else 0.0
        hit_rate = 1.0 if hits > 0 else 0.0
        
        return {
            "precision": precision,
            "recall": recall,
            "hit_rate": hit_rate,
        }
    
    def evaluate_groundedness_string_match(
        self,
        answer: str,
        expected_phrases: List[str],
    ) -> Dict[str, Any]:
        """
        Evaluate answer groundedness using string matching.
        
        Args:
            answer: Generated answer
            expected_phrases: Phrases that should appear in the answer
            
        Returns:
            Dictionary with groundedness score and matched phrases
        """
        answer_lower = answer.lower()
        matched = []
        unmatched = []
        
        for phrase in expected_phrases:
            if phrase.lower() in answer_lower:
                matched.append(phrase)
            else:
                unmatched.append(phrase)
        
        score = len(matched) / len(expected_phrases) if expected_phrases else 0.0
        
        return {
            "groundedness_score": score,
            "matched_phrases": matched,
            "unmatched_phrases": unmatched,
            "total_expected": len(expected_phrases),
        }
    
    def evaluate_groundedness_llm_judge(
        self,
        question: str,
        answer: str,
        context: List[str],
        judge_fn: Callable[[str], str] = None,
    ) -> Dict[str, Any]:
        """
        Evaluate answer groundedness using LLM-as-Judge.
        
        Args:
            question: Original question
            answer: Generated answer
            context: Retrieved context snippets
            judge_fn: Function that takes a prompt and returns judgment
            
        Returns:
            Dictionary with groundedness assessment
        """
        if judge_fn is None:
            return {
                "grounded": None,
                "explanation": "No judge function provided",
                "hallucination_detected": None,
            }
        
        context_text = "\n".join([f"- {c}" for c in context])
        prompt = f"""Evaluate if this answer is grounded in the provided context.

Context:
{context_text}

Question: {question}
Answer: {answer}

Is the answer grounded in the context? (yes/no)
"""
        
        judgment = judge_fn(prompt)
        grounded = "yes" in judgment.lower()
        hallucination = "no" in judgment.lower() or "not grounded" in judgment.lower()
        
        return {
            "grounded": grounded,
            "hallucination_detected": hallucination,
            "raw_judgment": judgment,
        }
    
    def run_evaluation(
        self,
        question: str,
        expected_faq_ids: List[str],
        expected_answer_contains: List[str],
        top_k: int = 3,
        judge_fn: Callable[[str], str] = None,
    ) -> Dict[str, Any]:
        """
        Run a complete RAG evaluation on a single question.
        
        Args:
            question: Customer question
            expected_faq_ids: IDs of FAQs that should be retrieved
            expected_answer_contains: Phrases that should appear in answer
            top_k: Number of documents to retrieve
            judge_fn: Optional LLM judge function
            
        Returns:
            Complete evaluation results
        """
        # Step 1: Retrieve
        retrieved = self.retrieve(question, top_k=top_k)
        retrieved_ids = [r["document"].get("id", "") for r in retrieved]
        
        # Step 2: Generate answer
        answer = self.generate_answer(question, retrieved)
        
        # Step 3: Evaluate retrieval
        retrieval_metrics = self.evaluate_retrieval(retrieved_ids, expected_faq_ids)
        
        # Step 4: Evaluate groundedness (string match)
        groundedness_string = self.evaluate_groundedness_string_match(
            answer, expected_answer_contains
        )
        
        # Step 5: Evaluate groundedness (LLM judge) if provided
        context = [r["document"].get("answer", "") for r in retrieved]
        groundedness_llm = self.evaluate_groundedness_llm_judge(
            question, answer, context, judge_fn
        )
        
        return {
            "question": question,
            "retrieved_docs": retrieved,
            "retrieved_ids": retrieved_ids,
            "answer": answer,
            "retrieval_metrics": retrieval_metrics,
            "groundedness_string": groundedness_string,
            "groundedness_llm": groundedness_llm,
        }
    
    def compute_aggregate_metrics(
        self,
        results: List[Dict[str, Any]],
    ) -> Dict[str, float]:
        """
        Compute aggregate metrics across multiple evaluations.
        
        Args:
            results: List of evaluation results
            
        Returns:
            Dictionary with aggregate metrics
        """
        if not results:
            return {}
        
        avg_precision = np.mean([r["retrieval_metrics"]["precision"] for r in results])
        avg_recall = np.mean([r["retrieval_metrics"]["recall"] for r in results])
        avg_hit_rate = np.mean([r["retrieval_metrics"]["hit_rate"] for r in results])
        avg_groundedness = np.mean([
            r["groundedness_string"]["groundedness_score"] for r in results
        ])
        
        return {
            "avg_precision": avg_precision,
            "avg_recall": avg_recall,
            "avg_hit_rate": avg_hit_rate,
            "avg_groundedness_score": avg_groundedness,
            "total_evaluated": len(results),
        }


print("‚úÖ RAGEvaluator class defined!")

---

## üìã Step 9: Define Test Cases

In [None]:
# Test cases for RAG evaluation
TEST_CASES = [
    {
        "question": "Can I get a refund if I don't like the product?",
        "expected_faq_ids": ["faq_001"],
        "expected_answer_contains": ["30 days", "refund"],
        "category": "returns",
    },
    {
        "question": "How fast is delivery?",
        "expected_faq_ids": ["faq_002"],
        "expected_answer_contains": ["5-7 business days", "express"],
        "category": "shipping",
    },
    {
        "question": "Where can I see my order status?",
        "expected_faq_ids": ["faq_003"],
        "expected_answer_contains": ["Order History", "account"],
        "category": "orders",
    },
    {
        "question": "Can I pay with PayPal?",
        "expected_faq_ids": ["faq_004"],
        "expected_answer_contains": ["PayPal"],
        "category": "payment",
    },
    {
        "question": "My package arrived broken, what do I do?",
        "expected_faq_ids": ["faq_007"],
        "expected_answer_contains": ["48 hours", "replacement", "photos"],
        "category": "returns",
    },
    {
        "question": "Do you ship to Canada?",
        "expected_faq_ids": ["faq_006"],
        "expected_answer_contains": ["international", "50 countries"],
        "category": "shipping",
    },
    {
        "question": "I want to cancel my order",
        "expected_faq_ids": ["faq_005"],
        "expected_answer_contains": ["1 hour", "cancel"],
        "category": "orders",
    },
]

print(f"üìã Defined {len(TEST_CASES)} test cases:")
print("")
for tc in TEST_CASES:
    print(f"  [{tc['category']}] {tc['question'][:40]}...")

---

## üèÉ Step 10: Run Full RAG Evaluation

In [None]:
# Create evaluator
rag_evaluator = RAGEvaluator(
    vector_store=vector_store,
    generator_fn=generator_fn,
)

# Run evaluation
print("üîÑ Running Full RAG Evaluation...")
print("=" * 70)

all_results = []
for tc in TEST_CASES:
    result = rag_evaluator.run_evaluation(
        question=tc["question"],
        expected_faq_ids=tc["expected_faq_ids"],
        expected_answer_contains=tc["expected_answer_contains"],
    )
    all_results.append(result)
    
    # Display results
    hit_status = "‚úÖ" if result["retrieval_metrics"]["hit_rate"] > 0 else "‚ùå"
    ground_status = "‚úÖ" if result["groundedness_string"]["groundedness_score"] > 0.5 else "‚ùå"
    
    print(f"\n{'='*60}")
    print(f"Question: {tc['question']}")
    print(f"Category: {tc['category']}")
    print(f"{'='*60}")
    print(f"Answer: {result['answer'][:80]}...")
    print(f"")
    print(f"Retrieval: {hit_status}")
    print(f"   Retrieved: {result['retrieved_ids'][:3]}")
    print(f"   Expected: {tc['expected_faq_ids']}")
    print(f"   Hit Rate: {result['retrieval_metrics']['hit_rate']:.0%}")
    print(f"")
    print(f"Groundedness: {ground_status}")
    print(f"   Score: {result['groundedness_string']['groundedness_score']:.0%}")
    print(f"   Matched: {result['groundedness_string']['matched_phrases']}")
    print(f"   Unmatched: {result['groundedness_string']['unmatched_phrases']}")

---

## üìä Step 11: Compute and Display Aggregate Metrics

In [None]:
# Compute aggregate metrics
metrics = rag_evaluator.compute_aggregate_metrics(all_results)

print("üìä Aggregate RAG Evaluation Metrics")
print("=" * 70)
print(f"")
print(f"Total Test Cases: {metrics['total_evaluated']}")
print(f"")
print(f"Retrieval Metrics:")
print(f"   Average Precision: {metrics['avg_precision']:.0%}")
print(f"   Average Recall: {metrics['avg_recall']:.0%}")
print(f"   Average Hit Rate: {metrics['avg_hit_rate']:.0%}")
print(f"")
print(f"Groundedness Metrics:")
print(f"   Average Groundedness Score: {metrics['avg_groundedness_score']:.0%}")

---

## üìã Step 12: Generate Summary Table

In [None]:
print("üìã Evaluation Summary Table")
print("=" * 100)
print(f"{'#':<3} {'Question':<35} {'Category':<10} {'Retrieval':<12} {'Groundedness':<12}")
print("-" * 100)

for i, (tc, r) in enumerate(zip(TEST_CASES, all_results), 1):
    retrieval_status = "‚úÖ Hit" if r["retrieval_metrics"]["hit_rate"] > 0 else "‚ùå Miss"
    groundedness_pct = r["groundedness_string"]["groundedness_score"]
    groundedness_status = f"{groundedness_pct:.0%}"
    
    question_short = tc["question"][:33] + ".." if len(tc["question"]) > 35 else tc["question"]
    
    print(f"{i:<3} {question_short:<35} {tc['category']:<10} {retrieval_status:<12} {groundedness_status:<12}")

print("-" * 100)
print(f"")
print(f"Summary: {metrics['avg_hit_rate']:.0%} hit rate, {metrics['avg_groundedness_score']:.0%} groundedness")

---

## üîç Step 13: Analyze Results by Category

In [None]:
# Analyze results by category
print("ÔøΩÔøΩ Results by Category")
print("=" * 70)

category_results = {}
for tc, r in zip(TEST_CASES, all_results):
    cat = tc["category"]
    if cat not in category_results:
        category_results[cat] = []
    category_results[cat].append(r)

for cat, results in sorted(category_results.items()):
    hit_rate = np.mean([r["retrieval_metrics"]["hit_rate"] for r in results])
    groundedness = np.mean([r["groundedness_string"]["groundedness_score"] for r in results])
    
    print(f"\n{cat.upper()} ({len(results)} questions):")
    print(f"   Hit Rate: {hit_rate:.0%}")
    print(f"   Groundedness: {groundedness:.0%}")

---

## üß™ Step 14: Optional LLM-as-Judge Evaluation

In [None]:
# Mock LLM judge function (for demonstration)
def mock_llm_judge(prompt: str) -> str:
    """
    Mock LLM judge that checks if answer contains context words.
    
    In production, use an actual LLM (GPT-4, Claude, etc.)
    """
    # Simple heuristic: check if answer seems to contain policy info
    if "policies" in prompt.lower() or "faq" in prompt.lower():
        return "yes, the answer appears to be grounded in the provided context"
    return "no, the answer may contain unsupported claims"


print("üß™ LLM-as-Judge Evaluation (Mock)")
print("=" * 70)
print("")
print("Note: Using mock judge for demonstration.")
print("      In production, use GPT-4, Claude, or similar.")
print("")

# Run evaluation with judge on a subset
for tc in TEST_CASES[:3]:
    result = rag_evaluator.run_evaluation(
        question=tc["question"],
        expected_faq_ids=tc["expected_faq_ids"],
        expected_answer_contains=tc["expected_answer_contains"],
        judge_fn=mock_llm_judge,
    )
    
    print(f"Question: {tc['question'][:40]}...")
    print(f"   String Match Score: {result['groundedness_string']['groundedness_score']:.0%}")
    print(f"   LLM Judge Grounded: {result['groundedness_llm']['grounded']}")
    print("")

---

## üîç Step 15: Failure Analysis

In [None]:
print("üîç Failure Analysis")
print("=" * 70)

# Identify retrieval failures
retrieval_failures = [
    (tc, r) for tc, r in zip(TEST_CASES, all_results)
    if r["retrieval_metrics"]["hit_rate"] == 0
]

# Identify groundedness failures
groundedness_failures = [
    (tc, r) for tc, r in zip(TEST_CASES, all_results)
    if r["groundedness_string"]["groundedness_score"] < 0.5
]

print(f"")
print(f"Retrieval Failures: {len(retrieval_failures)} / {len(TEST_CASES)}")
if retrieval_failures:
    for tc, r in retrieval_failures:
        print(f"   ‚ùå {tc['question'][:40]}...")
        print(f"      Expected: {tc['expected_faq_ids']}")
        print(f"      Got: {r['retrieved_ids'][:3]}")

print(f"")
print(f"Groundedness Failures: {len(groundedness_failures)} / {len(TEST_CASES)}")
if groundedness_failures:
    for tc, r in groundedness_failures:
        print(f"   ‚ö†Ô∏è {tc['question'][:40]}...")
        print(f"      Missing phrases: {r['groundedness_string']['unmatched_phrases']}")

---

## üß™ Step 16: Test Custom Queries

In [None]:
# Test with custom queries
print("üß™ Test Custom Queries")
print("=" * 70)

custom_queries = [
    "What's your refund policy?",
    "How do I get my money back?",
    "Is shipping free?",
    "What credit cards do you take?",
]

for query in custom_queries:
    # Retrieve
    results = vector_store.search(query, top_k=1)
    top_doc, score = results[0] if results else (None, 0)
    
    # Generate
    context = [top_doc["answer"]] if top_doc else []
    answer = generator_fn(query, context)
    
    print(f"\nQuery: {query}")
    print(f"   Top Match: {top_doc['id'] if top_doc else 'None'} (score: {score:.3f})")
    print(f"   Answer: {answer[:70]}...")

---

## üìö Summary

In this notebook, you learned how to:

1. **Create a FAQ corpus** for customer service use cases
2. **Implement a simple vector store** for semantic retrieval
3. **Build a mock generator** (simulating tinyGPT)
4. **Evaluate retrieval quality** using precision, recall, and hit rate
5. **Evaluate groundedness** using string matching and LLM-as-Judge
6. **Analyze failures** to identify areas for improvement

### Key Takeaways

1. RAG combines retrieval and generation for grounded answers
2. Retrieval quality directly impacts answer quality
3. Groundedness can be evaluated with string matching or LLM judges
4. Simple bag-of-words embeddings work for demonstrations but production systems need better embeddings

### Next Steps

1. **Use FAISS** for efficient similarity search at scale
2. **Use sentence-transformers** for better semantic embeddings
3. **Integrate a real LLM** (tinyGPT ONNX) for generation
4. **Add citation tracking** to show which FAQ the answer is based on

---

## ‚úî Knowledge Mastery Checklist

Before moving to Week 16, ensure you can check all boxes:

- [ ] I understand what RAG is and why it's useful for customer service
- [ ] I can create a FAQ corpus and explain its structure
- [ ] I can implement a simple vector store for semantic retrieval
- [ ] I understand how to evaluate retrieval quality (precision, recall, hit rate)
- [ ] I can evaluate answer groundedness using string matching
- [ ] I understand how to use an LLM as a judge for groundedness
- [ ] I can identify common RAG failure modes
- [ ] I understand the trade-off between groundedness and helpfulness

---

**Week 15 Complete!**

*Next: Week 16 ‚Äî Marketing & Content Use Cases*