### Testing RAG Applications 📑

## What is RAG (Retrieval-Augmented Generation)?

**RAG** is a method that imprves the Large Language Models (LLMs) by combining them with external knowledge. RAG allows the model to access and use up-to-date, specific information from external data sources.

### When to Use RAG:

You can use RAG for example to add a content of the classes to allow students ask questions, or for ecoomerce to include the info of your products.

- **Private/Proprietary Data**: When you need to query company documents, internal knowledge bases
- **Up-to-date Information**: When the LLM's training data is outdated
- **Domain-Specific Knowledge**: For specialized fields not well-represented in training data
- **Reducing Hallucinations**: By grounding responses in actual documents
- **Cost Efficiency**: Cheaper than fine-tuning for many use cases


### Key Components of RAG:

1. **Document Store**: A collection of documents containing the knowledge you want to query can be a document, code repo
2. **Embeddings**: Vector representations of text chunks that capture semantic meaning
3. **Vector Store**: A database that stores embeddings and enables similarity search
4. **Retriever**: Finds the most relevant documents based on the query
5. **Generator**: The LLM that generates answers using retrieved context

### How RAG Works:

1. The info added to the LLM is break down into chunks and converted to vector
2. The vectors are stored in a database 
3. The user submits a prompt
5. The system tranforms into a numerical format (vector)
5. Use the vector to search into the knowledge base 
6. The info is passed to the language model
7. The answer is returned

## RAG Application Example

In this example the RAG will add some info about MCP (Model Context Protocol) and after we can check that the correct answer and context it's returned

1. **Loads data** from the website https://www.descope.com/learn/post/mcp
2. **Chunks the text** into smaller, manageable chunks
3. **Creates embeddings** a embedding is the numerical vector representation of the text that caputre the semantic meaning and stores them in a vector database
4. **Retrieves relevant chunks** when answering questions
5. **Generates answers** using the retrieved context

This approach allows us to answer questions about MCP even though the base LLM might not have been trained on this specific information.

### Setup: 

We will use:

- **LangChain** A framework for developing application powered by languages modesl
- **LangSmith** is LangChain's platform for debugging, testing, and monitoring LLM applications.
- **OpenAI** The model to add the new info about MCP
- **DeepEval** To test the model
- **Ollama* To use free models from your computer

### API Keys and Environment Variables

Before we start, we need to set up our API keys. This notebook uses:
- **OpenAI API Key**: Required for embeddings and LLM generation
- **LangSmith API Key**: Optional, but recommended for debugging and tracing

You can get a free API key at [https://smith.langchain.com/](https://smith.langchain.com/)

You need to add this API key in a .env file or set as environment variable

In [12]:
import os
from getpass import getpass
import deepeval

# Load environment variables from .env file
try:
    from dotenv import load_dotenv
    load_dotenv()
    print("✅ Loaded environment variables from .env file")
except ImportError:
    print("⚠️ python-dotenv not installed. Install with: pip install python-dotenv")
    print("Will use manual input for API keys...")

# Set up OpenAI API Key (Required)
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
else:
    print("✅ OpenAI API key loaded from environment")

# Set up Confident API Key (Required for DeepEval)
if "CONFIDENT_API_KEY" not in os.environ:
    os.environ["CONFIDENT_API_KEY"] = getpass("Enter your Confident API key: ")
else:
    print("✅ Confident API key loaded from environment")

# Set up LangSmith API Key (Optional - suppresses warning)
if "LANGSMITH_API_KEY" not in os.environ:
    os.environ["LANGSMITH_API_KEY"] = "not-needed"  # Prevents the warning
else:
    print("✅ LangSmith API key loaded from environment")

# Disable Confident AI verbose trace logging to avoid SSL certificate errors
os.environ["CONFIDENT_TRACE_VERBOSE"] = "NO"

api_key = os.getenv("CONFIDENT_API_KEY")
deepeval.login(api_key)

✅ Loaded environment variables from .env file
✅ OpenAI API key loaded from environment
✅ Confident API key loaded from environment
✅ LangSmith API key loaded from environment


Install the next python packages

### Step 1: Load and Process Documents

The first step in building a RAG application is to load and process your documents. This involves:

1. **Loading documents** from various sources (web, PDFs, databases, etc.)
2. **Splitting text** into smaller chunks for better retrieval
3. **Creating embeddings** (vector representations) of each chunk
4. **Storing embeddings** in a vector database for fast similarity search

In [13]:
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load documents from the web
loader = WebBaseLoader("https://www.descope.com/learn/post/mcp")
docs = loader.load()

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
splits = text_splitter.split_documents(docs)

# Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)

print(f"Loaded {len(docs)} documents")
print(f"Split into {len(splits)} chunks")
print(f"Created vector store with {vectorstore.index.ntotal} vectors")

Loaded 1 documents
Split into 24 chunks
Created vector store with 24 vectors


### Step 2: Create RAG Chain

Now we'll create a RAG chain that combines retrieval with generation. The chain will:
1. Take a user question
2. Retrieve relevant documents from our vector store
3. Format those documents as context
4. Pass both context and question to the LLM
5. Generate an answer based on the retrieved context

#### Using LangChain Hub Prompts

LangChain Hub provides pre-built, tested prompts for common use cases. The `rlm/rag-prompt` is a popular RAG prompt that instructs the model to:
- Answer based on the provided context
- Say "I don't know" if the answer isn't in the context
- Keep answers concise (3 sentences max)

In the next examle we will use a free model llama3.2 

The temperature controls the randomness and creativity of the model's outputs. With lower temperature < .2 will probably returns the same words always. with higher value for example 0.8 will use more random choices and will return results with more variants. 

The max_tokens sets the maximun lenght of the model's repsonse. 

RAG applications generally use low temperature (0 to 0.3) for accuracy 


In [14]:
from langchain import hub
from langchain.chains import RetrievalQA
from langchain_ollama import ChatOllama

llm = ChatOllama(
    base_url="http://localhost:11434",
    model="qwen3:latest",
    temperature=0.3,
    max_token=500
)

# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")

qa_chain = RetrievalQA.from_llm(
    llm, retriever=vectorstore.as_retriever(), prompt=prompt
)

query = "What is Model Context Protocol?"
result = qa_chain.invoke({"query": query})

### RAG that returns the context

LangChain provides multiple ways to create RAG chains. You can use the `create_retrieval_chain` to return the answer and the context (the documents that were added)

1. **Automatically handles retrieval**: The chain manages document retrieval internally
2. **Returns structured output**: Includes both the answer and the retrieved context
3. **Uses a different prompt template**: The `langchain-ai/retrieval-qa-chat` prompt is optimized for conversational interactions

This approach is useful when you want:
- More control over the retrieval process
- Access to both the answer and the source documents
- A more conversational interaction style

In this example I am using the OpenAI that will return better results

In [15]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4", 
    temperature=0.3,
    max_tokens=500
)

# See full prompt at https://smith.langchain.com/hub/langchain-ai/retrieval-qa-chat
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
rag_chain = create_retrieval_chain(vectorstore.as_retriever(), combine_docs_chain)

rag_chain.invoke({"input": "What is Model Context Protocol?"})

{'input': 'What is Model Context Protocol?',
 'context': [Document(id='8cf85fec-cffc-4b76-9930-90142262458a', metadata={'source': 'https://www.descope.com/learn/post/mcp', 'title': 'What Is the Model Context Protocol (MCP) and How It Works', 'description': 'Learn more about MCP, the open source protocol developed by Anthropic to provide LLMs and AI agents a standardized way to connect with external data sources and tools.', 'language': 'en'}, page_content='changed how we interact with information and technology. They can write eloquently, perform deep research, and solve increasingly complex problems. But while typical models excel at responding to natural language, they’ve been constrained by their isolation from real-world data and systems.\xa0The Model Context Protocol (MCP) addresses this challenge by providing a standardized way for LLMs to connect with external data sources and tools—essentially a “universal remote” for AI apps. Released by Anthropic as an open-source protocol, M

### Using RetrievalQA Chain

Here's yet another way to create a RAG chain using the `RetrievalQA` class. This is a more traditional approach that:

1. **Simplifies chain creation**: Combines retrieval and QA in a single class
2. **Uses the same prompt**: We can reuse the `rlm/rag-prompt` from LangChain Hub
3. **Returns structured output**: The response includes both the query and the result

Key differences:
- Uses `query` as the input key instead of `question` or `input`
- Returns a dictionary with `query` and `result` keys
- Handles the retrieval and formatting internally

This approach is ideal when you want a simple, straightforward RAG implementation without complex chain composition.

### Understanding the Context in RAG Chains

In [16]:
# Define the format_docs function
def format_docs(docs):
    """Format a list of documents into a single string for context."""
    return "\n\n".join(doc.page_content for doc in docs)

# When using hub.pull("rlm/rag-prompt"), the prompt expects:
# - context: The retrieved documents formatted as text
# - question: The user's question

# Let's see how the context is passed through the chain
question = "What is Model Context Protocol?"

# Get the retrieved documents
retrieved_docs = vectorstore.as_retriever().invoke(question)
print(f"Number of retrieved documents: {len(retrieved_docs)}")
print("\nRetrieved documents:")
for i, doc in enumerate(retrieved_docs):
    print(f"\nDocument {i+1}:")
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata: {doc.metadata}")

# Format the documents as context
context = format_docs(retrieved_docs)
print(f"\nFormatted context (first 500 chars):\n{context[:500]}...")

# This is what gets passed to the prompt
print(f"\nThe prompt receives:")
print(f"- context: {len(context)} characters of retrieved text")
print(f"- question: {question}")

Number of retrieved documents: 4

Retrieved documents:

Document 1:
Content: changed how we interact with information and technology. They can write eloquently, perform deep research, and solve increasingly complex problems. But while typical models excel at responding to natu...
Metadata: {'source': 'https://www.descope.com/learn/post/mcp', 'title': 'What Is the Model Context Protocol (MCP) and How It Works', 'description': 'Learn more about MCP, the open source protocol developed by Anthropic to provide LLMs and AI agents a standardized way to connect with external data sources and tools.', 'language': 'en'}

Document 2:
Content: What Is the Model Context Protocol (MCP) and How It WorksSkip to main contentArrow RightJoin us for the world's largest MCP hackathon! Let's go >Log InUser CircleProductUse CasesDevelopersCustomersRes...
Metadata: {'source': 'https://www.descope.com/learn/post/mcp', 'title': 'What Is the Model Context Protocol (MCP) and How It Works', 'description': 'Learn m

### Step 3: Testing RAG Application with DeepEval

Now that we have a working RAG application, we need to evaluate its performance. This is crucial because:

1. **RAG Quality Varies**: The quality of answers depends on retrieval accuracy and generation quality
2. **No Ground Truth**: Unlike traditional ML, we often don't have exact "correct" answers
3. **Multiple Dimensions**: We need to evaluate relevance, accuracy, completeness, and more

#### What is DeepEval?

DeepEval is an open-source framework for evaluating LLM applications. It provides:
- Pre-built metrics for common evaluation needs
- Custom metric creation capabilities
- Integration with popular LLM frameworks
- Detailed evaluation reports

#### Creating Test Cases

A test case in DeepEval consists of:
- **Input**: The question asked to the RAG system
- **Actual Output**: What the RAG system returned
- **Expected Output**: What we expect the system to return (optional)
- **Retrieval Context**: The documents retrieved by the RAG system (important for RAG metrics!)


In [17]:
from deepeval.test_case import LLMTestCase

question = "What is Model Context Protocol?"
test_case = LLMTestCase(
  input=question,
  actual_output=qa_chain.invoke({"query": question})["result"],
  expected_output="The Model Context Protocol (MCP) addresses this challenge by providing a standardized way for LLMs to connect with external data sources and tools—essentially a “universal remote” for AI apps. Released by Anthropic as an open-source protocol, MCP builds on existing function calling by eliminating the need for custom integration between LLMs and other apps."
)

test_cases = [test_case]

### Defining Evaluation Metrics

DeepEval provides various metrics to evaluate RAG systems:

#### 1. GEval (General Evaluation)
GEval is a flexible metric that uses LLMs to evaluate outputs based on custom criteria. You can define:
- **Name**: A descriptive name for your metric
- **Criteria**: What the LLM should evaluate
- **Evaluation Parameters**: Which parts of the test case to evaluate

#### 2. Built-in Metrics
- **Answer Relevancy**: Measures if the answer addresses the question
- **Faithfulness**: Checks if the answer is grounded in the retrieved context
- **Contextual Precision**: Evaluates if relevant docs are ranked higher
- **Contextual Relevancy**: Measures how relevant the retrieved context is

In [18]:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval

# Custom metric to evaluate conciseness
# This uses an LLM to judge if the output is concise while complete
concise_metrics = GEval(
    name = "Concise",
    criteria="Assess if the actual output remains concise while preserving all essential information.",
    
    # Only evaluate the actual output (not comparing with expected)
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT
    ]
)

# Custom metric to evaluate completeness
# This checks if all key information is retained in the output
completeness_metrics = GEval(
    name = "Completeness",
    criteria="Assess whether the actual output retains all the key information from the input",
    
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT
    ]
)

If you want to evauate with a local model un comment and execute the next command

In [19]:
#!deepeval set-ollama deepseek-r1:latest

### Evaluation with GEval 

### Running the Evaluation

Now let's run our evaluation with multiple metrics. DeepEval will:
1. Execute each metric on our test case
2. Provide detailed scores and explanations
3. Show which metrics passed or failed based on thresholds

Note: Since our test case doesn't include `retrieval_context`, metrics like Faithfulness and Contextual Precision won't work properly. In a real RAG evaluation, you should always include the retrieved documents.

In [20]:
import deepeval.metrics

deepeval.evaluate(test_cases=test_cases, metrics=[
    completeness_metrics,
    deepeval.metrics.AnswerRelevancyMetric(),
    concise_metrics
] )

Output()



Metrics Summary

  - ✅ Completeness [GEval] (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4.1, reason: The actual output accurately preserves all key information from the input: it identifies MCP as an open-source protocol by Anthropic, describes its purpose in connecting LLMs to external data sources and tools, notes the standardized approach and elimination of custom code, mentions the client-server architecture inspired by LSP, and highlights security features like least privilege and OAuth scopes. No key information is missing or altered, and the response is concise as requested., error: None)
  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4.1, reason: The score is 1.00 because the answer was fully relevant and addressed the question directly without any irrelevant information. Great job staying focused and concise!, error: None)
  - ✅ Concise [GEval] (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4.

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Completeness [GEval]', threshold=0.5, success=True, score=1.0, reason='The actual output accurately preserves all key information from the input: it identifies MCP as an open-source protocol by Anthropic, describes its purpose in connecting LLMs to external data sources and tools, notes the standardized approach and elimination of custom code, mentions the client-server architecture inspired by LSP, and highlights security features like least privilege and OAuth scopes. No key information is missing or altered, and the response is concise as requested.', strict_mode=False, evaluation_model='gpt-4.1', error=None, evaluation_cost=0.0028699999999999993, verbose_logs='Criteria:\nAssess whether the actual output retains all the key information from the input \n \nEvaluation Steps:\n[\n    "Identify all key information present in the input.",\n    "Compare each actual output to the inpu

### Creating a Test Dataset

For more comprehensive testing, let's create a dataset with multiple test cases. DeepEval allows you to:

1. **Create Golden datasets**: These are test cases with expected outputs that serve as ground truth
2. **Push datasets to the cloud**: Store and version your test datasets
3. **Pull datasets**: Retrieve datasets for consistent testing across teams

#### Golden Test Cases

Golden test cases are particularly useful for:
- Regression testing
- Comparing different model versions
- Establishing baseline performance

In [21]:
from deepeval.dataset import Golden, EvaluationDataset

test_data = [
    {
        "input": "What is MCP",
        "reference": "The Model Context Protocol (MCP) addresses this challenge by providing a standardized way for LLMs to connect with external data sources and tools—essentially a “universal remote” for AI apps. Released by Anthropic as an open-source protocol, MCP builds on existing function calling by eliminating the need for custom integration between LLMs and other apps."
    },
    {
        "input": "What is Relationship between function calling & Model Context Protocol",
        "reference": "The Model Context Protocol (MCP) builds on top of function calling, a well-established feature that allows large language models (LLMs) to invoke predetermined functions based on user requests. MCP simplifies and standardizes the development process by connecting AI applications to context while leveraging function calling to make API interactions more consistent across different applications and model vendors."
    },
    {
        "input": "What are the core components of MCP, just give the heading",
        "reference":""" 
                    - MCP Client
                    - MCP Servers
                    - Protocol Handshake
                    - Capability Discovery
                """
    }
]

goldens = []
for data in test_data:
    golden = Golden(
        input=data['input'],
        expected_output=data['reference']
    )

    goldens.append(golden)

dataset = EvaluationDataset(goldens=goldens)

dataset.push('test_rag')


### Running Comprehensive Evaluation on Multiple Test Cases

The code below demonstrates how to evaluate multiple test cases with RAG-specific metrics. Notice how the evaluation includes retrieval context, which is essential for metrics like:

- **Faithfulness**: Ensures the answer is grounded in retrieved documents
- **Contextual Precision**: Checks if relevant documents are ranked higher
- **Contextual Relevancy**: Measures the relevance of retrieved documents

These metrics help identify issues like:
- Hallucinations (answers not supported by context)
- Poor retrieval quality
- Irrelevant document retrieval

In [1]:
import deepeval
import deepeval.metrics
from deepeval.dataset import Golden, EvaluationDataset
from deepeval.test_case import LLMTestCase
from typing import List
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

def convert_goldens_to_test_cases(dataset: EvaluationDataset) -> List[LLMTestCase] :
    test_cases = []
    for golden in dataset.goldens:
        response = rag_chain.invoke({"input": golden.input})
        
        test_case = LLMTestCase(
            input=golden.input,
            actual_output=response["answer"],
            expected_output=golden.expected_output,
            retrieval_context=[doc.page_content for doc in response["context"]]
        )
        test_cases.append(test_case)
    return test_cases

dataset = EvaluationDataset()
dataset.pull(alias="test_rag")

data = convert_goldens_to_test_cases(dataset)

concise_metrics = GEval(
    name="Concise",
    model="gpt-4o",  # Specify OpenAI model
    criteria="Assess if the actual output remains concise while preserving all essential information.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)

deepeval.evaluate(
    test_cases=data,
    metrics=[
        deepeval.metrics.AnswerRelevancyMetric(),
        deepeval.metrics.FaithfulnessMetric(),
        deepeval.metrics.ContextualPrecisionMetric(),
        deepeval.metrics.ContextualRelevancyMetric(),
        concise_metrics
    ]
)


Output()

NameError: name 'rag_chain' is not defined