### Testing RAG Applications 📑

#### RAG Application
This application reads data about Model Context Protocol (MCP) server from internet, stores in vector stores, chunks the data with embedding and useful to answer the question about MCP while inferenced.

<img src="./img/RAG.png" width="500" height="400" style="display: block; margin: auto;">

In [1]:
#!pip install -qU langchain-chroma

In [2]:
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.document import Document
from langchain_ollama import ChatOllama

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
llm = ChatOllama(
    base_url="http://localhost:11434",
    model = "qwen2.5:latest",
    temperature=0.5,
    max_tokens = 250
)

In [4]:
# Load data from Web
loader = WebBaseLoader("https://www.descope.com/learn/post/mcp")
data = loader.load()

# Split text into documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# Add text to vector db
embedding = OllamaEmbeddings(model="llama3.2:latest")
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

# Create a retriever
retriever = vectordb.as_retriever()

def format_docs(docs: List[Document]) -> str:
    return "\n\n".join([d.page_content for d in docs])


template = """Answer the question based only on the following context:

    {context}
    
    Give a summary not the full detail

    Question: {question}
    """
prompt = ChatPromptTemplate.from_template(template)


def retrieve_and_format(question):
    docs = retriever.get_relevant_documents(question)
    return format_docs(docs)

chain = {"context": retrieve_and_format, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()


#### Output of the LLM Application

In [5]:
response = chain.invoke("What is MCP")

print(response)

  docs = retriever.get_relevant_documents(question)


MCP, or Model Context Protocol, is a protocol designed to enable AI assistants to interact with various external services and platforms. It supports actions like retrieving channel history from messaging apps and performing GitHub operations such as creating forks or branches. MCP servers, including reference, official integrations, and community-developed ones, demonstrate its capabilities in integrating diverse systems.


### Testing RAG Application with DeepEval
<img src="./img/RAGTesting.png" width="800" height="400" style="display: block; margin: auto;">

In [6]:
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

test_case = LLMTestCase(
  input="What is MCP",
  actual_output= chain.invoke("What is MCP"),
  expected_output="The Model Context Protocol (MCP) addresses this challenge by providing a standardized way for LLMs to connect with external data sources and tools—essentially a “universal remote” for AI apps. Released by Anthropic as an open-source protocol, MCP builds on existing function calling by eliminating the need for custom integration between LLMs and other apps."
)


dataset = EvaluationDataset(test_cases=[test_case])



In [7]:
test_case

LLMTestCase(input='What is MCP', actual_output='MCP, or Model Context Protocol, is a protocol designed to enable AI assistants to interact with various external APIs and platforms. It supports actions like retrieving channel history from messaging apps, creating forks or branches on GitHub, and more. MCP servers, including reference, official integrations, and community-developed ones, showcase how AI assistants can access and utilize information from different sources.', expected_output='The Model Context Protocol (MCP) addresses this challenge by providing a standardized way for LLMs to connect with external data sources and tools—essentially a “universal remote” for AI apps. Released by Anthropic as an open-source protocol, MCP builds on existing function calling by eliminating the need for custom integration between LLMs and other apps.', context=None, retrieval_context=None, additional_metadata=None, comments=None, tools_called=None, expected_tools=None, reasoning=None, name=None)

In [8]:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval

concise_metrics = GEval(
    name = "Concise",
    criteria="Assess if the actual output remains concise while preserving all essential information.",
    
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT
    ]
)

In [9]:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval

completness_metrics = GEval(
    name = "Completeness",
    criteria="Assess whether the actual output retains all the key information from the input",
    
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT
    ]
)

In [10]:
import deepeval

deepeval.login_with_confident_api_key("chf7LtTWtK1foTOAiK+vHFZ622I16kZtcpzfC+7FAVU=")

In [11]:
!deepeval set-ollama deepseek-r1:8b

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.


### Evaluation with GEval 

In [13]:
import deepeval.metrics

deepeval.evaluate(dataset, metrics=[
    completness_metrics,
    deepeval.metrics.AnswerRelevancyMetric(),
    concise_metrics
] )

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:50, 50.50s/test case]




Metrics Summary

  - ✅ Completeness (GEval) (score: 1.0, threshold: 0.5, strict: False, evaluation model: deepseek-r1:8b (Ollama), reason: All key components are present and accurately aligned with their sources. No omissions or distortions; context and meaning are retained., error: None)
  - ✅ Answer Relevancy (score: 0.8333333333333334, threshold: 0.5, strict: False, evaluation model: deepseek-r1:8b (Ollama), reason: The score is 0.83 because the actual output included an irrelevant statement about GitHub branches, which does not address what MCP is., error: None)
  - ✅ Concise (GEval) (score: 1.0, threshold: 0.5, strict: False, evaluation model: deepseek-r1:8b (Ollama), reason: The content is concise and includes all essential information without redundancy. The language is clear and complete, ensuring nothing vital is omitted., error: None)

For test case:

  - input: What is MCP
  - actual output: MCP, or Model Context Protocol, is a protocol designed to enable AI assistants to 

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Completeness (GEval)', threshold=0.5, success=True, score=1.0, reason='All key components are present and accurately aligned with their sources. No omissions or distortions; context and meaning are retained.', strict_mode=False, evaluation_model='deepseek-r1:8b (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nAssess whether the actual output retains all the key information from the input \n \nEvaluation Steps:\n[\n    "Compare the input and output to ensure all key components are present.",\n    "Verify that each piece of information aligns accurately with its source.",\n    "Check for any omissions or distortions in the representation of details.",\n    "Assess whether the retained information maintains its original context and meaning."\n]'), MetricData(name='Answer Relevancy', threshold=0.5, success=True, score=0.8333333333333334, reason='The score is 0.83 b