### Testing RAG Applications 📑

#### RAG Application
This application reads data about Model Context Protocol (MCP) server from internet, stores in vector stores, chunks the data with embedding and useful to answer the question about MCP while inferenced.

<img src="./img/RAG.png" width="500" height="400" style="display: block; margin: auto;">

In [1]:
%pip install -qU langchain-chroma

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
%pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
from langchain_ollama import OllamaEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.document import Document
from langchain_ollama import ChatOllama
from typing import List

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
llm = ChatOllama(
    base_url="http://localhost:11434",
    model = "qwen2.5:latest",
    temperature=0.5,
    max_tokens = 250
)

In [4]:
from langchain.chat_models import ChatOpenAI

# Initialize the OpenAI model with your API key
llm = ChatOpenAI(
    openai_api_key="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",  # Replace with your actual API key
    model="gpt-3.5-turbo",  # Specify the OpenAI model you want to use
    temperature=0.5,        # Control randomness
    max_tokens=250          # Limit the response length
)

  llm = ChatOpenAI(


In [None]:
# Load data from Web
loader = WebBaseLoader("https://www.descope.com/learn/post/mcp")
data = loader.load()

# Split text into documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# Add text to vector db
embedding = OllamaEmbeddings(model="llama3.2:latest")
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

# Create a retriever
retriever = vectordb.as_retriever()

def format_docs(docs: List[Document]) -> str:
    return "\n\n".join([d.page_content for d in docs])


template = """Answer the question based only on the following context:

    {context}
    
    Give a summary not the full detail

    Question: {question}
    """
prompt = ChatPromptTemplate.from_template(template)


def retrieve_and_format(question):
    docs = retriever.get_relevant_documents(question)
    return format_docs(docs)

chain = {"context": retrieve_and_format, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()


#### Output of the LLM Application

In [6]:
response = chain.invoke("What is MCP")

print(response)

  docs = retriever.get_relevant_documents(question)


MCP is a protocol that standardizes how API features work, allowing for easier integration and interaction with external APIs. It helps streamline the process of defining function schemas, implementing handlers, and supporting multiple models.


In [7]:
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

test_case = LLMTestCase(
  input="What is MCP",
  actual_output= chain.invoke("What is MCP"),
  expected_output="The Model Context Protocol (MCP) addresses this challenge by providing a standardized way for LLMs to connect with external data sources and tools—essentially a “universal remote” for AI apps. Released by Anthropic as an open-source protocol, MCP builds on existing function calling by eliminating the need for custom integration between LLMs and other apps."
)


dataset = EvaluationDataset(test_cases=[test_case])

In [7]:
test_case

LLMTestCase(input='What is MCP', actual_output='MCP, or Model Context Protocol, is a protocol designed to enable AI assistants to interact with various external APIs and platforms. It supports actions like retrieving channel history from messaging apps, creating forks or branches on GitHub, and more. MCP servers, including reference, official integrations, and community-developed ones, showcase how AI assistants can access and utilize information from different sources.', expected_output='The Model Context Protocol (MCP) addresses this challenge by providing a standardized way for LLMs to connect with external data sources and tools—essentially a “universal remote” for AI apps. Released by Anthropic as an open-source protocol, MCP builds on existing function calling by eliminating the need for custom integration between LLMs and other apps.', context=None, retrieval_context=None, additional_metadata=None, comments=None, tools_called=None, expected_tools=None, reasoning=None, name=None)

In [8]:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval

concise_metrics = GEval(
    name = "Concise",
    criteria="Assess if the actual output remains concise while preserving all essential information.",
    
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.INPUT
    ]
)

In [9]:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval

completness_metrics = GEval(
    name = "Completeness",
    criteria="Assess whether the actual output retains all the key information from the input",
    
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.INPUT
    ]
)

In [15]:
import deepeval

deepeval.login_with_confident_api_key("BMpUbhn6iKOOQ8aThTYq0MLONYHh5AuZWWZPX8Udw4U=")

In [11]:
!deepeval set-ollama deepseek-r1:8b

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.


### Evaluation with GEval 

In [14]:
import deepeval.metrics

deepeval.evaluate(dataset, metrics=[
    completness_metrics,
    deepeval.metrics.AnswerRelevancyMetric(),
    concise_metrics
] )

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:12, 12.91s/test case]




Metrics Summary

  - ✅ Completeness (GEval) (score: 0.7153852170196032, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The output captures the essence of MCP as a protocol for standardizing API features and simplifying interactions with LLM APIs, but lacks specific examples or details about the types of information retrieved or the external APIs involved., error: None)
  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the response perfectly addressed the question about MCP without any irrelevant information. Great job on staying focused and providing a clear answer!, error: None)
  - ✅ Concise (GEval) (score: 0.7798126297766479, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The output is concise and includes essential information about MCP standardizing API features, streamlining function calls, and eliminating model-specific schemas. However, it lacks details on how it i

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Completeness (GEval)', threshold=0.5, success=True, score=0.7153852170196032, reason='The output captures the essence of MCP as a protocol for standardizing API features and simplifying interactions with LLM APIs, but lacks specific examples or details about the types of information retrieved or the external APIs involved.', strict_mode=False, evaluation_model='gpt-4o', error=None, evaluation_cost=0.0024625000000000003, verbose_logs='Criteria:\nAssess whether the actual output retains all the key information from the input \n \nEvaluation Steps:\n[\n    "Compare the actual output against the input to identify any missing key information.",\n    "Check the actual output for presence of each critical detail found in the input.",\n    "Evaluate the significance of any omitted information from the actual output relative to the input.",\n    "Determine if the retained information in th