<a href="https://colab.research.google.com/github/anshupandey/AI_Agents/blob/main/AAP_C14_RAG_Evaluation_using_fixed_sources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Evaluation using Fixed Sources

A simple RAG pipeline requries at least two components: a retriever and a response generator. You can evaluate the whole chain end-to-end, as shown in the [QA Correctness](../qa-correctness/) walkthrough. However, for more actionable and fine-grained metrics, it is helpful to evaluate each component in isolation.

To evaluate the response generator directly, create a dataset with the user query and retrieved documents as inputs and the expected response as an output.

In this walkthrough, you will take this approach to evaluate the response generation component of a RAG pipeline, using both correctness and a custom "faithfulness" evaluator to generate multiple metrics. The results will look something like the following:

![Custom Evaluator](https://github.com/langchain-ai/langsmith-cookbook/blob/main/testing-examples/using-fixed-sources/img/example_results.png?raw=1)

## Prerequisites

First, install the required packages and configure your environment.

In [1]:
!pip install -q -U langchain-core langchain-community langgraph
!pip install --upgrade --quiet google-cloud-aiplatform requests
!pip install -q -U langchain-google-vertexai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m371.7/371.7 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.4/91.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m987.6/987.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency

In [2]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [1]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [2]:
PROJECT_ID = "maxis-poc-427906"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)

In [3]:
import os
import uuid

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "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"  # Update with your API key
uid = uuid.uuid4()

## 1. Create a dataset

Next, create a dataset. The simple dataset below is enough to illustrate ways the response generator may deviate from the desired behavior by relying too much on its pretrained "knowledge".

In [4]:
# A simple example dataset
examples = [
    {
        "inputs": {
            "question": "What's the company's total revenue for q2 of 2022?",
            "documents": [
                {
                    "metadata": {},
                    "page_content": "In q1 the lemonade company made $4.95. In q2 revenue increased by a sizeable amount to just over $2T dollars.",
                }
            ],
        },
        "outputs": {
            "label": "2 trillion dollars",
        },
    },
    {
        "inputs": {
            "question": "Who is Lebron?",
            "documents": [
                {
                    "metadata": {},
                    "page_content": "On Thursday, February 16, Lebron James was nominated as President of the United States.",
                }
            ],
        },
        "outputs": {
            "label": "Lebron James is the President of the USA.",
        },
    },
]

In [5]:
from langsmith import Client

client = Client()

dataset_name = f"Faithfulness Example - {uid}"
dataset = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
    inputs=[e["inputs"] for e in examples],
    outputs=[e["outputs"] for e in examples],
    dataset_id=dataset.id,
)

## 2. Define chain

Suppose your chain is composed of two main components: a retriever and response synthesizer. Using LangChain runnables, it's easy to separate these two components to evaluate them in isolation.

Below is a very simple RAG chain with a placeholder retriever. For our testing, we will evaluate ONLY the response synthesizer.

In [13]:
from langchain import prompts
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from langchain_core.runnables import RunnablePassthrough

from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(model="chat-bison@002")

class MyRetriever(BaseRetriever):
    def _get_relevant_documents(self, query, *, run_manager):
      # print("-----",query)
      return [Document(page_content="Example")]

def debug_step(data):
  print(data)
  return data

# This is what we will evaluate
response_synthesizer = prompts.ChatPromptTemplate.from_messages(
    [
        ("system", "Respond using the following documents as context, do not use any other information to answer the question other than document :\n {documents}"),
        ("user", "{question}"),
    ]
) | debug_step | llm

# Full chain below for illustration
chain = {
    "documents": MyRetriever(),
    "question": RunnablePassthrough(),
} | response_synthesizer

In [None]:
response_synthesizer.invoke({"documents":"On Thursday, February 16, Lebron James was nominated as President of the United States.",
                             "question":"Who is Lebron?"})

In [None]:
chain.invoke({"documents":"On Thursday, February 16, Lebron James was nominated as President of the United States.",
                             "question":"Who is Lebron?"})

## 3. Evaluate

Below, we will define a custom "FaithfulnessEvaluator" that measures how faithful the chain's output prediction is to the reference input documents, given the user's input question.

In this case, we will wrap the [Scoring Eval Chain](https://python.langchain.com/docs/guides/productionization/evaluation/string/scoring_eval_chain) and manually select which fields in the run and dataset example to use to represent the prediction, input question, and reference.

In [17]:
from langchain.smith import RunEvalConfig
from langsmith.evaluation import RunEvaluator, EvaluationResult
from langchain.evaluation import load_evaluator

evaluation_llm = ChatVertexAI(model="gemini-1.5-flash-001")

eval_config = RunEvalConfig(
    evaluators=['qa'],
    eval_llm = evaluation_llm,
    input_key="question",
)
results = client.run_on_dataset(
    llm_or_chain_factory=response_synthesizer,
    dataset_name=dataset_name,
    evaluation=eval_config,
)


View the evaluation results for project 'new-health-57' at:
https://smith.langchain.com/o/5a6b14c9-fdd5-5907-901d-da9646efc726/datasets/e6bf0697-d088-4800-bd43-49f57ae883a3/compare?selectedSessions=9cf454ab-3f8b-4f5b-9dbc-75b0c3d26caf

View all tests for Dataset Faithfulness Example - 71c9fc77-d5d0-46ba-b0b8-f9f4473bd7e7 at:
https://smith.langchain.com/o/5a6b14c9-fdd5-5907-901d-da9646efc726/datasets/e6bf0697-d088-4800-bd43-49f57ae883a3
[>                                                 ] 0/2messages=[SystemMessage(content="Respond using the following documents as context, do not use any other information to answer the question other than document :\n [{'metadata': {}, 'page_content': 'In q1 the lemonade company made $4.95. In q2 revenue increased by a sizeable amount to just over $2T dollars.'}]"), HumanMessage(content="What's the company's total revenue for q2 of 2022?")]
messages=[SystemMessage(content="Respond using the following documents as context, do not use any other informatio



[------------------------>                         ] 1/2[------------------------------------------------->] 2/2

In [18]:
from langchain_core.prompts.prompt import PromptTemplate
from langsmith.evaluation import LangChainStringEvaluator, evaluate

_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students' answers to questions.
You are grading the following question:
{query}
Here is the real answer:
{answer}
You are grading the following predicted answer:
{result}
Respond with CORRECT or INCORRECT:
Grade:
"""

PROMPT = PromptTemplate(
    input_variables=["query", "answer", "result"], template=_PROMPT_TEMPLATE
)


def predict(inputs:dict)->dict:
  print(inputs)
  response = chain.invoke({"documents":inputs["documents"][0]['page_content'],"question":inputs["question"]})
  print(response)
  return {"output":response}

evaluation_llm = ChatVertexAI(model="gemini-1.5-flash-001")

qa_evaluator = LangChainStringEvaluator("qa", config={"llm": evaluation_llm, "prompt": PROMPT})
#context_qa_evaluator = LangChainStringEvaluator("context_qa", config={"llm": eval_llm})
#cot_qa_evaluator = LangChainStringEvaluator("cot_qa", config={"llm": eval_llm})


results = evaluate(predict,
                   data=dataset_name,
                   evaluators=[qa_evaluator],
                   )

View the evaluation results for experiment: 'yellow-class-19' at:
https://smith.langchain.com/o/5a6b14c9-fdd5-5907-901d-da9646efc726/datasets/e6bf0697-d088-4800-bd43-49f57ae883a3/compare?selectedSessions=89fae7f3-3c9e-4caa-b26b-ebda8b622098




0it [00:00, ?it/s]

{'question': "What's the company's total revenue for q2 of 2022?", 'documents': [{'metadata': {}, 'page_content': 'In q1 the lemonade company made $4.95. In q2 revenue increased by a sizeable amount to just over $2T dollars.'}]}
{'question': 'Who is Lebron?', 'documents': [{'metadata': {}, 'page_content': 'On Thursday, February 16, Lebron James was nominated as President of the United States.'}]}
messages=[SystemMessage(content="Respond using the following documents as context, do not use any other information to answer the question other than document :\n [Document(page_content='Example')]"), HumanMessage(content="{'documents': '', 'question': 'Who is Lebron?'}")]
messages=[SystemMessage(content="Respond using the following documents as context, do not use any other information to answer the question other than document :\n [Document(page_content='Example')]"), HumanMessage(content='{\'documents\': \'\', \'question\': "What\'s the company\'s total revenue for q2 of 2022?"}')]


Exception in thread Thread-30 (<lambda>):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/langsmith/evaluation/_runner.py", line 379, in <lambda>
    target=lambda: self._process_data(self._manager)
  File "/usr/local/lib/python3.10/dist-packages/langsmith/evaluation/_runner.py", line 400, in _process_data
    for item in tqdm(results):
  File "/usr/local/lib/python3.10/dist-packages/tqdm/notebook.py", line 250, in __iter__
    for obj in it:
  File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.10/dist-packages/langsmith/evaluation/_runner.py", line 1174, in get_results
    for run, example, evaluation_results in zip(
  File "/usr/local/lib/python3.10/dist-packages/langsmit

content=' LeBron James is an American professional basketball player for the Los Angeles Lakers of the National Basketball Association (NBA). He is widely considered to be one of the greatest basketball players of all time. James has won four NBA championships, four NBA MVP awards, four NBA Finals MVP awards, and two Olympic gold medals. He is the all-time leading scorer in NBA playoff history and is fourth on the all-time regular season scoring list.' response_metadata={'is_blocked': False, 'errors': (), 'safety_attributes': [{'Insult': 0.1, 'Sexual': 0.1}], 'grounding_metadata': {'citations': [], 'search_queries': []}, 'usage_metadata': {'candidates_billable_characters': 372.0, 'candidates_token_count': 86.0, 'prompt_billable_characters': 42.0, 'prompt_token_count': 14.0}} id='run-001926ad-edad-4e84-9ad0-0888b7f0758d-0' usage_metadata={'input_tokens': 14, 'output_tokens': 86, 'total_tokens': 100}
content=" The company's total revenue for Q2 of 2022 was $12.3 billion, a 16% increase f

## Discussion

You've now evaluated the response generator for its response correctness and its "faithfulness" to the source text but fixing retrieved document sources in the dataset. This is an effective way to confirm that the response component of your chat bot behaves according to expectations.

In setting up the evaluation, you used a custom run evaluator to select which fields in the dataset to use in the evaluation template. Under the hood, this still uses an off-the-shelf [scoring evaluator](https://python.langchain.com/docs/guides/productionization/evaluation/string/scoring_eval_chain).

Most of LangChain's open-source evaluators implement the "[StringEvaluator](https://python.langchain.com/docs/guides/productionization/evaluation/string/)" interface, meaning they compute a metric based on:

- An input string from the dataset example inputs (configurable by the RunEvalConfig's input_key property)
- An output prediction string from the evaluated chain's outputs (configurable by the RunEvalConfig's prediction_key property)
- (If labels or context are required) a reference string from the example outputs (configurable by the RunEvalConfig's reference_key property)

In our case, we wanted to take the context from the example _inputs_ fields. Wrapping the evaluator as a custom `RunEvaluator` is an easy way to get a further level of control in situations when you want to use other fields.