<a href="https://colab.research.google.com/github/anshupandey/Generative-AI-for-Professionals/blob/main/Gemini_RAG_Evaluation_LangSmith.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Evaluation
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/guides/evaluation/examples/rag.ipynb)

RAG (Retrieval Augmented Generation) is one of the most popular techniques for building applications with LLMs.

For an in-depth review, see our RAG series of notebooks and videos [here](https://github.com/langchain-ai/rag-from-scratch)).

## Types of RAG eval

There are at least 4 types of RAG eval that users of typically interested in (here, `<>` means "compared against"):

1. **Response <> reference answer**: metrics like correctness measure "*how similar/correct is the answer, relative to a ground-truth label*"
2. **Response <> input**: metrics like answer relevance, helpfulness, etc. measure "*how well does the generated response address the initial user input*"
3. **Response <> retrieved docs**: metrics like faithfulness, hallucinations, etc. measure "*to what extent does the generated response agree with the retrieved context*"
5. **Retrieved docs <> input**: metrics like score @ k, mean reciprocal rank, NDCG, etc. measure "*how good are my retrieved results for this query*"

![](https://github.com/langchain-ai/langsmith-cookbook/blob/1a46c089ede410384f69dc3e808567406c39e009/testing-examples/rag_eval/langsmith_rag_eval.png?raw=1)


Each of these evals has something in common: it will **compare** text against some grounding (e.g., answer vs reference answer, etc).

We can use various built-in `LangChainStringEvaluator` types for this, but the same principles apply no matter which type of evaluator you are using. (see [here](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations#overview)).

All `LangChainStringEvaluator` implementations can accept 3 inputs:

```
prediction: The prediction string.
reference: The reference string.
input: The input string.
```

`prediction` is always required

`input` is required for most evaluators (`criteria`, `score_string`, `labeled_criteria`, `labeled_score_string`, `qa`, `cot_qa`)

`reference` is required for labeled evaluators, which are evaluators that grade against an expected value (`qa`, `cot_qa`, `labeled_criteria`, `labeled_score_string`)


Below, we will use this to perform eval.

## RAG pipeline

To start, we build a RAG pipeline. We will be using LangChain strictly for creating the retriever and retrieving the relevant documents. The overall pipeline does not use LangChain. LangSmith works regardless of whether or not your pipeline is built with LangChain.

**Note** in the below example, we return the retrieved documents as part of the final answer. In a follow-up tutorial, we will showcase how to make use of these RAG evaluation techniques *even when your pipline returns only the final answer!*

In [1]:
!pip install langsmith langchain-community langchain tiktoken langchain-chroma --quiet
!pip install langchain-google-vertexai --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m73.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.3/990.3 kB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m584.3/584.3 kB[0m [31m37.3 MB/s[0m eta [3

In [8]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-4.3.1-py3-none-any.whl (295 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.3.1


We build an `index` using a set of LangChain docs.

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, it is recommended to restart the runtime. Run the following cell to restart the current kernel.

The restart process might take a minute or so.


In [2]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

After the restart is complete, continue to the next step.


<div class="alert alert-block alert-warning">
<b>⚠️ Wait for the kernel to finish restarting before you continue. ⚠️</b>
</div>


In [1]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [2]:
# TODO(developer): Update the below lines
PROJECT_ID = "jrproject-402905"
LOCATION = "us-central1"
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

In [3]:
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "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"

## Import libraries


In [5]:
### INDEX

from bs4 import BeautifulSoup as Soup
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

In [9]:
doc_paths = ["https://www.morningstar.com/content/dam/marketing/shared/research/methodology/771945_Morningstar_Rating_for_Funds_Methodology.pdf",
             "https://s21.q4cdn.com/198919461/files/doc_downloads/press_kits/2016/Morningstar-Sustainable-Investing-Handbook.pdf"]
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loaders = [PyPDFLoader(pdf, extract_images=False) for pdf in doc_paths]

docs = []

for loader in loaders:
    doc = loader.load()
    docs.extend(doc)
len(docs)

34

In [10]:
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3500, chunk_overlap=500)
splits = text_splitter.split_documents(docs)
print(len(splits))


42


In [13]:
from langchain_google_vertexai.embeddings import VertexAIEmbeddings
embedding_fun = VertexAIEmbeddings(model_name="textembedding-gecko@001")

In [26]:
# Embed
vectorstore = Chroma.from_documents(documents=splits, embedding=embedding_fun)

# Index
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})



In [30]:
retrieved_docs = retriever.invoke("What is a style profile")
len(retrieved_docs)

2

In [31]:
print(retrieved_docs[1].page_content)

©2021 Morningstar, Inc. All rights reserved. The information in this document is the property of Morningstar, Inc. Reproduction or transcription by any means, in whole or in part, without the prior written 
consent of Morningstar, Inc., is prohibited.
 The Morningstar RatingTM for Funds    August 2021 Page 3 of 21
3
3
3bond funds domiciled in Europe against other European high-yield bond funds. For more information 
about available categories, please contact your local Morningstar office.
Style Profiles
A style profile may be considered a summary of a fund’s risk-factor exposures. Fund categories 
define groups of funds whose members are similar enough in their risk-factor exposures that return 
comparisons between them are useful.
The risk factors on which fund categories are based can relate to value-growth orientation; 
capitalization; industry sector, geographic region, and country weights; duration and credit quality; 
historical return volatility; beta; and many other investment 

Next, we build a `RAG chain` that returns an `answer` and the retrieved documents as `contexts`.

In [16]:
from langchain_google_vertexai import ChatVertexAI
from langchain_core.messages import HumanMessage, SystemMessage
llm = ChatVertexAI(model="gemini-1.5-flash-001")
messages = [
    SystemMessage(content="Translate the following from English into Italian"),
    HumanMessage(content="hi!"),
]

llm.invoke(messages).content


'Ciao! \n'

In [32]:
### RAG


from langsmith import traceable
from langsmith.wrappers import wrap_openai
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

class RagBot:

    def __init__(self, retriever, model):
        self._retriever = retriever
        # Wrapping the client instruments the LLM
        self.model = llm

    @traceable()
    def retrieve_docs(self, question):
        return self._retriever.invoke(question)

    @traceable()
    def get_answer(self, question: str):
        context = self.retrieve_docs(question)
        print(context)
        self.chain = prompt | self.model
        response = self.chain.invoke({"context":context,"question":question})

        # Evaluators will expect "answer" and "contexts"
        return {
            "answer": response.content,
            "contexts": [str(doc) for doc in context],
        }


rag_bot = RagBot(retriever,model=llm)

In [41]:
response = rag_bot.get_answer("How should investors use our Sustainability Ratings?") # what is Portfolio ESG Score?
response["answer"][:150]

[Document(metadata={'page': 5, 'source': 'https://s21.q4cdn.com/198919461/files/doc_downloads/press_kits/2016/Morningstar-Sustainable-Investing-Handbook.pdf'}, page_content='The Morningstar Sustainable Investing Handbook5Using the Morningstar Sustainability Rating for funds \nOur Sustainability Rating allows investors to assess how well the companies in a fund’s  \nportfolio are managing their ESG risks and opportunities. The rating will also allow investors to \ncompare funds across categories and relative to benchmarks using specific ESG factors. \nThe ratings can serve as an initial screen for investors interested in sustainability and ESG  \nfactors.  They are also a useful starting point for investors wanting to know more about a manager’s \ninvestment process and how it relates to sustainable investing.\nThe ratings will help investors determine both the level of sustainability in their existing portfolios \nand allow them to set sustainability targets. Some investors, for exampl

'Investors can use Morningstar Sustainability Ratings to:\n\n* **Assess ESG risk and opportunity management:**  The ratings help investors understand how'

In [40]:
response['answer']

"The Portfolio ESG Score is an asset-weighted average of normalized company-level ESG Scores for the covered holdings in the portfolio. These company-level ESG Scores come from Sustainalytics and reflect companies' management systems, practices, policies, and other indicators related to environmental, social, and governance issues. They are scored on a scale of 0-100, with a higher score indicating better performance. A high Portfolio ESG Score means that a fund has more of its assets invested in companies that score well according to the Sustainalytics ESG methodology. \n"

In [56]:
response

{'answer': "Investors can use Morningstar Sustainability Ratings to:\n\n* **Assess ESG risk and opportunity management:**  The ratings help investors understand how well the companies within a fund's portfolio are managing their environmental, social, and governance (ESG) risks and opportunities.\n* **Compare funds:** The ratings allow investors to compare funds across different categories and relative to benchmarks based on specific ESG factors.\n* **Screen for sustainability:** The ratings can serve as an initial screen for investors who are interested in sustainability and ESG factors.\n* **Understand investment process:** The ratings can provide insights into a fund manager's investment process and how it relates to",
 'contexts': ["page_content='The Morningstar Sustainable Investing Handbook5Using the Morningstar Sustainability Rating for funds \nOur Sustainability Rating allows investors to assess how well the companies in a fund’s  \nportfolio are managing their ESG risks and op

## RAG Dataset

Next, we build a dataset of QA pairs based upon the [documentation](https://python.langchain.com/docs/expression_language/) that we indexed.

In [35]:
import getpass
import os


os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

In [47]:
from langsmith import Client

# QA
inputs = [
    "What is a style profile?",
    "How should investors use our Sustainability Ratings?",
    "what is Portfolio ESG Score?",
]

outputs = [
    "A style profile is a summary of a fund's risk-factor exposures. It outlines the factors that influence a fund's returns, such as value-growth orientation, capitalization, industry sector, geographic region, country weights, duration, credit quality, historical return volatility, beta, and other investment style factors.",
    "Investors can use Morningstar Sustainability Ratings to assess how well companies in a fund manage ESG risks and opportunities, compare funds across categories, screen for sustainability, and understand a fund manager's investment process regarding ESG factors.",
    "The Portfolio ESG Score, provided by Sustainalytics, is an asset-weighted average of normalized company-level ESG scores, ranging from 0-100. A higher score indicates better ESG performance, with funds investing more in companies that excel in managing environmental, social, and governance issues.",
]

qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

# Create dataset
client = Client()
dataset_name = "RAG_test_Finance"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs about Finance.",
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

LangSmithConflictError: Conflict for /datasets. HTTPError('409 Client Error: Conflict for url: https://api.smith.langchain.com/datasets', '{"detail":"Dataset with this name already exists."}')

## RAG Evaluators

### Type 1: Reference Answer

First, lets consider the case in which we want to compare our RAG chain answer to a reference answer.

This is shown on the far right (blue):

![](https://github.com/langchain-ai/langsmith-cookbook/blob/1a46c089ede410384f69dc3e808567406c39e009/testing-examples/rag_eval/langsmith_rag_eval.png?raw=1)

Here is the eval process we will use:

![](https://github.com/langchain-ai/langsmith-cookbook/blob/1a46c089ede410384f69dc3e808567406c39e009/testing-examples/rag_eval/langsmith_rag_story.png?raw=1)

#### Eval flow

We will use a `LangChainStringEvaluator` to compare RAG chain answers to reference (ground truth) answers.

There are many types of `LangChainStringEvaluator` [see options](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations).

For comparing questions and answers, I like to use `LLM-as-judge` evaluators:
* `QA`
* `CoTQA`

For example, `CoT_QA` uses the eval prompt defined [here](https://smith.langchain.com/hub/langchain-ai/cot_qa).

And all `LangChainStringEvaluator` expose a common interface to pass the chain and dataset inputs:

1. `question` from the dataset -> `input`
2. `answer` from the dataset -> `reference`
3. `answer` from the LLM -> `prediction`

![](https://github.com/langchain-ai/langsmith-cookbook/blob/1a46c089ede410384f69dc3e808567406c39e009/testing-examples/rag_eval/langsmith_rag_flow.png?raw=1)

In [43]:
# RAG chain
def predict_rag_answer(example: dict):
    """Use this for answer evaluation"""
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}

def predict_rag_answer_with_context(example: dict):
    """Use this for evaluation of retrieved documents and hallucinations"""
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"], "contexts": response["contexts"]}

In [49]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

# Evaluator
qa_evalulator = [
    LangChainStringEvaluator(
        "cot_qa",
        config={"llm": llm,},
        prepare_data=lambda run, example: {
            "prediction": run.outputs["answer"],
            "reference": example.outputs["answer"],
            "input": example.inputs["question"],
        },
    )
]
dataset_name = "RAG_test_Finance"
experiment_results = evaluate(
    predict_rag_answer,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="rag-qa-oai",
    metadata={"variant": "Finance context, Gemini Flash"},
)

View the evaluation results for experiment: 'rag-qa-oai-29165536' at:
https://smith.langchain.com/o/e5ffb10d-d7d5-5a52-9eb0-1e8ca168f073/datasets/c077395e-3ec2-4f33-8e26-85cac805d423/compare?selectedSessions=a2eaa497-6e16-4ca3-9da2-12a41fb4adba




0it [00:00, ?it/s]



[Document(metadata={'page': 5, 'source': 'https://s21.q4cdn.com/198919461/files/doc_downloads/press_kits/2016/Morningstar-Sustainable-Investing-Handbook.pdf'}, page_content='The Morningstar Sustainable Investing Handbook5Using the Morningstar Sustainability Rating for funds \nOur Sustainability Rating allows investors to assess how well the companies in a fund’s  \nportfolio are managing their ESG risks and opportunities. The rating will also allow investors to \ncompare funds across categories and relative to benchmarks using specific ESG factors. \nThe ratings can serve as an initial screen for investors interested in sustainability and ESG  \nfactors.  They are also a useful starting point for investors wanting to know more about a manager’s \ninvestment process and how it relates to sustainable investing.\nThe ratings will help investors determine both the level of sustainability in their existing portfolios \nand allow them to set sustainability targets. Some investors, for exampl



### Type 2: Answer Hallucination

Second, lets consider the case in which we want to compare our RAG chain answer to the retrieved documents.

This is shown in the red in the top figure.

#### Eval flow

We will use a `LangChainStringEvaluator`, as mentioned above.

There are many types of `LangChainStringEvaluator`.

For comparing documents and answers, a common built-in `LangChainStringEvaluator` options is `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.

We will use `labeled_score_string` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://smith.langchain.com/hub/wfh/labeled-score-string).

Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:

1. `contexts` from  LLM chain -> `reference`
2. `answer` from the LLM chain -> `prediction`

![](https://github.com/langchain-ai/langsmith-cookbook/blob/1a46c089ede410384f69dc3e808567406c39e009/testing-examples/rag_eval/langsmith_rag_flow_hallucination.png?raw=1)

In [52]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

answer_hallucination_evaluator = LangChainStringEvaluator(
    "labeled_score_string",
    config={
        "criteria": {
            "accuracy": """Is the Assistant's Answer grounded in the Ground Truth documentation? A score of [[1]] means that the
            Assistant answer contains is not at all based upon / grounded in the Groun Truth documentation. A score of [[5]] means
            that the Assistant answer contains some information (e.g., a hallucination) that is not captured in the Ground Truth
            documentation. A score of [[10]] means that the Assistant answer is fully based upon the in the Ground Truth documentation.""",
        },
         "llm": llm,
        # If you want the score to be saved on a scale from 0 to 1
        "normalize_by": 10,
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["contexts"],
        "input": example.inputs["question"],
    },
)

In [53]:

experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators=[answer_hallucination_evaluator],
    experiment_prefix="rag-qa-oai-hallucination",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "LCEL context, gpt-3.5-turbo",
    },
)

View the evaluation results for experiment: 'rag-qa-oai-hallucination-82d1a8a0' at:
https://smith.langchain.com/o/e5ffb10d-d7d5-5a52-9eb0-1e8ca168f073/datasets/c077395e-3ec2-4f33-8e26-85cac805d423/compare?selectedSessions=427749c9-7017-4d12-85fd-24da246c1369




0it [00:00, ?it/s]



[Document(metadata={'page': 10, 'source': 'https://s21.q4cdn.com/198919461/files/doc_downloads/press_kits/2016/Morningstar-Sustainable-Investing-Handbook.pdf'}, page_content='The Morningstar Sustainable Investing Handbook10Key Definitions\n \nPortfolio ESG Score\nThe Portfolio ESG Score is an asset-weighted average of normalized company-level ESG  \nScores for the covered holdings in the portfolio. Company-level ESG Scores from Sustainalytics \nreflect companies’ management systems, practices, policies, and other indicators related to \nenvironmental, social, and governance issues. Their company-level ESG scores use a 0-100  scale. \nA high Portfolio ESG Score is better than a low score. At the portfolio level, high scores indicate \nthat a fund has more of its assets invested in companies that score well according to the \nSustainalytics ESG methodology.\nPortfolio Controversy Score\nThe Portfolio Controversy Score is the asset-weighted average level of the seriousness of the \ncontro

ERROR:langsmith.evaluation._runner:Error running target function: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-flash. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.
ERROR:langsmith.evaluation._runner:Error running target function: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-flash. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.
ERROR:langsmith.evaluation._runner:Error running evaluator <DynamicRunEvaluator evaluate> on run 5a34a967-1501-4991-979d-55220030fe9b: KeyError('answer')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/langsmith/evaluation/_runner.py", line 1258, in _run_evaluators
    evaluator_response = evaluator.evaluate_run(
  File

### Type 3: Document Relevance to Question

Finally, lets consider the case in which we want to compare our RAG chain document retrieval to the question.

This is shown in green in the top figure.

#### Eval flow

We will use a `LangChainStringEvaluator`, as mentioned above.

For comparing documents and answers, common built-in `LangChainStringEvaluator` options are `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.

We will use `score_string` as an LLM-as-judge evaluator [(docs)](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations#criteria-evaluators-no-labels), which uses the eval prompt defined [here](https://smith.langchain.com/hub/wfh/score-string).

Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:

1. `question` from  LLM chain -> `reference`
2. `contexts` from the LLM chain -> `prediction`

![](https://github.com/langchain-ai/langsmith-cookbook/blob/1a46c089ede410384f69dc3e808567406c39e009/testing-examples/rag_eval/langsmith_rag_flow_doc_relevance.png?raw=1)

In [54]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate
import textwrap

docs_relevance_evaluator = LangChainStringEvaluator(
    "score_string",
    config={
        "criteria": {
            "document_relevance": textwrap.dedent(
                """The response is a set of documents retrieved from a vectorstore. The input is a question
            used for retrieval. You will score whether the Assistant's response (retrieved docs) is relevant to the Ground Truth
            question. A score of [[1]] means that none of the  Assistant's response documents contain information useful in answering or addressing the user's input.
            A score of [[5]] means that the Assistant answer contains some relevant documents that can at least partially answer the user's question or input.
            A score of [[10]] means that the user input can be fully answered using the content in the first retrieved doc(s)."""
            )
        },
         "llm": llm,
        # If you want the score to be saved on a scale from 0 to 1
        "normalize_by": 10,
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["contexts"],
        "input": example.inputs["question"],
    },
)

In [55]:
experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators=[docs_relevance_evaluator],
    experiment_prefix="rag-qa-oai-doc-relevance",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "LCEL context, gpt-3.5-turbo",
    },
)

View the evaluation results for experiment: 'rag-qa-oai-doc-relevance-1ce90523' at:
https://smith.langchain.com/o/e5ffb10d-d7d5-5a52-9eb0-1e8ca168f073/datasets/c077395e-3ec2-4f33-8e26-85cac805d423/compare?selectedSessions=227d3e2d-8172-41e0-83b1-bc53ac99fd52




0it [00:00, ?it/s]

[Document(metadata={'page': 5, 'source': 'https://s21.q4cdn.com/198919461/files/doc_downloads/press_kits/2016/Morningstar-Sustainable-Investing-Handbook.pdf'}, page_content='The Morningstar Sustainable Investing Handbook5Using the Morningstar Sustainability Rating for funds \nOur Sustainability Rating allows investors to assess how well the companies in a fund’s  \nportfolio are managing their ESG risks and opportunities. The rating will also allow investors to \ncompare funds across categories and relative to benchmarks using specific ESG factors. \nThe ratings can serve as an initial screen for investors interested in sustainability and ESG  \nfactors.  They are also a useful starting point for investors wanting to know more about a manager’s \ninvestment process and how it relates to sustainable investing.\nThe ratings will help investors determine both the level of sustainability in their existing portfolios \nand allow them to set sustainability targets. Some investors, for exampl



[Document(metadata={'page': 10, 'source': 'https://s21.q4cdn.com/198919461/files/doc_downloads/press_kits/2016/Morningstar-Sustainable-Investing-Handbook.pdf'}, page_content='The Morningstar Sustainable Investing Handbook10Key Definitions\n \nPortfolio ESG Score\nThe Portfolio ESG Score is an asset-weighted average of normalized company-level ESG  \nScores for the covered holdings in the portfolio. Company-level ESG Scores from Sustainalytics \nreflect companies’ management systems, practices, policies, and other indicators related to \nenvironmental, social, and governance issues. Their company-level ESG scores use a 0-100  scale. \nA high Portfolio ESG Score is better than a low score. At the portfolio level, high scores indicate \nthat a fund has more of its assets invested in companies that score well according to the \nSustainalytics ESG methodology.\nPortfolio Controversy Score\nThe Portfolio Controversy Score is the asset-weighted average level of the seriousness of the \ncontro

ERROR:langsmith.evaluation._runner:Error running target function: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-flash. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.
ERROR:langsmith.evaluation._runner:Error running evaluator <DynamicRunEvaluator evaluate> on run 024e9e4f-24da-4fd2-8ce6-f2519f57f7d6: KeyError('contexts')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/langsmith/evaluation/_runner.py", line 1258, in _run_evaluators
    evaluator_response = evaluator.evaluate_run(
  File "/usr/local/lib/python3.10/dist-packages/langsmith/evaluation/evaluator.py", line 278, in evaluate_run
    result = self.func(
  File "/usr/local/lib/python3.10/dist-packages/langsmith/run_helpers.py", line 582, in wrapper
    raise e
  File "/usr/local/lib/python3.10/dist-packages/langsmith/run_helpers.py", line 579,

## Evaluating intermediate traces

What if we didn't explicity return documents from our RAG chain?

In this case, we can isolate them as intermediate chain values.

In [58]:
from langsmith.schemas import Example, Run
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

def document_relevance_grader(root_run: Run, example: Example) -> dict:
    """
    A simple evaluator that checks to see if retrieved documents are relevant to the question
    """

    # Get documents and question
    rag_pipeline_run = next(run for run in root_run.child_runs if run.name == "get_answer")
    retrieve_run = next(run for run in rag_pipeline_run.child_runs if run.name == "retrieve_docs")
    doc_txt = "\n\n".join(doc.page_content for doc in retrieve_run.outputs["output"])
    question = retrieve_run.inputs["question"]

    # Data model for grade
    class GradeDocuments(BaseModel):
        """Binary score for relevance check on retrieved documents."""
        binary_score: int = Field(description="Documents are relevant to the question, 1 or 0")

    # LLM with function call
    llm = ChatVertexAI(model="gemini-1.5-pro-001")
    structured_llm_grader = llm.with_structured_output(GradeDocuments)

    # Prompt
    system = """You are a grader assessing relevance of a retrieved document to a user question. \n
        If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. \n
        It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
        Give a binary score 1 or 0 score, where 1 means that the document is relevant to the question."""

    grade_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system),
            ("human", "Retrieved document: \n\n {document} \n\n User question: {question}"),
        ]
    )

    retrieval_grader = grade_prompt | structured_llm_grader
    score = retrieval_grader.invoke({"question": question, "document": doc_txt})
    return {"key": "document_relevance", "score": int(score.binary_score)}

def answer_hallucination_grader(root_run: Run, example: Example) -> dict:
    """
    A simple evaluator that checks to see the answer is grounded in the documents
    """

    # Get documents and answer
    rag_pipeline_run = next(run for run in root_run.child_runs if run.name == "get_answer")
    retrieve_run = next(run for run in rag_pipeline_run.child_runs if run.name == "retrieve_docs")
    doc_txt = "\n\n".join(doc.page_content for doc in retrieve_run.outputs["output"])
    generation = rag_pipeline_run.outputs["answer"]

    # Data model
    class GradeHallucinations(BaseModel):
        """Binary score for hallucination present in generation answer."""

        binary_score: int = Field(description="Answer is grounded in the facts, 1 or 0")

    # LLM with function call
    llm = ChatVertexAI(model="gemini-1.5-pro-001")
    structured_llm_grader = llm.with_structured_output(GradeHallucinations)

    # Prompt
    system = """You are a grader assessing whether an LLM generation is grounded in / supported by a set of retrieved facts. \n
         Give a binary score 1 or 0, where 1 means that the answer is grounded in / supported by the set of facts."""
    hallucination_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system),
            ("human", "Set of facts: \n\n {documents} \n\n LLM generation: {generation}"),
        ]
    )

    hallucination_grader = hallucination_prompt | structured_llm_grader
    score = hallucination_grader.invoke({"documents": doc_txt, "generation": generation})
    return {"key": "answer_hallucination", "score": int(score.binary_score)}

from langsmith.evaluation import evaluate
experiment_results = evaluate(
    predict_rag_answer,
    data=dataset_name,
    evaluators=[document_relevance_grader,answer_hallucination_grader],
    experiment_prefix= "LCEL context, gpt-3.5-turbo"
)

View the evaluation results for experiment: 'LCEL context, gpt-3.5-turbo-f0f1db92' at:
https://smith.langchain.com/o/e5ffb10d-d7d5-5a52-9eb0-1e8ca168f073/datasets/c077395e-3ec2-4f33-8e26-85cac805d423/compare?selectedSessions=5d6fe835-2ca6-4f72-a3d3-8d6102dd425b




0it [00:00, ?it/s]



[Document(metadata={'page': 2, 'source': 'https://www.morningstar.com/content/dam/marketing/shared/research/methodology/771945_Morningstar_Rating_for_Funds_Methodology.pdf'}, page_content='©2021 Morningstar, Inc. All rights reserved. The information in this document is the property of Morningstar, Inc. Reproduction or transcription by any means, in whole or in part, without the prior written \nconsent of Morningstar, Inc., is prohibited.\n The Morningstar RatingTM for Funds    August 2021 Page 3 of 21\n3\n3\n3bond funds domiciled in Europe against other European high-yield bond funds. For more information \nabout available categories, please contact your local Morningstar office.\nStyle Profiles\nA style profile may be considered a summary of a fund’s risk-factor exposures. Fund categories \ndefine groups of funds whose members are similar enough in their risk-factor exposures that return \ncomparisons between them are useful.\nThe risk factors on which fund categories are based can re

ERROR:langsmith.evaluation._runner:Error running evaluator <DynamicRunEvaluator document_relevance_grader> on run eab568da-606d-4d0d-a9ba-94744d0989d4: ResourceExhausted('Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-pro. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py", line 76, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 1181, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcErr

View the evaluation results for experiment: 'LCEL context, gpt-3.5-turbo-b7baef73' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/e5197c9e-24ab-405a-82c5-cef7afadb1b4/compare?selectedSessions=4a7d10a7-e26e-4906-ba79-b03fc2ca31ce




0it [00:00, ?it/s]