# RAGAS part 1: Retrieval

[RAGAS](https://github.com/explodinggradients/ragas) is a popular framework for evaluating Retrieval Augmented Generation (RAG) applications. It's been a longstanding
popular request to add these metrics to [`autoevals`](https://github.com/braintrustdata/autoevals), so I thought I'd take the opportunity to audit and port them, and
share the process openly. Hopefully it serves as a deeper window into evaluating RAG with RAGAS-style analysis, and also a guide on how to write your own evaluators.

We'll use the [Coda Help Desk](https://www.braintrustdata.com/docs/cookbook/CodaHelpDesk) data to benchmark RAGAS evaluators, observe issues, and then tweak and port them
into autoevals. Although this cookbook is implemented in Python, the evaluators are now available in both Python and Typescript.


In [None]:
%pip install -U autoevals[scipy] braintrust requests openai lancedb markdownify ragas tqdm

## Generating QA pairs

To start, let's generate a series of questions and expected answers. The next few blocks of code are copied from the [Coda Help Desk](https://www.braintrustdata.com/docs/cookbook/CodaHelpDesk) cookbook, and simply download some articles and then generate QA pairs.


In [1]:
QA_GEN_MODEL = "gpt-4-1106-preview"
QA_ANSWER_MODEL = "gpt-3.5-turbo"
QA_GRADING_MODEL = "gpt-4"
RELEVANCE_MODEL = "gpt-3.5-turbo"
RAGAS_MODEL = "gpt-3.5-turbo-16k"

NUM_SECTIONS = 20
NUM_QA_PAIRS = 20  # Increase this number to test at a larger scale

In [2]:
import asyncio
import os
import re
import time

import braintrust
import markdownify
import openai
import requests

import autoevals

data = requests.get(
    "https://gist.githubusercontent.com/wong-codaio/b8ea0e087f800971ca5ec9eef617273e/raw/39f8bd2ebdecee485021e20f2c1d40fd649a4c77/articles.json"
).json()

markdown_docs = [{"id": row["id"], "markdown": markdownify.markdownify(row["body"])} for row in data]

i = 0
markdown_sections = []
for markdown_doc in markdown_docs:
    sections = re.split(r"(.*\n=+\n)", markdown_doc["markdown"])
    current_section = ""
    for section in sections:
        if not section.strip():
            continue

        if re.match(r".*\n=+\n", section):
            current_section = section
        else:
            section = current_section + section
            markdown_sections.append({"doc_id": markdown_doc["id"], "section_id": i, "markdown": section.strip()})
            current_section = ""
            i += 1

print(f"Downloaded {len(markdown_sections)} Markdown sections. Here are the first 3:")
markdown_sections[:3]

  from .autonotebook import tqdm as notebook_tqdm


Downloaded 988 Markdown sections. Here are the first 3:


[{'doc_id': '8179780',
  'section_id': 0,
  'markdown': "Not all Coda docs are used in the same way. You'll inevitably have a few that you use every week, and some that you'll only use once. This is where starred docs can help you stay organized.\n\n\n\nStarring docs is a great way to mark docs of personal importance. After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**. All starred docs, even from multiple different workspaces, will live in this section.\n\n\n\nStarring docs only saves them to your personal My Shortcuts. It doesn’t affect the view for others in your workspace. If you’re wanting to shortcut docs not just for yourself but also for others in your team or workspace, you’ll [use pinning](https://help.coda.io/en/articles/2865511-starred-pinned-docs) instead."},
 {'doc_id': '8179780',
  'section_id': 1,
 {'doc_id': '8179780',
  'section_id': 2,

In [3]:
client = braintrust.wrap_openai(
    openai.AsyncOpenAI(
        base_url="https://braintrustproxy.com/v1",
        default_headers={"x-bt-use-cache": "always"},
        api_key=os.environ.get("OPENAI_API_KEY", "Your OPENAI_API_KEY here"),
    )
)

In [4]:
import json
from typing import List

from pydantic import BaseModel, Field


class QAPair(BaseModel):
    questions: List[str] = Field(
        ..., description="List of questions, all with the same meaning but worded differently"
    )
    answer: str = Field(..., description="Answer")


class QAPairs(BaseModel):
    pairs: List[QAPair] = Field(..., description="List of question/answer pairs")


async def produce_candidate_questions(row):
    response = await client.chat.completions.create(
        model=QA_GEN_MODEL,
        messages=[
            {
                "role": "user",
                "content": f"""\
Please generate 8 question/answer pairs from the following text. For each question, suggest
2 different ways of phrasing the question, and provide a unique answer.

Content:

{row['markdown']}
""",
            }
        ],
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "propose_qa_pairs",
                    "description": "Propose some question/answer pairs for a given document",
                    "parameters": QAPairs.schema(),
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "propose_qa_pairs"}},
    )

    try:
        pairs = QAPairs(**json.loads(response.choices[0].message.tool_calls[0].function.arguments))
    except:
        print("Warning: failed to parse an example. Continuing to others")
        return []
    return pairs.pairs

In [5]:
all_candidates_tasks = [asyncio.create_task(produce_candidate_questions(a)) for a in markdown_sections[:NUM_SECTIONS]]
all_candidates = [await f for f in all_candidates_tasks]

data = []
row_id = 0
for row, doc_qa in zip(markdown_sections[:NUM_SECTIONS], all_candidates):
    for i, qa in enumerate(doc_qa):
        for j, q in enumerate(qa.questions):
            data.append(
                {
                    "input": q,
                    "expected": qa.answer,
                    "metadata": {
                        "document_id": row["doc_id"],
                        "section_id": row["section_id"],
                        "question_idx": i,
                        "answer_idx": j,
                        "id": row_id,
                        "split": "test" if j == len(qa.questions) - 1 and j > 0 else "train",
                    },
                }
            )
            row_id += 1

print(f"Generated {len(data)} QA pairs. Here are the first 10...")
for x in data[:10]:
    print(x)

Generated 270 QA pairs. Here are the first 10...
{'input': 'What is the purpose of starring documents in Coda?', 'expected': 'Starring docs in Coda helps to mark documents of personal importance and organizes them in a section called My Shortcuts.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 0, 'answer_idx': 0, 'id': 0, 'split': 'train'}}
{'input': 'How can starring docs in Coda help you?', 'expected': 'Starring docs in Coda helps to mark documents of personal importance and organizes them in a section called My Shortcuts.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 0, 'answer_idx': 1, 'id': 1, 'split': 'test'}}
{'input': 'What happens when you star a doc in Coda?', 'expected': 'After you star a doc in Coda, it will appear in a section on your doc list called My Shortcuts.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 1, 'answer_idx': 0, 'id': 2, 'split': 'train'}}
{'input': 'Where do starred docs go

Finally, we'll format the data into a `Dataset`, so that it can be consumed by the `ragas` library.


In [6]:
from datasets import Dataset

ragas_data_list = [
    {
        "question": a["input"],
        "answer": a["expected"],
        "ground_truth": a["expected"],
        "contexts": [markdown_sections[a["metadata"]["section_id"]]["markdown"]],
    }
    for a in data[:NUM_QA_PAIRS]
]

ragas_ds = Dataset.from_list(ragas_data_list)

## Baselining the RAGAS retrieval metrics

RAGAS splits metrics into two buckets: generation and retrieval.

![RAGAS framework](https://docs.ragas.io/en/stable/_static/imgs/component-wise-metrics.png)

We'll start by working through retrieval metrics:

- `context_precision`
- `context_relevancy`
- `context_recall`
- `context_entity_recall`


In [7]:
from ragas import evaluate
from ragas.metrics import (
    context_entity_recall,
    context_precision,
    context_recall,
    context_relevancy,
)

score = evaluate(
    ragas_ds,
    metrics=[
        context_precision,
        context_recall,
        context_entity_recall,
        context_relevancy,
    ],
)
score_df = score.to_pandas()
score_df

Evaluating: 100%|██████████| 80/80 [00:08<00:00,  9.20it/s]


Unnamed: 0,question,answer,ground_truth,contexts,context_precision,context_recall,context_entity_recall,context_relevancy
0,What is the purpose of starring documents in C...,Starring docs in Coda helps to mark documents ...,Starring docs in Coda helps to mark documents ...,[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.333333,0.555556
1,How can starring docs in Coda help you?,Starring docs in Coda helps to mark documents ...,Starring docs in Coda helps to mark documents ...,[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.333333,0.555556
2,What happens when you star a doc in Coda?,"After you star a doc in Coda, it will appear i...","After you star a doc in Coda, it will appear i...",[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.5,0.555556
3,Where do starred docs go after you star them i...,"After you star a doc in Coda, it will appear i...","After you star a doc in Coda, it will appear i...",[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.5,0.222222
4,Can starred docs from different workspaces be ...,"Yes, all starred docs, even from multiple diff...","Yes, all starred docs, even from multiple diff...",[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.0,0.111111
5,Is it possible to find starred docs from vario...,"Yes, all starred docs, even from multiple diff...","Yes, all starred docs, even from multiple diff...",[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.0,0.111111
6,Does starring a doc in Coda affect other users...,"No, starring docs only saves them to your pers...","No, starring docs only saves them to your pers...",[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.0,0.222222
7,Will other users in your workspace be impacted...,"No, starring docs only saves them to your pers...","No, starring docs only saves them to your pers...",[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.0,0.222222
8,What should you use if you want to shortcut a ...,If you want to shortcut docs for the whole tea...,If you want to shortcut docs for the whole tea...,[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.25,0.111111
9,How do you create shortcuts for docs that are ...,If you want to shortcut docs for the whole tea...,If you want to shortcut docs for the whole tea...,[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.25,0.333333


In [8]:
for key in ["context_precision", "context_recall", "context_entity_recall", "context_relevancy"]:
    print(f"{key}: {score[key].mean()}")

context_precision: 0.9999999998999997
context_recall: 0.9625
context_entity_recall: 0.3316666648019445
context_relevancy: 0.3230158730158731


## Improving the implementation

Interesting, `context_precision` and `context_recall` seem fairly high, but `context_entity_recall` and `context_relevancy` are low. Let's dig into these by picking example `4`, which has a low score for both.


In [7]:
example = ragas_data_list[4]
example

{'question': 'Can starred docs from different workspaces be accessed in one place?',
 'answer': 'Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.',
 'ground_truth': 'Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.',
 'contexts': ["Not all Coda docs are used in the same way. You'll inevitably have a few that you use every week, and some that you'll only use once. This is where starred docs can help you stay organized.\n\n\n\nStarring docs is a great way to mark docs of personal importance. After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**. All starred docs, even from multiple different workspaces, will live in this section.\n\n\n\nStarring docs only saves them to your personal My Shortcuts. It doesn’t affect the view for others in your workspace. If you’re wanting to shortcut docs not just for yourself but also f

### Context Entity Recall

Let's grab the prompt from `context_entity_recall`, which helps extract entities from the ground truth answer the contexts, and see what's going on.


In [8]:
from ragas.metrics._context_entities_recall import TEXT_ENTITY_EXTRACTION as CONTEXT_ENTITIES_RECALL_TEMPLATE

prompt = CONTEXT_ENTITIES_RECALL_TEMPLATE.format(text="\n".join(example["contexts"]))
print(prompt.prompt_str)

Given a text, extract unique entities without repetition. Ensure you consider different forms or mentions of the same entity as a single entity.

The output should be a well-formatted JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output JSON schema:
```
{"type": "object", "properties": {"entities": {"title": "Entities", "type": "array", "items": {"type": "string"}}}, "required": ["entities"]}
```

Do not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).

Examples:

text: "The Eiffel Tower, located in Paris, France, is one of the most iconic landmarks globally.\n            Millions of vi

In [9]:
resp = await client.chat.completions.create(
    model=RAGAS_MODEL,
    messages=[{"role": "user", "content": prompt.prompt_str}],
)
print(resp.choices[0].message.content)

```{"entities": ["Coda docs", "My Shortcuts", "workspaces", "section", "starred docs", "view", "others", "workspace", "team", "pinning"]}```


In [10]:
print(
    (
        await client.chat.completions.create(
            model=RAGAS_MODEL,
            messages=[
                {
                    "role": "user",
                    "content": CONTEXT_ENTITIES_RECALL_TEMPLATE.format(text=example["ground_truth"]).prompt_str,
                }
            ],
        )
    )
    .choices[0]
    .message.content
)

```{"entities": ["starred docs", "multiple different workspaces", "My Shortcuts section"]}```


Interesting, only the string `"starred docs"` is in both sets, but `multiple different workspaces` and `My Shortcuts section` appears to be covered as well. Let's see if the list comparison in `autoevals` returns
a better result.

#### Using a better list overlap


In [11]:
from braintrust_core.score import Score, Scorer

from autoevals import Levenshtein


class ListOverlap(Scorer):
    def __init__(self, pairwise_scorer=None, **kwargs):
        self.pairwise_scorer = pairwise_scorer or Levenshtein()

    async def _run_eval_async(self, output, expected=None, **kwargs):
        if expected is None:
            raise ValueError("ListOverlap requires an expected value")

        distances_futures = [
            [self.pairwise_scorer._run_eval_async(output_item, expected_item) for expected_item in expected]
            for output_item in output
        ]

        distances = [
            [(await distance_future).score for distance_future in distance_futures]
            for distance_futures in distances_futures
        ]

        return self._compute_scores(output, expected, distances, **kwargs)

    def _run_eval_sync(self, output, expected=None, **kwargs):
        if expected is None:
            raise ValueError("ListOverlap requires an expected value")

        distances = [
            [self.pairwise_scorer._run_eval_sync(output_item, expected_item).score for expected_item in expected]
            for output_item in output
        ]

        return self._compute_scores(output, expected, distances, **kwargs)

    def _compute_scores(self, rows, columns, distances, **kwargs):
        import numpy as np
        from scipy.optimize import linear_sum_assignment

        distances = 1 - np.array(distances)
        row_ind, col_ind = linear_sum_assignment(distances)

        pairs = [(rows[r], columns[c], 1 - distances[r][c]) for (r, c) in zip(row_ind, col_ind)]
        lowest_distances = distances[row_ind, col_ind]

        # The score is the average of the lowest distances
        avg_lowest_distance = lowest_distances.mean()

        return Score(
            name=self._name(),
            score=1 - avg_lowest_distance,
            metadata={"pairs": pairs, "lowest_distances": lowest_distances.tolist()},
        )

In [12]:
from pprint import pprint

from autoevals import EmbeddingSimilarity

pprint(
    await ListOverlap(pairwise_scorer=EmbeddingSimilarity(), allow_extra_entities=True).eval_async(
        output=[
            "Coda docs",
            "My Shortcuts",
            "workspaces",
            "section",
            "starred docs",
            "view",
            "others",
            "workspace",
            "team",
            "pinning",
        ],
        expected=["starred docs", "multiple different workspaces", "My Shortcuts section"],
    )
)

Score(name='ListOverlap',
      score=0.8800621780550074,
      metadata={'lowest_distances': [0.15267817386741356,
                                     0.2071352919675643,
                                     0.0],
                'pairs': [('My Shortcuts',
                           'My Shortcuts section',
                           0.8473218261325864),
                          ('workspaces',
                           'multiple different workspaces',
                           0.7928647080324357),
                          ('starred docs', 'starred docs', 1.0)]},
      error=None)


### Context Relevancy

Now let's look at context relevancy. We'll start by examining the prompt used to extract relevant sentences.


In [13]:
from ragas.metrics._context_relevancy import CONTEXT_RELEVANCE, sent_tokenize

prompt = CONTEXT_RELEVANCE.format(question=example["question"], context="\n".join(example["contexts"]))
print(prompt.prompt_str)

Please extract relevant sentences from the provided context that is absolutely required answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase "Insufficient Information".  While extracting candidate sentences you're not allowed to make any changes to sentences from given context.

Your actual task:

question: Can starred docs from different workspaces be accessed in one place?
context: Not all Coda docs are used in the same way. You'll inevitably have a few that you use every week, and some that you'll only use once. This is where starred docs can help you stay organized.



Starring docs is a great way to mark docs of personal importance. After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**. All starred docs, even from multiple different workspaces, will live in this section.



Starring docs only saves them to your pe

In [14]:
resp = await client.chat.completions.create(
    model=RAGAS_MODEL,
    messages=[{"role": "user", "content": prompt.prompt_str}],
)

print(resp.choices[0].message.content)

All starred docs, even from multiple different workspaces, will live in this section.


In [15]:
# As a refresher, let's remember the answer
print(example["answer"])

Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.


#### Adding chain of thought + function calling

Interesting, it appears that at least the previous sentence ("After you star a doc...My Shortcuts" is needed to produce the final answer). Let's see if we can improve
this metric by asking for Chain of Thought.


In [16]:
class RelevantSentence(BaseModel):
    sentence: str = Field(..., description="The selected sentence")
    reasons: List[str] = Field(
        ..., description="Reasons why the sentence is relevant. Explain your thinking step by step."
    )


class RelevantSentences(BaseModel):
    sentences: List[RelevantSentence] = Field(..., description="List of referenced sentences")


response = await client.chat.completions.create(
    model=QA_GEN_MODEL,
    messages=[{"role": "user", "content": prompt.prompt_str}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "extract_sentences",
                "description": "Extract relevant sentences from a given context",
                "parameters": RelevantSentences.schema(),
            },
        }
    ],
    tool_choice={"type": "function", "function": {"name": "extract_sentences"}},
)

try:
    sentences = RelevantSentences(**json.loads(response.choices[0].message.tool_calls[0].function.arguments))
except:
    print("Failed to parse. Skipping:")
    print(response.choices[0].message.tool_calls[0].function.arguments)

pprint(sentences.sentences)

[RelevantSentence(sentence='After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**.', reasons=['This sentence directly answers the question by indicating that starred documents are accessible from a designated section, implying that they can be accessed in one place.']),
 RelevantSentence(sentence='All starred docs, even from multiple different workspaces, will live in this section.', reasons=['This sentence explicitly states that starred documents from different workspaces are available in one section, confirming that they can be accessed in one place.'])]


## Porting to Autoevals

Much better. Let's create an auto evaluator for each of these metrics, and then run an `Eval()` in Braintrust.


In [17]:
import chevron

from autoevals.list import ListContains
from autoevals.llm import OpenAIScorer
from autoevals.oai import arun_cached_request, run_cached_request
from autoevals.string import EmbeddingSimilarity

ENTITY_PROMPT = """Given a text, extract unique entities without repetition. Ensure you consider different forms or mentions of the same entity as a single entity.

The output should be a well-formatted JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output JSON schema:
```
{"type": "object", "properties": {"entities": {"title": "Entities", "type": "array", "items": {"type": "string"}}}, "required": ["entities"]}
```

Do not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).

Examples:

text: "The Eiffel Tower, located in Paris, France, is one of the most iconic landmarks globally.\n            Millions of visitors are attracted to it each year for its breathtaking views of the city.\n            Completed in 1889, it was constructed in time for the 1889 World's Fair."
output: ```{"entities": ["Eiffel Tower", "Paris", "France", "1889", "World's Fair"]}```

text: "The Colosseum in Rome, also known as the Flavian Amphitheatre, stands as a monument to Roman architectural and engineering achievement.\n            Construction began under Emperor Vespasian in AD 70 and was completed by his son Titus in AD 80.\n            It could hold between 50,000 and 80,000 spectators who watched gladiatorial contests and public spectacles."
output: ```{"entities": ["Colosseum", "Rome", "Flavian Amphitheatre", "Vespasian", "AD 70", "Titus", "AD 80"]}```

text: "The Great Wall of China, stretching over 21,196 kilometers from east to west, is a marvel of ancient defensive architecture.\n            Built to protect against invasions from the north, its construction started as early as the 7th century BC.\n            Today, it is a UNESCO World Heritage Site and a major tourist attraction."
output: ```{"entities": ["Great Wall of China", "21,196 kilometers", "7th century BC", "UNESCO World Heritage Site"]}```

Your actual task:

text: {{text}}
output: """

# Unfortunately we can't use Pydantic in autoevals because of back-compat issues

ENTITY_SCHEMA = {
    "type": "object",
    "properties": {"entities": {"title": "Entities", "type": "array", "items": {"type": "string"}}},
    "required": ["entities"],
}


def extract_entities_request(text, **extra_args):
    return dict(
        messages=[{"role": "user", "content": chevron.render(ENTITY_PROMPT, {"text": text})}],
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "extract_entities",
                    "description": "Extract unique entities from a given text",
                    "parameters": ENTITY_SCHEMA,
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "extract_entities"}},
        **extra_args,
    )


async def aextract_entities(text, **extra_args):
    response = await arun_cached_request(**extract_entities_request(text=text, **extra_args))
    return json.loads(response["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"])


def extract_entities(text, **extra_args):
    response = run_cached_request(**extract_entities_request(text=text, **extra_args))
    return json.loads(response["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"])


class ContextEntityRecall(OpenAIScorer):
    def __init__(self, pairwise_scorer=None, model="gpt-3.5-turbo-16k", **kwargs):
        super().__init__(**kwargs)

        self.extraction_model = model
        self.contains_scorer = ListContains(
            pairwise_scorer=pairwise_scorer or EmbeddingSimilarity(), allow_extra_entities=True
        )

    async def _run_eval_async(self, output, expected=None, context=None, **kwargs):
        if expected is None:
            raise ValueError("ContextEntityRecall requires an expected value")
        if context is None:
            raise ValueError("ContextEntityRecall requires a context value")

        context = "\n".join(context) if isinstance(context, list) else context

        expected_entities = [
            e
            for e in (await aextract_entities(text=expected, model=self.extraction_model, **self.extra_args))[
                "entities"
            ]
        ]
        context_entities = [
            e
            for e in (await aextract_entities(text=context, model=self.extraction_model, **self.extra_args))[
                "entities"
            ]
        ]

        score = await self.contains_scorer.eval_async(output=context_entities, expected=expected_entities)

        return Score(
            name=self._name(),
            score=score.score,
            metadata={"context_entities": context_entities, "expected_entities": expected_entities},
        )

    def _run_eval_sync(self, output, expected=None, context=None, **kwargs):
        if expected is None:
            raise ValueError("ContextEntityRecall requires an expected value")
        if context is None:
            raise ValueError("ContextEntityRecall requires a context value")

        context = "\n".join(context) if isinstance(context, list) else context

        expected_entities = [
            e for e in (extract_entities(text=expected, model=self.extraction_model, **self.extra_args))["entities"]
        ]
        context_entities = [
            e for e in (extract_entities(text=context, model=self.extraction_model, **self.extra_args))["entities"]
        ]

        score = self.contains_scorer.eval(output=context_entities, expected=expected_entities)

        return Score(
            name=self._name(),
            score=score.score,
            metadata={"context_entities": context_entities, "expected_entities": expected_entities},
        )


pprint(
    await ContextEntityRecall().eval_async(
        output=example["answer"], expected=example["ground_truth"], context=example["contexts"]
    )
)

Score(name='ContextEntityRecall',
      score=0.6952517120038902,
      metadata={'context_entities': ['Coda docs',
                                     'My Shortcuts',
                                     'workspaces',
                                     'pinning'],
                'expected_entities': ['starred docs',
                                      'multiple different workspaces',
                                      'My Shortcuts section']},
      error=None)


In [18]:
# Unfortunately we cannot use pydantic in autoevals due to back-compat issues
print(RelevantSentences.schema())

{'$defs': {'RelevantSentence': {'properties': {'sentence': {'description': 'The selected sentence', 'title': 'Sentence', 'type': 'string'}, 'reasons': {'description': 'Reasons why the sentence is relevant. Explain your thinking step by step.', 'items': {'type': 'string'}, 'title': 'Reasons', 'type': 'array'}}, 'required': ['sentence', 'reasons'], 'title': 'RelevantSentence', 'type': 'object'}}, 'properties': {'sentences': {'description': 'List of referenced sentences', 'items': {'$ref': '#/$defs/RelevantSentence'}, 'title': 'Sentences', 'type': 'array'}}, 'required': ['sentences'], 'title': 'RelevantSentences', 'type': 'object'}


In [19]:
# Tweaked to return an empty array instead of "Insufficient information".
SENTENCE_PROMPT = """Please extract relevant sentences from the provided context that is absolutely required answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return an empty array.  While extracting candidate sentences you're not allowed to make any changes to sentences from given context.

Your actual task:

question: {{question}}
context: {{context}}
candidate sentences: """

SENTENCE_SCHEMA = {
    "$defs": {
        "RelevantSentence": {
            "properties": {
                "sentence": {"description": "The selected sentence", "title": "Sentence", "type": "string"},
                "reasons": {
                    "description": "Reasons why the sentence is relevant. Explain your thinking step by step.",
                    "items": {"type": "string"},
                    "title": "Reasons",
                    "type": "array",
                },
            },
            "required": ["sentence", "reasons"],
            "title": "RelevantSentence",
            "type": "object",
        }
    },
    "properties": {
        "sentences": {
            "description": "List of referenced sentences",
            "items": {"$ref": "#/$defs/RelevantSentence"},
            "title": "Sentences",
            "type": "array",
        }
    },
    "required": ["sentences"],
    "title": "RelevantSentences",
    "type": "object",
}


def extract_sentences_request(question, context, **extra_args):
    return dict(
        messages=[
            {"role": "user", "content": chevron.render(SENTENCE_PROMPT, {"question": question, "context": context})}
        ],
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "extract_sentences",
                    "description": "Extract relevant sentences from a given context",
                    "parameters": SENTENCE_SCHEMA,
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "extract_sentences"}},
        **extra_args,
    )


class ContextRelevancy(OpenAIScorer):
    def __init__(self, pairwise_scorer=None, model="gpt-3.5-turbo-16k", **kwargs):
        super().__init__(**kwargs)

        self.model = model

    async def _run_eval_async(self, output, expected=None, input=None, context=None, **kwargs):
        if input is None:
            raise ValueError("ContextRelevancy requires an input value")
        if context is None:
            raise ValueError("ContextRelevancy requires a context value")

        if isinstance(context, list):
            context = "\n".join(context)

        response = await arun_cached_request(
            **extract_sentences_request(question=input, context=context, model=self.model, **self.extra_args)
        )
        sentences = json.loads(response["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"])

        return Score(
            name=self._name(),
            score=len("".join([s["sentence"] for s in sentences["sentences"]])) / len(context),
            metadata={
                "relevant_sentences": sentences["sentences"],
            },
        )

    def _run_eval_sync(self, output, expected=None, input=None, context=None, **kwargs):
        if input is None:
            raise ValueError("ContextRelevancy requires an input value")
        if context is None:
            raise ValueError("ContextRelevancy requires a context value")

        if isinstance(context, list):
            context = "\n".join(context)

        response = run_cached_request(
            **extract_sentences_request(question=input, context=context, model=self.model, **self.extra_args)
        )
        sentences = json.loads(response["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"])

        return Score(
            name=self._name(),
            score=len("".join([s["sentence"] for s in sentences["sentences"]])) / len(context),
            metadata={
                "relevant_sentences": sentences["sentences"],
            },
        )


pprint(
    await ContextRelevancy().eval_async(
        input=example["question"],
        output=example["answer"],
        expected=example["ground_truth"],
        context=example["contexts"],
    )
)

Score(name='ContextRelevancy',
      score=0.7423076923076923,
      metadata={'relevant_sentences': [{'reasons': [],
                                        'sentence': 'Starring docs is a great '
                                                    'way to mark docs of '
                                                    'personal importance.'},
                                       {'reasons': [],
                                        'sentence': 'After you star a doc, it '
                                                    'will live in a section on '
                                                    'your doc list called '
                                                    '**[My '
                                                    'Shortcuts](https://coda.io/shortcuts)**.'},
                                       {'reasons': [],
                                        'sentence': 'All starred docs, even '
                                                    'from multiple 

## Running an Eval

Now that we have `Scorer`s for each metric, we can easily run an `Eval()` in Braintrust.


In [20]:
from braintrust import Eval


async def context_entites(input, output, expected, metadata):
    return await ContextEntityRecall().eval_async(output=output, expected=expected, context=metadata["contexts"])


async def context_relevancy(input, output, expected, metadata):
    return await ContextRelevancy().eval_async(
        input=input, output=output, expected=expected, context=metadata["contexts"]
    )


await Eval(
    name="Ragas Retrieval",
    data=[
        {
            "input": {"question": x["question"], "ground_truth": x["answer"]},
            "expected": x["answer"],
            "metadata": {"contexts": x["contexts"]},
        }
        for x in ragas_data_list
    ],
    task=lambda input: input["ground_truth"],
    scores=[context_entites, context_relevancy],
)

Experiment ragas-1712519132 is running at http://localhost:3000/app/braintrustdata.com/p/Ragas%20Retrieval/experiments/ragas-1712519132
Ragas Retrieval (data): 20it [00:00, 36986.81it/s]
Ragas Retrieval (tasks): 100%|██████████| 20/20 [00:08<00:00,  2.42it/s]


See results for ragas-1712519132 at http://localhost:3000/app/braintrustdata.com/p/Ragas%20Retrieval/experiments/ragas-1712519132







Excellent, the new scores look much higher, especially `ContextEntitiesScorer`, which we'd expect to be close to `1` for this use case.

![scores](./assets/new_scores.png)


### Further improvements

We can also dig into individual examples, and see exactly the set of extracted/overlapping entities. As you can see, certain examples are clearly broken, as "Eiffel Tower"
is not in the text.

![overlapping entities](./assets/extracted_entities.png)


Braintrust makes it easy to debug these kinds of issues. For example, we can open up the exact prompt that ran, and try to see if using GPT-4 returns better results.

![Try fix](./assets/try_fix.gif)
