# RAGAS part 1: Retrieval

[RAGAS](https://github.com/explodinggradients/ragas) is a popular framework for evaluating Retrieval Augmented Generation (RAG) applications. It's been a longstanding
popular request to add these metrics to [`autoevals`](https://github.com/braintrustdata/autoevals), so I thought I'd take the opportunity to audit and port them, and
share the process openly. Hopefully it serves as a deeper window into evaluating RAG with RAGAS-style analysis, and also a guide on how to write your own evaluators.

We'll use the [Coda Help Desk](https://www.braintrustdata.com/docs/cookbook/CodaHelpDesk) data to benchmark RAGAS evaluators, observe issues, and then tweak and port them
into autoevals. Although this cookbook is implemented in Python, the evaluators are now available in both Python and Typescript.


In [None]:
%pip install -U autoevals[scipy] braintrust requests openai lancedb markdownify ragas tqdm

In [2]:
import json
from pprint import pprint

from datasets import Dataset

with open("data.json", "r") as f:
    ragas_data_list = json.load(f)

ragas_ds = Dataset.from_list(ragas_data_list)
print("Question:", ragas_ds[0]["question"])
print("Answer:", ragas_ds[0]["answer"])

Question: What is the purpose of starring documents in Coda?
Answer: Starring docs in Coda helps to mark documents of personal importance and organizes them in a section called My Shortcuts.


  from .autonotebook import tqdm as notebook_tqdm


## Baselining the RAGAS retrieval metrics

RAGAS splits metrics into two buckets: generation and retrieval.

![RAGAS framework](https://docs.ragas.io/en/stable/_static/imgs/component-wise-metrics.png)

We'll start by working through retrieval metrics:

- `context_precision`
- `context_relevancy`
- `context_recall`
- `context_entity_recall`


In [4]:
from ragas import evaluate
from ragas.metrics import (
    context_entity_recall,
    context_precision,
    context_recall,
    context_relevancy,
)

score = evaluate(
    ragas_ds,
    metrics=[
        context_precision,
        context_recall,
        context_entity_recall,
        context_relevancy,
    ],
)
score_df = score.to_pandas()
score_df.head(5)

Evaluating: 100%|██████████| 80/80 [00:06<00:00, 11.97it/s]


Unnamed: 0,question,answer,ground_truth,contexts,context_precision,context_recall,context_entity_recall,context_relevancy
0,What is the purpose of starring documents in C...,Starring docs in Coda helps to mark documents ...,Starring docs in Coda helps to mark documents ...,[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.333333,0.555556
1,How can starring docs in Coda help you?,Starring docs in Coda helps to mark documents ...,Starring docs in Coda helps to mark documents ...,[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.333333,0.555556
2,What happens when you star a doc in Coda?,"After you star a doc in Coda, it will appear i...","After you star a doc in Coda, it will appear i...",[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.5,0.555556
3,Where do starred docs go after you star them i...,"After you star a doc in Coda, it will appear i...","After you star a doc in Coda, it will appear i...",[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.5,0.222222
4,Can starred docs from different workspaces be ...,"Yes, all starred docs, even from multiple diff...","Yes, all starred docs, even from multiple diff...",[Not all Coda docs are used in the same way. Y...,1.0,1.0,0.0,0.111111


In [5]:
for key in ["context_entity_recall", "context_relevancy"]:
    print(f"Mean {key}:", score_df[key].mean())

Mean context_entity_recall: 0.3649999980241667
Mean context_relevancy: 0.3230158730158731


## Exploring `context_precision` and `context_recall`

In this next section, we're going to explore `context_precision` and `context_recall`, to understand why their performance is low, and try to improve them. We'll use example `4`, which has a low score for both.


In [6]:
example = ragas_data_list[4]
example

{'question': 'Can starred docs from different workspaces be accessed in one place?',
 'answer': 'Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.',
 'ground_truth': 'Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.',
 'contexts': ["Not all Coda docs are used in the same way. You'll inevitably have a few that you use every week, and some that you'll only use once. This is where starred docs can help you stay organized.\n\n\n\nStarring docs is a great way to mark docs of personal importance. After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**. All starred docs, even from multiple different workspaces, will live in this section.\n\n\n\nStarring docs only saves them to your personal My Shortcuts. It doesn’t affect the view for others in your workspace. If you’re wanting to shortcut docs not just for yourself but also f

### Context Entity Recall

Let's grab the prompt from `context_entity_recall`, which helps extract entities from the ground truth answer the contexts, and take a look at what's happening under the hood:


In [7]:
from ragas.metrics._context_entities_recall import TEXT_ENTITY_EXTRACTION as CONTEXT_ENTITIES_RECALL_TEMPLATE

prompt = CONTEXT_ENTITIES_RECALL_TEMPLATE.format(text="\n".join(example["contexts"]))
print(prompt.prompt_str)

Given a text, extract unique entities without repetition. Ensure you consider different forms or mentions of the same entity as a single entity.

The output should be a well-formatted JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output JSON schema:
```
{"type": "object", "properties": {"entities": {"title": "Entities", "type": "array", "items": {"type": "string"}}}, "required": ["entities"]}
```

Do not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).

Examples:

text: "The Eiffel Tower, located in Paris, France, is one of the most iconic landmarks globally.\n            Millions of vi

In [8]:
import os

import openai

RAGAS_MODEL = "gpt-3.5-turbo-16k"

# We'll use the Braintrust proxy so that everything is cached.
client = openai.AsyncOpenAI(
    base_url="https://braintrustproxy.com/v1",
    default_headers={"x-bt-use-cache": "always"},
    api_key=os.environ.get("OPENAI_API_KEY", "Your OPENAI_API_KEY here"),
)


resp = await client.chat.completions.create(
    model=RAGAS_MODEL,
    messages=[{"role": "user", "content": prompt.prompt_str}],
)
print(resp.choices[0].message.content)

```{"entities": ["Coda docs", "My Shortcuts", "workspaces", "section", "starred docs", "view", "others", "workspace", "team", "pinning"]}```


In [9]:
print(
    (
        await client.chat.completions.create(
            model=RAGAS_MODEL,
            messages=[
                {
                    "role": "user",
                    "content": CONTEXT_ENTITIES_RECALL_TEMPLATE.format(text=example["ground_truth"]).prompt_str,
                }
            ],
        )
    )
    .choices[0]
    .message.content
)

```{"entities": ["starred docs", "multiple different workspaces", "My Shortcuts section"]}```


Interesting, only the string `"starred docs"` is in both sets, but `multiple different workspaces` and `My Shortcuts section` appears to be covered as well. Let's see if the list comparison in `autoevals` returns
a better result.

#### Using a better list overlap


In [10]:
from pprint import pprint

from autoevals.list import ListContains
from autoevals.string import EmbeddingSimilarity

pprint(
    await ListContains(pairwise_scorer=EmbeddingSimilarity(), allow_extra_entities=True).eval_async(
        output=[
            "Coda docs",
            "My Shortcuts",
            "workspaces",
            "section",
            "starred docs",
            "view",
            "others",
            "workspace",
            "team",
            "pinning",
        ],
        expected=["starred docs", "multiple different workspaces", "My Shortcuts section"],
    )
)

Score(name='ListContains',
      score=0.8800621780550074,
      metadata={'lowest_distances': [0.15267817386741356,
                                     0.2071352919675643,
                                     0.0],
                'pairs': [('My Shortcuts',
                           'My Shortcuts section',
                           0.8473218261325864),
                          ('workspaces',
                           'multiple different workspaces',
                           0.7928647080324357),
                          ('starred docs', 'starred docs', 1.0)]},
      error=None)


### Context Relevancy

Now let's look at context relevancy. We'll start by examining the prompt used to extract relevant sentences.


In [11]:
from ragas.metrics._context_relevancy import CONTEXT_RELEVANCE, sent_tokenize

prompt = CONTEXT_RELEVANCE.format(question=example["question"], context="\n".join(example["contexts"]))
print(prompt.prompt_str)

Please extract relevant sentences from the provided context that is absolutely required answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase "Insufficient Information".  While extracting candidate sentences you're not allowed to make any changes to sentences from given context.

Your actual task:

question: Can starred docs from different workspaces be accessed in one place?
context: Not all Coda docs are used in the same way. You'll inevitably have a few that you use every week, and some that you'll only use once. This is where starred docs can help you stay organized.



Starring docs is a great way to mark docs of personal importance. After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**. All starred docs, even from multiple different workspaces, will live in this section.



Starring docs only saves them to your pe

In [12]:
resp = await client.chat.completions.create(
    model=RAGAS_MODEL,
    messages=[{"role": "user", "content": prompt.prompt_str}],
)

print(resp.choices[0].message.content)

All starred docs, even from multiple different workspaces, will live in this section.


In [13]:
# As a refresher, let's remember the answer
print(example["answer"])

Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.


#### Adding chain of thought + function calling

Interesting, it appears that at least the previous sentence ("After you star a doc...My Shortcuts" is needed to produce the final answer). Let's see if we can improve
this metric by asking for Chain of Thought.


In [14]:
from typing import List

from pydantic import BaseModel, Field


class RelevantSentence(BaseModel):
    sentence: str = Field(..., description="The selected sentence")
    reasons: List[str] = Field(
        ..., description="Reasons why the sentence is relevant. Explain your thinking step by step."
    )


class RelevantSentences(BaseModel):
    sentences: List[RelevantSentence] = Field(..., description="List of referenced sentences")


response = await client.chat.completions.create(
    model=RAGAS_MODEL,
    messages=[{"role": "user", "content": prompt.prompt_str}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "extract_sentences",
                "description": "Extract relevant sentences from a given context",
                "parameters": RelevantSentences.schema(),
            },
        }
    ],
    tool_choice={"type": "function", "function": {"name": "extract_sentences"}},
)

try:
    sentences = RelevantSentences(**json.loads(response.choices[0].message.tool_calls[0].function.arguments))
except:
    print("Failed to parse. Skipping:")
    print(response.choices[0].message.tool_calls[0].function.arguments)

pprint(sentences.sentences)

[RelevantSentence(sentence='Starring docs is a great way to mark docs of personal importance.', reasons=[]),
 RelevantSentence(sentence='After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**.', reasons=[]),
 RelevantSentence(sentence='All starred docs, even from multiple different workspaces, will live in this section.', reasons=[])]


Much better!


## Porting to Autoevals

Much better. We've bundled these improvements into scoring functions in `autoevals`. If you're curious, you can find the
implementations on [Github](https://github.com/braintrustdata/autoevals/blob/main/py/autoevals/ragas.py).


In [15]:
from autoevals.ragas import ContextEntityRecall

pprint(
    await ContextEntityRecall().eval_async(
        output=example["answer"], expected=example["ground_truth"], context=example["contexts"]
    )
)

Score(name='ContextEntityRecall',
      score=0.6952517120038902,
      metadata={'context_entities': ['Coda docs',
                                     'My Shortcuts',
                                     'workspaces',
                                     'pinning'],
                'expected_entities': ['starred docs',
                                      'multiple different workspaces',
                                      'My Shortcuts section']},
      error=None)


In [17]:
from autoevals.ragas import ContextRelevancy

pprint(
    await ContextRelevancy().eval_async(
        input=example["question"],
        output=example["answer"],
        expected=example["ground_truth"],
        context=example["contexts"],
    )
)

Score(name='ContextRelevancy',
      score=0.7423076923076923,
      metadata={'relevant_sentences': [{'reasons': [],
                                        'sentence': 'Starring docs is a great '
                                                    'way to mark docs of '
                                                    'personal importance.'},
                                       {'reasons': [],
                                        'sentence': 'After you star a doc, it '
                                                    'will live in a section on '
                                                    'your doc list called '
                                                    '**[My '
                                                    'Shortcuts](https://coda.io/shortcuts)**.'},
                                       {'reasons': [],
                                        'sentence': 'All starred docs, even '
                                                    'from multiple 

## Running an Eval

Now that we have `Scorer`s for each metric, we can easily run an `Eval()` in Braintrust.


In [21]:
from braintrust import Eval


async def context_entites(input, output, expected, metadata):
    return await ContextEntityRecall().eval_async(output=output, expected=expected, context=metadata["contexts"])


async def context_relevancy(input, output, expected, metadata):
    return await ContextRelevancy().eval_async(
        input=input, output=output, expected=expected, context=metadata["contexts"]
    )


result = await Eval(
    name="Ragas Retrieval",
    data=[
        {
            "input": {"question": x["question"], "ground_truth": x["answer"]},
            "expected": x["answer"],
            "metadata": {"contexts": x["contexts"]},
        }
        for x in ragas_data_list
    ],
    task=lambda input: input["ground_truth"],
    scores=[context_entites, context_relevancy],
)

print(result.summary)

Experiment ragas-1712521002 is running at http://localhost:3000/app/braintrustdata.com/p/Ragas%20Retrieval/experiments/ragas-1712521002
Ragas Retrieval (data): 20it [00:00, 7364.24it/s]
Ragas Retrieval (tasks): 100%|██████████| 20/20 [00:07<00:00,  2.69it/s]



See results for ragas-1712521002 at http://localhost:3000/app/braintrustdata.com/p/Ragas%20Retrieval/experiments/ragas-1712521002

See results for ragas-1712521002 at http://localhost:3000/app/braintrustdata.com/p/Ragas%20Retrieval/experiments/ragas-1712521002


Excellent, the new scores look much higher, especially `ContextEntitiesScorer`, which we'd expect to be close to `1` for this use case.

![scores](./assets/new_scores.png)


### Further improvements

We can also dig into individual examples, and see exactly the set of extracted/overlapping entities. As you can see, certain examples are clearly broken, as "Eiffel Tower"
is not in the text.

![overlapping entities](./assets/extracted_entities.png)


Braintrust makes it easy to debug these kinds of issues. For example, we can open up the exact prompt that ran, and try to see if using GPT-4 returns better results.

![Try fix](./assets/try_fix.gif)
