# Using Ragas metrics in Braintrust to measure and improve RAG performance

Ragas is a popular framework for evaluating Retrieval Augmented Generation (RAG) applications, and were recently included in Braintrust's `autoevals` package, with improvements. You can read more about our deep dive into Ragas metrics [here](https://braintrust-git-Ragas-braintrustdata.vercel.app/docs/cookbook/Ragas-Retrieval). For this cookbook example, we're going to build a simple RAG application and show how to compute Ragas metrics.

For data, we're going to use the Q&A for Coda that we built on in our [previous notebook](https://www.braintrustdata.com/docs/cookbook/CodaHelpDesk), but you don't need to have completed it - we'll build everything up from scratch in the first few lines!

First, install dependancies:


In [1]:
%pip install -U autoevals braintrust openai scipy lancedb markdownify --quiet

Note: you may need to restart the kernel to use updated packages.


## Setting up the RAG application

We'll quickly set up a full end-to-end RAG application, based on our earlier [cookbook](https://www.braintrustdata.com/docs/cookbook/CodaHelpDesk). We use the Coda Q&A dataset, LanceDB for our vector database, and OpenAI's textual embedding model.


In [15]:
CODA_QA_FILE_LOC = "https://gist.githubusercontent.com/wong-codaio/b8ea0e087f800971ca5ec9eef617273e/raw/39f8bd2ebdecee485021e20f2c1d40fd649a4c77/articles.json"

In [4]:
import asyncio
import os
import re
import tempfile
import lancedb

import braintrust
import markdownify
import openai
import requests

NUM_SECTIONS = 20

braintrust.login(
    api_key=os.environ.get("BRAINTRUST_API_KEY", "Your BRAINTRUST_API_KEY here")
)

openai_client = braintrust.wrap_openai(
    openai.AsyncOpenAI(
        base_url="https://braintrustproxy.com/v1",
        default_headers={"x-bt-use-cache": "always"},
        api_key=os.environ.get("OPENAI_API_KEY", "Your OPENAI_API_KEY here"),
    )
)

coda_qa_content_data = requests.get(CODA_QA_FILE_LOC).json()

markdown_sections = [
    {"doc_id": row["id"], "markdown": section.strip()}
    for row in coda_qa_content_data
    for section in re.split(r"(.*\n=+\n)", markdownify.markdownify(row["body"]))
    if section.strip() and not re.match(r".*\n=+\n", section)
]


LANCE_DB_PATH = os.path.join(tempfile.TemporaryDirectory().name, "docs-lancedb")


@braintrust.traced
async def embed_text(text: str):
    params = dict(input=text, model="text-embedding-3-small")
    response = await openai_client.embeddings.create(**params)
    embedding = response.data[0].embedding
    return embedding


embeddings = await asyncio.gather(
    *(embed_text(section["markdown"]) for section in markdown_sections)
)

db = lancedb.connect(LANCE_DB_PATH)
table = db.create_table(
    "sections",
    data=[
        {
            "doc_id": row["doc_id"],
            "section_id": i,
            "markdown": row["markdown"],
            "vector": embedding,
        }
        for i, (row, embedding) in enumerate(
            zip(markdown_sections[:NUM_SECTIONS], embeddings)
        )
    ],
)

table.count_rows()

20

Done!

Now we will write some simple, framework-free code to quickly retrieve relevant documents and answer questions:


In [5]:
from typing import List, Iterable
from textwrap import dedent

QA_ANSWER_MODEL = "gpt-3.5-turbo"
TOP_K = 2


@braintrust.traced
async def fetch_top_k_relevant_sections(input: str) -> List[str]:
    embedding = await embed_text(input)
    results = table.search(embedding).limit(TOP_K).to_arrow().to_pylist()
    return [result["markdown"] for result in results]


@braintrust.traced
async def generate_answer_from_docs(question: str, relevant_sections: Iterable[str]):
    context = "\n\n".join(relevant_sections)
    completion = await openai_client.chat.completions.create(
        model=QA_ANSWER_MODEL,
        messages=[
            {
                "role": "user",
                "content": dedent(
                    f"""\
            Given the following context
            {context}
            Please answer the following question:
            Question: {question}
            """
                ),
            }
        ],
    )
    return completion.choices[0].message.content

Great! Now we can answer a question through our RAG pipeline, first by querying our vector DB for the relevant documents:


In [6]:
question = (
    "What impact does starring a document have on other workspace members in Coda?"
)

relevant_sections = await fetch_top_k_relevant_sections(question)

relevant_sections

["Not all Coda docs are used in the same way. You'll inevitably have a few that you use every week, and some that you'll only use once. This is where starred docs can help you stay organized.\n\n\n\nStarring docs is a great way to mark docs of personal importance. After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**. All starred docs, even from multiple different workspaces, will live in this section.\n\n\n\nStarring docs only saves them to your personal My Shortcuts. It doesn’t affect the view for others in your workspace. If you’re wanting to shortcut docs not just for yourself but also for others in your team or workspace, you’ll [use pinning](https://help.coda.io/en/articles/2865511-starred-pinned-docs) instead.",
 'When should I star a doc and when should I pin it?\n--------------------------------------------------\n\n\n\nStarring docs is best for docs of *personal* importance. Starred docs appear in your **My Short

And then passing the relevant documents to the LLM:


In [7]:
answer = await generate_answer_from_docs(question, relevant_sections)

print(answer)

Starring a document in Coda only affects the individual who starred it. It does not impact other workspace members, as the star only saves the document to the individual's personal My Shortcuts section. Other workspace members will not see the document as starred in their own views.


We'll use a quick convenience function to combine these two steps, and return both the final answer and the retrieved documents so we can observe if we picked useful documents! (Later, returning documents will come in useful for evaluations)


In [8]:
from braintrust import EvalHooks


@braintrust.traced
async def generate_answer_e2e(question: str):
    retrieved_content = await fetch_top_k_relevant_sections(question)
    answer = await generate_answer_from_docs(question, retrieved_content)
    return dict(answer=answer, retrieved_docs=retrieved_content)


e2e_answer = await generate_answer_e2e(question)
print(e2e_answer.get("answer"))

Starring a document in Coda only affects the individual who starred it. It does not impact other workspace members, as the star only saves the document to the individual's personal My Shortcuts section. Other workspace members will not see the document as starred in their own views.


Perfect! Now that we have the whole system working, we can compute Ragas metrics and try a couple improvements

## Baseline Ragas metrics with autoeval

To get a large enough sample size for evaluations, we're going to use the synthetic test questions we generated in [our earlier cookbook](https://www.braintrust.dev/docs/cookbook/CodaHelpDesk). Feel free to check out that cookbook if you're interested in synthetic data generation, otherwise, you can just load them from this file location:


In [16]:
CODA_QA_PAIRS_LOC = "https://gist.githubusercontent.com/nelsonauner/2ef4d38948b78a9ec2cff4aa265cff3f/raw/c47306b4469c68e8e495f4dc050f05aff9f997e1/qa_pairs_coda_data.jsonl"

In [9]:
import json

coda_qa_pairs = requests.get(CODA_QA_PAIRS_LOC)

qa_pairs = [json.loads(line) for line in coda_qa_pairs.text.split("\n") if line]

qa_pairs[0]

{'input': 'What is the purpose of starred docs in Coda?',
 'expected': 'Starring docs in Coda helps to mark documents of personal importance and organizes them in a specific section called My Shortcuts for easy access.',
 'metadata': {'document_id': '8179780',
  'section_id': 0,
  'question_idx': 0,
  'answer_idx': 0,
  'id': 0,
  'split': 'train'}}

Ragas provides a variety of metrics - you can read an overview of each metric at the [Ragas metrics overview](https://docs.ragas.io/en/stable/concepts/metrics/index.html), but for the purposes of this guide, we'll show you how to calculate two scores we've found to be useful: Answer Correctness, which compares your system's answer to provided golden answer, and Context Recall, which compares the retrieved context to the information in the provided golden answer.

Before we calculate metrics, we'll write a short wrapper class that splits the returned output and context into two arguments that our Ragas evaluator classes can easily ingest


In [14]:
from autoevals import ContextRecall, AnswerCorrectness


class RAGScorerWrapper:
    """This wrapper passes on retrieved_docs to the scorer's eval_async method via the context arg"""

    def __new__(cls, scorer_class):
        async def eval_async(self, output, **kwargs):
            return await super(RAGWrappedScorer, self).eval_async(
                output=output["answer"],
                context=output["retrieved_docs"],
                **kwargs,
            )

        RAGWrappedScorer = type(
            scorer_class.__name__, (scorer_class,), {"eval_async": eval_async}
        )
        return RAGWrappedScorer()

And now we can run our evaluation!


In [10]:
from braintrust import Eval

eval_result = await Eval(
    name="Coda RAG with ragas",
    experiment_name=f"Using {QA_ANSWER_MODEL}",
    data=qa_pairs[:20],
    task=generate_answer_e2e,
    scores=[RAGScorerWrapper(AnswerCorrectness), RAGScorerWrapper(ContextRecall)],
)

Experiment Using gpt-3.5-turbo is running at https://www.braintrust.dev/app/braintrustdata.com/p/Coda%20RAG%20with%20ragas/experiments/Using%20gpt-3.5-turbo
Coda RAG with ragas [experiment_name=Using gpt-3.5-turbo] (data): 20it [00:00, 53025.34it/s]


Coda RAG with ragas [experiment_name=Using gpt-3.5-turbo] (tasks):   0%|          | 0/20 [00:00<?, ?it/s]


90.45% 'ContextRecall'     score
68.23% 'AnswerCorrectness' score

2.22s duration

See results for Using gpt-3.5-turbo at https://www.braintrust.dev/app/braintrustdata.com/p/Coda%20RAG%20with%20ragas/experiments/Using%20gpt-3.5-turbo


One difficulty of ragas is that it can be complex to understand how metrics were generated for a given datapoint, but with Braintrust we see each each constituent metric, all submetrics, and go down to the actual LLM calls used to generate the metric components.

For example, here, we can see that `AnswerCorrectness` is computed by taking both a factuality and answer similarity score, and can examine the subscore computations - in this case, `factuality_score` and `answer_similarity_score`

![ ragas metric computation ](assets/ragas_metric_computation.png)


### Evaluating changes

Now that we have our end-to-end system, and know how to use Braintrust to examine the results, let's make some small changes and evaluate if they helped.

#### Swapping LLMs

Up to this point, we've used GPT-3.5 to answer questions, but do we observe performance improvements when using GPT-4 instead?
Let's name a new experiment and rerun the evaluations:


In [11]:
QA_ANSWER_MODEL = "gpt-4-turbo"


eval_result = await Eval(
    name="Coda RAG with ragas",
    experiment_name=f"Using {QA_ANSWER_MODEL}",
    data=qa_pairs[:20],
    task=generate_answer_e2e,
    scores=[RAGScorerWrapper(AnswerCorrectness), RAGScorerWrapper(ContextRecall)],
)

Experiment Using gpt-4-turbo is running at https://www.braintrust.dev/app/braintrustdata.com/p/Coda%20RAG%20with%20ragas/experiments/Using%20gpt-4-turbo
Coda RAG with ragas [experiment_name=Using gpt-4-turbo] (data): 20it [00:00, 56527.01it/s]


Coda RAG with ragas [experiment_name=Using gpt-4-turbo] (tasks):   0%|          | 0/20 [00:00<?, ?it/s]


Using gpt-4-turbo compared to Using gpt-3.5-turbo:
74.10% (+05.87%) 'AnswerCorrectness' score	(13 improvements, 7 regressions)
90.45% (-) 'ContextRecall'     score	(0 improvements, 0 regressions)

1.71s (-51.83%) 'duration'	(20 improvements, 0 regressions)

See results for Using gpt-4-turbo at https://www.braintrust.dev/app/braintrustdata.com/p/Coda%20RAG%20with%20ragas/experiments/Using%20gpt-4-turbo


Great, it looks like changing our LLM model increased AnswerCorrectness while maintaining the same ContextRecall. Both of these results make intuitive sense: We'd expect AnswerCorrectness to increase with a better model, and ContextRecall doesn't use an LLM at all, so it shouldn't be affected (Interested in LLM-powered document selection? See our other RAG [notebook](https://www.braintrustdata.com/docs/cookbook/CodaHelpDesk))

#### Reducing document retrieval

Now, let's see if we can get away with only pulling a single document per question, instead of the two we've been fetching up to this point.


In [12]:
TOP_K = 1


eval_result = await Eval(
    name="Coda RAG with ragas",
    experiment_name=f"Using {QA_ANSWER_MODEL}, TOP_K={TOP_K}",
    data=qa_pairs[:20],
    task=generate_answer_e2e,
    scores=[RAGScorerWrapper(AnswerCorrectness), RAGScorerWrapper(ContextRecall)],
)

Experiment Using gpt-4-turbo, TOP_K=1 is running at https://www.braintrust.dev/app/braintrustdata.com/p/Coda%20RAG%20with%20ragas/experiments/Using%20gpt-4-turbo%2C%20TOP_K%3D1
Coda RAG with ragas [experiment_name=Using gpt-4-turbo, TOP_K=1] (data): 20it [00:00, 46269.21it/s]


Coda RAG with ragas [experiment_name=Using gpt-4-turbo, TOP_K=1] (tasks):   0%|          | 0/20 [00:00<?, ?it/…


Using gpt-4-turbo, TOP_K=1 compared to Using gpt-4-turbo:
97.29% (+06.84%) 'ContextRecall'     score	(2 improvements, 3 regressions)
59.19% (-14.91%) 'AnswerCorrectness' score	(6 improvements, 14 regressions)

1.70s (-00.95%) 'duration'	(12 improvements, 8 regressions)

See results for Using gpt-4-turbo, TOP_K=1 at https://www.braintrust.dev/app/braintrustdata.com/p/Coda%20RAG%20with%20ragas/experiments/Using%20gpt-4-turbo%2C%20TOP_K%3D1


Reducing our retrieval step from two to one documents per question decreased our overall answer quality.
Jumping into braintrust's UI, we can see the comparison here:

![ ragas metric computation ](assets/ragas_triple_comparison.png)

We can also easily drill down on examples. Here we identify a question that had a drastic score difference when we used two vs. one documents per answer. Looking at the diff view between runs, we see that the second document pulled provide context to completely change the answer from a "No" to a "Yes", increasing the AnswerCorrectness metric from .15 to .58.

![example diff](assets/example_qa_different.png)

And there you have it! We hope you found this cookbook useful for quickly getting ragas metrics implemented in your LLM app
