---
title: "How to use synthetic data to bootstrap your RAG system evals"
date: 2025-07-11
date-modified: 2025-07-11
description-meta: "How to use synthetic data to build a RAG system"
categories:
  - llm
  - python
  - rag 
---

I recently took part in Hamel and Shreya's [AI Evals for Engineers & PMs](https://maven.com/parlance-labs/evals). I couldn't go through the lessons live, but I've been catching up on the recordings and the course materials in the last few weeks.

It's a course packed with high-quality knowledge that you won't find elsewhere. If you want to learn how to improve your LLM systems, I don't think you can do better than this course.

I've worked in many AI projects, but I still hadn't had the chance to generate synthetic data to bootstrap my Retrieval Augmented Generation (RAG) system evals. So I thought it'd be fun to try it out.

In this article, I'll walk you through the steps you need to take to generate synthetic data for your RAG system evals.

## How it works

The process goes as follows:

1. Index your documents in a vector database.
2. Sample a few documents from the vector database.
3. Using the sampled documents extract a unambiguous **fact** and make a **question** about it. 
4. Filter the generated questions to remove the ones that don't seem realistic.
5. Use the filtered dataset to calculate the evals of your RAG system.

## Prerequisites

If you plan to follow along, you'll need to:

1. Sign up and generate an API key in [OpenAI](https://platform.openai.com/docs/overview).
2. Set the API key as an environment variable called `OPENAI_API_KEY`.
3. Create a virtual environment in Python (I use [`uv`](https://docs.astral.sh/uv/)) and install the following packages: `langchain`, `langchain-openai`, `langchain-community`, `jupyter`, `chromadb`, `python-dotenv`, `nest_asyncio`, and `sentence-transformers`.
4. Download the People Group's section from GitLab's [handbook](https://gitlab.com/gitlab-com/content-sites/handbook/-/tree/main/content/handbook/people-group).

Also, I'm assuming you're familiar with the basics of RAG systems and how to use vector databases. If you need a refresher, you can check out my [RAG post](https://dylancastillo.co/posts/what-is-rag.html).

Then, you'll be able to run the code from this article. If you don't want to copy and paste the code, you can  download this [notebook](https://github.com/dylanjcastillo/blog/tree/main/posts/synthetic-data-rag.ipynb).

## Setup and loading the data

We'll use `asyncio` in some of the code snippets, so you'll need to enable `nest_asyncio` to run this notebook: 

In [1]:
import nest_asyncio

nest_asyncio.apply()

Then, you can proceed as usual, importing the required packages:

In [25]:
import asyncio
import os
import random
from textwrap import dedent

import chromadb
import pandas as pd
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from dotenv import load_dotenv
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_text_splitters import MarkdownTextSplitter
from langsmith import Client, traceable
from pydantic import BaseModel
from sentence_transformers import CrossEncoder

load_dotenv()

True

These are the main libraries you'll need for this article:

- **chromadb**: Vector database for storing and retrieving document embeddings
- **langchain**: Framework for building LLM applications
- **langchain-openai**: Wrapper for OpenAI's API, providing access to LLMs and embeddings
- **pydantic**: Provides models for generating structured data and validating types
- **sentence-transformers**: In the last section of the article, we'll use this library to rerank the retrieved documents. 

The rest of the libraries are used for typical Python tasks, such as reading files, managing environment variables, etc. 

Next, you can read the data from the People Group's section of GitLab's handbook.

In [None]:
loader = DirectoryLoader(
    "../data/synthetic-data-rag/people-group/", glob="**/*.md", loader_cls=TextLoader
)
docs = loader.load()

len(docs)

99

If you set everything up correctly, you should see the number of documents in the `docs` variable. Depending on when you read this article, the number of documents may change, as the handbook is updated regularly. When I got the data, there were 99 documents in the `docs` variable.

Then, you must add the data to the vector database.

## Index data

A **vector database** is a database designed to efficiently store and query data as vector embeddings (numerical representations). Provided with a user query, it's the engine you use to find the most similar data in your database.

For this tutorial, we'll use [ChromaDB](https://www.trychroma.com/). So, let's set it up: 

In [4]:
openai_ef = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
client = chromadb.PersistentClient(
    path="../data/synthetic-data-rag/chroma",
)
collection = client.get_or_create_collection(
    "gitlab-handbook", embedding_function=openai_ef
)

This will do three things:

1. Define an embedding function that uses OpenAI's API to generate embeddings for the documents.
2. Create a ChromaDB client to interact with the vector database.
3. Create a collection in the vector database to store the document embeddings and set the embedding function to use the one we defined earlier.

Then, you must split the documents into smaller chunks. This isn't strictly necessary if you have short documents (depending on your embedding model), but it's a good practice to ensure that the embeddings are more meaningful and that the vector database can retrieve relevant chunks of text.

In [28]:
text_splitter = MarkdownTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4o",
    chunk_size=400,
    chunk_overlap=0,
)
splits = text_splitter.split_documents(docs)

len(splits)

999

Luckily for us, GitLab's handbook is already in Markdown, so we can use the `MarkdownTextSplitter` to split the documents. This will use the headings in the files for the splitting in addition to the length. This generally results in better chunks, as they will be more likely to contain complete thoughts or sections of the document.

We'll use the `tiktoken` encoder to do the splits based on the number of tokens (not characters) and set a chunk size of 400 tokens, with no overlap.

Then, because you'll likely get a significant number of chunks, let's define a utility function that lets you add the chunks to the vector database in batches. This will help you avoid hitting the API rate limits: 

In [6]:
def create_batches(ids, documents, metadatas, batch_size=100):
    batches = []
    for i in range(0, len(ids), batch_size):
        batch_ids = ids[i : i + batch_size]
        batch_documents = documents[i : i + batch_size]
        batch_metadatas = metadatas[i : i + batch_size]
        batches.append((batch_ids, batch_metadatas, batch_documents))
    return batches

Using this function, you can add the chunks to the vector database in batches:

In [7]:
ids = [f"{str(i)}" for i in range(len(splits))]
documents = [doc.page_content for doc in splits]
metadatas = [doc.metadata for doc in splits]


if collection.count() > 0:
    print("Collection already exists, skipping creation.")
else:
    print("Adding documents...")
    batches = create_batches(ids=ids, documents=documents, metadatas=metadatas)
    for i, batch in enumerate(batches):
        print(f"Adding batch {i} of size {len(batch[0])}")
        collection.add(ids=batch[0], metadatas=batch[1], documents=batch[2])

Collection already exists, skipping creation.


This should take a few seconds to run. Once it's done, you're vector database will be ready to use.

Let's define a couple of functions to query the vector database:

In [8]:
class RetrievedDoc(BaseModel):
    id: str
    path: str
    page_content: str


def get_similar_docs(text: str, top_k: int = 5) -> list[RetrievedDoc]:
    results = collection.query(query_texts=[text], n_results=top_k)
    docs = [results["documents"][0][i] for i in range(top_k)]
    metadatas = [results["metadatas"][0][i] for i in range(top_k)]
    ids = [results["ids"][0][i] for i in range(top_k)]
    return [
        RetrievedDoc(id=id_, path=m["source"], page_content=d)
        for d, m, id_ in zip(docs, metadatas, ids)
    ]


def get_doc_by_id(doc_id: str) -> RetrievedDoc:
    results = collection.get(ids=[doc_id])
    doc = results["documents"][0]
    metadata = results["metadatas"][0]
    return RetrievedDoc(id=doc_id, path=metadata["source"], page_content=doc)

These functions will let you either get the most similar documents to a user query or retrieve a document by its ID.

With this in place, we can now generate a sample of documents we'll use to generate the synthetic data: 

In [10]:
golden_docs_idx = random.sample(range(len(splits)), 200)
golden_docs = [get_doc_by_id(str(i)) for i in golden_docs_idx]

This will result in 200 documents that we'll use to generate the synthetic data. 

## Generate QA Pairs

Then, using the documents you just sampled, you can generate the synthetic data. For each document, you'll extract a fact and generate a question from it. Hamel and Shreya mention that this might result in questions that are too easy to answer, which might not be ideal for evaluating your RAG system.

To create more challenging synthetic queries, they recommend adding similar chunks to the generation process, so that it can generate questions that are uniquely answered by the target chunk but also include themes or keywords that are present in other chunks.

Here's an example of how this works:

**Target chunk:** "George Orwell's masterpiece, *Nineteen Eighty-Four*, was published in June 1949 and introduced the concept of 'Big Brother' to a global audience."

**Similar chunks:** 

    1. "Aldous Huxley's *Brave New World*, another influential work of dystopian fiction, was first released in 1932 and explores themes of social conditioning and control."
    2. "Ray Bradbury's *Fahrenheit 451*, published in 1953, depicts a future society where books are banned and 'firemen' burn any that are found."

**Synthetic Question:** "In what year was the dystopian novel that introduced the concept of 'Big Brother' published?"

The target chunk helps the generator come up with a synthetic question. The similar chunks provide distractors that help the generator include themes or keywords that are also present in other chunks (e.g., dystopian fiction), making the question more challenging.

To do this, you can use the following prompts: 

In [None]:
system_prompt_generate = dedent(
    """
    You are a helpful assistant generating synthetic QA pairs for retrieval evaluation.

    Given a target chunk of text and a set of confounding chunks, you must extract a specific, self-contained fact from the target chunk that is not included in the confounding chunks. Then write a question that is directly and unambiguously answered by that fact. The question should only be answered by the fact extracted from the target chunk (and not by any of the confounding chunks) but it should also use themes or terminology that is present in the confounding chunks.

    Always respond with a JSON object with the following keys (in that exact order):
    1. "fact": "<the fact extracted from the target chunk>",
    2. "confounding_terms": "<a list of terms or themes from the confounding chunks that are relevant to the question>",
    3. "question": "<the question that is directly and unambiguously answered by the fact>",
    
    You should write the questions as if you're an employee looking for information in the handbook. The question should be as realistic and natural as possible, reflecting the kind of queries an employee might actually make when searching for information in the handbook.
    """
)

user_prompt_generate = dedent(
    """
    TARGET CHUNK:
    {target_chunk}

    CONFOUNDING CHUNKS:
    {confounding_chunks} 
    """
)

These prompts will be used to generate the synthetic data. The `system_prompt_generate` defines the generation process in the same way we defined earlier, and `user_prompt_generate` provides the required context: target and confounding chunks. 

Then, you initialize the LLM, set up the response model, and define a function to format the documents for the LLM: 

In [None]:
class Response(BaseModel):
    fact: str
    confounding_terms: list[str] = []
    question: str


llm = ChatOpenAI(model="gpt-4.1-mini", temperature=1)
llm_with_structured_output = llm.with_structured_output(Response)

messages = ChatPromptTemplate.from_messages(
    [("system", system_prompt_generate), ("user", user_prompt_generate)]
)

def format_docs(chunks: list[RetrievedDoc]) -> str:
    return "\n".join(
        [f"*** Filepath: {chunk.path} ***\n{chunk.page_content}\n" for chunk in chunks]
    )

Finally, you can define a function to generate the synthetic data. This function will take a chunk, retrieve the most similar chunks, and generate a question from the target chunk using the similar chunks as distrctors: 

In [None]:
async def generate_qa_pair(chunk):
    similar_chunks = get_similar_docs(chunk.page_content)
    compiled_messages = await messages.ainvoke(
        {
            "target_chunk": format_docs([similar_chunks[0]]),
            "confounding_chunks": format_docs(similar_chunks[1:]),
        }
    )
    output = await llm_with_structured_output.ainvoke(compiled_messages)
    return output

To speed up question generation, you can run this concurrently using asyncio.

Here's how you can do it:

In [None]:
tasks = [generate_qa_pair(random_split) for random_split in golden_docs]
qa_pairs = await asyncio.gather(*tasks)

df = pd.DataFrame([qa_pair.dict() for qa_pair in qa_pairs])
df.to_excel("../data/synthetic-data-rag/files/qa_pairs.xlsx", index=False)

Here are some of the resulting QA pairs:

1. Example 1:
   - **Question:** How soon should managers send the results after the 360 feedback cycle closes to prepare for the feedback meeting?
   - **Answer:** Managers should send the results of 360 feedback within 48 hours of the feedback cycle closing so they can prepare and come to the meeting with questions and discussion points.

2. Example 2:
   - **Question:** At what point in the hiring process must candidates disclose outside employment or side projects for GitLab to assess potential conflicts with their job obligations?
   - **Answer:** Candidates at a certain stage in the recruiting process are asked to disclose outside employment, side projects, or other activities so GitLab can determine if a conflict exists with their ability to fulfill obligations to GitLab.

Even though the questions seem relevant, you might've noticed that they might not really reflect how real user queries are structured.

To improve that, you can provide few shot examples of real or adjusted queries. But before that, you can generate a filter that will help you remove the questions that don't seem realistic enough. 

## Filtering QA pairs

In [None]:
system_prompt_rate = dedent(
    """
    You are an AI assistant helping us curate a high-quality dataset of questions for evaluating an company's internal handbook. We have generated synthetic questions and need to filter out those that are unrealistic or not representative of typical user queries.

    Here are examples of realistic and unrealistic user queries we have manually rated:

    ### Realistic Queries (Good Examples)

    * **Query:** "What is the required process for creating a new learning hub for your team in Level Up at GitLab?"
        * **Explanation:** Very realistic user query. It's concise, information-seeking, and process-oriented.
        * **Rating:** 5
    * **Query:** "Where is the People Operations internal handbook hosted, and how can someone gain access to it?"
        * **Explanation:** Realistic query but might be a bit too detailed for a typical user.
        * **Rating:** 4
    * **Query:** "Who controls access to People Data in the data warehouse at GitLab, and what approvals are required for Analytics Engineers and Data Analysts to obtain access?"
        * **Explanation:** Seems reasonable but too lengthy for a typical user query. 
        * **Rating:** 3

    ### Unrealistic Queries (Bad Examples)

    * **Query:** "If a GitLab team member has been with the company for over 3 months and is interested in participating in the Onboarding Buddy Program, what should they do to express their interest?"
        * **Explanation:** Overly specific and unnatural. No real user would ask this.
        * **Rating:** 1
    * **Query:** "On what date did the 'Managing Burnout with Time Off with John Fitch' session occur as part of the FY21 Learning Speaker Series?"
        * **Explanation:** Irrelevant and overly specific. Not a typical user query. 
        * **Rating:** 2

    ### Your Task

    For the following generated question, please:

    1.  Rate its realism as a typical user query for an internal handbook application on a scale of 1 to 5 (1 = Very Unrealistic, 3 = Neutral/Somewhat Realistic, 5 = Very Realistic).
    2.  Provide a brief explanation for your rating, comparing it to the examples above if helpful.

    ### Output Format

    **Explanation:** `[Your brief explanation]`
    **Rating:** `[Your 1–5 rating]`
    """
)

user_prompt_rate = dedent(
    """
    **Generated Question to Evaluate:**
    `{question_to_evaluate}`
    """
)

In [None]:
class ResponseFiltering(BaseModel):
    explanation: str
    rating: int


llm_with_structured_output_filtering = llm.with_structured_output(ResponseFiltering)

messages_filtering = ChatPromptTemplate.from_messages(
    [("system", system_prompt_rate), ("user", user_prompt_rate)]
)

In [None]:
async def rate_qa_pair(qa_pair):
    compiled_messages = await messages_filtering.ainvoke(
        {"question_to_evaluate": qa_pair.question}
    )
    output = await llm_with_structured_output_filtering.ainvoke(compiled_messages)
    return output


tasks = [rate_qa_pair(qa_pair) for qa_pair in qa_pairs]
results = await asyncio.gather(*tasks)

In [None]:
rated_qa_pairs = [
    {
        "rating": result.rating,
        "explanation": result.explanation,
        "question": qa_pair.question,
        "answer": qa_pair.fact,
    }
    for (result, qa_pair) in zip(results, qa_pairs)
]

In [None]:
df_rated_qa_pairs = pd.DataFrame(
    rated_qa_pairs, columns=["Rating", "Explanation", "Question"]
)

df_rated_qa_pairs.to_excel(
    "../data/synthetic-data-rag/files/rated_qa_pairs.xlsx", index=False
)

## Evaluating RAG system 

Now that we have the filtered QA pairs, it's time to evaluate your RAG system.

You'll evaluate two parts of the RAG system: retrieval and generation.

Start by creating a dataset on LangSmith:

In [30]:
# | output: false

langsmith_client = Client()
dataset_name = "Gitlab Handbook QA Evaluation 2"

try:
    dataset = langsmith_client.create_dataset(dataset_name=dataset_name)
    examples = [
        {
            "inputs": {
                "question": h["question"],
            },
            "outputs": {
                "answer": h["answer"],
                "doc": {
                    "id": chunk.id,
                    "path": chunk.path,
                },
            },
        }
        for h, chunk in zip(rated_qa_pairs, golden_docs)
        if h["rating"] >= 5
    ]
    langsmith_client.create_examples(dataset_id=dataset.id, examples=examples)
except Exception:
    print("Dataset already exists, skipping creation.")
    dataset = langsmith_client.read_dataset(dataset_name=dataset_name)

Dataset already exists, skipping creation.


### Retrieval Metrics

To evaluate the retrieval part of the RAG system, you can use metrics such as recall@k, precision@k, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG).

For this tutorial, you'll use two metrics: MRR and recall@k.

**MRR** (Mean Reciprocal Rank) is a measure of how well the RAG system retrieves relevant documents. It calculates the average of the reciprocal ranks of the first relevant document for each query. It essentially measures how quickly the system retrieves the first relevant document for a given query.

For example, if a relevant document appears at position 1, the MRR contribution is 1/1 = 1. If it appears at position 3, the contribution is 1/3 ≈ 0.33. The formula is:

$$MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i}$$

where $|Q|$ is the number of queries and $rank_i$ is the position of the first relevant document for query $i$.

**Recall@k** measures the proportion of relevant documents retrieved in the top k results. It helps you understand how many relevant documents are retrieved by the RAG system.

For example, if you retrieve 5 documents and 3 of them are relevant, recall@5 is 3/5 = 0.6. The formula is: 

$$Recall@k = \frac{|\text{relevant documents in top k}|}{|\text{total relevant documents}|}$$

where $|\text{relevant documents in top k}|$ is the number of relevant documents retrieved in the top k results, and $|\text{total relevant documents}|$ is the total number of relevant documents for the query.

In our case, given that we only have one relevant document per query, recall@k will be 1 if the relevant document is in the top k results, and 0 otherwise.

You can define two LangSmith evaluators to calculate these metrics:

In [12]:
def mrr(inputs: dict, outputs: dict, reference_outputs: dict) -> float:
    reference_docs = [str(reference_outputs["doc"]["id"])]
    docs = outputs.get("docs", [])
    if not docs:
        return 0.0
    rank = next((i + 1 for i, doc in enumerate(docs) if doc in reference_docs), None)
    return 1.0 / rank if rank else 0.0


def recall(inputs: dict, outputs: dict, reference_outputs: dict) -> float:
    reference_docs = [str(reference_outputs["doc"]["id"])]
    docs = outputs.get("docs", [])
    if not docs:
        return 0.0
    return float(any(doc in reference_docs for doc in docs))

LangSmith evaluators take `inputs`, `outputs`, and `reference_outputs` as arguments. The `inputs` are the user query and the retrieved documents, the `outputs` are the generated answers, and the `reference_outputs` are the target chunks.

Then using those values, you can calculate the MRR and recall@k metrics.

### Generation metrics

To measure the generation quality, Hamel and Shreya recommend using [ARES](https://github.com/stanford-futuredata/ARES) or [RAGAS](https://github.com/explodinggradients/ragas).

ARES requires a human preference validation set of at least 50 examples and the standard RAGAS metrics consume tons of tokens. So for the sake of simplicity, we'll build 3 simple metrics using an LLM judge, similar to RAGAS' [Nvidia metrics](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/nvidia_metrics/).

- **Answer accuracy**: Measures how accurate the generated answer is compared to the target chunk.
- **Context relevance**: Measures if the context provided to the LLM is relevant to the user query.
- **Grounded**: Measures if the generated answer is grounded in the provided context.

#### Answer accuracy

We'll use an LLM judge that evaluates the generated answer against a reference answer.

In [13]:
system_prompt_answer_accuracy = dedent(
    """
    You are an expert evaluator. Your task is to evaluate the accuracy of a User Answer against a Reference Answer, given a Question.

    Here's the grading scale you must use:

    0 - If User Answer is not contained in Reference Answer or not accurate in all terms, topics, numbers, metrics, dates and units or the User Answer do not answer the question.
    2 - If User Answer is full contained and equivalent to Reference Answer in all terms, topics, numbers, metrics, dates and units.
    1 - If User Answer is partially contained and almost equivalent to Reference Answer in all terms, topics, numbers, metrics, dates and units.

    Your rating must be only 0, 1 or 2 according to the instructions above.

    Your answer must be a JSON object with the following keys:
    1. "explanation": "<a brief explanation of your rating>",
    2. "rating": "<your rating, which must be one of the following: 0, 1, 2>"
    """
)

user_prompt_answer_accuracy = dedent(
    """
    **Question:** `{question}`
    **User Answer:** `{user_answer}`
    **Reference Answer:** `{reference_answer}`
    """
)

messages_answer_accuracy = ChatPromptTemplate.from_messages(
    [("system", system_prompt_answer_accuracy), ("user", user_prompt_answer_accuracy)]
)


class ResponseAnswerAccuracy(BaseModel):
    explanation: str
    rating: int


llm = ChatOpenAI(model="gpt-4.1-mini")

llm_with_structured_output_answer_accuracy = llm.with_structured_output(
    ResponseAnswerAccuracy
)


async def answer_accuracy(
    inputs: dict, outputs: dict, reference_outputs: dict
) -> float:
    compiled_messages = await messages_answer_accuracy.ainvoke(
        {
            "question": inputs["question"],
            "user_answer": outputs["answer"],
            "reference_answer": reference_outputs["answer"],
        }
    )
    output = await llm_with_structured_output_answer_accuracy.ainvoke(compiled_messages)
    return output.rating / 2.0

#### Context relevance

Then, we'll create another LLM judge that evaluates if the context provided to the LLM is relevant to the user query.

These are the prompts:

In [14]:
system_prompt_context_relevance = dedent(
    """
    You are an expert evaluator. Your task is to evaluate the relevance of a Context in order to answer a Question. 

    Do not rely on your previous knowledge about the Question. Use only what is written in the Context and in the Question.

    Here's the grading scale you must use:

    0 - If the context does not contain any relevant information to answer the question.
    1 - If the context partially contains relevant information to answer the question.
    2 - If the context contains relevant information to answer the question.

    You must always provide the relevance score of 0, 1, or 2, nothing else.

    Your answer must be a JSON object with the following keys:
    1. "explanation": "<a brief explanation of your rating>",
    2. "rating": "<your rating, which must be one of the following: 0, 1, 2>"
    """
)

user_prompt_context_relevance = dedent(
    """
    **Question:** `{question}`
    **Context:** `{context}`
    """
)

messages_context_relevance = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt_context_relevance),
        ("user", user_prompt_context_relevance),
    ]
)


class ResponseContextRelevance(BaseModel):
    explanation: str
    rating: int


llm = ChatOpenAI(model="gpt-4.1-mini")
llm_with_structured_output_context_relevance = llm.with_structured_output(
    ResponseContextRelevance
)


async def context_relevance(
    inputs: dict, outputs: dict, reference_outputs: dict
) -> float:
    compiled_messages = await messages_context_relevance.ainvoke(
        {
            "question": inputs["question"],
            "context": outputs["context"],
        }
    )
    output = await llm_with_structured_output_context_relevance.ainvoke(
        compiled_messages
    )
    return output.rating / 2

#### Groundedness

Finally, we'll create a judge that evaluates if the generated answer is grounded in the provided context.

These are the prompts:

In [15]:
system_prompt_groundedness = dedent(
    """
    You are an expert evaluator. Your task is to evaluate the groundedness of an assertion against a context. 

    Do not rely on your previous knowledge about the assertion or context. Use only what is written in the assertion and in the context.

    Here's the grading scale you must use:

    0 - If the assertion is not supported by the context. Or, if the context or assertion is empty.
    1 - If the context partially contains relevant information to support the assertion.
    2 - If the context fully supports the assertion.

    You must always provide the relevance score of 0, 1, or 2, nothing else.

    Your answer must be a JSON object with the following keys:

    1. "explanation": "<a brief explanation of your rating>",
    2. "rating": "<your rating, which must be one of the following: 0, 1, 2>"
    """
)

user_prompt_groundedness = dedent(
    """
    **Assertion:** `{answer}`
    **Context:** `{context}`
    """
)

messages_groundedness = ChatPromptTemplate.from_messages(
    [("system", system_prompt_groundedness), ("user", user_prompt_groundedness)]
)


class ResponseGroundedness(BaseModel):
    explanation: str
    rating: int


llm = ChatOpenAI(model="gpt-4.1-mini")
llm_with_structured_output_groundedness = llm.with_structured_output(
    ResponseGroundedness
)


async def groundedness(inputs: dict, outputs: dict, reference_outputs: dict) -> float:
    compiled_messages = await messages_groundedness.ainvoke(
        {
            "answer": outputs["answer"],
            "context": outputs["context"],
        }
    )
    output = await llm_with_structured_output_groundedness.ainvoke(compiled_messages)
    return output.rating / 2

### Run evaluation

Now we can run the full RAG pipeline and evaluate its results using the LangSmith evaluators.

In [16]:
system_prompt_generation = dedent(
    """
    You're a helpful assistant. Provided with a question and the most relevant documents, you must generate a concise and accurate answer based on the information in those documents.
    """
)

user_prompt_generation = dedent(
    """
    QUESTION: {question}

    RELEVANT DOCUMENTS:
    {documents}
    """
)

messages_generation = ChatPromptTemplate.from_messages(
    [("system", system_prompt_generation), ("user", user_prompt_generation)]
)

llm_generation = ChatOpenAI(
    model="gpt-4o-mini",
)

For example, you could evaluate different values for the number of retrieved documents. I'll run this code for number of retrieved documents equal to 3, 5, and 10, and compare the results.

In [None]:
K = 3

@traceable
async def target(inputs: dict) -> dict:
    relevant_docs = get_similar_docs(inputs["question"], top_k=K)
    formatted_docs = format_docs(relevant_docs)
    messages = await messages_generation.ainvoke(
        {
            "question": inputs["question"],
            "documents": formatted_docs,
        }
    )
    response = await llm_generation.ainvoke(messages)
    return {
        "answer": response.content,
        "docs": [doc.id for doc in relevant_docs],
        "context": formatted_docs,
    }

In [None]:
experiment_results = await langsmith_client.aevaluate(
    target,
    data=dataset_name,
    evaluators=[recall, mrr, answer_accuracy, context_relevance, groundedness],
    max_concurrency=50,
    experiment_prefix=f"top-{K}",
)

## Improving retrieval with a reranker

A quick way to improve your RAG system is to rerank the retrieved documents. In addition to doing retrieval using vector similarity or keyword search, you do a reranking step that uses a slow and expensive model to rerank the retrieve documents based on their relevance to the user query. 

You can use the `sentence-transformers` library to do this. 

In [17]:
cross_encoder = CrossEncoder("mixedbread-ai/mxbai-rerank-xsmall-v1")

Here's an example of how to use the `sentence-transformers` library to rerank the retrieved documents:

In [18]:
query = "What is the process for creating a new learning hub for your team in Level Up at GitLab?"
hits = get_similar_docs(query, top_k=50)
cross_inp = [[query, h.page_content] for h in hits]
reranker_scores = cross_encoder.predict(cross_inp)
sorted_hits = sorted(hits, key=lambda x: reranker_scores[hits.index(x)], reverse=True)

Now you can do the same evaluation as before, but using the reranked documents.

In [21]:
K = 5

def get_reranked_docs(
    query: str, similar_docs: list[RetrievedDoc]
) -> list[RetrievedDoc]:
    cross_inp = [[query, doc.page_content] for doc in similar_docs]
    reranker_scores = cross_encoder.predict(cross_inp)
    sorted_docs = sorted(
        similar_docs, key=lambda x: reranker_scores[similar_docs.index(x)], reverse=True
    )
    return sorted_docs


@traceable
async def target_with_reranking(inputs: dict) -> dict:
    relevant_docs = get_similar_docs(inputs["question"], top_k=75)
    reranked_docs = get_reranked_docs(inputs["question"], relevant_docs)[:K]
    formatted_docs = format_docs(reranked_docs)
    messages = await messages_generation.ainvoke(
        {
            "question": inputs["question"],
            "documents": formatted_docs,
        }
    )
    response = await llm.ainvoke(messages)
    return {
        "answer": response,
        "docs": [doc.id for doc in reranked_docs],
        "context": formatted_docs,
    }

In [22]:
experiment_results = await langsmith_client.aevaluate(
    target_with_reranking,
    data=dataset_name,
    evaluators=[recall, mrr, answer_accuracy, context_relevance, groundedness],
    max_concurrency=50,
    experiment_prefix=f"top-{K}-reranked",
)

View the evaluation results for experiment: 'top-5-reranked-9a0c8901' at:
https://smith.langchain.com/o/197ee903-f183-50ac-ae2c-929c6a09833a/datasets/0d1ea85c-d251-4104-ae51-844b6564e7ad/compare?selectedSessions=846548f7-1a2c-48fb-a646-918bab167425




0it [00:00, ?it/s]

## Conclusion