# RAG Evaluation
_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_

This notebook demonstrates how you can evaluate your RAG (Retrieval Augmented Generation), by building a synthetic evaluation dataset and using LLM-as-a-judge to compute the accuracy of your system.

For an introduction to RAG, you can check [this other cookbook](rag_zephyr_langchain)!

RAG systems are complex: here a RAG diagram, where we noted in blue all possibilities for system enhancement:

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_workflow.png" height="700">

Implementing any of these improvements can bring a huge performance boost; but changing anything is useless if you cannot monitor the impact of your changes on the system's performance!
So let's see how to evaluate our RAG system.

### Evaluating RAG performance

Since there are so many moving parts to tune with a big impact on performance, benchmarking the RAG system is crucial.

For our evaluation pipeline, we will need:
1. An evaluation dataset with question - answer couples (QA couples)
2. An evaluator to compute the accuracy of our system on the above evaluation dataset.

‚û°Ô∏è It turns out, we can use LLMs to help us all along the way!
1. The evaluation dataset will be synthetically generated by an LLM ü§ñ, and questions will be filtered out by other LLMs ü§ñ
2. An [LLM-as-a-judge](https://huggingface.co/papers/2306.05685) agent ü§ñ will then perform the evaluation on this synthetic dataset.

__Let's dig into it and start building our evaluation pipeline!__ First, we install the required model dependancies.

In [None]:
# !pip install -q torch transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets langchain-community ragatouille

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
from tqdm.auto import tqdm
import pandas as pd
from typing import Optional, List, Tuple
import json
import datasets

pd.set_option("display.max_colwidth", None)

In [5]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

### Load your knowledge base

In [6]:
ds = datasets.load_dataset(
    "m-ric/huggingface_doc", split="train"
)

In [7]:
print(ds)

Dataset({
    features: ['text', 'source'],
    num_rows: 2647
})


# 1. Build a synthetic dataset for evaluation
We first build a synthetic dataset of questions and associated contexts. The method is to get elements from our knowledge base, and ask an LLM to generate questions based on these documents.

Then we setup other LLM agents to act as quality filters for the generated QA couples: each of them will act as the filter for a specific flaw.

### 1.1. Prepare source documents

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument

langchain_docs = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(ds)
]


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

  0%|          | 0/2647 [00:00<?, ?it/s]

### 1.2. Setup agents for question generation

We use [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) for QA couple generation because it it has excellent performance in leaderboards such as [Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard).

In [11]:
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage

from dotenv import load_dotenv

load_dotenv()

# Create the LLM client
llm = ChatOpenAI(
    model="gpt-4.1",
    temperature=0,
    max_tokens=500,
)


def call_llm(llm_client, prompt: str):
    response = llm_client.invoke(
        [
            SystemMessage(content="You are a helpful assistant."),
            HumanMessage(content=prompt),
        ]
    )
    return response.content


print(call_llm(llm, "This is a test context"))

Thank you for providing the test context! How can I assist you further?


In [12]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

Now let's generate our QA couples.
For this example, we generate only 10 QA couples and will load the rest from the Hub.

But for your specific knowledge base, given that you want to get at least ~100 test samples, and accounting for the fact that we will filter out around half of these with our critique agents later on, you should generate much more, in the >200 samples.

In [14]:
import random
from tqdm import tqdm

N_GENERATIONS = 30  # keep it low for testing

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(
    random.sample(docs_processed, min(N_GENERATIONS, len(docs_processed)))
):
    output_QA_couple = call_llm(
        llm, QA_generation_prompt.format(context=sampled_context.page_content)
    )
    try:
        question = (
            output_QA_couple.split("Factoid question: ")[-1]
            .split("Answer: ")[0]
            .strip()
        )
        answer = output_QA_couple.split("Answer: ")[-1].strip()
        assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context.page_content,
                "question": question,
                "answer": answer,
                "source_doc": sampled_context.metadata.get("source"),
            }
        )
    except Exception:
        continue

Generating 30 QA couples...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [00:35<00:00,  1.17s/it]


In [15]:
display(pd.DataFrame(outputs).head(5))

Unnamed: 0,context,question,answer,source_doc
0,"<div class=""flex justify-center"">\n<img class=""block dark:hidden"" src=""https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/IOB_versions.svg"" alt=""IOB1 vs IOB2 format""/>\n<img class=""hidden dark:block"" src=""https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/IOB_versions-dark.svg"" alt=""IOB1 vs IOB2 format""/>\n</div>\n\nWith this map, we are ready to reproduce (almost entirely) the results of the first pipeline -- we can just grab the score and label of each token that was not classified as `O`:\n\n```py\nresults = []\ntokens = inputs.tokens()\n\nfor idx, pred in enumerate(predictions):\n label = model.config.id2label[pred]\n if label != ""O"":\n results.append(\n {""entity"": label, ""score"": probabilities[idx][pred], ""word"": tokens[idx]}\n )\n\nprint(results)\n```\n\n```python out\n[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S'},\n {'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl'},\n {'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va'},\n {'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in'},\n {'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu'},\n {'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging'},\n {'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face'},\n {'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn'}]\n```\n\nThis is very similar to what we had before, with one exception: the pipeline also gave us information about the `start` and `end` of each entity in the original sentence. This is where our offset mapping will come into play. To get the offsets, we just have to set `return_offsets_mapping=True` when we apply the tokenizer to our inputs:\n\n```py\ninputs_with_offsets = tokenizer(example, return_offsets_mapping=True)\ninputs_with_offsets[""offset_mapping""]\n```",What does setting return_offsets_mapping=True do when applying the tokenizer to inputs?,It provides the start and end offsets of each token in the original sentence.,huggingface/course/blob/main/chapters/en/chapter6/3.mdx
1,"Being similar to LoRA, IA3 carries many of the same advantages: \n\n* IA3 makes fine-tuning more efficient by drastically reducing the number of trainable parameters. (For T0, an IA3 model only has about 0.01% trainable parameters, while even LoRA has > 0.1%)\n* The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable IA3 models for various downstream tasks built on top of them.\n* Performance of models fine-tuned using IA3 is comparable to the performance of fully fine-tuned models.\n* IA3 does not add any inference latency because adapter weights can be merged with the base model.\n\nIn principle, IA3 can be applied to any subset of weight matrices in a neural network to reduce the number of trainable\nparameters. Following the authors' implementation, IA3 weights are added to the key, value and feedforward layers\nof a Transformer model. To be specific, for transformer models, IA3 weights are added to the outputs of key and value layers, and to the input of the second feedforward layer\nin each transformer block.\n\nGiven the target layers for injecting IA3 parameters, the number of trainable parameters\ncan be determined based on the size of the weight matrices.\n\n\n## Common IA3 parameters in PEFT\n\nAs with other methods supported by PEFT, to fine-tune a model using IA3, you need to:\n\n1. Instantiate a base model.\n2. Create a configuration (`IA3Config`) where you define IA3-specific parameters.\n3. Wrap the base model with `get_peft_model()` to get a trainable `PeftModel`.\n4. Train the `PeftModel` as you normally would train the base model.\n\n`IA3Config` allows you to control how IA3 is applied to the base model through the following parameters:",To which layers of a Transformer model are IA3 weights added?,"IA3 weights are added to the outputs of the key and value layers, and to the input of the second feedforward layer in each transformer block.",huggingface/peft/blob/main/docs/source/conceptual_guides/ia3.md
2,"Here's our chat function:\n\n```python\nimport random\n\ndef random_response(message, history):\n return random.choice([""Yes"", ""No""])\n```\n\nNow, we can plug this into `gr.ChatInterface()` and call the `.launch()` method to create the web interface:\n\n```python\nimport gradio as gr\n\ngr.ChatInterface(random_response).launch()\n```\n\nThat's it! Here's our running demo, try it out:\n\n$demo_chatinterface_random_response\n\n## Another example using the user's input and history\n\nOf course, the previous example was very simplistic, it didn't even take user input or the previous history into account! Here's another simple example showing how to incorporate a user's input as well as the history.\n\n```python\nimport random\nimport gradio as gr\n\ndef alternatingly_agree(message, history):\n if len(history) % 2 == 0:\n return f""Yes, I do think that '{message}'""\n else:\n return ""I don't think so""\n\ngr.ChatInterface(alternatingly_agree).launch()\n```\n\n## Streaming chatbots\n\nIf in your chat function, you use `yield` to generate a sequence of responses, you'll end up with a streaming chatbot. It's that simple!\n\n```python\nimport time\nimport gradio as gr\n\ndef slow_echo(message, history):\n for i in range(len(message)):\n time.sleep(0.3)\n yield ""You typed: "" + message[: i+1]\n\ngr.ChatInterface(slow_echo).launch()\n```\n\nNotice that we've [enabled queuing](/guides/key-features#queuing), which is required to use generator functions. While the response is streaming, the ""Submit"" button turns into a ""Stop"" button that can be used to stop the generator function. You can customize the appearance and behavior of the ""Stop"" button using the `stop_btn` parameter.\n\n## Customizing your chatbot\n\nIf you're familiar with Gradio's `Interface` class, the `gr.ChatInterface` includes many of the same arguments that you can use to customize the look and feel of your Chatbot. For example, you can:",What Python library is used to create the chat interface in the provided examples?,gradio,gradio-app/gradio/blob/main/guides/04_chatbots/01_creating-a-chatbot-fast.md
3,"Xception\n\n**Xception** is a convolutional neural network architecture that relies solely on [depthwise separable convolution layers](https://paperswithcode.com/method/depthwise-separable-convolution).\n\nThe weights from this model were ported from [Tensorflow/Models](https://github.com/tensorflow/models).\n\n## How do I use this model on an image?\nTo load a pretrained model:\n\n```python\nimport timm\nmodel = timm.create_model('xception', pretrained=True)\nmodel.eval()\n```\n\nTo load and preprocess the image:\n```python \nimport urllib\nfrom PIL import Image\nfrom timm.data import resolve_data_config\nfrom timm.data.transforms_factory import create_transform\n\nconfig = resolve_data_config({}, model=model)\ntransform = create_transform(**config)\n\nurl, filename = (""https://github.com/pytorch/hub/raw/master/images/dog.jpg"", ""dog.jpg"")\nurllib.request.urlretrieve(url, filename)\nimg = Image.open(filename).convert('RGB')\ntensor = transform(img).unsqueeze(0) # transform and add batch dimension\n```\n\nTo get the model predictions:\n```python\nimport torch\nwith torch.no_grad():\n out = model(tensor)\nprobabilities = torch.nn.functional.softmax(out[0], dim=0)\nprint(probabilities.shape)\n# prints: torch.Size([1000])\n```\n\nTo get the top-5 predictions class names:\n```python\n# Get imagenet class mappings\nurl, filename = (""https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"", ""imagenet_classes.txt"")\nurllib.request.urlretrieve(url, filename) \nwith open(""imagenet_classes.txt"", ""r"") as f:\n categories = [s.strip() for s in f.readlines()]\n\n# Print top categories per image\ntop5_prob, top5_catid = torch.topk(probabilities, 5)\nfor i in range(top5_prob.size(0)):\n print(categories[top5_catid[i]], top5_prob[i].item())\n# prints class names and probabilities like:\n# [('Samoyed', 0.6425196528434753), ('Pomeranian', 0.04062102362513542), ('keeshond', 0.03186424449086189), ('white wolf', 0.01739676296710968), ('Eskimo dog', 0.011717947199940681)]\n```",What type of convolutional layers does the Xception neural network architecture use exclusively?,Depthwise separable convolution layers,huggingface/pytorch-image-models/blob/main/docs/models/xception.md
4,". To make sense of those logits, we need to dig into the third and last step of the pipeline: post-processing. To convert logits into probabilities, we need to apply a SoftMax layer to them. As we can see, this transforms them into positive numbers that sum up to 1. The last step is to know which of those corresponds to the positive or the negative label. This is given by the id2label field of the model config. The first probabilities (index 0) correspond to the negative label, and the seconds (index 1) correspond to the positive label. This is how our classifier built with the pipeline function picked those labels and computed those scores. Now that you know how each steps works, you can easily tweak them to your needs.",What function is used to convert logits into probabilities in the described pipeline?,SoftMax,huggingface/course/blob/main/subtitles/en/raw/chapter2/02_inside-pipeline-pt.md


### 1.3. Setup critique agents

The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.

We thus build critique agents that will rate each question on several criteria, given in [this paper](https://huggingface.co/papers/2312.10003):
- **Groundedness:** can the question be answered from the given context?
- **Relevance:** is the question relevant to users? For instance, `"What is the date when transformers 4.29.1 was released?"` is not relevant for ML practitioners.

One last failure case we've noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like `"What is the name of the function used in this guide?"`.
We also build a critique agent for this criteria:
- **Stand-alone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be `What is the function used in this article?` for a question generated from a specific blog article.

We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

üí° ___When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.___

We now build and run these critique agents.

In [16]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [17]:
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(
            llm,
            question_groundedness_critique_prompt.format(
                context=output["context"], question=output["question"]
            ),
        ),
        "relevance": call_llm(
            llm,
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            llm,
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception:
        continue

Generating critique for each QA couple...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [02:30<00:00,  5.01s/it]


Now let us filter out bad questions based on our critique agent scores:

In [18]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]
print("============================================")
print("Final evaluation dataset:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

eval_dataset = datasets.Dataset.from_pandas(
    generated_questions, split="train", preserve_index=False
)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What does setting return_offsets_mapping=True do when applying the tokenizer to inputs?,It provides the start and end offsets of each token in the original sentence.,4,5,5
1,To which layers of a Transformer model are IA3 weights added?,"IA3 weights are added to the outputs of the key and value layers, and to the input of the second feedforward layer in each transformer block.",5,4,5
2,What Python library is used to create the chat interface in the provided examples?,gradio,5,3,1
3,What type of convolutional layers does the Xception neural network architecture use exclusively?,Depthwise separable convolution layers,5,1,5
4,What function is used to convert logits into probabilities in the described pipeline?,SoftMax,5,4,1
5,What is the default value of the temperature parameter in the generation payload?,1.0,5,4,1
6,What is the name of the new sentiment analysis task introduced for the e-commerce domain in the benchmark?,Allegro Reviews (AR),5,3,1
7,What is the expected validation accuracy after fine-tuning the model using the provided script?,99%,5,2,1
8,Under what license are the Hugging Face Transformers Notebooks released?,"Apache License, Version 2.0",5,4,5
9,What activation function does EfficientNet-Lite use to make it more suitable for mobile devices?,ReLU6,5,2,5


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What does setting return_offsets_mapping=True do when applying the tokenizer to inputs?,It provides the start and end offsets of each token in the original sentence.,4,5,5
1,To which layers of a Transformer model are IA3 weights added?,"IA3 weights are added to the outputs of the key and value layers, and to the input of the second feedforward layer in each transformer block.",5,4,5
8,Under what license are the Hugging Face Transformers Notebooks released?,"Apache License, Version 2.0",5,4,5
10,What optimizer should be used with ORTSeq2SeqTrainingArguments for ONNX Runtime training?,adamw_ort_fused,5,4,5
15,What does the 'multi_class' parameter specify when used with multiclass targets?,"The 'multi_class' parameter determines the type of configuration to use for multiclass targets, with options 'ovr' (one-vs-rest) and 'ovo' (one-vs-one).",5,4,5
16,What command-line argument should you use with torchrun to select the number of GPUs for training?,--nproc_per_node,5,5,5
19,What is the function used to perform a forward pass in Data2VecVisionForSemanticSegmentation?,forward,5,5,5
24,Which command-line flag ensures that training runs are tracked on Weights and Biases?,"report_to=""wandb""",5,4,5


Now our synthetic evaluation dataset is complete! We can evaluate different RAG systems on this evaluation dataset.

We have generated only a few QA couples here to reduce time and cost. But let's kickstart the next part by loading a pre-generated dataset:

In [19]:
eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

README.md:   0%|          | 0.00/893 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/289k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/65 [00:00<?, ? examples/s]

## Save datasets

In [25]:
# save datasets locally

import os, json, csv, datetime


# helpers
def to_serializable(x):
    try:
        json.dumps(x)
        return x
    except TypeError:
        if hasattr(x, "page_content"):
            return {
                "page_content": getattr(x, "page_content", None),
                "metadata": getattr(x, "metadata", {}),
            }
        if isinstance(x, dict):
            return {k: to_serializable(v) for k, v in x.items()}
        if isinstance(x, (list, tuple, set)):
            return [to_serializable(i) for i in x]
        if hasattr(x, "__dict__"):
            return {k: to_serializable(v) for k, v in vars(x).items()}
        return repr(x)


def write_jsonl(path, records):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        for r in records:
            f.write(json.dumps(to_serializable(r), ensure_ascii=False) + "\n")


def write_csv(path, records):
    if not records or not isinstance(records[0], dict):
        os.makedirs(os.path.dirname(path), exist_ok=True)
        with open(path, "w", encoding="utf-8", newline="") as f:
            pass
        return
    keys = sorted({k for r in records for k in r.keys()})
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "w", encoding="utf-8", newline="") as f:
        w = csv.DictWriter(f, fieldnames=keys)
        w.writeheader()
        for r in records:
            w.writerow({k: r.get(k, "") for k in keys})


# output directory
save_dir = f"datasets_local/{datetime.datetime.now():%Y%m%d_%H%M%S}"
os.makedirs(save_dir, exist_ok=True)

# initial corpus (try common variable names)
for name in ("initial_corpus", "raw_corpus", "raw_docs", "input_docs", "docs_raw"):
    if name in globals():
        write_jsonl(os.path.join(save_dir, "initial_corpus.jsonl"), globals()[name])
        break

# processed docs
if "docs_processed" in globals():
    write_jsonl(os.path.join(save_dir, "processed_docs.jsonl"), docs_processed)

# qa datasets
if "outputs" in globals():
    write_jsonl(os.path.join(save_dir, "qa_all.jsonl"), outputs)
    if outputs and isinstance(outputs[0], dict):
        write_csv(os.path.join(save_dir, "qa_all.csv"), outputs)

if "outputs_filtered" in globals():
    write_jsonl(os.path.join(save_dir, "qa_filtered.jsonl"), outputs_filtered)
    if outputs_filtered and isinstance(outputs_filtered[0], dict):
        write_csv(os.path.join(save_dir, "qa_filtered.csv"), outputs_filtered)

if "outputs_rejected" in globals():
    write_jsonl(os.path.join(save_dir, "qa_rejected.jsonl"), outputs_rejected)

print(save_dir)

datasets_local/20250913_120022


In [26]:
# load datasets from a saved folder

import os, json


def read_jsonl(path):
    if not os.path.exists(path):
        return []
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f]


# set folder produced by the save cell
load_dir = "datasets_local/20250913_120022"

initial_corpus = read_jsonl(os.path.join(load_dir, "initial_corpus.jsonl"))
docs_processed = read_jsonl(os.path.join(load_dir, "processed_docs.jsonl"))
outputs = read_jsonl(os.path.join(load_dir, "qa_all.jsonl"))
outputs_filtered = read_jsonl(os.path.join(load_dir, "qa_filtered.jsonl"))
outputs_rejected = read_jsonl(os.path.join(load_dir, "qa_rejected.jsonl"))