# Generating Synthetic Data for Retrieval Augmented Generation. 

In this notebook, we'll generate a synthetic dataset for RAG using Wikipedia. We'll build on all documents in the topic San Francisco.

In [1]:
!pip install llama-index datasets tqdm python-dotenv spacy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Setup Environment Variables

In [35]:
from dotenv import load_dotenv

load_dotenv()

True

## Download sciq dataset

Note: Parts of this repo are inspired by [this](https://srk.ai/blog/004-ai-llm-retrieval-eval-llamaindex) blog on evaluating the performance of RAGs.

In [36]:
from datasets import load_dataset

dataset = load_dataset("sciq")

In [37]:
import hashlib
import os

# Create a directory to store the content
documents_folder = os.path.join(os.path.abspath(""), ".content/docs/")
os.makedirs(documents_folder, exist_ok=True)


NUM_DOCUMENTS = 50


# Function to save article content to a file
def save_article_content(text, folder):
    try:
        # Fetching the content of the city's Wikipedia page
        checksum = hashlib.md5(text.encode("utf-8")).hexdigest()
        file_path = os.path.join(folder, checksum + ".txt")
        with open(file_path, "w") as file:
            file.write(text)
        return file_path
    except Exception as e:
        print(e)
        return str(e)


# Saving the content of each city to the folder
saved_files = []
for row in dataset["train"]:
    if row["support"]:
        saved_files.append(save_article_content(row["support"], documents_folder))
    if len(saved_files) >= NUM_DOCUMENTS:
        break

## Generate Synthetic RAG Data From the Documents

In [38]:
# We'll load data about the histories of some cities from Wikipedia.
from glob import glob

from llama_index import ServiceContext, VectorStoreIndex
from llama_index.readers import SimpleDirectoryReader

input_files = glob(os.path.join(documents_folder, "*.txt"))
documents = SimpleDirectoryReader(input_files=input_files).load_data()
service_context = ServiceContext.from_defaults(chunk_size=512, chunk_overlap=50)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

### Generate RAG Dataset

In [39]:
# generate questions against chunks
from llama_index import ServiceContext
from llama_index.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms import OpenAI

# set context for llm provider
gpt_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3))

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    service_context=gpt_context,
    num_questions_per_chunk=1,  # set the number of questions per nodes
    show_progress=True,
)

Parsing nodes: 100%|██████████| 50/50 [00:00<00:00, 1261.16it/s]


In [40]:
# From LlamaIndex
import nest_asyncio

nest_asyncio.apply()

rag_dataset = dataset_generator.generate_dataset_from_nodes()
df = rag_dataset.to_pandas()

  0%|          | 0/50 [00:00<?, ?it/s]

100%|██████████| 50/50 [00:25<00:00,  1.95it/s]
100%|██████████| 10/10 [00:02<00:00,  3.49it/s]
100%|██████████| 10/10 [00:03<00:00,  2.78it/s]
100%|██████████| 5/5 [00:01<00:00,  4.13it/s]
100%|██████████| 5/5 [00:01<00:00,  4.41it/s]
100%|██████████| 10/10 [00:03<00:00,  2.56it/s]
100%|██████████| 10/10 [00:03<00:00,  3.29it/s]
100%|██████████| 10/10 [00:03<00:00,  2.57it/s]
100%|██████████| 10/10 [00:02<00:00,  3.64it/s]
100%|██████████| 5/5 [00:02<00:00,  2.11it/s]
100%|██████████| 8/8 [00:06<00:00,  1.24it/s]
100%|██████████| 5/5 [00:01<00:00,  4.41it/s]
100%|██████████| 7/7 [00:01<00:00,  4.33it/s]
100%|██████████| 7/7 [00:03<00:00,  2.20it/s]
100%|██████████| 10/10 [00:06<00:00,  1.55it/s]
100%|██████████| 5/5 [00:02<00:00,  2.47it/s]
100%|██████████| 8/8 [00:03<00:00,  2.37it/s]
100%|██████████| 10/10 [00:03<00:00,  3.26it/s]
100%|██████████| 10/10 [00:03<00:00,  2.57it/s]
100%|██████████| 5/5 [00:00<00:00,  5.01it/s]
100%|██████████| 10/10 [00:02<00:00,  4.94it/s]
100%|███████

In [43]:
df.head()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,What are some characteristics of birds that di...,[Birds are endothermic tetrapod vertebrates. T...,Some characteristics of birds that distinguish...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
1,"How do birds reproduce, and what is unique abo...",[Birds are endothermic tetrapod vertebrates. T...,Birds reproduce by laying amniotic eggs. These...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
2,Why are birds considered the most numerous ver...,[Birds are endothermic tetrapod vertebrates. T...,Birds are considered the most numerous vertebr...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
3,What factors have contributed to the rapid inc...,[Birds are endothermic tetrapod vertebrates. T...,The traits that have contributed to the rapid ...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
4,"How do birds move, and what is the significanc...",[Birds are endothermic tetrapod vertebrates. T...,"Birds move by walking or hopping on two legs, ...",ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)


### Save the Dataset in HuggingFace format.

In [44]:
from datasets import Dataset

documents_folder = os.path.join(os.path.abspath(""), ".content")
dataset = Dataset.from_pandas(df)
print(f"Dataset contains {len(dataset)} rows")
dataset.save_to_disk(documents_folder + "/rag_sciq_data.hf")

Dataset contains 393 rows


Saving the dataset (1/1 shards): 100%|██████████| 393/393 [00:00<00:00, 45612.97 examples/s]
