# Generating Synthetic Data for Retrieval Augmented Generation. 

In this notebook, we'll generate a synthetic dataset for RAG using Wikipedia. We'll build on all documents in the topic San Francisco.

In [9]:
!pip install llama-index wikipedia tqdm python-dotenv spacy

Collecting spacy
  Downloading spacy-3.7.2-cp311-cp311-macosx_10_9_x86_64.whl.metadata (25 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp311-cp311-macosx_10_9_x86_64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp311-cp311-macosx_10_9_x86_64.whl.metadata (8.4 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp311-cp311-macosx_10_9_x86_64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.1.8 (from spacy)
  Downloading thinc-8.2.2-cp311-cp311-macosx_10_9_x86_64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.2-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading

## Setup Environment Variables

In [2]:
from dotenv import load_dotenv

load_dotenv()

True

## Save Articles from Wikipedia

We'll first download data from Wikipedia pages about popular cities.

In [4]:
popular_cities = ["San Francisco", "Oakland, California", "San Jose, California"]

In [6]:
import os

import wikipedia

# Create a directory to store the content
documents_folder = os.path.join(os.path.abspath(""), ".content")
os.makedirs(documents_folder, exist_ok=True)


# Function to save article content to a file
def save_article_content(city, folder):
    try:
        # Fetching the content of the city's Wikipedia page
        content = wikipedia.page(city).content
        file_path = os.path.join(folder, city.replace(" ", "_") + ".txt")
        with open(file_path, "w") as file:
            file.write(content)
        return file_path
    except Exception as e:
        return str(e)


# Saving the content of each city to the folder
saved_files = [save_article_content(city, documents_folder) for city in popular_cities]

## Generate Synthetic RAG Data From the Documents

In [11]:
# We'll load data about the histories of some cities from Wikipedia.
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.readers import WikipediaReader

documents = WikipediaReader().load_data(pages=[f"History of {x}" for x in popular_cities])
service_context = ServiceContext.from_defaults(chunk_size=512, chunk_overlap=50)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

### Generate RAG Dataset

In [12]:
# generate questions against chunks
from llama_index import ServiceContext
from llama_index.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms import OpenAI

# set context for llm provider
gpt_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-4", temperature=0.3))

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    service_context=gpt_context,
    num_questions_per_chunk=1,  # set the number of questions per nodes
    show_progress=True,
)

Parsing nodes:   0%|          | 0/3 [00:00<?, ?it/s]

In [13]:
import nest_asyncio

nest_asyncio.apply()
rag_dataset = dataset_generator.generate_dataset_from_nodes()
df = rag_dataset.to_pandas()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:09<00:00,  2.93it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.69s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.51s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:27<00:00, 27.51s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:29<00:00, 29.90s/it]
100%|████████████████████████████████████████

In [16]:
df.head()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"Discuss the early history of San Francisco, in...","[The history of the city of San Francisco, Cal...",The earliest evidence of human habitation in S...,ai (gpt-4),ai (gpt-4)
1,Discuss the early European exploration and set...,[== Arrival of Europeans and early settlement ...,The first documented European sighting of San ...,ai (gpt-4),ai (gpt-4)
2,Discuss the impact of the 1848 California gold...,[== 1848 gold rush ==\nThe California gold rus...,The 1848 California gold rush had a significan...,ai (gpt-4),ai (gpt-4)
3,Discuss the transformation of San Francisco in...,[== Paris of the West ==\n\nIt was during the ...,San Francisco began to transform into a major ...,ai (gpt-4),ai (gpt-4)
4,Discuss the corruption and graft trials involv...,[== Corruption and graft trials ==\n\nMayor Eu...,"Mayor Eugene Schmitz, president of the Musicia...",ai (gpt-4),ai (gpt-4)


### Save the Dataset in HuggingFace format.

In [17]:
from datasets import Dataset

dataset = Dataset.from_pandas(df)
print(f"Dataset contains {len(dataset)} rows")
dataset.save_to_disk("rag_synth_data.hf")

Dataset contains 47 rows


Saving the dataset (0/1 shards):   0%|          | 0/47 [00:00<?, ? examples/s]