# Generating Data for Retrieval Augmented Generation. 

In this notebook, we'll generate a dataset for RAG using `sciq` dataset from Huggingface.

In [2]:
!pip install llama-index==0.9.48 datasets tqdm python-dotenv spacy

Collecting llama-index==0.10.7
  Downloading llama_index-0.10.7-py3-none-any.whl.metadata (8.6 kB)
Collecting llama-index-agent-openai<0.2.0,>=0.1.0 (from llama-index==0.10.7)
  Downloading llama_index_agent_openai-0.1.1-py3-none-any.whl.metadata (645 bytes)
Collecting llama-index-core<0.11.0,>=0.10.0 (from llama-index==0.10.7)
  Downloading llama_index_core-0.10.7-py3-none-any.whl.metadata (3.6 kB)
Collecting llama-index-embeddings-openai<0.2.0,>=0.1.0 (from llama-index==0.10.7)
  Downloading llama_index_embeddings_openai-0.1.3-py3-none-any.whl.metadata (604 bytes)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index==0.10.7)
  Downloading llama_index_legacy-0.9.48-py3-none-any.whl.metadata (8.5 kB)
Collecting llama-index-llms-openai<0.2.0,>=0.1.0 (from llama-index==0.10.7)
  Downloading llama_index_llms_openai-0.1.2-py3-none-any.whl.metadata (586 bytes)
Collecting llama-index-multi-modal-llms-openai<0.2.0,>=0.1.0 (from llama-index==0.10.7)
  Downloading llama_index_multi_m

## Setup Environment Variables

In [3]:
from dotenv import load_dotenv

load_dotenv()

True

## Download sciq dataset

Note: Parts of this repo are inspired by [this](https://srk.ai/blog/004-ai-llm-retrieval-eval-llamaindex) blog on evaluating the performance of RAGs.

In [4]:
from datasets import load_dataset

dataset = load_dataset("sciq")

  from .autonotebook import tqdm as notebook_tqdm


## Generate RAG Dataset From the Documents

In [5]:
# Let's see what's in the dataset.

dataset["train"].to_pandas().head()

Unnamed: 0,question,distractor3,distractor1,distractor2,correct_answer,support
0,What type of organism is commonly used in prep...,viruses,protozoa,gymnosperms,mesophilic organisms,"Mesophiles grow best in moderate temperature, ..."
1,What phenomenon makes global winds blow northe...,tropical effect,muon effect,centrifugal effect,coriolis effect,Without Coriolis Effect the global winds would...
2,Changes from a less-ordered state to a more-or...,endothermic,unbalanced,reactive,exothermic,Summary Changes of state are examples of phase...
3,What is the least dangerous radioactive decay?,zeta decay,beta decay,gamma decay,alpha decay,All radioactive decay is dangerous to living t...
4,Kilauea in hawaii is the world’s most continuo...,magma,greenhouse gases,carbon and smog,smoke and ash,Example 3.5 Calculating Projectile Motion: Hot...


We have a `question`, and `support` fields that are super relevant. The answer is also present in the `correct_answer` field, but is one worded.

In [6]:
def cleanup_cols(row):
    return {"question": row["question"], "answer": row["correct_answer"].title(), "support": row["support"]}


updated_dataset = dataset.map(cleanup_cols)

In [7]:
updated_dataset = updated_dataset.remove_columns(["distractor3", "distractor1", "distractor2", "correct_answer"])
updated_dataset["train"].to_pandas().head()

Unnamed: 0,question,support,answer
0,What type of organism is commonly used in prep...,"Mesophiles grow best in moderate temperature, ...",Mesophilic Organisms
1,What phenomenon makes global winds blow northe...,Without Coriolis Effect the global winds would...,Coriolis Effect
2,Changes from a less-ordered state to a more-or...,Summary Changes of state are examples of phase...,Exothermic
3,What is the least dangerous radioactive decay?,All radioactive decay is dangerous to living t...,Alpha Decay
4,Kilauea in hawaii is the world’s most continuo...,Example 3.5 Calculating Projectile Motion: Hot...,Smoke And Ash


### Save the Dataset in HuggingFace format.

In [8]:
import os

content_folder = os.path.join(os.path.abspath(""), ".content")
print(f"Dataset contains {len(updated_dataset)} rows")
updated_dataset.save_to_disk(content_folder + "/rag_sciq_data.hf")

Dataset contains 3 rows


Saving the dataset (1/1 shards): 100%|██████████| 11679/11679 [00:00<00:00, 783563.83 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1000/1000 [00:00<00:00, 123184.35 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1000/1000 [00:00<00:00, 114890.40 examples/s]


Great! Now we're ready to build our RAG.

In [9]:
updated_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'support', 'answer'],
        num_rows: 11679
    })
    validation: Dataset({
        features: ['question', 'support', 'answer'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['question', 'support', 'answer'],
        num_rows: 1000
    })
})