<a href="https://colab.research.google.com/github/daisysong76/AI--Machine--learning/blob/main/Data_for_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Data for Retrieval Augmented Generation.
Generating data for Retrieval Augmented Generation (RAG) is essential because RAG models combine the strengths of both retrieval-based and generative models to enhance their performance in generating more accurate, contextually relevant, and information-rich responses. Here are some reasons why generating data for RAG is crucial:

RAG models can retrieve more relevant documents or information snippets. This enriches the model's contextual understanding.

Training on Diverse Data Sources: Generating comprehensive and diverse datasets allows these models to refine their retrieval mechanisms, ensuring that the most relevant and useful information is retrieved to aid in the generation process.

Addressing Data Scarcity in Specific Domains

In this notebook, we'll generate a dataset for RAG using `sciq` dataset from Huggingface.

In [3]:
!pip install llama-index==0.9.48 datasets tqdm python-dotenv spacy

Collecting llama-index==0.9.48
  Downloading llama_index-0.9.48-py3-none-any.whl (15.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Collecting dataclasses-json (from llama-index==0.9.48)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index==0.9.48)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index==0.9.48)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama-index==0.9.48)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Setup Environment Variables

In [None]:
from dotenv import load_dotenv

load_dotenv()

True

## Download sciq dataset

Note: Parts of this repo are inspired by [this](https://srk.ai/blog/004-ai-llm-retrieval-eval-llamaindex) blog on evaluating the performance of RAGs.

In [4]:
from datasets import load_dataset

dataset = load_dataset("sciq")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.99M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/339k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/343k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11679 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

## Generate RAG Dataset From the Documents

In [5]:
# Let's see what's in the dataset.

dataset["train"].to_pandas().head()

Unnamed: 0,question,distractor3,distractor1,distractor2,correct_answer,support
0,What type of organism is commonly used in prep...,viruses,protozoa,gymnosperms,mesophilic organisms,"Mesophiles grow best in moderate temperature, ..."
1,What phenomenon makes global winds blow northe...,tropical effect,muon effect,centrifugal effect,coriolis effect,Without Coriolis Effect the global winds would...
2,Changes from a less-ordered state to a more-or...,endothermic,unbalanced,reactive,exothermic,Summary Changes of state are examples of phase...
3,What is the least dangerous radioactive decay?,zeta decay,beta decay,gamma decay,alpha decay,All radioactive decay is dangerous to living t...
4,Kilauea in hawaii is the world’s most continuo...,magma,greenhouse gases,carbon and smog,smoke and ash,Example 3.5 Calculating Projectile Motion: Hot...


We have a `question`, and `support` fields that are super relevant. The answer is also present in the `correct_answer` field, but is one worded.

The cleanup_cols function takes a single argument, row, which is expected to be a dictionary-like object containing keys for "question", "correct_answer", and "support". The function constructs and returns a new dictionary with three key-value pairs.

In [6]:
def cleanup_cols(row):
    return {"question": row["question"], "answer": row["correct_answer"].title(), "support": row["support"]}

updated_dataset = dataset.map(cleanup_cols)

Map:   0%|          | 0/11679 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [7]:
updated_dataset = updated_dataset.remove_columns(["distractor3", "distractor1", "distractor2", "correct_answer"])
updated_dataset["train"].to_pandas().head()

Unnamed: 0,question,support,answer
0,What type of organism is commonly used in prep...,"Mesophiles grow best in moderate temperature, ...",Mesophilic Organisms
1,What phenomenon makes global winds blow northe...,Without Coriolis Effect the global winds would...,Coriolis Effect
2,Changes from a less-ordered state to a more-or...,Summary Changes of state are examples of phase...,Exothermic
3,What is the least dangerous radioactive decay?,All radioactive decay is dangerous to living t...,Alpha Decay
4,Kilauea in hawaii is the world’s most continuo...,Example 3.5 Calculating Projectile Motion: Hot...,Smoke And Ash


### Save the Dataset in HuggingFace format.

In [8]:
import os

content_folder = os.path.join(os.path.abspath(""), ".content")
print(f"Dataset contains {len(updated_dataset)} rows")
updated_dataset.save_to_disk(content_folder + "/rag_sciq_data.hf")

Dataset contains 3 rows


Saving the dataset (0/1 shards):   0%|          | 0/11679 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Great! Now we're ready to build our RAG.

In [9]:
updated_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'support', 'answer'],
        num_rows: 11679
    })
    validation: Dataset({
        features: ['question', 'support', 'answer'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['question', 'support', 'answer'],
        num_rows: 1000
    })
})