# Generating Data for Fine tuning for a Retrieval Augmented Generation task. 

In this notebook, we'll generate a dataset for RAG using `sciq` dataset from Huggingface.

In [1]:
!pip install llama-index datasets tqdm python-dotenv spacy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Setup Environment Variables

In [2]:
from dotenv import load_dotenv

load_dotenv()

True

## Download sciq dataset

Note: Parts of this repo are inspired by [this](https://srk.ai/blog/004-ai-llm-retrieval-eval-llamaindex) blog on evaluating the performance of RAGs.

In [3]:
from datasets import load_dataset

dataset = load_dataset("sciq")

  from .autonotebook import tqdm as notebook_tqdm


## Generate RAG Dataset From the Documents

In [4]:
# Let's see what's in the dataset.

dataset["train"].to_pandas().head()

Unnamed: 0,question,distractor3,distractor1,distractor2,correct_answer,support
0,What type of organism is commonly used in prep...,viruses,protozoa,gymnosperms,mesophilic organisms,"Mesophiles grow best in moderate temperature, ..."
1,What phenomenon makes global winds blow northe...,tropical effect,muon effect,centrifugal effect,coriolis effect,Without Coriolis Effect the global winds would...
2,Changes from a less-ordered state to a more-or...,endothermic,unbalanced,reactive,exothermic,Summary Changes of state are examples of phase...
3,What is the least dangerous radioactive decay?,zeta decay,beta decay,gamma decay,alpha decay,All radioactive decay is dangerous to living t...
4,Kilauea in hawaii is the world’s most continuo...,magma,greenhouse gases,carbon and smog,smoke and ash,Example 3.5 Calculating Projectile Motion: Hot...


We have a `question`, and `support` fields that are super relevant. The answer is also present in the `correct_answer` field, but is one worded.

In [5]:
def cleanup_cols(row):
    return {"question": row["question"], "answer": row["correct_answer"].title(), "support": row["support"]}


updated_dataset = dataset.map(cleanup_cols)

In [6]:
updated_dataset = updated_dataset.remove_columns(["distractor3", "distractor1", "distractor2", "correct_answer"])
updated_dataset["train"].to_pandas().head()

Unnamed: 0,question,support,answer
0,What type of organism is commonly used in prep...,"Mesophiles grow best in moderate temperature, ...",Mesophilic Organisms
1,What phenomenon makes global winds blow northe...,Without Coriolis Effect the global winds would...,Coriolis Effect
2,Changes from a less-ordered state to a more-or...,Summary Changes of state are examples of phase...,Exothermic
3,What is the least dangerous radioactive decay?,All radioactive decay is dangerous to living t...,Alpha Decay
4,Kilauea in hawaii is the world’s most continuo...,Example 3.5 Calculating Projectile Motion: Hot...,Smoke And Ash


## Generate RAG Dataset using LlamaIndex

In [18]:
import hashlib
import os
from glob import glob

from datasets import load_dataset

dataset = load_dataset("sciq")


def cleanup_cols(row):
    return {"question": row["question"], "answer": row["correct_answer"].title(), "support": row["support"]}


updated_dataset = dataset.map(cleanup_cols)
updated_dataset = updated_dataset.remove_columns(["distractor3", "distractor1", "distractor2", "correct_answer"])

# Create a directory to store the content
content_folder = os.path.join(os.path.abspath(""), ".content/")
documents_folder = os.path.join(os.path.abspath(""), ".content/docs/")
os.makedirs(documents_folder, exist_ok=True)

NUM_DOCUMENTS = None


# Function to save article content to a file
def save_article_content(text, folder):
    try:
        # Fetching the content of the city's Wikipedia page
        checksum = hashlib.md5(text.encode("utf-8")).hexdigest()
        file_path = os.path.join(folder, checksum + ".txt")
        with open(file_path, "w") as file:
            file.write(text)
        return file_path
    except Exception as e:
        print(e)
        return str(e)


dataset = updated_dataset

# Saving the content of each train set document in a file
saved_files = []
for split in ["train", "validation", "test"]:
    print(f"Dataset split {split} contains {len(dataset[split])} rows")
    for row in dataset[split]:
        if row["support"]:
            saved_files.append(save_article_content(row["support"], documents_folder))
        if NUM_DOCUMENTS and len(saved_files) >= NUM_DOCUMENTS:
            break
# We'll load documents that we've already downloaded in the Synthetic Dataset for RAG
data_dir = os.path.join(os.path.abspath(""), ".content/docs")
input_files = glob(os.path.join(data_dir, "*.txt"))
print(f"{len(input_files)} files in folder: {input_files[0]}, ...")

Dataset split train contains 11679 rows
Dataset split validation contains 1000 rows
Dataset split test contains 1000 rows
12241 files in folder: /Users/rahulparundekar/workspaces/course-openai-api/rag_and_fine_tuning/.content/docs/e5698722d474cd3d91b5879885556afd.txt, ...


In [20]:
from llama_index import ServiceContext, SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms import OpenAI

service_context = ServiceContext.from_defaults(llm=OpenAI())
documents = SimpleDirectoryReader(input_files=input_files).load_data("*.txt")
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Loading files: 100%|██████████| 12241/12241 [00:05<00:00, 2041.75file/s]


In [21]:
query_engine = index.as_query_engine()

In [31]:
from datasets import Dataset, DatasetDict
from tqdm import tqdm

# See examples.
splits = {}
for split in ["train", "validation", "test"]:
    new_rows = []
    for row in tqdm(dataset[split]):
        question = row["question"]
        original_answer = row["answer"]
        response = query_engine.query(question)
        context = []
        for node in response.source_nodes:
            context.append(str(node.text))
        # print("Question:")
        # print(question)
        # print("\nAnswer:")
        # print(str(response))
        # print("\nExpected Answer:")
        # print(expected_answer)
        # print()
        new_rows.append(
            {"question": question, "contexts": context, "answer": str(response), "original_answer": original_answer}
        )
        if split == "train":
            if len(new_rows) >= 500:
                break
        else:
            if len(new_rows) >= 100:
                break
    splits[split] = Dataset.from_list(new_rows)

new_dataset = DatasetDict(splits)

  4%|▍         | 499/11679 [15:36<5:49:42,  1.88s/it] 
 10%|▉         | 99/1000 [03:02<27:39,  1.84s/it]  
 10%|▉         | 99/1000 [02:51<26:01,  1.73s/it]


### Save the Dataset in HuggingFace format.

In [32]:
from dotenv import load_dotenv

load_dotenv()
new_dataset.push_to_hub("rparundekar/rag_fine_tuning_500")

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 147.43ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  1.11it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 340.61ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  1.57it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 373.86ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  1.61it/s]
README.md: 100%|██████████| 605/605 [00:00<00:00, 2.69MB/s]


CommitInfo(commit_url='https://huggingface.co/datasets/rparundekar/rag_fine_tuning_500/commit/2c4961d8ec7fa7238a7a5bd190ac9c43e832d14f', commit_message='Upload dataset', commit_description='', oid='2c4961d8ec7fa7238a7a5bd190ac9c43e832d14f', pr_url=None, pr_revision=None, pr_num=None)

Great! Now we're ready to build our RAG.

In [33]:
new_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'contexts', 'answer', 'original_answer'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['question', 'contexts', 'answer', 'original_answer'],
        num_rows: 100
    })
    test: Dataset({
        features: ['question', 'contexts', 'answer', 'original_answer'],
        num_rows: 100
    })
})

In [34]:
new_dataset["train"].to_pandas().head()

Unnamed: 0,question,contexts,answer,original_answer
0,What type of organism is commonly used in prep...,[Bacteria can be used to make cheese from milk...,Bacteria is commonly used in the preparation o...,Mesophilic Organisms
1,What phenomenon makes global winds blow northe...,[Without Coriolis Effect the global winds woul...,The phenomenon that makes global winds blow no...,Coriolis Effect
2,Changes from a less-ordered state to a more-or...,[Summary Changes of state are examples of phas...,exothermic,Exothermic
3,What is the least dangerous radioactive decay?,[All radioactive decay is dangerous to living ...,Alpha decay is the least dangerous radioactive...,Alpha Decay
4,Kilauea in hawaii is the world’s most continuo...,[Example 3.5 Calculating Projectile Motion: Ho...,smoke and ash,Smoke And Ash
