## Synthetic RAG Dataset creation by AWS

### Step 1: Loading the Data

This section will create chunks from your preferred document based on the parameters defined in 'AWS_data_creation.py'. You will then randomly sample `n` number of chunks to generate your synthetic RAG dataset. To keep document chunks constant across all methods, the chunks will be exported as a pickle file.

In [None]:
# adjustable parameters
folder_name = "policy_docs"  # folder containing original source
filename = "CFR-2025-title5-vol1.pdf"  # target policy
doc_tag = 't5'  # abbreviated policy tag
n = 100  # number of document chunks to sample

In [2]:
from AWS_data_creation import chunk_doc, generate_dataset
import pandas as pd 
import random
import pickle

# randomly sample document chunks
random.seed(42)
docs = chunk_doc(folder_name, filename)
sampled_docs = random.sample(docs, n) #without replacement

# store document chunks in folder defined above
## documents will be reused for naive & ragas dataset generation methods
with open(f"{folder_name}/{doc_tag}.pk", 'wb') as fi:
    pickle.dump(sampled_docs, fi)

2025-12-05 17:50:11,653 - botocore.credentials - INFO - Found credentials from IAM Role: OCHCO-Analytics-SSM-CloudWatch


Average length among 965 pages loaded is 4041 characters.
After the split you have 3196
Average length among 3196 chunks is 1274 characters.


### Step 2: Generating the Dataset

This process takes about 12 minutes for a sample size of n=100. At the end, you should have generated n questions- one q/a pair per chunk.

In [None]:
dataset = pd.DataFrame(columns=["question", "question_compressed", "reference_answer", "source_sentence","source_raw","source_document"])  
dataset_df = generate_dataset(sampled_docs, dataset)
num_questions_generated = dataset_df.shape[0]
print(f"Generated a total of {num_questions_generated} questions.")

### Step 3: Save Your Synthetic Dataset

We will be evaluating this dataset along with the RAGAS and naive datasets generated from the same policy doc chunks.

In [None]:
import pandas as pd 
import os

os.chdir('../')
dataset_df.to_csv(f"datasets/test-set-aws-synth-{n}-{doc_tag}.csv", index=False)
dataset_df.head()