# gen_synthetic_data.ipynb

This notebook uses RAGAS to generate synthetic data to evaluate retrieval for the CareCompanion app.

**Step 0: Imports, constants, and API Keys!**

In [4]:
!pip install -q langchain==0.2.16 langchain_core openai ragas==0.1.14 pandas 

In [5]:
import os
import openai
from getpass import getpass

# collect OpenAI key
openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

**Step 1: Generate synthetic data**

In [7]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-4o-mini")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

# Initialize data generator and set up distributions
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.3,
    reasoning: 0.2
}

In [8]:
# load the document corpus, then generate the synthetic test set
myfile = "source_documents.json"

import json
from langchain.schema import Document

# Load JSON data
with open(myfile, 'r') as file:
    data = json.load(file)

# Convert JSON data into a list of LangChain Document objects
docs = [
    Document(page_content=item["page_content"], metadata=item.get("metadata",{}))
    for item in data
]

print(f"loaded {len(docs)} docs")


loaded 216 docs


In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000, # use a different chunk size from the app       
    chunk_overlap=100,     
)

split_docs = []

for doc in docs:

    splits = text_splitter.split_text(doc.page_content)
    for i,split in enumerate(splits):
        metadata_with_chunk = {**doc.metadata, "chunk_id": i}
            
        # Create the document with the updated metadata
        split_doc = Document(page_content=split, metadata=metadata_with_chunk)
        split_docs.append(split_doc)

print(f"len(docs): {len(docs)}, len(split_docs):{len(split_docs)}")

len(docs): 216, len(split_docs):687


In [10]:
split_docs[0]

Document(metadata={'url': 'https://www.cdc.gov/alzheimers-dementia', 'title': "Alzheimer's Disease and Dementia | Alzheimer's Disease and Dementia | CDC", 'chunk_id': 0}, page_content="alzheimer's disease and dementia | alzheimer's disease and dementia | cdc     alzheimer's disease and dementia alzheimer's basics learn about signs and symptoms of alzheimer's disease and who is affected. aug. 15, 2024 dementia basics learn about common types of dementia, signs and symptoms, and risk factors. aug. 17, 2024 signs and symptoms of alzheimer's learn how to recognize the early signs of alzheimer's disease. signs and symptoms of dementia learn what early signs and symptoms of dementia to look out for. tools and resources find a variety of resources about alzheimer’s disease and healthy aging. reducing risk learn what lifestyle behaviors can reduce the risk of developing dementia. additional topics healthy aging at any age information to help you stay healthy and strong throughout your life. se

Keep the default distributions: 50% reasoning, 30% simple, 20% multi-context. This can generate substantial openai usage so keep the test set small for now.

In [15]:
import pandas as pd
testset = generator.generate_with_langchain_docs(split_docs, 30, distributions, with_debugging_logs=True)


# Generating the test data costs money, time, and compute, so make sure to save it for later re-use
if testset:
    test_df = testset.to_pandas().to_csv(f"ragas_test_data.csv",index=False)
    print("wrote test data to file")

embedding nodes:   1%|          | 9/1378 [00:00<01:47, 12.68it/s][ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Apathy', 'Depression', 'Anxiety', 'Dementia support organizations', 'Cognitive Behavioral Therapy (CBT)']}
embedding nodes:   1%|          | 11/1378 [00:01<01:53, 12.02it/s][ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Aggressive behaviour and dementia', "Alzheimer's Society", 'Dementia support', 'Carers Trust', 'Samaritans helpline']}
embedding nodes:   1%|          | 13/1378 [00:01<02:15, 10.04it/s][ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Dementia support forum', 'Caregiving psychology', 'Managing aggressive behaviour', 'Wellbeing advice for carers', 'Challenging care home notices']}
[ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Dementia communication', 'Body language', 'Support for carers', 'Managing challenging behavior', 'Wellbeing of caregivers']}
[ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Aggressive behaviou

wrote test data to file


In [14]:
print(testset.to_pandas())

                                             question  ... episode_done
0   What is the relationship between air pollution...  ...         True
1   What is the relationship between sleep and dem...  ...         True
2   What is the purpose of reminiscing with pictur...  ...         True
3   What is the role of music therapy in reducing ...  ...         True
4   What is the significance of the APOE4 gene in ...  ...         True
5   What is mild cognitive impairment (MCI) and ho...  ...         True
6   What is the significance of the APOE E4 gene i...  ...         True
7   What is the importance of end of life planning...  ...         True
8   What challenges can arise from misidentificati...  ...         True
9   What es el propósito de la evaluación de condu...  ...         True
10  What factors contribute to an increased risk o...  ...         True
11  What is the main difference between depression...  ...         True
12  How does the apoe4 gene affect Alzheimer's, es...  ...      

wrote test data to file
