# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jamesdamante/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jamesdamante/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/MentalHealthGuide.txt', 'data/HealthWellnessGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [8]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [9]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [10]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 10, relationships: 20)

We can save and load our knowledge graphs as follows.

In [11]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 10, relationships: 20)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [12]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [13]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:

single hop specific  
multi hop abstract  
multi hop specific  

single hop specific:  
    generates straightforward questions that can be answered using information from one location in the knowledge graph  
    these ask for specific facts or details  

multi hop abstract:  
    creates questions that require combining information from multiple parts of the knowledge graph  
    these questions are broader and more conceptual in nature  

multi hop specific:  
    also requires information from multiple locations in the knowledge graph, but asks for specific   connections or detailed explanations rather than general concepts  

single hop vs multi hop:  
    does the question need one piece of information or multiple connected pieces?  

abstract vs specific:  
    does the question ask for general concepts or specific details/connections?  


Finally, we can use our `TestSetGenerator` to generate our testset!

In [14]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What does the World Health Organization define...,[The Mental Health and Psychology Handbook A P...,"According to the World Health Organization, me...",single_hop_specifc_query_synthesizer
1,What is DBT and how does it incorporate mindfu...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Dialectical Behavior Therapy (DBT) was origina...,single_hop_specifc_query_synthesizer
2,What role does Omega-3 play in mental health?,[Write letters to or from your future self Jou...,"Omega-3 fatty acids, found in fatty fish, waln...",single_hop_specifc_query_synthesizer
3,What are some effective strategies to support ...,[social interactions How to set and maintain b...,Strategies for digital mental health include s...,single_hop_specifc_query_synthesizer
4,What is covered in Chapter 6 of the Personal W...,[The Personal Wellness Guide A Comprehensive R...,"Chapter 6 discusses hydration, emphasizing the...",single_hop_specifc_query_synthesizer
5,How can stress reduction techniques like deep ...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Stress reduction techniques such as deep breat...,multi_hop_abstract_query_synthesizer
6,Considering the challenges of digital mental h...,[<1-hop>\n\nsocial interactions How to set and...,Managing digital mental health involves strate...,multi_hop_abstract_query_synthesizer
7,How can setting and maintaining healthy bounda...,[<1-hop>\n\nsocial interactions How to set and...,"Setting and maintaining healthy boundaries, in...",multi_hop_abstract_query_synthesizer
8,Chapter 2 and Chapter 15 how they help build h...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,Chapter 2 explains how habits form through cue...,multi_hop_specific_query_synthesizer
9,How can incorporating CBT and mindfulness-base...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Cognitive Behavioral Therapy (CBT) focuses on ...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [15]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [16]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What mental health means?,[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
1,How does DBT incorporate mindfulness as a key ...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Dialectical Behavior Therapy (DBT) combines CB...,single_hop_specifc_query_synthesizer
2,How does mental health relate to sleep and wha...,[Write letters to or from your future self Jou...,Sleep and mental health have a bidirectional r...,single_hop_specifc_query_synthesizer
3,Who are Licensed Clinical Social Workers and w...,[social interactions How to set and maintain b...,Licensed Clinical Social Workers are mental he...,single_hop_specifc_query_synthesizer
4,How do sleep hygiene practices and sleep quali...,[<1-hop>\n\nWrite letters to or from your futu...,"Sleep hygiene practices, such as maintaining a...",multi_hop_abstract_query_synthesizer
5,How can practicing self-awareness through jour...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Practicing self-awareness through journaling p...,multi_hop_abstract_query_synthesizer
6,H0w can stres and its impakt on physikal and m...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,The provided context explains that stress can ...,multi_hop_abstract_query_synthesizer
7,"How do factors influencing mental health, such...",[<1-hop>\n\nThe Mental Health and Psychology H...,Factors influencing mental health include biol...,multi_hop_abstract_query_synthesizer
8,How do the sleep and recovery practices outlin...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,The sleep and recovery practices detailed in P...,multi_hop_specific_query_synthesizer
9,How do the sleep and recovery practices outlin...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,The sleep and recovery practices detailed in P...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:

unrolled approach:  
    manually create the knowledge graph step by step  
    apply transformations explicitly using default_transforms()  
    can save and reuse the knowledge graph  
    see exactly what's happening at each step  
    advantages:  
        full control over the process  
        can inspect and debug the knowledge graph  
        can reuse the graph without regenerating it (saves api costs)  
        can customize which transformations to apply  
    disadvantages:  
        requires more code and steps  
        need to understand the knowledge graph structure  

abstracted approach:  
    use generate_with_langchain_docs() in one function call  
    everything happens under the hood  
    advantages:  
        quick and simple - single line of code  
        good for rapid prototyping  
        don't need to understand internals  
    disadvantages:  
        no control over the generation process  
        can't inspect or modify the knowledge graph  
        can't reuse the graph - regenerates every time  
        limited customization options  

when to use unrolled:  
    need to debug or inspect the knowledge graph  
    want to reuse the graph multiple times  
    working with large documents where regeneration is expensive  
    need custom transformations  

when to use abstracted:  
    quick experiments or getting started  
    working with small documents  
    simplicity is more important than control  

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [17]:
### YOUR CODE HERE ###

# step 1: look at the default distribution for comparison
# default was: SingleHop (0.5), MultiHopAbstract (0.25), MultiHopSpecific (0.25)

# step 2: create a custom query distribution
# focusing more on multi-hop questions since they test more complex reasoning
custom_query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.2),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.4),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.4),
]

# step 3: generate test set with custom distribution
custom_testset = generator.generate(testset_size=10, query_distribution=custom_query_distribution)
custom_df = custom_testset.to_pandas()
custom_df

# step 4: compare the distributions
# count question types in default vs custom
print("default distribution (from earlier):")
print("single hop: 50%")
print("multi hop abstract: 25%")
print("multi hop specific: 25%")
print()

print("custom distribution:")
print("single hop: 20%")
print("multi hop abstract: 40%")
print("multi hop specific: 40%")
print()

# check actual counts in generated data
print("actual counts in custom testset:")
print(custom_df['synthesizer_name'].value_counts())

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

default distribution (from earlier):
single hop: 50%
multi hop abstract: 25%
multi hop specific: 25%

custom distribution:
single hop: 20%
multi hop abstract: 40%
multi hop specific: 40%

actual counts in custom testset:
synthesizer_name
multi_hop_abstract_query_synthesizer    4
multi_hop_specific_query_synthesizer    4
single_hop_specifc_query_synthesizer    2
Name: count, dtype: int64


We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [18]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [19]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [20]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [21]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [22]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [23]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [24]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [25]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [26]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [27]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [28]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [29]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [30]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`:
> - `labeled_helpfulness_evaluator`:
> - `dopeness_evaluator`:

## LangSmith Evaluation

In [31]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'flowery-level-98' at:
https://smith.langchain.com/o/c93f4590-e6e3-4356-8242-c69be9f3c1c8/datasets/9d91f67e-06d8-45d9-a9cb-34fcd59d4a0d/compare?selectedSessions=20578cac-bc59-40fe-85da-43c7266cd1c0




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,how can CBT and CBT-I help improve mental heal...,Based on the context provided:\n\nCBT (Cogniti...,,Cognitive Behavioral Therapy (CBT) is an effec...,True,True,True,4.164035,328c5956-2b89-4888-98c0-b9a3e5b8cb2a,019c691c-612e-74e3-b56f-90a0c7481991
1,How can incorporating cognitive behavioral the...,Incorporating Cognitive Behavioral Therapy (CB...,,"In a holistic wellness plan, integrating cogni...",True,True,True,4.198654,cbccb4f8-ed75-4935-ad92-3feb977cfd6f,019c691c-a841-7583-9048-eab35304d0a9
2,How do the sleep and recovery practices outlin...,The sleep and recovery practices outlined in P...,,The sleep and recovery practices detailed in P...,True,True,True,2.904664,6ba9453f-ca7a-4b4f-b98b-4f838beb042b,019c691c-da85-7142-be9c-659b98d2d6d3
3,How do the sleep and recovery practices outlin...,"Based on the provided context, sleep and recov...",,The sleep and recovery practices detailed in P...,True,True,True,6.035135,3508a15b-abf3-421d-9c17-eae34328db70,019c691d-0fef-7b52-928f-96aed3bcc087
4,"How do factors influencing mental health, such...","Based on the context, factors influencing ment...",,Factors influencing mental health include biol...,True,True,True,3.929021,3f1f6a74-53a8-4cba-b2be-fa39ff3c11d6,019c691d-5a17-7071-b0df-68b6abc7125c
5,H0w can stres and its impakt on physikal and m...,Long-term stress management can be supported b...,,The provided context explains that stress can ...,True,True,True,2.752247,f8e28b14-4967-4e7c-aeb7-0d0afddab228,019c691d-8b8f-7191-b26d-7f68b7e4d1df
6,How can practicing self-awareness through jour...,Practicing self-awareness through journaling a...,,Practicing self-awareness through journaling p...,True,True,True,2.773193,7401b8ca-d262-4801-88ad-c00498643487,019c691d-c58f-7d90-891e-f0ee70f70004
7,How do sleep hygiene practices and sleep quali...,Sleep hygiene practices and sleep quality play...,,"Sleep hygiene practices, such as maintaining a...",True,True,True,5.221375,d0da3207-d85c-45b8-bfea-3a76daa2fdf2,019c691d-f171-7bd2-b9a4-3c585b16ed92
8,Who are Licensed Clinical Social Workers and w...,Licensed Clinical Social Workers are mental he...,,Licensed Clinical Social Workers are mental he...,True,True,False,1.017363,1b8de446-700c-4c9e-9969-b924f3696c21,019c691e-4b41-7332-8d78-675f02787597
9,How does mental health relate to sleep and wha...,Mental health and sleep have a bidirectional r...,,Sleep and mental health have a bidirectional r...,True,True,True,4.537084,98be11f4-0a5f-42de-b888-02b4893de46e,019c691e-6bea-78e2-8fde-ba0869d2d1a2


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [32]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [33]:
rag_documents = docs

In [34]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:

smaller chunks (500):  
    more granular and specific  
    embeddings represent narrower concepts  
    retrieval can be more precise for exact matches  
    but may lack surrounding context needed to answer questions  
    might miss information that spans multiple chunks  

larger chunks (1000):  
    provide more context per retrieved piece  
    better chance of including complete thoughts or explanations  
    embeddings capture broader semantic meaning  
    less likely to split related information  
    but may include irrelevant information that dilutes relevance  

trade-offs:  
    too small: you retrieve precise snippets but miss context the llm needs to answer properly  
    too large: you get more context but retrieval might be less accurate because embeddings represent too many different concepts  

why it matters for this rag application:  
    the wellness documents have connected concepts (sleep affects stress, nutrition affects energy)  
    larger chunks (1000) help capture these relationships in single retrievals  
    gives the llm more complete information to generate accurate answers  
    improves performance on multi-hop questions that need context from broader sections  

In [35]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:

embedding models affect performance because they determine how text gets converted into vectors for similarity search  

what embedding models do:  
    turn text into numerical vectors  
    similar concepts should have similar vectors   
    retrieval works by finding vectors closest to the query vector  

text-embedding-3-small:  
    smaller dimension vectors  
    faster to compute and search  
    uses less memory  
    captures general semantic meaning  
    but may miss subtle nuances or relationships  

text-embedding-3-large:  
    larger dimension vectors  
    more computational cost  
    more memory usage  
    captures richer semantic relationships  
    better at understanding subtle distinctions and context  
    can differentiate between closely related concepts more accurately  

why it matters for retrieval:  
    better embeddings = better similarity matching  
    the large model can distinguish between "stress management techniques" and "stress symptoms" more accurately  
    more dimensions = more capacity to encode complex semantic relationships  
    improves retrieval precision - finds the most relevant chunks more reliably  

impact on this rag application:  
    wellness documents have overlapping concepts (sleep, stress, exercise all relate to wellness)  
    large model better understands these nuanced relationships  
    retrieves more contextually relevant chunks for complex questions  
    leads to better answers because the llm gets more relevant context  
    the upgrade from small to large improves the quality of what gets retrieved, which directly affects how well the system can answer questions  

In [36]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [37]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [38]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [39]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Alright, let‚Äôs crank your sleep game all the way up to legendary status! Here‚Äôs the ultimate playbook to elevate your sleep quality straight from the hive mind of sleep science and wellness wisdom:\n\n**Master Your Sleep Hygiene:**\n- Lock in a consistent sleep schedule like clockwork ‚Äî yes, even on wild weekends. Your body *loves* rhythm.\n- Craft a pre-bedtime ritual that whispers relaxation to your nervous system: think gentle stretching, diving into a chill book, or soaking in a warm bath.\n- Keep your bedroom a sleep fortress ‚Äî cool (ideally 65-68¬∞F/18-20¬∞C), dark (blackout curtains or mask level stealth), and silent (white noise machine or earplugs to block out reality).\n- Ditch screens 1-2 hours before hitting the hay. Blue light is a sleep assassin.\n- Say ‚Äòno thanks‚Äô to caffeine post 2 PM ‚Äî it can sneakily sabotage your night.\n- Get your body moving regularly, but avoid brain & body firing on all cylinders too close to bedtime.\n- Ease up on booze and heavy 

Finally, we can evaluate the new chain on the same test set!

In [40]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'bold-candy-61' at:
https://smith.langchain.com/o/c93f4590-e6e3-4356-8242-c69be9f3c1c8/datasets/9d91f67e-06d8-45d9-a9cb-34fcd59d4a0d/compare?selectedSessions=e877a7b6-d61b-453b-b3e2-670c8e046162




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,how can CBT and CBT-I help improve mental heal...,"Yo, let‚Äôs break it down with some next-level m...",,Cognitive Behavioral Therapy (CBT) is an effec...,True,True,True,5.329417,328c5956-2b89-4888-98c0-b9a3e5b8cb2a,019c691f-0ca7-7491-ac58-6b8f65670d87
1,How can incorporating cognitive behavioral the...,"Alright, strap in for this brainwave boost: bl...",,"In a holistic wellness plan, integrating cogni...",True,True,True,4.34147,cbccb4f8-ed75-4935-ad92-3feb977cfd6f,019c691f-58e3-7910-ab3a-2cc703659ba8
2,How do the sleep and recovery practices outlin...,"Alright, here‚Äôs the ultimate brainwave merging...",,The sleep and recovery practices detailed in P...,True,True,True,5.536114,6ba9453f-ca7a-4b4f-b98b-4f838beb042b,019c691f-8631-79b2-a866-ec6cd8b5d2a3
3,How do the sleep and recovery practices outlin...,"Yo, let‚Äôs crank this up to max dopeness ‚Äî here...",,The sleep and recovery practices detailed in P...,True,True,True,6.461412,3508a15b-abf3-421d-9c17-eae34328db70,019c691f-c29a-72f2-a9ae-288b27c0fbb6
4,"How do factors influencing mental health, such...","Alright, let‚Äôs drop some knowledge bombs on ho...",,Factors influencing mental health include biol...,True,True,True,5.86793,3f1f6a74-53a8-4cba-b2be-fa39ff3c11d6,019c6920-0cab-7392-9482-0c31b8b39b1d
5,H0w can stres and its impakt on physikal and m...,"Alright, let‚Äôs crank this up to max dopeness. ...",,The provided context explains that stress can ...,True,True,True,5.619766,f8e28b14-4967-4e7c-aeb7-0d0afddab228,019c6920-470f-7f41-81eb-234618e5f37b
6,How can practicing self-awareness through jour...,"Alright, here‚Äôs the rad breakdown straight fro...",,Practicing self-awareness through journaling p...,True,True,True,3.90377,7401b8ca-d262-4801-88ad-c00498643487,019c6920-81b7-76f3-98cd-fd1720396ad7
7,How do sleep hygiene practices and sleep quali...,"Oh, you‚Äôre diving deep into the vortex where s...",,"Sleep hygiene practices, such as maintaining a...",True,True,True,5.545604,d0da3207-d85c-45b8-bfea-3a76daa2fdf2,019c6920-add2-7ab3-8ca7-503e3055efc9
8,Who are Licensed Clinical Social Workers and w...,"Oh yeah, Licensed Clinical Social Workers (LCS...",,Licensed Clinical Social Workers are mental he...,True,True,True,2.146024,1b8de446-700c-4c9e-9969-b924f3696c21,019c6920-e442-7531-805d-ffd6fe00969f
9,How does mental health relate to sleep and wha...,"Yo, here‚Äôs the lowdown straight from the menta...",,Sleep and mental health have a bidirectional r...,True,True,True,3.786113,98be11f4-0a5f-42de-b888-02b4893de46e,019c6921-08ab-7230-aa3a-54af9585bde6


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:

screenshot available here : data/evaluation_comparison.png

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores

what changed between the two chains:

experiment #1 (flowery-level-98) - default chain:
dopeness: 0.75
helpfulness: 1.00
qa: 1.00

experiment #2 (bold-candy-61) - dopeness chain:
dopeness: 1.00
helpfulness: 1.00
qa: 1.00


metric changes explained:


dopeness: increased from 0.75 to 1.00

biggest improvement
the new prompt explicitly instructed the model to "make your answer rad, ensure high levels of dopeness. do not be generic"
original prompt had no style instructions, just factual responses
this was the intended target of the improvements

helpfulness: stayed at 1.00
no change, both chains performed perfectly
larger chunks (1000 vs 500) provided enough context in both cases
better embeddings (text-embedding-3-large) helped maintain quality
the "dopeness" style didn't hurt helpfulness

qa correctness: stayed at 1.00
no change, both chains answered correctly
larger chunks gave more complete information
better embedding model (text-embedding-3-large) retrieved more relevant context
improved retrieval quality maintained accuracy even with stylistic changes

why these results make sense:
the modifications (larger chunks, better embeddings, dope prompt) were designed to improve style without sacrificing accuracy
dopeness was the main target and showed the only improvement
qa and helpfulness remained perfect, showing the improvements didn't introduce errors
the prompt change had the biggest impact since it directly addressed response style