# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dznidaric\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dznidaric\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
from uuid import uuid4

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = os.environ.get("LANGCHAIN_API_KEY")
os.environ["LANGSMITH_ENDPOINT"] = "https://eu.api.smith.langchain.com"

if not os.environ["LANGCHAIN_API_KEY"]:
    os.environ["LANGSMITH_TRACING"] = "false"
    print("LangSmith tracing disabled")
else:
    print(f"LangSmith tracing enabled. Project: {os.environ['LANGCHAIN_PROJECT']}")

LangSmith tracing enabled. Project: AIM - SDG - 297ba2d2


We'll also want to set a project name to make things easier for ourselves.

OpenAI's API Key!

In [3]:
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [4]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data\\HealthWellnessGuide.txt', 'data\\MentalHealthGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [5]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 9, relationships: 17)

We can save and load our knowledge graphs as follows.

In [10]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 9, relationships: 17)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [10]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [None]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
The three query synthesizers generate different types of test questions from documents, each with varying complexity.

**SingleHopSpecificQuerySynthesizer** creates simple, direct questions answerable from a single piece of information in documents. It generates straightforward, fact-based questions that require looking at only one specific piece of information. Example: "What is X?" or "When does Y happen?"

**MultiHopAbstractQuerySynthesizer** generates complex questions connecting multiple concepts or themes across different document sections. Focuses on broader conceptual relationships rather than specific facts.

**MultiHopSpecificQuerySynthesizer** creates complex questions needing information from multiple sources, but targets specific entities or facts that appear in different places. Like the abstract version but focuses on concrete details rather than concepts.

Finally, we can use our `TestSetGenerator` to generate our testset!

In [13]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How do carbohydrates function as a primary ene...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Carbohydrates are described as the primary ene...,single_hop_specifc_query_synthesizer
1,Is Chapter 20 about the importance of Chapter ...,[13: The Science of Habit Formation Habits are...,"Yes, Chapter 20 discusses the importance of so...",single_hop_specifc_query_synthesizer
2,What does the personal wellness guide say abou...,[The Personal Wellness Guide A Comprehensive R...,The guide recommends starting the week with a ...,single_hop_specifc_query_synthesizer
3,What does the World Health Organization say ab...,[The Mental Health and Psychology Handbook A P...,"According to the World Health Organization, me...",single_hop_specifc_query_synthesizer
4,Is MBCT a therapie that combines mindfulness a...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Cognitive Therapy (MBCT) com...,single_hop_specifc_query_synthesizer
5,how grow mindset help mental health and like it,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,the context talks about growth mindset being t...,multi_hop_abstract_query_synthesizer
6,how sleep hygiene practices and routines help ...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Sleep hygiene practices and routines improve s...,multi_hop_abstract_query_synthesizer
7,How can enforcing boundaries and handling push...,[<1-hop>\n\nsocial interactions How to set and...,Enforcing boundaries and handling pushback are...,multi_hop_abstract_query_synthesizer
8,How do B vitamns and vitamns help in holstic w...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,B vitamins are essential for neurotransmitter ...,multi_hop_specific_query_synthesizer
9,how CBT and CBT-I help with mental health and ...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,The context explains that CBT (Cognitive Behav...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [11]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [15]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Can you tell me what Chapter 4 is about in tha...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Chapter 4 covers the fundamentals of healthy e...,single_hop_specifc_query_synthesizer
1,"What is Chapter 21 about, and how does it rela...",[13: The Science of Habit Formation Habits are...,"Chapter 21 discusses digital wellness, highlig...",single_hop_specifc_query_synthesizer
2,Why is exercise important for mental well-being?,[The Personal Wellness Guide A Comprehensive R...,Exercise is one of the most important things y...,single_hop_specifc_query_synthesizer
3,How does the United States address mental heal...,[The Mental Health and Psychology Handbook A P...,The provided context does not include specific...,single_hop_specifc_query_synthesizer
4,how stress management and mental wellness help...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Stress management techniques like deep breathi...,multi_hop_abstract_query_synthesizer
5,How can practicing self-compassion when settin...,[<1-hop>\n\nsocial interactions How to set and...,Practicing self-compassion when setting bounda...,multi_hop_abstract_query_synthesizer
6,H0w can buiLDing a workout routine and usin be...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,Building a workout routine that includes regul...,multi_hop_abstract_query_synthesizer
7,How can establishing evening wind-down routine...,[<1-hop>\n\n13: The Science of Habit Formation...,Establishing evening wind-down routines such a...,multi_hop_abstract_query_synthesizer
8,How can understanding the habit formation proc...,[<1-hop>\n\n13: The Science of Habit Formation...,Understanding the habit formation process outl...,multi_hop_specific_query_synthesizer
9,Can you tell me how Chapter 4 and Chapter 19 a...,[<1-hop>\n\n13: The Science of Habit Formation...,Chapter 4 discusses the science of habit forma...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
The "unrolled" manual approach gives us control over query synthesizers and distributions, while the "abstracted" automatic approach uses Ragas defaults.
**Manual Approach:**
- Full control over synthesizer types and proportions‚Äã
- Customize question types for our specific RAG system‚Äã
- Optimize costs by excluding expensive multi-hop synthesizers‚Äã
- Requires understanding of each synthesizer type

**Automatic Approach:**
- Simple one-line setup‚Äã
- Uses balanced default distribution (typically 33% per type)‚Äã
- Less control and harder to optimize costs‚Äã
- Best for quick prototyping

In short choose:
- Manual: when you need specific question distributions, want to test particular retrieval patterns, optimize costs, or customize prompts
- Automatic: For quick prototyping, balanced test coverage, or building initial baseline datasets


---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [17]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.7),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.2),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.1),
]

custom_testset = generator.generate(testset_size=10, query_distribution=query_distribution)
custom_testset.to_pandas()

# Define a custom query distribution with different weights
# Generate a new test set and compare with the default


Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Exception in callback Task.__step()
handle: <Handle Task.__step()>
Traceback (most recent call last):
  File "C:\Users\dznidaric\AppData\Local\Programs\Python\Python313\Lib\asyncio\events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: cannot enter context: <_contextvars.Context object at 0x000001722724D100> is already entered
Exception in callback Task.__step()
handle: <Handle Task.__step()>
Traceback (most recent call last):
  File "C:\Users\dznidaric\AppData\Local\Programs\Python\Python313\Lib\asyncio\events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: cannot enter context: <_contextvars.Context object at 0x000001722724D100> is already entered
Exception in callback Task.__step()
handle: <Handle Task.__step()>
Traceback (most recent call last):
  File "C:\Users\dznidaric\AppData\Local\Programs\Python\Pyt

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the significance of PART 2 in the cont...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,PART 2: NUTRITION AND DIET covers the fundamen...,single_hop_specifc_query_synthesizer
1,What information can I find about Chapter 14 r...,[13: The Science of Habit Formation Habits are...,Chapter 14 discusses morning routines for well...,single_hop_specifc_query_synthesizer
2,What is the significance of Tuesdai in the con...,[The Personal Wellness Guide A Comprehensive R...,The provided context does not mention or expla...,single_hop_specifc_query_synthesizer
3,Wht is the role of the World Health Organizato...,[The Mental Health and Psychology Handbook A P...,"According to the context, the World Health Org...",single_hop_specifc_query_synthesizer
4,What is Mindfulness-Based Cognitive Therapy?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Cognitive Therapy (MBCT) com...,single_hop_specifc_query_synthesizer
5,How does the neurotransmitter influence mental...,[Write letters to or from your future self Jou...,The context discusses how exercise affects the...,single_hop_specifc_query_synthesizer
6,Who are Licensed Clinical Social Workers?,[social interactions How to set and maintain b...,Licensed Clinical Social Workers are mental he...,single_hop_specifc_query_synthesizer
7,"How can stress reduction techniques, such as m...",[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Stress reduction techniques like mindfulness a...,multi_hop_abstract_query_synthesizer
8,Based on the strategies for digital mental hea...,[<1-hop>\n\nsocial interactions How to set and...,Individuals can effectively promote their ment...,multi_hop_abstract_query_synthesizer
9,Can you tell me how Chapter 4 and Chapter 14 a...,[<1-hop>\n\n13: The Science of Habit Formation...,Chapter 4 talks about the science of habit for...,multi_hop_specific_query_synthesizer


In [26]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Can you tell me what Chapter 4 is about in tha...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Chapter 4 covers the fundamentals of healthy e...,single_hop_specifc_query_synthesizer
1,"What is Chapter 21 about, and how does it rela...",[13: The Science of Habit Formation Habits are...,"Chapter 21 discusses digital wellness, highlig...",single_hop_specifc_query_synthesizer
2,Why is exercise important for mental well-being?,[The Personal Wellness Guide A Comprehensive R...,Exercise is one of the most important things y...,single_hop_specifc_query_synthesizer
3,How does the United States address mental heal...,[The Mental Health and Psychology Handbook A P...,The provided context does not include specific...,single_hop_specifc_query_synthesizer
4,how stress management and mental wellness help...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Stress management techniques like deep breathi...,multi_hop_abstract_query_synthesizer
5,How can practicing self-compassion when settin...,[<1-hop>\n\nsocial interactions How to set and...,Practicing self-compassion when setting bounda...,multi_hop_abstract_query_synthesizer
6,H0w can buiLDing a workout routine and usin be...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,Building a workout routine that includes regul...,multi_hop_abstract_query_synthesizer
7,How can establishing evening wind-down routine...,[<1-hop>\n\n13: The Science of Habit Formation...,Establishing evening wind-down routines such a...,multi_hop_abstract_query_synthesizer
8,How can understanding the habit formation proc...,[<1-hop>\n\n13: The Science of Habit Formation...,Understanding the habit formation process outl...,multi_hop_specific_query_synthesizer
9,Can you tell me how Chapter 4 and Chapter 19 a...,[<1-hop>\n\n13: The Science of Habit Formation...,Chapter 4 discusses the science of habit forma...,multi_hop_specific_query_synthesizer


In [25]:
custom_testset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the significance of PART 2 in the cont...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,PART 2: NUTRITION AND DIET covers the fundamen...,single_hop_specifc_query_synthesizer
1,What information can I find about Chapter 14 r...,[13: The Science of Habit Formation Habits are...,Chapter 14 discusses morning routines for well...,single_hop_specifc_query_synthesizer
2,What is the significance of Tuesdai in the con...,[The Personal Wellness Guide A Comprehensive R...,The provided context does not mention or expla...,single_hop_specifc_query_synthesizer
3,Wht is the role of the World Health Organizato...,[The Mental Health and Psychology Handbook A P...,"According to the context, the World Health Org...",single_hop_specifc_query_synthesizer
4,What is Mindfulness-Based Cognitive Therapy?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Cognitive Therapy (MBCT) com...,single_hop_specifc_query_synthesizer
5,How does the neurotransmitter influence mental...,[Write letters to or from your future self Jou...,The context discusses how exercise affects the...,single_hop_specifc_query_synthesizer
6,Who are Licensed Clinical Social Workers?,[social interactions How to set and maintain b...,Licensed Clinical Social Workers are mental he...,single_hop_specifc_query_synthesizer
7,"How can stress reduction techniques, such as m...",[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Stress reduction techniques like mindfulness a...,multi_hop_abstract_query_synthesizer
8,Based on the strategies for digital mental hea...,[<1-hop>\n\nsocial interactions How to set and...,Individuals can effectively promote their ment...,multi_hop_abstract_query_synthesizer
9,Can you tell me how Chapter 4 and Chapter 14 a...,[<1-hop>\n\n13: The Science of Habit Formation...,Chapter 4 talks about the science of habit for...,multi_hop_specific_query_synthesizer


In [24]:
print("Default Distribution:")
print(dataset.to_pandas()['synthesizer_name'].value_counts())
print("\nCustom Distribution (70/20/10):")
print(custom_testset.to_pandas()['synthesizer_name'].value_counts())

Default Distribution:
synthesizer_name
single_hop_specifc_query_synthesizer    4
multi_hop_abstract_query_synthesizer    4
multi_hop_specific_query_synthesizer    4
Name: count, dtype: int64

Custom Distribution (70/20/10):
synthesizer_name
single_hop_specifc_query_synthesizer    7
multi_hop_abstract_query_synthesizer    2
multi_hop_specific_query_synthesizer    1
Name: count, dtype: int64


Why I choose 70/20/10 distribution:
- Early development: to test basic retrieval before complex queries‚Äã
- Cost optimization: MultiHop synthesizers are slower and more expensive‚Äã

We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [6]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [12]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [13]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [15]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [16]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [17]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [18]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [20]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [21]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [22]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [23]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [None]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`: qa system, evaluating models response correctnes based on the users input 
> - `labeled_helpfulness_evaluator`: evaluates models response to the correct or ideal answer, and asks: "is this answer actually helpful to the user, considering what the correct answer should look like?"
> - `dopeness_evaluator`: evaluator, but instead of checking if an answer is helpful, it checks if the answer is ‚Äúdope.‚Äù 

## LangSmith Evaluation

In [25]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'earnest-trousers-25' at:
https://eu.smith.langchain.com/o/3dbfbe1b-7743-4229-aa0c-fa2b7410d3fd/datasets/a153e5e5-64cf-4cdc-98f1-0e15a5d5b633/compare?selectedSessions=2ce9903d-78eb-4ab6-8ee3-7f21314b236a




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can CBT and CBT-I be used together to impr...,CBT and CBT-I can be used together to improve ...,,Cognitive Behavioral Therapy (CBT) focuses on ...,True,True,True,4.794093,0f112e7e-4df7-4496-8373-ad30d3487d5a,019c521b-0a31-7b03-9cae-c62ec8bc4704
1,H0w can CBT and CBT-I be used togeth3r to impr...,CBT (Cognitive Behavioral Therapy) and CBT-I (...,,The context highlights that Cognitive Behavior...,True,True,True,3.944595,5e1db07a-668a-4eba-aea4-acd7a176dcaf,019c521b-3be8-7ce2-98e5-08f1d0dc8acc
2,How do chapters 4 and 16 relate to managing st...,I don't know.,,Chapter 4 discusses the fundamentals of health...,False,False,False,2.115384,bcba3d45-86de-4b09-b18d-937dbf5ca6f8,019c521b-748f-7cb3-aca4-cc7bfe112676
3,How do Chapters 4 and 21 together support a me...,I don't know.,,Chapters 4 and 21 collectively emphasize the i...,False,False,False,0.725123,5df0e63b-caac-4dcd-889c-8c476cd8038b,019c521b-a17c-7c01-9396-547e3ea89283
4,How does the gut-brain axis and sleep influenc...,The gut-brain axis influences mental health th...,,The gut-brain axis plays a crucial role in moo...,True,True,True,3.533596,5ef48457-24ef-4406-8824-9a7c64ecec9a,019c521b-d1c3-7d90-bb04-a7c8bafd2e38
5,How can understanding the components of the ha...,Understanding the components of the habit loop...,,Understanding the components of the habit loop...,True,True,False,4.314035,9ecd3d16-f8d7-41ef-ac86-afc5ee099824,019c521c-0cfc-7de0-b16d-c26892c54861
6,"How does social media impact mental health, es...",Social media impacts mental health in several ...,,Social media can negatively impact mental heal...,True,True,True,3.451959,f7e3137c-cbea-4e6c-a9b4-ccfa0cb00f1b,019c521c-4aee-7e72-83aa-234168332396
7,How can establishing an effective evening wind...,Establishing an effective evening wind-down ro...,,An effective evening wind-down routine can sig...,True,True,True,4.466266,08af1f7f-815b-4a33-ae15-d92ebc4c53cc,019c521c-81ae-7f23-bec9-319c38c5b20d
8,What does the World Health Organization say ab...,"According to the World Health Organization, me...",,"According to the World Health Organization, me...",True,True,False,1.62493,03b95f29-9495-409c-bc60-c599c099c31c,019c521c-d187-7092-ae64-35210e1adfa6
9,What is a Bird Dog in the context of exercise ...,I don't know.,,The Bird Dog is an exercise that involves exte...,False,False,False,1.023378,e1f0cf87-f541-4ce0-99aa-fc2674564e3d,019c521c-fd1d-7fd2-8198-a00978b2565b


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [26]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [27]:
rag_documents = docs

In [28]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
Changing the chunk size changes how much information we send to the model at once, and that affects the performance. In our case we doubled our chunk_size which might give too much information, even the irrelevant information, it needs to process double the text which increases latency and cost, also might dilute the importnat details.

In [29]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
Changing the embedding model, changes how meaning is represented. Now we changed to the large model which has more dimensions, understands meaning better, could and should retrieve more relevant chunks which makes answers more accurate and useful. All of this comes at the slightly higher cost, but if the model gets the correct answer on the first try whereas we need to ask multiple times with smaller model, in the end it will be cheaper.

In [30]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [31]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [32]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [33]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Alright, let‚Äôs crank your sleep game up to legendary status! Based on the rad context, here‚Äôs the ultimate playbook to turbocharge your sleep quality:\n\n1. **Lock in a consistent sleep schedule** ‚Äî wake and crash at the same time daily, even on wild weekends. Your body LOVES rhythm.\n\n2. **Craft a zen bedtime ritual** ‚Äî think chill vibes like reading, gentle stretching, or a soul-soothing warm bath. Cue the relaxation!\n\n3. **Optimize your sleep dojo** ‚Äî keep your bedroom cool (65-68¬∞F / 18-20¬∞C), blackout those rays, and drown distractions with white noise or earplugs. Invest in a mattress and pillows that feel like clouds.\n\n4. **Ditch screens 1-2 hours before bed** ‚Äî those blue lights are sneaky sleep saboteurs.\n\n5. **Cut caffeine after 2 PM** ‚Äî no late-day jitters creeping into your dreamland.\n\n6. **Move your body regularly**, but skip heavy workouts near bedtime ‚Äî get those good endorphins without revving your engine too close to lights out.\n\n7. **Avoi

Finally, we can evaluate the new chain on the same test set!

In [34]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'drab-root-72' at:
https://eu.smith.langchain.com/o/3dbfbe1b-7743-4229-aa0c-fa2b7410d3fd/datasets/a153e5e5-64cf-4cdc-98f1-0e15a5d5b633/compare?selectedSessions=a374d856-7e66-479c-84b2-fba043c43a1b




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can CBT and CBT-I be used together to impr...,"Alright, buckle up, because merging CBT and CB...",,Cognitive Behavioral Therapy (CBT) focuses on ...,True,True,True,6.662688,0f112e7e-4df7-4496-8373-ad30d3487d5a,019c5229-d301-7ee3-88ca-fccb07e2536c
1,H0w can CBT and CBT-I be used togeth3r to impr...,"Alright, here‚Äôs the straight fire on how CBT a...",,The context highlights that Cognitive Behavior...,True,True,True,6.633139,5e1db07a-668a-4eba-aea4-acd7a176dcaf,019c522a-217e-7313-aa0a-75c84f6497d3
2,How do chapters 4 and 16 relate to managing st...,"Yo, here‚Äôs the deal: Chapter 4 isn‚Äôt in the pr...",,Chapter 4 discusses the fundamentals of health...,False,False,True,3.082079,bcba3d45-86de-4b09-b18d-937dbf5ca6f8,019c522a-7c50-7f40-898f-377d5d7c9131
3,How do Chapters 4 and 21 together support a me...,"Alright, let‚Äôs crank this up to max rad mode! ...",,Chapters 4 and 21 collectively emphasize the i...,False,True,True,8.979395,5df0e63b-caac-4dcd-889c-8c476cd8038b,019c522a-d2f4-77a3-bbdb-61420ae99465
4,How does the gut-brain axis and sleep influenc...,"Yo, buckle up for this mind-body vibe check! T...",,The gut-brain axis plays a crucial role in moo...,True,True,True,5.592519,5ef48457-24ef-4406-8824-9a7c64ecec9a,019c522b-3688-7992-9822-579a9d31efee
5,How can understanding the components of the ha...,"Alright, let‚Äôs crank up the dopeness on habit-...",,Understanding the components of the habit loop...,True,True,True,7.341893,9ecd3d16-f8d7-41ef-ac86-afc5ee099824,019c522b-7c62-7262-a8f8-2adba277091a
6,"How does social media impact mental health, es...","Oh, you wanna know how social media shakes up ...",,Social media can negatively impact mental heal...,True,True,True,6.277405,f7e3137c-cbea-4e6c-a9b4-ccfa0cb00f1b,019c522b-c988-7720-aa76-15b4d22beacb
7,How can establishing an effective evening wind...,"Yo, let‚Äôs light up the path to mental health m...",,An effective evening wind-down routine can sig...,True,True,True,6.564516,08af1f7f-815b-4a33-ae15-d92ebc4c53cc,019c522c-08cb-7e82-81ac-6888160f6d5d
8,What does the World Health Organization say ab...,"Alright, let‚Äôs drop some serious knowledge on ...",,"According to the World Health Organization, me...",True,True,True,4.34342,03b95f29-9495-409c-bc60-c599c099c31c,019c522c-574f-7e40-aa9d-4b24fe3173f5
9,What is a Bird Dog in the context of exercise ...,I don't know.,,The Bird Dog is an exercise that involves exte...,False,False,False,0.900763,e1f0cf87-f541-4ce0-99aa-fc2674564e3d,019c522c-a7a0-7711-87c4-a26d52bb889b


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:
image.png
Of course the dopeness metric changed when we chnaged the prompt for the llm to response as a "cool guy". When we give llm directions to "act" as some personality we will get those metrics we wanted to check, in this case dopeness, but in the other for example we can check if the llm answered in formal language.

![alt text](<Snimka zaslona 2026-02-12 151239.png>)

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores