# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [17]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alber\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\alber\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [18]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [19]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [20]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data\\HealthWellnessGuide.txt', 'data\\MentalHealthGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [21]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [22]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [23]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [24]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 9, relationships: 18)

We can save and load our knowledge graphs as follows.

In [25]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 9, relationships: 18)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [26]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [30]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
The three types of Query Synthesizers serve complementary roles in evaluating a system‚Äôs retrieval and reasoning capabilities. The SingleHopSpecificQuerySynthesizer generates direct, factual questions that can be answered using information from a single node or chunk. These queries test basic retrieval accuracy by focusing on localized content, such as identifying the topic of a specific section or extracting a concrete fact. 

The MultiHopAbstractQuerySynthesizer produces questions that require integrating information from multiple parts of the graph to form higher‚Äëlevel conceptual insights. These queries assess the system‚Äôs ability to synthesize dispersed evidence and reason abstractly across documents. Examples include exploring how different factors interact to influence an outcome.

The MultiHopSpecificQuerySynthesizer also requires cross‚Äëdocument reasoning but focuses on retrieving precise, concrete information that must be combined from multiple sources. These queries test whether the system can locate and merge specific details scattered across the graph. 

Collectively, single‚Äëhop queries evaluate foundational retrieval performance, while multi‚Äëhop queries probe the system‚Äôs ability to connect distributed information‚Äîa capability where many RAG systems struggle in real‚Äëworld settings. 

Finally, we can use our `TestSetGenerator` to generate our testset!

In [31]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is Chapter 9 about?,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Chapter 9: Understanding and Managing Insomnia...,single_hop_specifc_query_synthesizer
1,What information can I find about Chapter 21 r...,[13: The Science of Habit Formation Habits are...,"Chapter 21 discusses digital wellness, highlig...",single_hop_specifc_query_synthesizer
2,What is Chapter 3 in the book?,[The Personal Wellness Guide A Comprehensive R...,Chapter 3: Building a Workout Routine. Startin...,single_hop_specifc_query_synthesizer
3,WHO mental health,[The Mental Health and Psychology Handbook A P...,"According to the context, the World Health Org...",single_hop_specifc_query_synthesizer
4,What is the main purpose of MBSR in stress red...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Stress Reduction (MBSR) is a...,single_hop_specifc_query_synthesizer
5,How can I use strategies for healthy digital h...,[<1-hop>\n\nsocial interactions How to set and...,To improve digital mental health and manage te...,multi_hop_abstract_query_synthesizer
6,how sleep and mental health like sleep hygiene...,[<1-hop>\n\nWrite letters to or from your futu...,Sleep and mental health have a bidirectional r...,multi_hop_abstract_query_synthesizer
7,How can understanding the foundational aspects...,[<1-hop>\n\nThe Mental Health and Psychology H...,The Mental Health and Psychology Handbook expl...,multi_hop_abstract_query_synthesizer
8,How do chapters 12 and 14 together help improv...,[<1-hop>\n\n13: The Science of Habit Formation...,Chapter 12 discusses mindfulness and meditatio...,multi_hop_specific_query_synthesizer
9,How can CBT and CBT-I help improve mental heal...,[<1-hop>\n\nWrite letters to or from your futu...,CBT helps improve mental health by identifying...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [36]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [15]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are the recommended exercises and strateg...,[The Personal Wellness Guide A Comprehensive R...,The provided context does not include specific...,single_hop_specifc_query_synthesizer
1,What does Stage 2 of sleep involve in the slee...,[PART 3: SLEEP AND RECOVERY Chapter 7: The Sci...,Stage 2 involves a drop in body temperature an...,single_hop_specifc_query_synthesizer
2,What information does Chapter 18 cover regardi...,[PART 5: BUILDING HEALTHY HABITS Chapter 13: T...,Chapter 18 discusses strategies to boost immun...,single_hop_specifc_query_synthesizer
3,How does the World Health Organization define ...,[The Mental Health and Psychology Handbook A P...,"According to the World Health Organization, me...",single_hop_specifc_query_synthesizer
4,how can exercise for common problems like lowe...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The wellness guide explains that gentle exerci...,multi_hop_abstract_query_synthesizer
5,How can incorporating mindfulness and social c...,[<1-hop>\n\nhour before bed - No caffeine afte...,Incorporating mindfulness and social connectio...,multi_hop_abstract_query_synthesizer
6,How can improving face-to-face interactions an...,[<1-hop>\n\nhour before bed - No caffeine afte...,Improving face-to-face interactions by engagin...,multi_hop_abstract_query_synthesizer
7,How can I improve my emotional intelligence an...,[<1-hop>\n\nhour before bed - No caffeine afte...,To improve emotional intelligence and manage c...,multi_hop_abstract_query_synthesizer
8,how chapter 7 and 17 connect about sleep and h...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,"chapter 7 talks about sleep and recovery, expl...",multi_hop_specific_query_synthesizer
9,H0w c4n I bUild a he4lthy m0rn1ng r0utine (cha...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,To build a healthy morning routine that improv...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
The unrolled approach offers full control over every stage of the knowledge‚Äëgraph and test‚Äëset generation pipeline. It allows direct inspection of intermediate artifacts, persistent storage of the KG for reuse, fine‚Äëgrained debugging of node and relation quality, and the ability to replace default transformations with domain‚Äëspecific components. This method also ensures strict reproducibility, since a saved KG in JSON format guarantees identical outputs across runs. However, this level of transparency comes at the cost of increased manual effort and slower iteration.

The abstracted approach prioritizes speed and convenience by automating the entire pipeline with reasonable defaults, making it ideal for rapid prototyping, baseline creation, and scenarios where documents are generic enough that custom transformations are unnecessary. Its main limitations stem from reduced visibility into intermediate steps, lack of persistent KG reuse, and lower reproducibility due to variability in LLM‚Äëgenerated structures. In practice, the abstracted workflow is suitable for early experimentation, while the unrolled workflow is preferable in specialized domains, when reproducibility is essential, or when iterative refinement of test‚Äëset quality is required

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [27]:
# ============================================
# Activity 1: Custom Query Distribution
# ============================================
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer
from ragas.testset.synthesizers import (
    SingleHopSpecificQuerySynthesizer,
    MultiHopAbstractQuerySynthesizer, 
    MultiHopSpecificQuerySynthesizer
)

# --- Default distribution ---
default_query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

# --- Custom distribution: heavier on multi-hop ---
custom_query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.2),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.4),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.4),
]

# Generate testset with custom distribution
from ragas.testset import TestsetGenerator
generator_custom = TestsetGenerator(
    llm=generator_llm, 
    embedding_model=generator_embeddings, 
    knowledge_graph=usecase_data_kg
)

custom_testset = generator_custom.generate(
    testset_size=10, 
    query_distribution=custom_query_distribution
)

custom_df = custom_testset.to_pandas()
custom_df

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What information does Chapter 8 cover regardin...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Chapter 8 discusses improving sleep quality th...,single_hop_specifc_query_synthesizer
1,What is an APPENDIX in the context of health i...,[13: The Science of Habit Formation Habits are...,The APPENDIX is a section that provides quick ...,single_hop_specifc_query_synthesizer
2,How does sleep impact mental health and how nu...,[<1-hop>\n\nWrite letters to or from your futu...,Sleep and mental health have a bidirectional r...,multi_hop_abstract_query_synthesizer
3,How can building healthy habits through daily ...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Building healthy habits by engaging in daily w...,multi_hop_abstract_query_synthesizer
4,How can managing digital mental health through...,[<1-hop>\n\nsocial interactions How to set and...,Managing digital mental health by setting inte...,multi_hop_abstract_query_synthesizer
5,How can combining strategies for building new ...,[<1-hop>\n\n13: The Science of Habit Formation...,Building new habits through understanding the ...,multi_hop_abstract_query_synthesizer
6,"H0w cAn chApTer 20 bE rElAtEd tO chApTer 1, aS...",[<1-hop>\n\nThe Personal Wellness Guide A Comp...,ChApTer 1 pRoViDeS a cOmPrEhEnSiVe oVeRvIeW oF...,multi_hop_specific_query_synthesizer
7,How does CBT-I relate to CBT in improving ment...,[<1-hop>\n\nWrite letters to or from your futu...,"CBT-I, or Cognitive Behavioral Therapy for Ins...",multi_hop_specific_query_synthesizer
8,How do Chapters 9 and 15 together help a menta...,[<1-hop>\n\n13: The Science of Habit Formation...,Chapters 9 and 15 provide insights into improv...,multi_hop_specific_query_synthesizer
9,How do vitamins like B vitamins and other nutr...,[<1-hop>\n\nWrite letters to or from your futu...,"The context explains that B vitamins, found in...",multi_hop_specific_query_synthesizer


In [32]:
# ============================================
# Compare default vs custom distributions
# ============================================
import pandas as pd

# Original testset 
default_df = testset.to_pandas()

# Count by synthesizer type
default_counts = default_df['synthesizer_name'].value_counts()
custom_counts = custom_df['synthesizer_name'].value_counts()

comparison = pd.DataFrame({
    'Default Distribution': default_counts,
    'Custom Distribution': custom_counts
}).fillna(0).astype(int)

comparison.index = comparison.index.str.replace('_query_synthesizer', '')
print("=== Distribution Comparison ===")
print(comparison)
print()

# Compare question complexity qualitatively
print("=== Sample Questions Comparison ===")
print("\n--- Default (SingleHop heavy) ---")
for _, row in default_df[default_df['synthesizer_name'].str.contains('single')].head(2).iterrows():
    print(f"  Q: {row['user_input'][:100]}")

print("\n--- Custom (MultiHop heavy) ---")
for _, row in custom_df[custom_df['synthesizer_name'].str.contains('multi')].head(4).iterrows():
    print(f"  [{row['synthesizer_name'].split('_')[1]}] Q: {row['user_input'][:100]}")

=== Distribution Comparison ===
                    Default Distribution  Custom Distribution
synthesizer_name                                             
multi_hop_abstract                     3                    4
multi_hop_specific                     3                    4
single_hop_specifc                     5                    2

=== Sample Questions Comparison ===

--- Default (SingleHop heavy) ---
  Q: What is Chapter 9 about?
  Q: What information can I find about Chapter 21 related to digital wellness and managing technology use

--- Custom (MultiHop heavy) ---
  [hop] Q: How does sleep impact mental health and how nutrition and diet influence emotional well-being, espec
  [hop] Q: How can building healthy habits through daily wellness practices such as hydration, nutrition, movem
  [hop] Q: How can managing digital mental health through strategies like setting time limits and digital detox
  [hop] Q: How can combining strategies for building new habits and establishing 

### Observations

The comparison clearly shows that **query distribution directly determines 
evaluation quality**. The default SingleHop questions ("What is Chapter 9 
about?") test document lookup, not understanding. The custom MultiHop 
questions force the system to retrieve across multiple chunks and synthesize 
coherent answers ‚Äî which is where RAG pipelines actually fail in production.

Key insight: **High scores on SingleHop-heavy testsets create false 
confidence.** A robust evaluation should weight multi-hop queries at 
60-80% to surface real weaknesses in both retrieval and generation.

### Why this distribution?

**Default:** 50% SingleHop / 25% MultiHopAbstract / 25% MultiHopSpecific  
**Custom:** 20% SingleHop / 40% MultiHopAbstract / 40% MultiHopSpecific

I shifted the weight heavily toward multi-hop queries for three reasons:

1. **Single-hop questions are too easy for modern RAG systems.** With decent 
   embeddings and chunk sizes, basic factual retrieval rarely fails. These 
   questions don't stress-test the pipeline where it matters.

2. **Real-world queries are inherently multi-hop.** In domains like healthcare, 
   education, or consulting, users rarely ask "what does Chapter 3 say?" ‚Äî they 
   ask questions that require synthesizing information across multiple sources. 
   A testset should reflect this reality.

3. **The split between Abstract and Specific multi-hop (40/40) tests two 
   different failure modes:**
   - Abstract: Can the system reason conceptually across documents? 
     (tests generation quality)
   - Specific: Can the system find and combine concrete facts from 
     different chunks? (tests retrieval precision)

**Observed difference:** The custom distribution produces questions that are 
noticeably harder and more realistic. The default distribution's single-hop 
questions like "What is PART 3 about?" would pass even a poorly-configured 
RAG system, giving a false sense of quality.

In [16]:
### YOUR CODE HERE ###

# Define a custom query distribution with different weights
# Generate a new test set and compare with the default


We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [34]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [37]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [38]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [39]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [40]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [41]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [42]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [43]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [44]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [45]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [46]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [47]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [48]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`:
> - `labeled_helpfulness_evaluator`:
> - `dopeness_evaluator`:

They use openevals with LLM-as-judge (GPT‚Äë4o as the judge) with three evaluators:

qa_evaluator: Does the answer match the reference factually? This is the most objective metric ‚Äî it compares the content directly against the ground truth.

labeled_helpfulness_evaluator: Is the answer helpful to the user, given the reference? This is more subjective ‚Äî an answer can be correct but still unhelpful if it is too vague or poorly structured.

dopeness_evaluator: Is the answer interesting, original, and non‚Äëgeneric? This is the ‚Äústyle‚Äù metric ‚Äî deliberately subjective to demonstrate that any criterion can be evaluated using LLM‚Äëas‚Äëjudge.

## LangSmith Evaluation

In [49]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'advanced-rest-24' at:
https://smith.langchain.com/o/4223b872-71bc-49e4-a691-1784d5029f58/datasets/1575b0a4-29ff-40a3-94b0-f5b03daf6a47/compare?selectedSessions=30dc8e8f-537c-43dd-ad43-da72dc2cbdf7




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can Cognitive Behavioral Therapy (CBT) and...,"Based on the context, Cognitive Behavioral The...",,The context highlights that Cognitive Behavior...,True,True,False,4.526667,2415ba56-bfa4-49bb-8506-653f3c5bfffd,019c6251-44fb-7d61-b66b-4c1bf7d6975d
1,How can CBT and CBT-I be used together to impr...,CBT (Cognitive Behavioral Therapy) and CBT-I (...,,Cognitive Behavioral Therapy (CBT) is effectiv...,True,True,True,4.724505,86d34606-192d-4808-a3e3-bcc6228130ca,019c6251-877f-7942-b29d-42f2660abf57
2,How does vitamin D relate to mental health and...,Vitamin D deficiency is associated with depres...,,"Vitamin D, obtained from sunlight and fortifie...",True,True,False,1.392466,d45c8c9d-3344-41b4-9d41-3bc6386e487b,019c6251-cdec-7131-bab8-93db6e52cb1a
3,How do B vitamns and vitamns help with mental ...,"Based on the provided context, B vitamins are ...",,"B vitamins, found in whole grains, eggs, and l...",True,False,False,2.050366,f705a368-78ed-426f-866f-1420f6cbfe22,019c6251-ef19-78a0-ad38-88cc5c890e12
4,How can healthy habbits and daily wellness che...,Healthy habits and daily wellness checklists h...,,Maintaining healthy habits and following a dai...,True,True,True,3.741195,d8d62d78-5294-4243-9072-26553fc0cae3,019c6252-3252-7803-8b1d-3494c0d2fe1d
5,How social connections help stress and mental ...,Based on the provided context:\n\n**How social...,,Social connections are linked to better mental...,True,True,True,4.060099,edfc972a-8c25-4b5c-b9e2-a2e54a09d365,019c6252-724b-7ac3-bd43-40dacb8a8ad6
6,how mindfulness and social connection help men...,Based on the provided context:\n\n**Mindfulnes...,,"The context shows that mindfulness, as a pract...",True,True,False,3.447317,c9e0d382-7385-4b3c-8eae-b9f53c633647,019c6252-9f63-7f32-bdad-11528e296b80
7,How can incorporating healthy habits and a dai...,Incorporating healthy habits and a daily welln...,,Incorporating healthy habits like staying well...,True,True,False,2.851302,4b89eef8-a928-4f5d-bec2-9dae0ee021b2,019c6252-e693-7221-9d73-8f85b856b40b
8,Whaat is the impact of the United States on me...,I don't know.,,The context provided does not include specific...,True,False,False,0.718634,3e9cb604-6646-4c34-8c2c-ccbec62e99b8,019c6253-22dd-7893-9c9c-c2b3733d63ce
9,How does exercize help with mental well-being?,Exercise helps with mental well-being by:\n\n-...,,The context emphasizes that exercise is import...,True,True,True,2.609015,66f5f23d-1e84-4d5c-98f9-d3d851af437e,019c6253-5411-7a60-86b6-0b11bce32853


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [50]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [51]:
rag_documents = docs

In [52]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
Chunk size determines how much context each retrieved piece carries. 
Smaller chunks (500) fragment ideas across multiple pieces, risking 
incomplete retrieval ‚Äî as seen with the "I don't know" response on the 
cross-chapter question. Larger chunks (1000) keep ideas self-contained, 
improving both retrieval relevance and generation quality. However, 
excessively large chunks introduce noise and reduce topic diversity 
within the fixed top_k window. The optimal size depends on the document 
structure and query complexity of the specific domain.

In [53]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
The embedding model defines the semantic space where similarity is 
computed. text-embedding-3-large (3072 dims) captures finer semantic 
distinctions than text-embedding-3-small (1536 dims), particularly for 
abstract or complex queries where the question's phrasing diverges from 
the document's vocabulary. This improves retrieval precision on multi-hop 
and abstract questions while having minimal impact on simple factual 
lookups. The trade-off is increased computational cost and storage 
requirements for indexing and search.

In [54]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [55]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [56]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [57]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

"Yo, if you wanna level up your sleep game and crash like a legend, here‚Äôs the ultimate cheat code straight from the sleep sages:\n\n1. **Lock in that consistent sleep schedule** ‚Äî Don't let weekends turn you into a nocturnal beast. Same bedtime, same wake time, every day.\n\n2. **Craft a chill bedtime ritual** ‚Äî Think reading a killer book, gentle stretching, or a warm bath to ease you into dreamland.\n\n3. **Make your sleep cave perfect AF** ‚Äî Keep the room cool (65-68¬∞F), pitch dark with blackout curtains or a mask, and drown out noise with white noise or earplugs.\n\n4. **Shut down screens 1-2 hours before lights out** ‚Äî Blue light messes with your brain‚Äôs sleep signals, don't let that tech beast keep you wired.\n\n5. **Ditch caffeine post 2 PM** ‚Äî That late buzz is a stealthy saboteur of deep sleep.\n\n6. **Exercise regularly but steer clear of bedtime workouts** ‚Äî Move that body, but not when you wanna snooze.\n\n7. **Keep alcohol and heavy meals off your late-ni

Finally, we can evaluate the new chain on the same test set!

In [58]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'passionate-dirt-76' at:
https://smith.langchain.com/o/4223b872-71bc-49e4-a691-1784d5029f58/datasets/1575b0a4-29ff-40a3-94b0-f5b03daf6a47/compare?selectedSessions=cb3278c0-190d-4efd-bc64-9ce5c2f1b004




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can Cognitive Behavioral Therapy (CBT) and...,"Yo, let‚Äôs get real about combining the powerho...",,The context highlights that Cognitive Behavior...,True,True,True,6.632463,2415ba56-bfa4-49bb-8506-653f3c5bfffd,019c6261-9a25-7cd0-873a-269f84259802
1,How can CBT and CBT-I be used together to impr...,"Alright, here‚Äôs the ultra-dopeness on merging ...",,Cognitive Behavioral Therapy (CBT) is effectiv...,True,True,True,5.215215,86d34606-192d-4808-a3e3-bcc6228130ca,019c6261-ea01-7de3-8f1e-4ed7bacc3598
2,How does vitamin D relate to mental health and...,Vitamin D is like the sun-powered superhero fo...,,"Vitamin D, obtained from sunlight and fortifie...",True,True,True,1.933393,d45c8c9d-3344-41b4-9d41-3bc6386e487b,019c6262-2aed-7dc0-bb57-76e4ff0fae74
3,How do B vitamns and vitamns help with mental ...,"Oh, you're diving into the powerhouse world of...",,"B vitamins, found in whole grains, eggs, and l...",True,True,True,3.225165,f705a368-78ed-426f-866f-1420f6cbfe22,019c6262-55d1-7432-bd7a-9eaeb8acad5b
4,How can healthy habbits and daily wellness che...,"Yo, buckle up for a mental health and sleep up...",,Maintaining healthy habits and following a dai...,True,True,True,5.67407,d8d62d78-5294-4243-9072-26553fc0cae3,019c6262-881a-7a70-b1a6-bb15ebc247c6
5,How social connections help stress and mental ...,"Yo, let‚Äôs break down the ultimate combo to cru...",,Social connections are linked to better mental...,True,True,True,5.163294,edfc972a-8c25-4b5c-b9e2-a2e54a09d365,019c6262-bd26-7120-b9ce-0e1605e00859
6,how mindfulness and social connection help men...,"Alright, let‚Äôs crank this up to max dopeness w...",,"The context shows that mindfulness, as a pract...",True,True,True,4.771847,c9e0d382-7385-4b3c-8eae-b9f53c633647,019c6262-ff6c-7f83-84c1-39447d1c86ea
7,How can incorporating healthy habits and a dai...,"Alright, buckle up for a mind-blowing dive int...",,Incorporating healthy habits like staying well...,True,True,True,5.097175,4b89eef8-a928-4f5d-bec2-9dae0ee021b2,019c6263-35ef-7382-83de-902be4ca5a36
8,Whaat is the impact of the United States on me...,"I got you, but here‚Äôs the deal: the context yo...",,The context provided does not include specific...,True,True,True,1.631611,3e9cb604-6646-4c34-8c2c-ccbec62e99b8,019c6263-8b41-7db3-8823-8b9aaa16e6e5
9,How does exercize help with mental well-being?,"Alright, let‚Äôs crank up the cool factor and di...",,The context emphasizes that exercise is import...,True,True,True,5.433297,66f5f23d-1e84-4d5c-98f9-d3d851af437e,019c6263-c3b1-7960-b399-ba702a2d2bd3


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:
### Activity 2 Answer:

**Metric comparison (12 questions):**
| Metric | Baseline | Improved | Delta |
|--------|----------|----------|-------|
| QA | 91.67% | 91.67% | 0% |
| Helpfulness | 75.00% | 91.67% | +16.67% |
| Dopeness | 41.67% | 100% | +58.33% |

**Key findings:**

1. **QA correctness was unchanged** ‚Äî factual accuracy depends on retrieval 
   quality, not prompt style. The one persistent failure (Chapter 5 query) 
   indicates a structural retrieval gap that requires architectural changes 
   (semantic splitting, chapter metadata) rather than parameter tuning.

2. **Helpfulness improved significantly** ‚Äî larger chunks provided more 
   complete context, eliminating two "I don't know" responses. The improved 
   prompt also encouraged more thorough answers, increasing perceived 
   usefulness.

3. **Dopeness is easily gameable** ‚Äî going from 42% to 100% with a single 
   prompt change reveals that style-based LLM-as-judge metrics are fragile. 
   In production, correctness and faithfulness metrics should take priority 
   over subjective quality metrics.

4. **Trade-off: quality vs cost** ‚Äî latency increased 50% (2.73s ‚Üí 4.10s) 
   due to larger chunks and longer responses. For production deployment, 
   this trade-off must be evaluated against user experience requirements.

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores