# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/adrianbarcan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/adrianbarcan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass
# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()

os.environ["LANGCHAIN_TRACING_V2"] = "true"


We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/MentalHealthGuide.txt', 'data/HealthWellnessGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 10, relationships: 20)

We can save and load our knowledge graphs as follows.

In [10]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 10, relationships: 20)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [11]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [12]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
Ragas gives us three types of query synthesizers to generate different kinds of test questions, and each one tests our RAG system in a different way:

- SingleHopSpecificQuerySynthesizer ‚Äî Creates straightforward, focused questions that can be answered from a single chunk of text. Think of it like a simple lookup ‚Äî "What does Chapter 20 cover?" or "What is DBT?" It only needs to find one relevant piece of context to answer. This is the easiest type of query for a RAG system to handle.
- MultiHopAbstractQuerySynthesizer ‚Äî This one is more complex. It generates questions that require information from multiple chunks, but the question itself is more high-level and abstract. For example, "How does sleep hygiene influence overall mental well-being?" ‚Äî to answer this, the system needs to pull context from different parts of the documents and synthesize a broader, more conceptual answer.
- MultiHopSpecificQuerySynthesizer ‚Äî This is similar to multi-hop abstract, in that it also needs multiple pieces of context, but the question is more specific and detailed. Something like "How do Chapters 9 and 21 together inform strategies for stress management?" ‚Äî it's asking for precise, concrete connections between specific sections. This is probably the hardest type for a RAG pipeline to get right because it needs both accurate retrieval across documents and specific, detailed reasoning.

In the notebook, they're distributed as 50% single-hop specific, 25% multi-hop abstract, and 25% multi-hop specific, which makes sense because we want the bulk of tests to cover the basics, but also to stress-test with harder, multi-context questions.

Finally, we can use our `TestSetGenerator` to generate our testset!

In [13]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,In United States mental health stuff how do pe...,[The Mental Health and Psychology Handbook A P...,The context explains that mental health in the...,single_hop_specifc_query_synthesizer
1,What is depression according to the context?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,The provided context does not include a specif...,single_hop_specifc_query_synthesizer
2,How does Vitamin D influence mental health?,[Write letters to or from your future self Jou...,Vitamin D is obtained from sunlight and fortif...,single_hop_specifc_query_synthesizer
3,How can I improve my mental health despite the...,[social interactions How to set and maintain b...,The digital age presents unique challenges for...,single_hop_specifc_query_synthesizer
4,What role do vitamins play in maintaining ment...,[The Personal Wellness Guide A Comprehensive R...,The provided context does not include specific...,single_hop_specifc_query_synthesizer
5,How does exercise influence the impact of ment...,[<1-hop>\n\nThe Mental Health and Psychology H...,"The context explains that physical activity, s...",multi_hop_abstract_query_synthesizer
6,How does sleep hygeine and sleep quality impac...,[<1-hop>\n\nWrite letters to or from your futu...,The context explains that sleep and mental hea...,multi_hop_abstract_query_synthesizer
7,How do stress reduction techniques like mindfu...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Stress reduction techniques such as mindfulnes...,multi_hop_abstract_query_synthesizer
8,"How does sleep influence mental health, and wh...",[<1-hop>\n\nWrite letters to or from your futu...,Sleep has a significant impact on mental healt...,multi_hop_specific_query_synthesizer
9,how can CBT and CBT-I help with mental health ...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Cognitive Behavioral Therapy (CBT) is an effec...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [15]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [15]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are the recommended exercises and strateg...,[The Personal Wellness Guide A Comprehensive R...,The provided context does not include specific...,single_hop_specifc_query_synthesizer
1,What does Stage 2 of sleep involve in the slee...,[PART 3: SLEEP AND RECOVERY Chapter 7: The Sci...,Stage 2 involves a drop in body temperature an...,single_hop_specifc_query_synthesizer
2,What information does Chapter 18 cover regardi...,[PART 5: BUILDING HEALTHY HABITS Chapter 13: T...,Chapter 18 discusses strategies to boost immun...,single_hop_specifc_query_synthesizer
3,How does the World Health Organization define ...,[The Mental Health and Psychology Handbook A P...,"According to the World Health Organization, me...",single_hop_specifc_query_synthesizer
4,how can exercise for common problems like lowe...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The wellness guide explains that gentle exerci...,multi_hop_abstract_query_synthesizer
5,How can incorporating mindfulness and social c...,[<1-hop>\n\nhour before bed - No caffeine afte...,Incorporating mindfulness and social connectio...,multi_hop_abstract_query_synthesizer
6,How can improving face-to-face interactions an...,[<1-hop>\n\nhour before bed - No caffeine afte...,Improving face-to-face interactions by engagin...,multi_hop_abstract_query_synthesizer
7,How can I improve my emotional intelligence an...,[<1-hop>\n\nhour before bed - No caffeine afte...,To improve emotional intelligence and manage c...,multi_hop_abstract_query_synthesizer
8,how chapter 7 and 17 connect about sleep and h...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,"chapter 7 talks about sleep and recovery, expl...",multi_hop_specific_query_synthesizer
9,H0w c4n I bUild a he4lthy m0rn1ng r0utine (cha...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,To build a healthy morning routine that improv...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
From working through the notebook, here's what I noticed about the two approaches:

1. Unrolled (Manual) Approach:
- We build everything step by step ‚Äî create the knowledge graph, apply transformations manually, define our own query distribution with specific synthesizers and weights, and then generate the test set.
- The big advantage is control. We get to decide exactly how the knowledge graph is built, we can save/load it separately, inspect it, and fine-tune the mix of query types (like we did with 50% single-hop, 25% multi-hop abstract, 25% multi-hop specific).
- The downside is it's more code and more complexity and we need to understand how each piece fits together.

2. Abstracted (Automatic) Approach:

- It's basically a one-liner ‚Äî generator.generate_with_langchain_docs(docs, testset_size=10). Ragas handles the knowledge graph creation, transformations, and query distribution under the hood.
- The advantage is simplicity and speed. We can get a test set generated really quickly without worrying about the internals.
- The trade-off is you lose fine-grained control over things like query type distribution, which transformations get applied, or the ability to reuse the knowledge graph.

When I'd choose each:

I'd go with the abstracted approach for quick prototyping or when I just need a basic evaluation dataset fast ‚Äî like early in development when I'm iterating quickly and just want a sanity check on my RAG pipeline.
I'd switch to the unrolled approach when I need more rigorous, repeatable evaluation ‚Äî for example, when I want to control the difficulty mix of questions, reuse the same knowledge graph across experiments for consistency, or when I'm doing a final evaluation before deploying to production. Being able to save and reload the knowledge graph is really useful for reproducibility.
Basically, the abstracted version is great for getting started, but as the evaluation needs mature, the unrolled version gives the flexibility that we eventually need.



---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [None]:

# I'm shifting the weights towards multi-hop queries (75% multi-hop vs 25% single-hop).
custom_query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.40),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.35),
]

# Generate test set with custom distribution
custom_testset = generator.generate(testset_size=10, query_distribution=custom_query_distribution)
custom_df = custom_testset.to_pandas()

# Compare with the default distribution
default_df = testset.to_pandas()

print("=== DEFAULT Distribution Results ===")
print(default_df["synthesizer_name"].value_counts())
print()
print("=== CUSTOM Distribution Results ===")
print(custom_df["synthesizer_name"].value_counts())
print()

# Side-by-side comparison
print("=== Sample Questions Comparison ===")
for synth_name in custom_df["synthesizer_name"].unique():
    print(f"\n--- {synth_name} ---")
    samples = custom_df[custom_df["synthesizer_name"] == synth_name]["user_input"].head(2)
    for q in samples:
        print(f"  ‚Ä¢ {q[:100]}...")


Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

=== DEFAULT Distribution Results ===
synthesizer_name
single_hop_specifc_query_synthesizer    5
multi_hop_abstract_query_synthesizer    3
multi_hop_specific_query_synthesizer    3
Name: count, dtype: int64

=== CUSTOM Distribution Results ===
synthesizer_name
multi_hop_abstract_query_synthesizer    4
multi_hop_specific_query_synthesizer    4
single_hop_specifc_query_synthesizer    3
Name: count, dtype: int64

=== Sample Questions Comparison ===

--- single_hop_specifc_query_synthesizer ---
  ‚Ä¢ What does the World Health Organization define as mental health?...
  ‚Ä¢ What is MBSR and how does it contribute to stress reduction and well-being?...

--- multi_hop_abstract_query_synthesizer ---
  ‚Ä¢ mental health and well being how social emotional health connect...
  ‚Ä¢ How does face-to-face interactions role in social support and mental health?...

--- multi_hop_specific_query_synthesizer ---
  ‚Ä¢ how exercise helps mental health and why doing exercise is good for your mental health l

3. Here's a comparison of the two distributions based on the results:

Quality Differences in the Generated Questions:

- Single-hop questions are clean and direct ‚Äî "What does the WHO define as mental health?" ‚Äî basically factoid retrieval from one chunk. Easy for a RAG pipeline.
- Multi-hop abstract questions are noticeably fuzzier and more conversational ‚Äî "mental health and well being how social emotional health connect". They read almost like how a real user would type a search query. These test whether our retrieval can handle vague, broad queries that span multiple sections.
- Multi-hop specific questions are the most complex ‚Äî "How does regular exercise, as discussed in both the first and second parts of the context, contribute to...". They explicitly reference multiple sections and demand precise cross-document reasoning. These are the hardest for a RAG system.

Key Takeaway: 

By shifting from 50/25/25 to 25/40/35, we went from a majority of easy lookups to a majority of harder multi-hop questions. This means our custom test set is a tougher evaluation benchmark ‚Äî if the RAG pipeline scores well on this distribution, we can be more confident it handles real-world complex queries, not just simple fact retrieval. The trade-off is that absolute scores will likely be lower, but that's fine because RAGAS is best used for directional comparison anyway (as the notebook itself mentioned).

4. Why these weights? 

I went heavier on multi-hop queries (40% abstract + 35% specific = 75% multi-hop) because the original distribution was dominated by simple single-hop lookups. In a real health & wellness app, users ask questions that cross topics ‚Äî like connecting sleep habits to stress management or nutrition to mental health. By stress-testing with more multi-hop queries, we can catch retrieval failures where the system struggles to pull relevant context from multiple documents. The 40/35 split between abstract and specific multi-hop ensures we test both high-level synthesis and precise cross-reference capabilities.    

We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [16]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [17]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [18]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [19]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [20]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [21]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [22]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [23]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [24]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [25]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [26]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [27]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [28]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`:
> - `labeled_helpfulness_evaluator`:
> - `dopeness_evaluator`:

## LangSmith Evaluation

In [29]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'essential-night-19' at:
https://eu.smith.langchain.com/o/2fe1c7ad-0fb5-4a33-9b4b-3dd820425a13/datasets/22b30e93-13c7-4274-b32b-b0b61b98db44/compare?selectedSessions=347291a5-9638-48d9-88f7-dce9b71c7050




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can exercise improve mental health and wha...,Exercise improves mental health through multip...,,Exercise improves mental health by releasing e...,True,True,True,3.906751,d3b4d1ec-a391-49d8-9e7b-8331a41e4d24,019c5322-81ce-7021-a7dc-14287b99b9e0
1,"How can incorporating regular exercise, as rec...",Incorporating regular exercise improves mental...,,The provided context highlights that regular e...,True,True,False,3.818964,713e87c6-5e61-4f8c-a50e-51d6a1cdcce2,019c5322-be90-7873-b48c-2ae0d888172b
2,How does CBT-I incorporate principles from CBT...,CBT-I (Cognitive Behavioral Therapy for Insomn...,,"CBT-I, or Cognitive Behavioral Therapy for Ins...",True,True,True,3.926847,254649d3-3039-41ab-89e6-948f73e81106,019c5322-f4af-7851-9ead-5ce43d04c31c
3,How does CBT-I relate to the use of CBT for me...,Based on the provided context:\n\nCBT-I (Cogni...,,"CBT-I, or Cognitive Behavioral Therapy for Ins...",True,True,True,4.711107,23a088e0-f471-4ecb-88eb-a743389ee12f,019c5323-26b0-7ad3-918d-b736d0533dd8
4,Hw does gut-brain axis and probiotics affect m...,The gut-brain axis influences mental health th...,,The context explains that the gut-brain axis i...,True,True,True,2.986962,6366b82b-0df3-4a7b-9258-50a9f76d03ee,019c5323-625d-7662-aa60-3b6fae6d6b16
5,How can I buid healthy habbits and improve my ...,To build healthy habits and improve your sleep...,,To build healthy habits and enhance your sleep...,True,True,False,4.532756,a65d9dbc-a740-4c03-bbdc-be6789af558c,019c5323-ae2d-7e91-9bcc-04a14ac1bf44
6,How can managing common health concerns like h...,"According to the provided context, managing co...",,Managing common health concerns such as headac...,True,True,True,3.685956,80b9d9f9-5fc0-49d2-ac61-1e8cb89246d7,019c5323-f84a-7c92-8641-4762fe987f70
7,How can incorporating mindfulness and meditati...,Incorporating mindfulness and meditation pract...,,Incorporating mindfulness and meditation pract...,True,True,True,4.827632,8273ab46-8c47-4929-a204-ccee3b124bff,019c5324-336c-7432-9b4b-20b9a3d09a09
8,How do Psychologists help with setting and mai...,I don't know.,,Psychologists are mental health professionals ...,False,False,False,1.426339,a1352ca2-fa66-4908-8dfc-0a937fa33b77,019c5324-7f1a-7022-b68f-3a726de011f2
9,How does exercise help improve mental health a...,Exercise helps improve mental health in severa...,,Exercise affects the brain in multiple benefic...,True,True,False,3.7261,5d814f04-4c9c-4a35-b39b-5ceb1e141a4d,019c5324-a46a-7013-ad07-8bc9fecfd199


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [30]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [31]:
rag_documents = docs

In [32]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
Chunk size is basically how much text we pack into each piece before embedding it and storing it in the vector database. Changing it has a direct impact on retrieval quality, which cascades into the whole RAG pipeline:

- Too small chunks (e.g., 200 characters) ‚Äî each chunk captures very little context. The embeddings become very narrow and specific. This means retrieval might find a relevant sentence but miss the surrounding explanation needed to actually answer the question. The LLM then gets fragmented context and may produce incomplete or incoherent answers.
- Too large chunks (e.g., 5000 characters) ‚Äî each chunk contains a lot of information, but the embedding has to represent all of it in a single vector. This "dilutes" the semantic meaning ‚Äî the embedding becomes a blurry average of many topics. So when we search for something specific, a large chunk might not rank as highly because its embedding doesn't closely match the query. Plus, we'd be feeding a lot of irrelevant text to the LLM alongside the relevant bit, which wastes tokens and can confuse the model.
- The sweet spot (like the 1000 we're using in the notebook with 50 overlap) tries to balance both ‚Äî enough context for a meaningful answer, but focused enough that the embedding accurately represents the content. The overlap of 50 helps ensure we don't accidentally cut a key idea right at a chunk boundary.

So in short, chunk size affects what gets retrieved and how much useful context the LLM sees. It's one of the most impactful parameters in a RAG system, and that's exactly why evaluation tools like RAGAS and LangSmith matter ‚Äî they let us actually measure the effect of changing it rather than just guessing.

In [33]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
The embedding model is what translates text (both our document chunks and the user's query) into numerical vectors. Swapping it out changes how meaning is represented, which directly impacts retrieval:

- Different models capture semantics differently. The notebook uses text-embedding-3-large, which is one of OpenAI's most capable embedding models. A smaller or older model (like text-embedding-ada-002) might not capture nuanced relationships between concepts as well. For example, a stronger model might understand that "insomnia" and "sleep hygiene" are related even though the words are different, while a weaker model might miss that connection.
- Embedding dimensionality matters. Larger models typically produce higher-dimensional vectors, which can represent more subtle distinctions between concepts. This means retrieval is more precise ‚Äî the right chunks rank higher for a given query. But it also means more storage and slightly slower similarity search.
- Domain fit plays a role. Some embedding models are fine-tuned for specific domains (medical, legal, etc.). If our health & wellness documents use specialized terminology, a general-purpose embedding model might not represent those terms as effectively as one trained on similar content.
- Query-document alignment. The same model must embed both the query and the documents. If we switch models, the entire vector store needs to be re-embedded ‚Äî we can't mix embeddings from different models because they live in completely different vector spaces.

So basically, the embedding model determines how well the system understands what's similar to what. A better model means more relevant chunks get retrieved, which means the LLM gets better context, which means better final answers. That's why ‚Äî combined with chunk size from Q3 ‚Äî these two parameters are probably the most impactful knobs to tune in a RAG pipeline, and exactly why we use RAGAS metrics and LangSmith to measure the impact of changing them.

In [34]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [35]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [37]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [38]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Alright, let‚Äôs crank your sleep game to legendary status! Based on the epic wisdom packed in the context, here‚Äôs your dope roadmap to unlocking next-level sleep quality:\n\n1. **Consistency is KING** ‚Äî Hit the sack and rise up at the same time every. single. day. Weekends don‚Äôt get a free pass. Your circadian rhythm thrives on routine, so stick to it like a boss.\n\n2. **Craft a Chill Bedtime Ritual** ‚Äî Think reading your fave book, gentle stretching, or a warm bath that melts stress away. This signals your brain it‚Äôs chillax time.\n\n3. **Zen out your sleep zone** ‚Äî Keep your bedroom cool (65-68¬∞F/18-20¬∞C), dark (blackout curtains or sleep mask, no exceptions), and silent (white noise machines or earplugs will be your sleep ninjas). Invest in killer pillows and a mattress that cradles you like royalty.\n\n4. **Ditch the Screens Early** ‚Äî Power down devices 1-2 hours before lights out. The blue light beast messes with your melatonin, your natural sleep hormone.\n\n5.

Finally, we can evaluate the new chain on the same test set!

In [39]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'pertinent-society-64' at:
https://eu.smith.langchain.com/o/2fe1c7ad-0fb5-4a33-9b4b-3dd820425a13/datasets/22b30e93-13c7-4274-b32b-b0b61b98db44/compare?selectedSessions=00ceb112-18a7-45b0-9196-9ac6cf4877c4




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can exercise improve mental health and wha...,"Yo, strap in ‚Äî exercise isn‚Äôt just flexing mus...",,Exercise improves mental health by releasing e...,True,True,True,6.495559,d3b4d1ec-a391-49d8-9e7b-8331a41e4d24,019c5325-d182-77f0-abbe-2f05b86a4c9f
1,"How can incorporating regular exercise, as rec...","Yo, here‚Äôs the lowdown straight from the menta...",,The provided context highlights that regular e...,True,True,True,4.92292,713e87c6-5e61-4f8c-a50e-51d6a1cdcce2,019c5326-30db-7591-a5b5-636e0807e94e
2,How does CBT-I incorporate principles from CBT...,"Yo, let‚Äôs break down how CBT-I legit channels ...",,"CBT-I, or Cognitive Behavioral Therapy for Ins...",True,True,True,8.737433,254649d3-3039-41ab-89e6-948f73e81106,019c5326-780a-7b53-8c27-13f49984a653
3,How does CBT-I relate to the use of CBT for me...,"Yo, here‚Äôs the slick lowdown straight from the...",,"CBT-I, or Cognitive Behavioral Therapy for Ins...",True,True,True,4.687644,23a088e0-f471-4ecb-88eb-a743389ee12f,019c5326-cd88-70d1-9b33-405fde4625a9
4,Hw does gut-brain axis and probiotics affect m...,"Yo, here‚Äôs the straight-up cosmic connection b...",,The context explains that the gut-brain axis i...,True,True,True,4.453433,6366b82b-0df3-4a7b-9258-50a9f76d03ee,019c5327-1b3e-76e0-94f5-c580ea19eeaa
5,How can I buid healthy habbits and improve my ...,"Alright, let‚Äôs crank your wellness game to leg...",,To build healthy habits and enhance your sleep...,True,True,True,5.581946,a65d9dbc-a740-4c03-bbdc-be6789af558c,019c5327-5181-7243-965e-671261455551
6,How can managing common health concerns like h...,"Yo, listen up‚Äîrocking your wellness game means...",,Managing common health concerns such as headac...,True,True,True,5.814851,80b9d9f9-5fc0-49d2-ac61-1e8cb89246d7,019c5327-94c9-7d43-9634-90b237aee85b
7,How can incorporating mindfulness and meditati...,"Oh heck yes, let‚Äôs dive deep into the rad worl...",,Incorporating mindfulness and meditation pract...,True,True,True,6.554741,8273ab46-8c47-4929-a204-ccee3b124bff,019c5327-d97b-7481-ac8d-a9cc130be5c4
8,How do Psychologists help with setting and mai...,"Alright, let‚Äôs crank up the mental health mojo...",,Psychologists are mental health professionals ...,True,True,True,4.781982,a1352ca2-fa66-4908-8dfc-0a937fa33b77,019c5328-2242-76e2-8c6a-35c388c37048
9,How does exercise help improve mental health a...,"Alright, let‚Äôs crank up the mental health mojo...",,Exercise affects the brain in multiple benefic...,True,True,True,4.407238,5d814f04-4c9c-4a35-b39b-5ceb1e141a4d,019c5328-5f48-7883-a94b-db4e565f6d13


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:
![LangSmith comparison of the two chains](image.png)

Looking at the LangSmith comparison between the two experiments, I can see clear patterns in how the metrics changed when switching from the baseline rag_chain to the dopeness_rag_chain:

1. QA (Correctness) ‚Äî Stayed at ~1.00 for both chains. This makes sense because both chains use the exact same retriever and vector store, so they pull the same context documents. The only thing that changed was the prompt style, not the factual content of the answers. The LLM still extracts and presents the correct information ‚Äî it just wraps it in a different tone.

2. Helpfulness ‚Äî Stayed at ~1.00 for both chains. Similar reasoning ‚Äî the underlying information and relevance of the answers didn't change. Whether the answer says "maintain a consistent sleep schedule" or "hit the sack at the same time every day like a boss," the advice is equally helpful and covers the same ground as the reference answer.

3. Dopeness ‚Äî Changed dramatically from 0.00 (baseline) to 1.00 (dopeness chain). This is the most significant change and it's entirely expected. The baseline chain uses a standard, professional prompt, so it naturally scores 0 on dopeness ‚Äî it was never designed to be "dope." The dopeness chain explicitly instructs the LLM to respond with excitement, slang, and energy, which is exactly what the custom dopeness evaluator is looking for. This shows that custom evaluators can effectively measure specific behavioral changes in the output.

4. Latency ‚Äî Slightly higher for the dopeness chain. The casual, energetic style tends to generate longer, more expressive responses (e.g., adding intros like "Alright, let's crank up the mental health mojo!" and emojis). More output tokens = more generation time.

5. Total Tokens ‚Äî Higher for the dopeness chain. Directly related to latency ‚Äî the dopeness prompt encourages verbose, enthusiastic responses, so the output token count increases. The input tokens are roughly the same since the retrieved context is identical.

Key takeaway: This comparison demonstrates that prompt engineering can dramatically change style-based metrics without affecting accuracy-based metrics, as long as the retrieval pipeline stays the same. It also shows the value of custom evaluators ‚Äî without the dopeness evaluator, we wouldn't have been able to quantify the stylistic difference between the two chains at all.

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores