# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/MentalHealthGuide.txt', 'data/HealthWellnessGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)


Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

In [9]:
type(docs[0])

langchain_core.documents.base.Document

In [10]:
dir(docs[0])

['__abstractmethods__',
 '__annotations__',
 '__class__',
 '__class_getitem__',
 '__class_vars__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__fields__',
 '__fields_set__',
 '__firstlineno__',
 '__format__',
 '__ge__',
 '__get_pydantic_core_schema__',
 '__get_pydantic_json_schema__',
 '__getattr__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pretty__',
 '__private_attributes__',
 '__pydantic_complete__',
 '__pydantic_computed_fields__',
 '__pydantic_core_schema__',
 '__pydantic_custom_init__',
 '__pydantic_decorators__',
 '__pydantic_extra__',
 '__pydantic_fields__',
 '__pydantic_fields_set__',
 '__pydantic_generic_metadata__',
 '__pydantic_init_subclass__',
 '__pydantic_on_complete__',
 '__pydantic_parent_namespace__',
 '__pydantic_post_init__',
 '__pydantic_private__',
 '__pydantic_root_model__',
 '__p

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [11]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

KnowledgeGraph(nodes: 9, relationships: 16)

We can save and load our knowledge graphs as follows.

In [12]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 9, relationships: 16)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [13]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [14]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
- **SingleHopSpecificQuerySynthesizer** generates questions that can be answered utilizing information in a single node (chunk of data or document)
  - these are usually straightforward, fact based questions that can be answered from a single document/section/node in the knowledge graph
- **MultiHopSpecificQuerySynthesizer** generates questions can be answered utilizing information that require reasoning across multiple nodes (chunk of data or documents)
  - like SingleHopSpecificQuerySynthesizer, these are also fact based questions/precise info but need multiple documents/chunks/nodes to be answered
- **MultiHopSpecificQuerySynthesizer** generates complex or abstract queries that require a deeper level of understanding across multiple documents/chunks/nodes to answer
  - these questions are usually not simple fact-based questions but require a higher-level of conceptual understanding
  - IE "when did Albert Einstein come up with the theory of relativity vs how have scientific theories evolved since Einstein's original publication?"

Finally, we can use our `TestSetGenerator` to generate our testset!

In [15]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,what is mental health,[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
1,What is CBT and how does it help in managing m...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Cognitive Behavioral Therapy (CBT) is one of t...,single_hop_specifc_query_synthesizer
2,"How does exercise influence mental health, and...",[Write letters to or from your future self Jou...,Exercise affects mental health by releasing en...,single_hop_specifc_query_synthesizer
3,What is the purpose of the APPENDIX in the con...,[social interactions How to set and maintain b...,"The APPENDIX provides mental health resources,...",single_hop_specifc_query_synthesizer
4,"As a mental health advocate, how do minerals c...",[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,The context mentions minerals as inorganic ele...,single_hop_specifc_query_synthesizer
5,How can practcing mindfulness and meditation h...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Practicing mindfulness and meditation can sign...,multi_hop_abstract_query_synthesizer
6,How do physical activity guidelines outlined i...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The Personal Wellness Guide emphasizes that re...,multi_hop_abstract_query_synthesizer
7,How does supporting digestive health through d...,[<1-hop>\n\n13: The Science of Habit Formation...,Supporting digestive health through diet and l...,multi_hop_abstract_query_synthesizer
8,How exercise helps mental health and also like...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,Exercise helps mental health by releasing endo...,multi_hop_specific_query_synthesizer
9,"How can practicing CBT and CBT-I, as described...",[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Practicing Cognitive Behavioral Therapy (CBT) ...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [16]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [15]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are the recommended exercises and strateg...,[The Personal Wellness Guide A Comprehensive R...,The provided context does not include specific...,single_hop_specifc_query_synthesizer
1,What does Stage 2 of sleep involve in the slee...,[PART 3: SLEEP AND RECOVERY Chapter 7: The Sci...,Stage 2 involves a drop in body temperature an...,single_hop_specifc_query_synthesizer
2,What information does Chapter 18 cover regardi...,[PART 5: BUILDING HEALTHY HABITS Chapter 13: T...,Chapter 18 discusses strategies to boost immun...,single_hop_specifc_query_synthesizer
3,How does the World Health Organization define ...,[The Mental Health and Psychology Handbook A P...,"According to the World Health Organization, me...",single_hop_specifc_query_synthesizer
4,how can exercise for common problems like lowe...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The wellness guide explains that gentle exerci...,multi_hop_abstract_query_synthesizer
5,How can incorporating mindfulness and social c...,[<1-hop>\n\nhour before bed - No caffeine afte...,Incorporating mindfulness and social connectio...,multi_hop_abstract_query_synthesizer
6,How can improving face-to-face interactions an...,[<1-hop>\n\nhour before bed - No caffeine afte...,Improving face-to-face interactions by engagin...,multi_hop_abstract_query_synthesizer
7,How can I improve my emotional intelligence an...,[<1-hop>\n\nhour before bed - No caffeine afte...,To improve emotional intelligence and manage c...,multi_hop_abstract_query_synthesizer
8,how chapter 7 and 17 connect about sleep and h...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,"chapter 7 talks about sleep and recovery, expl...",multi_hop_specific_query_synthesizer
9,H0w c4n I bUild a he4lthy m0rn1ng r0utine (cha...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,To build a healthy morning routine that improv...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
- if you want more control over the distribution of the types of queries that are being generated for the test set, then you'll want to utilize the unrolled (manual) process
- if you need a quick test set and don't care as much about controlling specifics of how the data is being generated (especially the distribution of questions), then going with the abstracted approach will be faster (and less code to type out)

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [26]:
### YOUR CODE HERE ###

# Define a custom query distribution with different weights
# Generate a new test set and compare with the default


### 1
# note: I'm skewing Multi-hop abstract queries and multi-hop specific queries way more than single-hop for the generated test set data
query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.10),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.50),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.40),
]

In [27]:
### 2
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the United States address mental heal...,[The Mental Health and Psychology Handbook A P...,The provided context does not include specific...,single_hop_specifc_query_synthesizer
1,How can incorporating specific exercises like ...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,Incorporating specific exercises such as the C...,multi_hop_abstract_query_synthesizer
2,How can setting boundaries improve mental heal...,[<1-hop>\n\nsocial interactions How to set and...,Setting and maintaining boundaries helps prote...,multi_hop_abstract_query_synthesizer
3,How can establishing good sleep hygiene and bu...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,The context emphasizes that understanding habi...,multi_hop_abstract_query_synthesizer
4,How do journaling practices and self-reflectio...,[<1-hop>\n\nWrite letters to or from your futu...,"Journaling practices, such as writing regularl...",multi_hop_abstract_query_synthesizer
5,How can incorporating specific exercises like ...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,Incorporating specific exercises such as the C...,multi_hop_abstract_query_synthesizer
6,how can CBT and CBT-I help improve mental heal...,[<1-hop>\n\nWrite letters to or from your futu...,CBT (Cognitive Behavioral Therapy) helps impro...,multi_hop_specific_query_synthesizer
7,Chapter 7 and 18 how they help mental health?,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,Chapter 7 talks about sleep and how it helps m...,multi_hop_specific_query_synthesizer
8,Based on the information from Chapters 4 and 1...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,Understanding the fundamentals of healthy eati...,multi_hop_specific_query_synthesizer
9,H0w do Ch4 and Ch8 in the context of sleep and...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,"Ch4, which covers fundamentals of healthy eati...",multi_hop_specific_query_synthesizer


In [28]:
# note default_query_distribution was imported earlier in the notebook
query_distribution = default_query_distribution(llm=generator_llm, kg=kg)
testset_default = generator.generate(testset_size=10, query_distribution=query_distribution)
testset_default.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,what mental health do,[The Mental Health and Psychology Handbook A P...,"Mental health includes our emotional, psycholo...",single_hop_specifc_query_synthesizer
1,What is MBSR?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Stress Reduction (MBSR) is a...,single_hop_specifc_query_synthesizer
2,How does serotonin relate to mental health?,[Write letters to or from your future self Jou...,The context mentions that exercise increases s...,single_hop_specifc_query_synthesizer
3,How does FOMO affect mental health and what ca...,[social interactions How to set and maintain b...,Fear of missing out (FOMO) drives compulsive c...,single_hop_specifc_query_synthesizer
4,how mindfulness and meditation help with sleep...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,the context shows that mindfulness and meditat...,multi_hop_abstract_query_synthesizer
5,how trauma and its psychological effects relat...,[<1-hop>\n\nThe Mental Health and Psychology H...,The context explains that trauma is a signific...,multi_hop_abstract_query_synthesizer
6,How can understanding social interactions and ...,[<1-hop>\n\nsocial interactions How to set and...,Understanding social interactions and boundary...,multi_hop_abstract_query_synthesizer
7,How sleep and exercise both affect mental heal...,[<1-hop>\n\nWrite letters to or from your futu...,The context shows that sleep and exercise both...,multi_hop_abstract_query_synthesizer
8,Based on the information from Chapter 1 of the...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Understanding the basics of exercise from Chap...,multi_hop_specific_query_synthesizer
9,How can CBT and CBT-I be used together to impr...,[<1-hop>\n\nWrite letters to or from your futu...,Cognitive Behavioral Therapy (CBT) helps ident...,multi_hop_specific_query_synthesizer


#### 3
- when the **reference context** and the synthesizer type is the same, RAGAS seems to generate similar questions across both test sets
  - even when the reference context is the same for the same synthesize type, each test set generated slightly different questions
  - the generated questions end up being similar or somewhat related probably because the reference context contains the answer to the question (so if both test sets generated the same type of question (synthesizer type) from the same reference context, the question will be similar or at least related
- the only major difference between the test sets is the distribution of questions (single-hop vs multi-hop abstract vs multi-hop)
  - since I skewed the distribution heavily towards abstract multi-hop questions, there were way more of those generated in the first test set compared to the default test set
  - the default test set has about an even 33% for each synthesizer type whereas I skewed abstract multi-hop to 50%, then multi-hop to 40% and only 10% for single-hop

#### 4
I chose 10% for single-hop, 50% for abstract multi-hop and 40% for multi-hop question generation for the test set. My thinking about this was I wanted to really test multi-hop ability for the LLM/agent. I think LLMs usually are pretty good at answer single-hop questions, but multi-hop, especially interpretting or understanding different sources of data (especially when they may have slightly different information) is where LLMs need a lot of fine-tuning and testing; hence I skewed the test set heavily towards these sort of questions.

We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [29]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [30]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [31]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [32]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [33]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [34]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [35]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [36]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [37]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [38]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [39]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off the floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [40]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [41]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`: assesses whether the actual output from the LLM is correct or not given the reference answer.
> - `labeled_helpfulness_evaluator`: assesses whether the actual output from the LLM is helpful to the user while also taking into account the correct reference answer
> - `dopeness_evaluator`: assesses whether the output is "dope" or not (dopeness being defined as lit or cool)

## LangSmith Evaluation

In [42]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can Cognitive Behavioral Therapy (CBT) and...,"Based on the provided context, Cognitive Behav...",,Integrating Cognitive Behavioral Therapy (CBT)...,True,True,True,5.029598,c7011e9b-f78d-4e9d-b004-c5e524145ae7,019c5a74-8489-7683-827c-59bd08f861a6
1,Whi chapters cover how to buid healthy habbits...,"Based on the provided context, Chapter 13 in t...",,Chapter 13 discusses the science of habit form...,True,False,False,1.860211,1ba8e793-f26f-46e9-8ee5-94f942bcea00,019c5a74-b850-73a0-8f58-3aba802ad6d6
2,Based on the information from Chapter 6 and Ch...,I don't know.,,"Understanding sleep cycles, including the stag...",False,False,False,0.929294,64a19268-df25-4169-9476-57f26146e528,019c5a74-ed8b-7392-b411-7eaeeaa6329a
3,What chapters cover the sciene of habit format...,The Science of Habit Formation is covered in C...,,Chapter 13 discusses the science of habit form...,True,True,True,15.317279,975ac54e-57d5-4584-8b09-ad40c7f348f3,019c5a75-0619-7b63-ac13-385d06e14b72
4,"How does engaging in physical activity, as par...",Engaging in physical activity as part of a com...,,Engaging in physical activity is a key compone...,True,True,True,4.496198,766776c3-44c8-4f8e-8719-61e3232e11d8,019c5a75-6648-7102-8048-0fbb2a12ad0e
5,How does sleep and its bidirectional relations...,Sleep and its bidirectional relationship with ...,,The context explains that sleep and mental hea...,True,True,True,2.780071,f44304ad-e6de-44b3-ad46-d99be5492ca1,019c5a75-9456-7e81-90ea-424818c2d599
6,How do benefits of outdoor and group exercize ...,"Based on the provided context, outdoor exercis...",,The benefits of outdoor and group exercize enh...,False,True,False,5.965331,01e8cd5b-e810-4a34-ae9b-35a06ea2a21b,019c5a75-cf56-7a90-8a31-8bae1e556041
7,Sleep and mental health impact sleep on emotio...,Based on the provided context:\n\nSleep signif...,,Sleep and mental health have a bidirectional r...,True,True,False,1.878822,e8943be5-1ff7-4660-b49b-9043078f0f8f,019c5a76-0d62-7a41-a262-2ed3f120cfdc
8,Who are the Licensed Professional Counselors i...,Licensed Professional Counselors are mental he...,,Licensed Professional Counselors are mental he...,True,True,False,0.988996,3d1ed888-992d-42ff-80f5-7e6ab39d787f,019c5a76-328c-78b1-8602-28d681b170e8
9,"As a mental health counselor, how does the ter...","As a mental health counselor, the term ""Exerci...",,Physical activity is one of the most effective...,True,True,True,3.705856,13951c11-e8d8-4a8c-a561-fc9ddbd8c646,019c5a76-4e58-7cc1-903d-6d6009ccde20


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [43]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [44]:
rag_documents = docs

In [45]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
- chunk size impacts your RAG system, especially its retrieval quality and answer generation
- chunk size is a trade-off between precision and recall
- if your chunk size is small, all the chunks will be more focused (higher precision) and more granular
  - retrieval can pinpoint exact information w/smaller chunks
  - but you may lose context since the chunks are so small (lose the bigger picture in each chunk)
  - smaller chunks will work well for single-hop questions b/c there is less noise in the chunk (answer typically lives in a focused passage)
- larger chunks have more context per chunk
  - good for preserving relationships between ideas (good recall)
  - better at preserving causal relationships
  - less retrieval calls are needed
  - but there's lower precision and possibly more noise in each chunk

In [45]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
- if you have better embeddings (better compression of text and its associated meaning into vectors/numeric representation), then the retrieval will give better context (capture the semantic meaning better)
- some embedding models may capture domain specific text better (healthcare, finance, etc.)
- overall better context/captured semantic meaning from embeddings should mean better answers from the RAG system and improved performance overall

In [46]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [47]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [48]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [49]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Alright, here‚Äôs how you can level up your sleep game and hit those sweet, restorative zzz‚Äôs like a pro:\n\n1. **Own a Consistent Sleep Schedule:** Slide into bed and rise at the same time every day‚Äîyes, even on weekends. Your body loves consistency, it craves routine.\n\n2. **Craft a Chill Bedtime Ritual:** Think gentle stretching, flipping through a book, or soaking in a warm bath. This is your pre-sleep wind-down to tell your brain it‚Äôs time to unplug.\n\n3. **Optimize Your Sleep Den:** Keep your bedroom cool (ideally between 65-68¬∞F/18-20¬∞C), pitch black with blackout curtains or a sleep mask, and whisper-quiet‚Äîwhite noise machines or earplugs can be your secret weapon.\n\n4. **Ditch Screens Early:** Shut down screens 1-2 hours before lights out to escape that blue light saboteur messing with your melatonin mojo.\n\n5. **Ban the PM Caffeine:** No caffeine after 2 PM. Those sneaky stimulants will keep your mind buzzing when you want it peacefully blank.\n\n6. **Time Your

Finally, we can evaluate the new chain on the same test set!

In [50]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can Cognitive Behavioral Therapy (CBT) and...,"Oh heck yeah, let‚Äôs dive into this cosmic fusi...",,Integrating Cognitive Behavioral Therapy (CBT)...,True,True,True,7.068275,c7011e9b-f78d-4e9d-b004-c5e524145ae7,019c5ab3-6527-7963-aab6-1b2750ea743b
1,Whi chapters cover how to buid healthy habbits...,"Oh, we're diving into the slick world of menta...",,Chapter 13 discusses the science of habit form...,False,False,True,2.335757,1ba8e793-f26f-46e9-8ee5-94f942bcea00,019c5ab3-9d99-7691-9750-95b06ee74ef1
2,Based on the information from Chapter 6 and Ch...,"Yo, here‚Äôs the cosmic scoop straight from the ...",,"Understanding sleep cycles, including the stag...",False,False,True,4.260564,64a19268-df25-4169-9476-57f26146e528,019c5ab3-cb27-7513-a5e3-49ddbe05758f
3,What chapters cover the sciene of habit format...,"Yo, get ready to vibe with the ultimate self-c...",,Chapter 13 discusses the science of habit form...,False,True,True,4.003019,975ac54e-57d5-4584-8b09-ad40c7f348f3,019c5ab3-ff8f-77d3-896f-96d12b6c0e7a
4,"How does engaging in physical activity, as par...","Alright, let‚Äôs dive deep into the rad synergy ...",,Engaging in physical activity is a key compone...,True,True,True,5.058504,766776c3-44c8-4f8e-8719-61e3232e11d8,019c5ab4-2b8a-7830-a67e-a17f6a717c1a
5,How does sleep and its bidirectional relations...,"Alright, buckle up for the ultimate mental wel...",,The context explains that sleep and mental hea...,True,True,True,3.306402,f44304ad-e6de-44b3-ad46-d99be5492ca1,019c5ab4-6112-7e73-a3b4-0aee0be584b6
6,How do benefits of outdoor and group exercize ...,"Yo, here‚Äôs the slick breakdown based on the me...",,The benefits of outdoor and group exercize enh...,True,True,True,3.001956,01e8cd5b-e810-4a34-ae9b-35a06ea2a21b,019c5ab4-8f04-7892-b61a-aafa8790e4ff
7,Sleep and mental health impact sleep on emotio...,"Oh, absolutely‚Äîsleep isn‚Äôt just a pit stop, it...",,Sleep and mental health have a bidirectional r...,True,True,True,2.865802,e8943be5-1ff7-4660-b49b-9043078f0f8f,019c5ab4-b927-7c41-a03d-341506ad7d42
8,Who are the Licensed Professional Counselors i...,"Yo, Licensed Professional Counselors (LPCs) ar...",,Licensed Professional Counselors are mental he...,True,True,True,1.342514,3d1ed888-992d-42ff-80f5-7e6ab39d787f,019c5ab4-e7e7-7ed0-8020-435e994bb3fd
9,"As a mental health counselor, how does the ter...","Alright, brace yourself for some next-level me...",,Physical activity is one of the most effective...,True,True,True,6.703793,13951c11-e8d8-4a8c-a561-fc9ddbd8c646,019c5ab5-014e-7ff1-8b20-5ec150910336


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

![Chart 1](langsmith_chart1.png)

![Chart 2](langsmith_chart2.png)

##### Answer:
- `excellent-father-70` is the 2nd experiment where we updated the RAG prompt and utilized a bigger embedding model to "dope-ify" our RAG application and it clearly worked
- you can see, `excellent-father-70` has a much higher dopeness rating (the average is 1) where as it is 0.4 for the first experiment (`fresh-dress-94`)
- the `helpfulness` and `qa` average scores stayed roughly the same between both experiments which shows that we didn't actually lose "correctness" or "helpfulness" by increasing our average "dopeness" rating in the 2nd experiment with the change in the RAG prompt and larger embedding model (which is pretty cool!)

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores