# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [18]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\akatalinic1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\akatalinic1\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [19]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [20]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [21]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [22]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data\\HealthWellnessGuide.txt', 'data\\MentalHealthGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [23]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [24]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [25]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [26]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 9, relationships: 18)

We can save and load our knowledge graphs as follows.

In [27]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 9, relationships: 18)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [28]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [29]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
There are three types of synthesizers:
‚Ä¢ Single Hop Specific Query Synthesizer - It only does one hop from one document/node to another, it creates a query that conencts those two nodes in a single hop.

ex. Who's CEO of OpenAI ?

‚Ä¢ MultiHop Abstract Query Synthesizer - It does multi hop to multiple documents/nodes and asks for combined meaning, not only one specific fact.

ex. Compare the tradeoffs between GPT-4o mini and GPT-4o across cost/latency/quality

‚Ä¢ MultiHop Specific Query Synthesizer - similar to single specific hop but it takes multiple hops to connect documents/nodes but it tries to find one exact answer with more precise details

ex. Which CEO was in charge when OpenAI had its highest profit margin, and what year was it ?

Finally, we can use our `TestSetGenerator` to generate our testset!

In [30]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What topics are covered in Chapter 4 of the pr...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Chapter 4: Fundamentals of Healthy Eating disc...,single_hop_specifc_query_synthesizer
1,What is Digtal Wellness and how can it be impr...,[13: The Science of Habit Formation Habits are...,"Digital wellness, as described in the context,...",single_hop_specifc_query_synthesizer
2,What is Chin?,[The Personal Wellness Guide A Comprehensive R...,The provided context does not include informat...,single_hop_specifc_query_synthesizer
3,Is the Psyhology Handbook helpful?,[The Mental Health and Psychology Handbook A P...,The Mental Health and Psychology Handbook is a...,single_hop_specifc_query_synthesizer
4,What is Cognitive Behavioral Therapy and how d...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Cognitive Behavioral Therapy (CBT) is one of t...,single_hop_specifc_query_synthesizer
5,How can understanding habit formation and crea...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,"Understanding habit formation, including the h...",multi_hop_abstract_query_synthesizer
6,How can journaling and self-care practices hel...,[<1-hop>\n\nWrite letters to or from your futu...,"Journaling practices, such as writing regularl...",multi_hop_abstract_query_synthesizer
7,how exercise and movement help lower back pain...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The context explains that gentle stretching an...,multi_hop_abstract_query_synthesizer
8,How does mental health influence physical well...,[<1-hop>\n\nThe Mental Health and Psychology H...,Mental health significantly impacts physical w...,multi_hop_specific_query_synthesizer
9,how to set boundaries and manage digital menta...,[<1-hop>\n\nsocial interactions How to set and...,from chapter 4 social interactions how to set ...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [32]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [33]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are micronutrients in diet?,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Micronutrients are inorganic elements like cal...,single_hop_specifc_query_synthesizer
1,How does supporting digestive health relate to...,[13: The Science of Habit Formation Habits are...,Supporting digestive health is important becau...,single_hop_specifc_query_synthesizer
2,What does the term pelvic refer to?,[The Personal Wellness Guide A Comprehensive R...,The context provided does not specify the exac...,single_hop_specifc_query_synthesizer
3,"As a Healthy Lifestyle Coach, how does mental ...",[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
4,How does developing psychological resilience t...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Developing psychological resilience through mi...,multi_hop_abstract_query_synthesizer
5,How build workout routine with progressive ove...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The guide explains that building a workout rou...,multi_hop_abstract_query_synthesizer
6,how mental health and CBT help with psychologi...,[<1-hop>\n\nThe Mental Health and Psychology H...,The context explains that mental health includ...,multi_hop_abstract_query_synthesizer
7,How can understanding the science of habit for...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,"Understanding the science of habit formation, ...",multi_hop_abstract_query_synthesizer
8,How do vitamins like B vitamins and other nutr...,[<1-hop>\n\nWrite letters to or from your futu...,"The context explains that B vitamins, found in...",multi_hop_specific_query_synthesizer
9,How does vitamn D help with mental health and ...,[<1-hop>\n\nWrite letters to or from your futu...,Vitamin D is obtained from sunlight and fortif...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
In manual unrolled apprach we build our pipleline by ourself, it's more time consuming but we have more control on prompts, edge cases and we can debug eaech stage, more specific approach to our needs. In abstracted we get plug and play solution, it's nice for PoC, you dont have much control due to the fact it's plug & play, alot of things are abstracted and it's harder to debug. More generic response.

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [None]:
### YOUR CODE HERE ###

# Define a custom query distribution with different weights
new_query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.6),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.1),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.3),
]
# Generate a new test set and compare with the default
new_testset = generator.generate(testset_size=12, query_distribution=new_query_distribution)
new_testset.to_pandas()

## In default dataset we had balanced question distribution 4/4/4, in my distribution (8/2/4) the questions are more in favour of singlehop than other two sytnhesizers. The reason I chose this way is that in reality most questions that user ask are direct, no need for multihops or abstract synthesis. This data is pretty straight forward (health/metanl guide), and answers should come from one specific doc/node.I set 0.3 for multihop specific to still test cases where you need to stitch evidence from multiple docs. I tried lowering abscract query synthesizer because those kind of questions are less common. But still in the end we should play more with weights to get the best possible results.

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/14 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are Proteins and why are they important f...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Proteins are essential for muscle repair and i...,single_hop_specifc_query_synthesizer
1,What minerals help body health,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,"Minerals are inorganic elements like calcium, ...",single_hop_specifc_query_synthesizer
2,What is work-life balance according to the con...,[13: The Science of Habit Formation Habits are...,Work-life balance involves maintaining a balan...,single_hop_specifc_query_synthesizer
3,What is digital wellness?,[13: The Science of Habit Formation Habits are...,Digital wellness refers to practices that help...,single_hop_specifc_query_synthesizer
4,What is Chin?,[The Personal Wellness Guide A Comprehensive R...,The provided context does not include informat...,single_hop_specifc_query_synthesizer
5,What are some exercises recommended for lower ...,[The Personal Wellness Guide A Comprehensive R...,Recommended exercises for lower back pain incl...,single_hop_specifc_query_synthesizer
6,What does the World Health Organization define...,[The Mental Health and Psychology Handbook A P...,"According to the World Health Organization, me...",single_hop_specifc_query_synthesizer
7,How does mental health influence physical health?,[The Mental Health and Psychology Handbook A P...,Mental health affects physical health through ...,single_hop_specifc_query_synthesizer
8,How do mindfulness-based therapies like MBSR a...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Mindfulness-based therapies such as Mindfulnes...,multi_hop_abstract_query_synthesizer
9,H0w can p3rsonal wellness and self-care pract1...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The context emphasizes that regular physical a...,multi_hop_abstract_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [34]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [35]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [36]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [37]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [38]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [39]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [40]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [41]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [42]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [43]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [44]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [45]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [46]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`:
> - `labeled_helpfulness_evaluator`:
> - `dopeness_evaluator`:

## LangSmith Evaluation

In [47]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'only-potato-57' at:
https://smith.langchain.com/o/4ac4b17e-92ac-4421-9ffa-fa0dc18c9d28/datasets/5d31b527-7f8d-47e5-ab83-4dedb3e897a9/compare?selectedSessions=009c2245-dc24-4579-a510-f6c7c22f1e56




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can CBT and CBT-I help improve mental heal...,CBT (Cognitive Behavioral Therapy) helps impro...,,CBT helps improve mental health by identifying...,True,True,True,3.709402,ac65cd41-a2fb-4795-ad84-4990325ff53e,019c672c-4987-7742-a178-7a7bbd4ef096
1,How does CBT help with mental health and how c...,Based on the provided context:\n\nCBT (Cogniti...,,"CBT, or Cognitive Behavioral Therapy, helps wi...",True,True,True,3.98175,017ec7c7-3d54-4569-9922-fe7a323af1c0,019c672c-8a95-7a81-acbd-3c6f09d199b9
2,How does vitamn D help with mental health and ...,Vitamin D is associated with mental health bec...,,Vitamin D is obtained from sunlight and fortif...,True,False,False,1.215774,1885c006-46a1-4650-8f7f-7b8285edc2c0,019c672c-ffe7-7eb0-9500-643cbc4ed76b
3,How do vitamins like B vitamins and other nutr...,Vitamins like B vitamins are essential for men...,,"The context explains that B vitamins, found in...",True,True,True,1.746193,7b19efbd-bbaa-4e22-ad2a-50f06edfd946,019c672d-393e-7983-aede-34a5dfbce0f6
4,How can understanding the science of habit for...,Understanding the science of habit formation h...,,"Understanding the science of habit formation, ...",True,True,True,3.404036,d46d9ee8-527e-4838-a99f-75c50f6f62ce,019c672d-6b0b-7882-b4c1-3b05c420f2af
5,how mental health and CBT help with psychologi...,"Based on the provided context, mental health a...",,The context explains that mental health includ...,True,True,True,5.434603,6963713a-6c6f-498a-a5f7-b6f148df45ad,019c672d-d198-7b51-8a13-5a5979fb0cff
6,How build workout routine with progressive ove...,To build a workout routine with progressive ov...,,The guide explains that building a workout rou...,True,True,False,1.788928,4e3684bd-6820-4f27-810d-bc42a5b717a3,019c672e-2329-7032-864a-56a7d9f4a396
7,How does developing psychological resilience t...,Developing psychological resilience through pr...,,Developing psychological resilience through mi...,True,True,True,3.854209,b5462f08-b5c0-40db-bcd9-8a18c5f35d14,019c672e-66e2-7d82-9ce3-b24ce1d7b1e9
8,"As a Healthy Lifestyle Coach, how does mental ...",Mental health influences overall well-being by...,,"Mental health encompasses our emotional, psych...",True,True,False,2.499762,e22cb779-4b36-43fe-9edc-a7088cd9a168,019c672e-bc7e-7d31-a98b-f32dac0c232e
9,What does the term pelvic refer to?,I don't know.,,The context provided does not specify the exac...,False,False,False,0.924296,f267112c-721f-4875-ba48-045bb81d27c0,019c672f-0b15-7642-b46b-f728866c25e6


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [48]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [49]:
rag_documents = docs

In [50]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
More size in chunks means more context for analyzing the given chunk. In the end, it can create unnecessary noise if the data isn‚Äôt structured properly (different headings/paragraph sizes, etc.). Embedding bigger chunks averages content across multiple topics, so you end up with worse semantic meaning. Cost is also higher with bigger chunks.

Smaller chunks take less time to process, but you can miss data that‚Äôs needed even with overlap. With smaller chunks, semantic retrieval is more precise, but answers can be scattered across a few chunks; embeddings are also ‚Äúcleaner‚Äù compared to bigger chunks.

A perfect chunk size doesn‚Äôt exist, but you can find an optimal chunk size by tunining it to your data and use case.

In [51]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
In a larger model we get better semantic understanding, so our top-k retrieval should be more accurate. On the other hand, it‚Äôs slower and more costly than using smaller models.

Smaller models are cheaper and faster, so they‚Äôre good for prototyping, but the drawback is weaker semantic understanding, so retrieval quality drops, especially for multi-hop or multiple-doc questions.

Also, if we change the embedding model we should always re-embed the data, because vectors from different models aren‚Äôt comparable and similarity search will be wrong.

In [52]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [53]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [54]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [55]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Alright, listen up ‚Äî leveling up your sleep game isn‚Äôt just about hitting the sack; it‚Äôs a full-on lifestyle flex. Here‚Äôs the ultimate blueprint, straight from the sleep sages:\n\n**1. Lock in that schedule:** Be a sleep ninja ‚Äî go to bed and wake up at the same times every single day, even weekends. Consistency = sleep magic.  \n**2. Craft a chill bedtime ritual:** Think reading, gentle stretches, or a warm bath. Signal your brain it‚Äôs time to power down like a pro.  \n**3. Make your bedroom a sleep fortress:** Keep it cool (65-68¬∞F / 18-20¬∞C), pitch-black with blackout curtains or a sleep mask, and dead quiet (white noise machines or earplugs if needed).  \n**4. Ditch screens early:** Shut off your digital devices at least 1-2 hours before bed. Blue light is the enemy of your snooze.  \n**5. Caffeine cutoff:** No caffeine after 2 PM unless you wanna be tossing and turning like a rap song on repeat.  \n**6. Move your body, smartly:** Exercise regularly but avoid workout

Finally, we can evaluate the new chain on the same test set!

In [56]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'memorable-work-43' at:
https://smith.langchain.com/o/4ac4b17e-92ac-4421-9ffa-fa0dc18c9d28/datasets/5d31b527-7f8d-47e5-ab83-4dedb3e897a9/compare?selectedSessions=4acd50d3-20e8-4f37-a698-f5472a665534




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can CBT and CBT-I help improve mental heal...,"Alright, buckle up‚Äîhere‚Äôs the ultra-dopeness o...",,CBT helps improve mental health by identifying...,True,True,True,7.021313,ac65cd41-a2fb-4795-ad84-4990325ff53e,019c6730-0885-7972-897b-620b718519f7
1,How does CBT help with mental health and how c...,"Alright, let‚Äôs crank this up to eleven and bre...",,"CBT, or Cognitive Behavioral Therapy, helps wi...",True,True,True,6.544804,017ec7c7-3d54-4569-9922-fe7a323af1c0,019c6730-615d-72f2-8e42-7f0479c5a8b1
2,How does vitamn D help with mental health and ...,"Alright, let‚Äôs crank up the cool factor and di...",,Vitamin D is obtained from sunlight and fortif...,True,True,True,3.830987,1885c006-46a1-4650-8f7f-7b8285edc2c0,019c6730-c121-7582-b744-1ab8cda522ca
3,How do vitamins like B vitamins and other nutr...,"Yo, let's drop some knowledge bombs on how tho...",,"The context explains that B vitamins, found in...",True,True,True,5.457349,7b19efbd-bbaa-4e22-ad2a-50f06edfd946,019c6731-14ab-7361-87c3-04a2cfa63e7d
4,How can understanding the science of habit for...,"Alright, buckle up, because we‚Äôre about to roc...",,"Understanding the science of habit formation, ...",True,True,True,6.174451,d46d9ee8-527e-4838-a99f-75c50f6f62ce,019c6731-5ce1-7570-af7f-fb1e354c27da
5,how mental health and CBT help with psychologi...,"Alright, let‚Äôs crank this up to eleven and div...",,The context explains that mental health includ...,True,True,True,8.66437,6963713a-6c6f-498a-a5f7-b6f148df45ad,019c6731-bc17-7f23-84e6-50113dc10693
6,How build workout routine with progressive ove...,"Oh heck yeah, let‚Äôs dive into building that wo...",,The guide explains that building a workout rou...,True,True,True,36.096108,4e3684bd-6820-4f27-810d-bc42a5b717a3,019c6732-2abc-75d0-8295-0ce11c246fc2
7,How does developing psychological resilience t...,"Alright, buckle up ‚Äî here‚Äôs the rad rundown st...",,Developing psychological resilience through mi...,True,True,True,5.950091,b5462f08-b5c0-40db-bcd9-8a18c5f35d14,019c6733-08b8-7e91-a47b-8b2f5ff27628
8,"As a Healthy Lifestyle Coach, how does mental ...","Yo, as a Healthy Lifestyle Coach, here‚Äôs the s...",,"Mental health encompasses our emotional, psych...",True,True,True,5.27905,e22cb779-4b36-43fe-9edc-a7088cd9a168,019c6733-53c5-7702-9f95-8b6be0793666
9,What does the term pelvic refer to?,I don‚Äôt know. The provided context drops some ...,,The context provided does not specify the exac...,False,False,True,1.680589,f267112c-721f-4875-ba48-045bb81d27c0,019c6733-a618-7052-b3dc-70b3a7085fda


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:
![LangSmith chain diff screenshot](data/image.png)

Making our prompt more dopeness oriented (make it rad/dope/non generic) raised the dopeness metric by basic prompt engineering. By giving a more detailed, specific prompt we can boost style based metrics even more. Because we constrained the model to answer only from the provided contex, no external info, the QA stayed same. 

Latency increased because dope answers are longer by default (more output tokens), plus we used larger chunks and a larger embedding model, which increased retrieval and processing time and marginally increased cost for this use case.

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores