# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [15]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Error loading punkt: <urlopen error [Errno 8] nodename nor
[nltk_data]     servname provided, or not known>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [Errno 8] nodename nor servname provided, or not
[nltk_data]     known>


False

In [16]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [17]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [18]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [19]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/MentalHealthGuide.txt', 'data/HealthWellnessGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [20]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [21]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [22]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [23]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

unable to apply transformation: Connection error.
unable to apply transformation: Connection error.


Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

unable to apply transformation: Connection error.
unable to apply transformation: Connection error.


Applying CustomNodeFilter: 0it [00:00, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/2 [00:00<?, ?it/s]

unable to apply transformation: node.property('summary') must be a string, found '<class 'NoneType'>'
unable to apply transformation: node.property('summary') must be a string, found '<class 'NoneType'>'


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

unable to apply transformation: Node f9c142f1-3444-4be6-8bab-50a645f8846e has no summary_embedding


KnowledgeGraph(nodes: 2, relationships: 0)

We can save and load our knowledge graphs as follows.

In [24]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 2, relationships: 0)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [25]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

Failed to send compressed multipart ingest: Connection error caused failure to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. Please confirm your internet connection. ConnectionError(MaxRetryError('HTTPSConnectionPool(host=\'api.smith.langchain.com\', port=443): Max retries exceeded with url: /runs/multipart (Caused by NameResolutionError("HTTPSConnection(host=\'api.smith.langchain.com\', port=443): Failed to resolve \'api.smith.langchain.com\' ([Errno 8] nodename nor servname provided, or not known)"))'))
Content-Length: 21790
API Key: lsv2_********************************************0ftrace=019c72fc-fe3d-7be3-9e20-6121ac11a8e8,id=019c72fc-fe3d-7be3-9e20-6121ac11a8e8; trace=019c72fc-fe3d-7be3-9e20-6121ac11a8e8,id=019c72fc-fe45-7c83-a290-0c3f3d3aacd8; trace=019c72fc-fe49-75f0-9a43-d80bc3891c41,id=019c72fc-fe49-75f0-9a43-d80bc3891c41; trace=019c72fc-fe49-75f0-9a43-d80bc3891c41,id=019c72fc-fe49-75f0-9a43-d8139559e061


However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [27]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
The SingleHopSpecirficQuerySynthesizer generates a test set using just one knowledge source. This is akin to using RAG for one data source. However, in the real world, people often ask questions whose answers aren't contained in just one data source. 

MultiHopAbstractionQuerySynthesizer helps to solve this issue, generating a test set of questions and answers from multiple data sources pulled across 'hops' from data source to data source tracking along common themes. 

MultiHopSpecificQuerySynethesizer is different in that it generates a test set related to specific information contained in the data sources that is meant to be extracted of the sources. whereas the MutliHopAbstractQuerySynthesizer generates a test set based on more open-ended questions that are answered by the information in the data sources. 

Finally, we can use our `TestSetGenerator` to generate our testset!

In [33]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the role of the Psychology Handbook in...,[The Mental Health and Psychology Handbook A P...,The Mental Health and Psychology Handbook is a...,single_hop_specifc_query_synthesizer
1,What is Chapter 11 about in the context of bui...,[PART 3: BUILDING RESILIENCE Chapter 7: What I...,Chapter 11 discusses the role of exercise in m...,single_hop_specifc_query_synthesizer
2,"As a Holistic Wellness Coach, could you explai...",[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,"Mindfulness-Based Stress Reduction (MBSR), dev...",single_hop_specifc_query_synthesizer
3,What is sleep hygiene education?,[- Sleep restriction: Limiting time in bed to ...,Sleep hygiene education involves optimizing th...,single_hop_specifc_query_synthesizer
4,What information is covered in Chapter 11 rega...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Chapter 11 discusses effective stress manageme...,single_hop_specifc_query_synthesizer
5,How can supporting digestive health through di...,[<1-hop>\n\n13: The Science of Habit Formation...,Supporting digestive health through diet and l...,multi_hop_abstract_query_synthesizer
6,How can practicing self-compassion and reframi...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,Practicing self-compassion and reframing setba...,multi_hop_abstract_query_synthesizer
7,How can incorporating lower back pain relief e...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The wellness guide recommends gentle stretchin...,multi_hop_abstract_query_synthesizer
8,How can understanding the habit loop from Chap...,[<1-hop>\n\n13: The Science of Habit Formation...,"Understanding the habit loop, which involves c...",multi_hop_specific_query_synthesizer
9,How does cognitive therapy relate to cognitive...,[<1-hop>\n\n- Sleep restriction: Limiting time...,"Cognitive therapy, as mentioned in the context...",multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [32]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [34]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"As a Holistic Wellness Coach, how does the con...",[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
1,Is Stanford related to mental health?,[PART 3: BUILDING RESILIENCE Chapter 7: What I...,The provided context does not mention any conn...,single_hop_specifc_query_synthesizer
2,What CBT mean for mental health?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Cognitive Behavioral Therapy (CBT) is one of t...,single_hop_specifc_query_synthesizer
3,How does mental health relate to social connec...,[- Sleep restriction: Limiting time in bed to ...,The context emphasizes that strong social conn...,single_hop_specifc_query_synthesizer
4,What are the different types of exercise inclu...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,A balanced exercise routine for overall well-b...,multi_hop_abstract_query_synthesizer
5,how practicing social connection and growth mi...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,Practicing social connection helps build resil...,multi_hop_abstract_query_synthesizer
6,How can developing emotional intelligence thro...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,Developing emotional intelligence involves rec...,multi_hop_abstract_query_synthesizer
7,How does the gut-brain axis and gut health inf...,[<1-hop>\n\nThe Mental Health and Psychology H...,The gut-brain axis plays a significant role in...,multi_hop_abstract_query_synthesizer
8,"How do cognitive therapy techniques, such as a...",[<1-hop>\n\n- Sleep restriction: Limiting time...,"Cognitive therapy techniques, like addressing ...",multi_hop_specific_query_synthesizer
9,How can understanding the habit loop from Chap...,[<1-hop>\n\n13: The Science of Habit Formation...,"Understanding the habit loop, which includes c...",multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
The unrolled method allows us to specify the ways the documents are transformed before they are organized as nodes in the knowledge graph. This is useful because it allows more control over shaping the relationships between documents that are more suited to our use cases. It also allows us to specify the types and distribution of query synthesizers that make up our query distribution for our test case. Overall, I would use the unrolled approach if I want maximum control over the types of synthetic data being generated and how my documents are being used to generate the data. 

On the other hand, the automoatic approach offers a simpler and faster way to generate data. The example above demonstrates that I only need to pass in my documents and the testset size as paramters to generate my test dataset. I would use this approach if I didn't need precise control over my testset generation and could work with the default settings of document transformations and synthesizer distributions. 

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [None]:
### YOUR CODE HERE ###

# Define a custom query distribution with different weights
query_distribution = [ 
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.4),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.1),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
]
# Generate a new test set and compare with the default
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

# Comparison and explanation
# I chose the weights to simulate a use case where the user needs to get specific, technical information 
# contained in the documents. This use case is common in Commercial Real Estate, where brokers 
# often draft and need to review the terms of leases. 

# The testset generated from this new query distribution mainly consists of questions and answers that reference
# facts directly found in the documents. The default distribution genereated questions and answers 
# that require more reasoning to produce an answer. 


Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the role of the Psychology Handbook in...,[The Mental Health and Psychology Handbook A P...,The Mental Health and Psychology Handbook is a...,single_hop_specifc_query_synthesizer
1,What is Chapter 7 about in the context of psyc...,[PART 3: BUILDING RESILIENCE Chapter 7: What I...,Chapter 7 discusses what psychological resilie...,single_hop_specifc_query_synthesizer
2,What is the University of Massachusetts Medica...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,The University of Massachusetts Medical Center...,single_hop_specifc_query_synthesizer
3,How can stimulus control help improve sleep ha...,[- Sleep restriction: Limiting time in bed to ...,Stimulus control involves using the bed only f...,single_hop_specifc_query_synthesizer
4,How does incorporating regular exercise and mo...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The Personal Wellness Guide emphasizes that re...,multi_hop_abstract_query_synthesizer
5,How can understanding the habit loop from Chap...,[<1-hop>\n\n13: The Science of Habit Formation...,"Understanding the habit loop, which includes c...",multi_hop_specific_query_synthesizer
6,"How does cognitive therapy, as discussed in th...",[<1-hop>\n\n- Sleep restriction: Limiting time...,"Cognitive therapy, which addresses beliefs abo...",multi_hop_specific_query_synthesizer
7,"How do B vitamins, found in foods like leafy g...",[<1-hop>\n\n- Sleep restriction: Limiting time...,"B vitamins, which are found in foods such as l...",multi_hop_specific_query_synthesizer
8,How do Chapters 8 and 12 together support deve...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,Chapter 8 discusses developing emotional intel...,multi_hop_specific_query_synthesizer
9,How do Chapters 11 and 21 collectively support...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,Chapter 11 emphasizes the vital role of exerci...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [75]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [76]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [77]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [78]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [79]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [80]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [81]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [82]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [83]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [84]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [85]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [86]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [87]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`: correctness of the prediction
> - `labeled_helpfulness_evaluator`: helpfulness of the prediction
> - `dopeness_evaluator`: dopeness of the prediction

## LangSmith Evaluation

In [88]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'passionate-shame-35' at:
https://smith.langchain.com/o/357fd74b-2136-4b17-80a6-2b5585a08d4a/datasets/e09d15dc-c292-48f4-86c4-4ba9cf6c9127/compare?selectedSessions=29d06368-19c4-440f-a71b-460d5badf803




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How do PART 4 and PART 2 together inform a hol...,Based on the provided context:\n\nPART 4 (Dail...,,PART 3 of the context explains that building r...,True,True,True,6.659394,af147a73-d587-4f9f-9924-a716d543ee6f,019c73a8-a4c8-76a2-a92d-57e1592664a6
1,"How do nutrition and diet, as highlighted in t...","According to the context, nutrition and diet i...",,The context emphasizes that nutrition plays a ...,True,True,True,2.916059,f11116bf-b783-444a-8c1e-21c42436cf95,019c73a8-eb32-7b63-bb6d-ba6580b5a03a
2,How can understanding the habit loop from Chap...,"Understanding the habit loop from Chapter 13, ...",,"Understanding the habit loop, which includes c...",True,True,True,2.725661,8d895930-78ed-4fe3-a83e-ee631c902bb3,019c73a9-236c-78a3-a1b9-6beca567afc8
3,"How do cognitive therapy techniques, such as a...",Cognitive therapy techniques like addressing b...,,"Cognitive therapy techniques, like addressing ...",True,True,False,1.920108,98a3a680-de25-42f5-9ab9-18e04e31c87b,019c73a9-5dd2-7bd2-9bd9-7cb29f3a7cb7
4,How does the gut-brain axis and gut health inf...,The gut-brain axis influences mental health th...,,The gut-brain axis plays a significant role in...,True,True,False,1.974439,58c77876-bda2-4192-b269-44c765a5c720,019c73a9-8ab6-77e3-9274-0dd79ca3554c
5,How can developing emotional intelligence thro...,Developing emotional intelligence through stra...,,Developing emotional intelligence involves rec...,True,True,True,4.520219,649c8d68-8a6f-454f-b5e0-3882f9a43b2c,019c73a9-b904-77c0-9616-242aebb57081
6,how practicing social connection and growth mi...,Based on the provided context:\n\n**Practicing...,,Practicing social connection helps build resil...,True,True,False,4.918136,9015fa08-246d-4fae-a546-f3eaaccd2f61,019c73a9-ef5f-7442-8fc9-5aaf277b79d9
7,What are the different types of exercise inclu...,The different types of exercise included in a ...,,A balanced exercise routine for overall well-b...,True,True,False,1.120696,6b5047c3-ef24-4d3f-94e7-6336ec14a2a5,019c73aa-37d0-70f3-9635-5735c4130c9d
8,How does mental health relate to social connec...,"Based on the provided context, mental health i...",,The context emphasizes that strong social conn...,True,True,True,7.395767,75f531ec-5a24-488e-8aaa-112bffcf60cc,019c73aa-5cdf-7c60-99b3-2a4b614bcc13
9,What CBT mean for mental health?,"Based on the provided context, CBT (Cognitive ...",,Cognitive Behavioral Therapy (CBT) is one of t...,True,True,False,1.532439,7f3946f6-234e-4ba4-8298-ceb595372985,019c73aa-9e31-7813-aed5-aaca90c3fabd


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [89]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [90]:
rag_documents = docs

In [91]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
Our application is a RAG application, meaning that under the hood, it provides responses based on the distance between the vector of the user's query to the vectors of the embedded chunks of information from the documents. The bigger the chunks, the less vectors there are to choose from to compare cosine similarity. This means that the closest chunk of information from the document may contain more information than is relevant to the user query, which risks reducing the accuracy of the provided answer to the user.

In [92]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
Modifying our embedding model would modify how the application determines what information from the documents is most relevant when generating an output to the user query. Each embedding model converts natural language text into vectors, but their methods on how to do this differ. These methods may produce different distances between vectors that result in different placement of the embedded user query vector among the embedded document vectors. This would affect what the 'closest' document vector is, affecting which information is passed into the generated output of the application as context. 

In [93]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [94]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [95]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [96]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

"Alright, buckle up for the ultimate sleep upgrade‚Äîhere‚Äôs how you can turbocharge your snooze sessions like a legit sleep ninja:\n\n1. **Lock in a consistent sleep schedule**: Wake up and hit the sack at the same times every day, even weekends. Your body thrives on rhythm, like a well-produced track.\n\n2. **Craft a chill bedtime routine**: Think reading your favorite book, gentle stretches, or a warm bath‚Äîanything that vibes you into Zen mode.\n\n3. **Optimize your sleep lair**: Keep your bedroom cool (65-68¬∞F / 18-20¬∞C), pitch-dark with blackout curtains or a sleek sleep mask, and drown out noise with white noise machines or earplugs. Bonus points for a mattress and pillows that feel like clouds.\n\n4. **Ditch the screens before bed**: Cut screen time 1-2 hours ahead‚Äîblue light messes with your brain‚Äôs sleep signals like a party crasher.\n\n5. **Cut caffeine after 2 PM**: No more afternoon coffee bombs. Let your body wind down, not rev up.\n\n6. **Exercise‚Äîbut not too l

Finally, we can evaluate the new chain on the same test set!

In [None]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'pertinent-leather-69' at:
https://smith.langchain.com/o/357fd74b-2136-4b17-80a6-2b5585a08d4a/datasets/e09d15dc-c292-48f4-86c4-4ba9cf6c9127/compare?selectedSessions=11bd53f4-9c21-4841-854c-d18ada41481d




0it [00:00, ?it/s]

---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:

##### SCREENSHOT: 
I had trouble attaching the screenshot in this cell. Please find it at the root directory, entitled `langsmith.png`

*Your answer here*

The dopeness metric DEFINITELY was impacted betwen the two runs. This is due to the fact that in my latest run, I specifcally prompted by application to 'ensure high levels of dopeness' in its answers, and it did not disappoint! This demonstrates the utility of prompt engineering in refining the output of the RAG application, which in turn enhances the user experience according to metrics that we, as the developers, care about- including metrics such as dopeness. 

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores