# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alexandercoenegrachts/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/alexandercoenegrachts/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Use-Case Data!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 64, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/38 [00:00<?, ?it/s]

Property 'summary' already exists in node '6f2da0'. Skipping!
Property 'summary' already exists in node '20f266'. Skipping!
Property 'summary' already exists in node '6b998b'. Skipping!
Property 'summary' already exists in node '8007c6'. Skipping!
Property 'summary' already exists in node '4e975b'. Skipping!
Property 'summary' already exists in node '1836cc'. Skipping!
Property 'summary' already exists in node '764e0e'. Skipping!
Property 'summary' already exists in node '5bf8fe'. Skipping!
Property 'summary' already exists in node 'f9f6e3'. Skipping!
Property 'summary' already exists in node 'ccc558'. Skipping!
Property 'summary' already exists in node '6581cc'. Skipping!
Property 'summary' already exists in node 'ea8c73'. Skipping!
Property 'summary' already exists in node 'eca906'. Skipping!
Property 'summary' already exists in node 'bae29e'. Skipping!
Property 'summary' already exists in node 'aa6144'. Skipping!
Property 'summary' already exists in node '566854'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/48 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'f9f6e3'. Skipping!
Property 'summary_embedding' already exists in node '1836cc'. Skipping!
Property 'summary_embedding' already exists in node '5bf8fe'. Skipping!
Property 'summary_embedding' already exists in node '6f2da0'. Skipping!
Property 'summary_embedding' already exists in node '8007c6'. Skipping!
Property 'summary_embedding' already exists in node '20f266'. Skipping!
Property 'summary_embedding' already exists in node '6b998b'. Skipping!
Property 'summary_embedding' already exists in node '4e975b'. Skipping!
Property 'summary_embedding' already exists in node '764e0e'. Skipping!
Property 'summary_embedding' already exists in node 'ccc558'. Skipping!
Property 'summary_embedding' already exists in node 'ea8c73'. Skipping!
Property 'summary_embedding' already exists in node '6581cc'. Skipping!
Property 'summary_embedding' already exists in node 'bae29e'. Skipping!
Property 'summary_embedding' already exists in node 'eca906'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 86, relationships: 712)

We can save and load our knowledge graphs as follows.

In [10]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 86, relationships: 712)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [11]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [12]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

#### Answer

SingleHopSpecificQuerySynthesizer (50% weight):
- Generates questions that can be answered with information from a single document or source
- Creates specific, factual questions that require direct information retrieval
- Example: "What is the main topic discussed in the AI usage report?"

MultiHopAbstractQuerySynthesizer (25% weight):
- Creates questions that require combining information from multiple sources
- Generates abstract, conceptual questions that need reasoning across documents
- Example: "How do the different AI use cases relate to each other in terms of business impact?"


MultiHopSpecificQuerySynthesizer (25% weight):
- Similar to MultiHopAbstract but focuses on specific, factual questions
- Requires
- Requires connecting specific details from multiple sources to answer
- Example: "What specific AI tools are mentioned across different sections and how do they differ?"

The weight distribution (50%, 25%, 25%) means the system will generate more single-hop questions (easier) than multi-hop questions (harder), creating a balanced test set that evaluates both simple retrieval and complex reasoning capabilities.


Finally, we can use our `TestSetGenerator` to generate our testset!

In [13]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"According to Eloundou et al., 2025, what are t...",[Introduction ChatGPT launched in November 202...,"Eloundou et al., 2025, study consumer usage of...",single_hop_specifc_query_synthesizer
1,"How is ChatGPT used in daily work activities, ...",[Table 1: ChatGPT daily message counts (millio...,The context indicates that ChatGPT is widely u...,single_hop_specifc_query_synthesizer
2,Wha is the role of Management and business in ...,[Variation by Occupation Figure 23 presents va...,Variation by Occupation Figure 23 shows that u...,single_hop_specifc_query_synthesizer
3,How does the paper describe the role of comput...,[Conclusion This paper studies the rapid growt...,The paper states that computer programming acc...,single_hop_specifc_query_synthesizer
4,How does the growht of ChatGPT in low- and mid...,[<1-hop>\n\nConclusion This paper studies the ...,The context indicates that ChatGPT's usage has...,multi_hop_abstract_query_synthesizer
5,how work related message sharing and usage pat...,[<1-hop>\n\nVariation by Occupation Figure 23 ...,the context shows that variation in ChatGPT us...,multi_hop_abstract_query_synthesizer
6,How does the rapid growth and widespread adopt...,[<1-hop>\n\nConclusion This paper studies the ...,The rapid growth and widespread adoption of Ch...,multi_hop_abstract_query_synthesizer
7,how Handa et al 2025 show ChatGPT use diff fro...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,Handa et al. (2025) show that ChatGPT usage is...,multi_hop_specific_query_synthesizer
8,How does the usage of ChatGPT in the US reflec...,[<1-hop>\n\nTable 1: ChatGPT daily message cou...,"In the US, ChatGPT usage demonstrates a signif...",multi_hop_specific_query_synthesizer
9,how many messages like 18 billion messages and...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,"According to the context, by July 2025, 18 bil...",multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [14]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/39 [00:00<?, ?it/s]

Property 'summary' already exists in node '68756c'. Skipping!
Property 'summary' already exists in node 'f61978'. Skipping!
Property 'summary' already exists in node 'd04da0'. Skipping!
Property 'summary' already exists in node '9e2e3f'. Skipping!
Property 'summary' already exists in node '5b4922'. Skipping!
Property 'summary' already exists in node 'a9de11'. Skipping!
Property 'summary' already exists in node 'aef45b'. Skipping!
Property 'summary' already exists in node '94291f'. Skipping!
Property 'summary' already exists in node '538e40'. Skipping!
Property 'summary' already exists in node 'c65926'. Skipping!
Property 'summary' already exists in node 'fd7550'. Skipping!
Property 'summary' already exists in node 'a82ece'. Skipping!
Property 'summary' already exists in node 'c6b9e9'. Skipping!
Property 'summary' already exists in node 'd07214'. Skipping!
Property 'summary' already exists in node '4e222c'. Skipping!
Property 'summary' already exists in node 'dee40d'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/47 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'c65926'. Skipping!
Property 'summary_embedding' already exists in node '9e2e3f'. Skipping!
Property 'summary_embedding' already exists in node '68756c'. Skipping!
Property 'summary_embedding' already exists in node 'a9de11'. Skipping!
Property 'summary_embedding' already exists in node 'f61978'. Skipping!
Property 'summary_embedding' already exists in node 'fd7550'. Skipping!
Property 'summary_embedding' already exists in node 'd04da0'. Skipping!
Property 'summary_embedding' already exists in node '94291f'. Skipping!
Property 'summary_embedding' already exists in node 'a82ece'. Skipping!
Property 'summary_embedding' already exists in node 'aef45b'. Skipping!
Property 'summary_embedding' already exists in node 'c6b9e9'. Skipping!
Property 'summary_embedding' already exists in node '5b4922'. Skipping!
Property 'summary_embedding' already exists in node '538e40'. Skipping!
Property 'summary_embedding' already exists in node 'd07214'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

In [15]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Ling and Imas 2025 what they say about ChatGPT...,[Introduction ChatGPT launched in November 202...,The context discusses Ling and Imas (2025) in ...,single_hop_specifc_query_synthesizer
1,"What does Bick et al., 2024 say about ChatGPT'...",[Introduction ChatGPT launched in November 202...,"Bick et al., 2024 reports that by July 2025, 1...",single_hop_specifc_query_synthesizer
2,How does ChatGPT contribute to productivity in...,[Table 1: ChatGPT daily message counts (millio...,ChatGPT is used for various purposes that enha...,single_hop_specifc_query_synthesizer
3,Can you tell me what Claude is and how it is u...,[Table 1: ChatGPT daily message counts (millio...,The provided context does not include specific...,single_hop_specifc_query_synthesizer
4,What is SOC in the context of ChatGPT usage?,[Variation by Occupation Figure 23 presents va...,Variation by Occupation Figure 23 presents var...,single_hop_specifc_query_synthesizer
5,How does the variation in ChatGPT usage across...,[Variation by Occupation Figure 23 presents va...,"According to the context, variation in ChatGPT...",single_hop_specifc_query_synthesizer
6,How does the percentage distribution of messag...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,The context shows that non-work messages have ...,multi_hop_abstract_query_synthesizer
7,how variation in ChatGPT use by job like in fi...,[<1-hop>\n\nVariation by Occupation Figure 23 ...,The context shows that users in highly paid pr...,multi_hop_abstract_query_synthesizer
8,How does the increasing percentage of non-work...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"The data shows that in June 2024, non-work mes...",multi_hop_abstract_query_synthesizer
9,"ChatGPT use vary by job, like science, managem...",[<1-hop>\n\nVariation by Occupation Figure 23 ...,Because users in highly paid professional jobs...,multi_hop_abstract_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [16]:
from langsmith import Client

client = Client()

dataset_name = "Use Case Synthetic Data - AIE8"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [17]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [18]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [20]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [21]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG"
)

In [22]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [23]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [24]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [25]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [26]:
rag_chain.invoke({"question" : "What are people doing with AI these days?"})

'Based on the provided context, people are using AI, particularly generative AI like ChatGPT, in a variety of ways including performing workplace tasks by augmenting or automating human labor, producing writing, software code, spreadsheets, and other digital products. Users also seek information and advice, but generative AI is distinguished by its ability to produce creative and digital outputs that go beyond traditional web search engines. AI is used both at work and outside work, with intents categorized broadly as Asking (seeking information/advice), Doing (producing output), and Expressing (self-expression). Additionally, AI can serve as co-workers producing output or as co-pilots providing advice to improve human productivity. Overall, people employ AI flexibly for work tasks, creative production, problem-solving, and personal expression.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [27]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [28]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dopeness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this response dope, lit, cool, or is it just a generic response?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`: Evaluates correctness - whether the generated answer is factually correct based on the reference answer
- `labeled_helpfulness_evaluator`: Evaluates helpfulness - whether the response is helpful to the user, taking into account the correct reference answer
- `dopeness_evaluator`: Evaluates dopeness - whether the response is "dope, lit, cool" or just a generic response (measures creativity/engagement)

## LangSmith Evaluation

In [29]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'worthwhile-map-40' at:
https://smith.langchain.com/o/0ac46901-bbd8-4e04-a1a2-f0202cf105d3/datasets/bc981e33-2bea-4618-b42b-9572c8f572f2/compare?selectedSessions=2a98b92a-d08e-4a5f-895f-503156a330e5




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,Considering the rapid growth of ChatGPT's usag...,"The rapid growth of ChatGPT's usage, with over...",,The context indicates that since ChatGPT's lau...,1,1,0,2.616546,dc951d33-6af7-4efd-b17a-5d397b3a904e,fa185a75-85fa-427c-9189-de85d7372e21
1,"ChatGPT use vary by job, like science, managem...",ChatGPT use varies by job because users in hig...,,Because users in highly paid professional jobs...,1,1,0,2.119988,6caf0c50-b2fd-45f2-a2b2-42114ab48688,e233ced7-940b-4f14-9381-762cfc6a1008
2,How does the increasing percentage of non-work...,"Based on the provided context, the increasing ...",,"The data shows that in June 2024, non-work mes...",1,0,0,5.978542,c91b517b-6907-42db-ace3-affb1ce2f296,b58203be-badc-4366-8d8b-cc39e7821b1a
3,how variation in ChatGPT use by job like in fi...,"Based on the provided context, Figure 23 prese...",,The context shows that users in highly paid pr...,1,0,0,5.120039,08f5e8ac-5e94-448d-a681-ca53609d3451,aaafacc2-3f50-49d7-b4e9-39fb18c71a11
4,How does the percentage distribution of messag...,"Based on the provided context, the percentage ...",,The context shows that non-work messages have ...,1,1,0,4.5324,551a12bc-44b2-41a1-bc51-e8214ab2c80d,6ae68b08-10c8-4195-a8e4-3f0ef6adbc64
5,How does the variation in ChatGPT usage across...,The context indicates that ChatGPT usage varie...,,"According to the context, variation in ChatGPT...",1,1,0,4.289069,2adc656d-f09b-447c-816a-acc27a7d81eb,3b5851d2-1518-4270-a987-a9fb670a4e0f
6,What is SOC in the context of ChatGPT usage?,SOC in the context of ChatGPT usage refers to ...,,Variation by Occupation Figure 23 presents var...,1,1,0,2.09313,8a5ae398-5cd8-436f-94e0-0aedc2e086e0,d0d860ef-4286-4e7d-9a41-e6c21afc5f30
7,Can you tell me what Claude is and how it is u...,I don't know.,,The provided context does not include specific...,1,0,0,1.132764,98d018dc-646e-48a6-854f-49bc8f2106ab,05bb0347-0e7b-4871-aaff-f17d2da5cdf9
8,How does ChatGPT contribute to productivity in...,"According to the context, ChatGPT contributes ...",,ChatGPT is used for various purposes that enha...,1,0,0,1.658413,48e59ac5-18bc-4332-ad94-5a2058b5cd7c,6cd5dd56-c507-4745-8b10-37af12af2604
9,"What does Bick et al., 2024 say about ChatGPT'...",Bick et al. (2024) report that 28% of US adult...,,"Bick et al., 2024 reports that by July 2025, 1...",0,0,0,1.214173,adac47df-1d7f-4764-b654-c0744ea7bcb7,7957f777-fc21-497d-9e6e-72b16bd4d4a3


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [30]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [31]:
rag_documents = docs

In [32]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

#### Answer
More context is given to the model
Less siloed answers: with small chunks, the answer might be fragmented across two chunks, decreasing accuracy 

In [33]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

#### Answer
In this case, we go for the `large` model, which is a more performant (but more expensive) model: it has higher dimensions, provides a better semantic understanding, and overall has better accuracy

In [34]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [35]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [36]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [37]:
dopeness_rag_chain.invoke({"question" : "How are people using AI to make money?"})

'Alright, buckle up for the rad rundown on how people are cashing in with AI, straight from the data vault! According to the context, folks aren’t just using AI as a basic tool to grind through tasks—they’re leveraging ChatGPT as a *decision-support wizard* and research assistant in their workflows. This means AI is turbocharging their brainpower by boosting the *quality* of their decision-making, especially in knowledge-heavy gigs where smarter choices equal bigger wins.\n\nThe true money move? Using AI not only to automate or augment labor but to *advise* and *empower* humans to work smarter, not just harder. Since decision quality spikes output, users are unlocking serious productivity boosts that translate into higher earnings and valuable economic surplus. In fact, estimates show US users would demand a hefty $98 payoff to skip AI for a month, reflecting a staggering minimum of $97 billion annual surplus sparked by generative AI’s mojo.\n\nSo, AI isn’t just an assistant; it’s like

Finally, we can evaluate the new chain on the same test set!

In [38]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'complicated-powder-24' at:
https://smith.langchain.com/o/0ac46901-bbd8-4e04-a1a2-f0202cf105d3/datasets/bc981e33-2bea-4618-b42b-9572c8f572f2/compare?selectedSessions=22ebce5b-e849-48e7-9c3a-5aa3588f8f7b




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,Considering the rapid growth of ChatGPT's usag...,"Alright, strap in because we're diving deep in...",,The context indicates that since ChatGPT's lau...,1,1,1,5.835028,dc951d33-6af7-4efd-b17a-5d397b3a904e,39a9f06c-193e-42aa-abac-b9265ba4f9b9
1,"ChatGPT use vary by job, like science, managem...","Oh, buckle up — here’s the scoop with some ser...",,Because users in highly paid professional jobs...,1,1,1,5.60003,6caf0c50-b2fd-45f2-a2b2-42114ab48688,b455431f-d72e-4482-9c4a-2dc753686c53
2,How does the increasing percentage of non-work...,"Yo, check this out—between June 2024 and June ...",,"The data shows that in June 2024, non-work mes...",1,0,1,5.443783,c91b517b-6907-42db-ace3-affb1ce2f296,bf875bca-3094-4a96-9c1a-e4ea307b88ce
3,how variation in ChatGPT use by job like in fi...,"Yo, buckle up—let’s dive into the juicy tapest...",,The context shows that users in highly paid pr...,1,1,1,8.735711,08f5e8ac-5e94-448d-a681-ca53609d3451,d591c0c0-0010-411e-9141-94e1dce2577b
4,How does the percentage distribution of messag...,"Alright, let's crank this up to eleven and div...",,The context shows that non-work messages have ...,1,1,1,6.03723,551a12bc-44b2-41a1-bc51-e8214ab2c80d,6cfe19c8-511a-4ebe-a942-5142f1ece608
5,How does the variation in ChatGPT usage across...,"Oh snap, buckle up—let’s decode the glory behi...",,"According to the context, variation in ChatGPT...",1,0,1,4.880572,2adc656d-f09b-447c-816a-acc27a7d81eb,f12ccaff-b411-4992-8469-3c4ac52d3f8b
6,What is SOC in the context of ChatGPT usage?,"Yo, strap in for this SOC drop straight from t...",,Variation by Occupation Figure 23 presents var...,1,1,1,3.316501,8a5ae398-5cd8-436f-94e0-0aedc2e086e0,2a6cdbc9-72bb-45da-acd7-bf3d7bc3f320
7,Can you tell me what Claude is and how it is u...,"Yo, here’s the lowdown straight from the data ...",,The provided context does not include specific...,1,1,1,2.99366,98d018dc-646e-48a6-854f-49bc8f2106ab,764cf72e-13bf-4418-81f8-a4405f7d7419
8,How does ChatGPT contribute to productivity in...,"Alright, strap in for some next-level AI insig...",,ChatGPT is used for various purposes that enha...,1,0,1,7.900164,48e59ac5-18bc-4332-ad94-5a2058b5cd7c,c6095d5e-349a-4726-a851-4759fb4b83c4
9,"What does Bick et al., 2024 say about ChatGPT'...","Alright, strap in for some next-level knowledg...",,"Bick et al., 2024 reports that by July 2025, 1...",1,1,1,4.025852,adac47df-1d7f-4764-b654-c0744ea7bcb7,fcd6a9aa-25cb-4096-9f48-488960f21373


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

![Screenshot of Chain Performance Comparison](Screenshot%202025-10-07%20at%2023.49.59.png)

**Analysis of Chain Performance Differences:**

**1. Correctness (QA Evaluator):**
- **Likely improved** due to larger chunk size providing more complete context
- The 1000-character chunks vs 500-character chunks give the LLM more information to generate accurate answers
- Better embedding model (`text-embedding-3-large`) retrieves more relevant chunks

**2. Helpfulness:**
- **Likely improved** because more complete context leads to more comprehensive and useful responses
- Larger chunks reduce the need to piece together fragmented information
- Better retrieval quality means more relevant information reaches the LLM

**3. Dopeness:**
- **Significantly improved** due to the explicit "dopeness" prompt modification
- The new prompt specifically instructs: "Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses."
- This directly addresses the dopeness evaluator's criteria for creativity and engagement

**Key Changes Impact:**
- **Chunk size increase**: More context = better answers = higher correctness/helpfulness
- **Embedding model upgrade**: Better retrieval = more relevant information = improved all metrics
- **Prompt modification**: Direct instruction for creativity = higher dopeness scores



