# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [None]:
#!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/deman/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/deman/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

LangChain API Key: ········


We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

In [20]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM - SDG - f0529bf9


OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key: ········


## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data

mkdir: data: File exists


In [6]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31524    0 31524    0     0  44504      0 --:--:-- --:--:-- --:--:-- 44462


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70549    0 70549    0     0  59934      0 --:--:--  0:00:01 --:--:-- 59990


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [8]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [10]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [11]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [12]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [13]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 10, relationships: 27)

We can save and load our knowledge graphs as follows.

In [14]:
kg.save("ai_across_years_kg.json")
ai_across_years_kg = KnowledgeGraph.load("ai_across_years_kg.json")
ai_across_years_kg

KnowledgeGraph(nodes: 10, relationships: 27)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [15]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=ai_across_years_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [16]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

These `QuerySynthesizers` are designed to generate queries using our knowledge-graph in increasing orders of complexity. More concretely:

* `SingleHopSpecificQuerySynthesizer` - This is a simple query synthesizer. It is designed to create query-answer pairs which can be answered by retrieving information from a single, most relevant node in our graph. It can be used to test the eval LLMs ability to identify the most pertinent bit of information (node) in our knowledge graph. Additionally, the questions and answers are *specific* or factual in nature.
* `MultiHopAbstractQuerySynthesizer` - This query synthesizer generates query-answer pairs which require multiple pieces of information (nodes in our graph) to be answered. However, it does so by first identifying clusters of nodes with similar themes, thereafter, it summarizes that theme and generates question/answer pairs which require abstract reasoning based on the common themes within the cluster of nodes. These questions require multiple hops over many connected nodes in the graph in order to be answered. This synthesizer is designed to generate a dataset which tests the abtract reasoning ability of an LLM across many information sources, and thus, the questions are more interpretive in nature. An example (from Ragas [docs](https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/#query-types-in-rag)):
     - “Which scientist influenced Einstein’s work on relativity, and what theory did they propose?” **is a specific query** which requires multiple hops but is factual,
     - “How did Einstein’s theory change our understanding of time and space?” **is an abstract query which requires multiple hops**.
`MultiHopAbstractQuerySynthesizer` focuses on these kinds of **interpretive/abstract question-answer pairs over multiple nodes**.
* `MultiHopSpecificQuerySynthesizer` - This synthesizer once again generates question-answer pairs using multiple hops over nodes in our knowledge graph. However, unlike `MultiHopSpecificQuerySynthesizer`, the questions are meant to be direct/fact-based (see example above). Once again, an example of this (from [docs](https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/#query-types-in-rag)) is the query “Which scientist influenced Einstein’s work on relativity, and what theory did they propose?”. Here the question is fact based, but requires multiple hops to answer.


Finally, we can use our `TestSetGenerator` to generate our testset!

In [17]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Baidu is what in LLMs?,[The ethics of this space remain diabolically ...,The context mentions Baidu as one of the organ...,single_hop_specifc_query_synthesizer
1,Who is Simon Willison?,[Simon Willison’s Weblog Subscribe Stuff we fi...,Simon Willison’s Weblog Subscribe Stuff we fig...,single_hop_specifc_query_synthesizer
2,In the context of AI development and the ongoi...,[the document includes some of the clearest ex...,"According to the provided context, 'llamafile ...",single_hop_specifc_query_synthesizer
3,What is DeepSeek and how does it relate to rec...,[The environmental impact got better The envir...,DeepSeek is mentioned among the models on the ...,single_hop_specifc_query_synthesizer
4,What is Amazon Nova's voice mode expected to i...,"[also accepts audio input, and the Google Gemi...",Amazon also pre-announced voice mode for Amazo...,single_hop_specifc_query_synthesizer
5,How do model performance and efficiency improv...,[<1-hop>\n\nThe environmental impact got bette...,"In 2024, significant advancements in model per...",multi_hop_abstract_query_synthesizer
6,"How does multimodal vision, including audio an...",[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"Multimodal vision, which includes audio, video...",multi_hop_abstract_query_synthesizer
7,How do advancements in AI hardware and infrast...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Simon Willison’s 2024 weblog highlights signif...,multi_hop_abstract_query_synthesizer
8,Considering the complex ethical landscape of l...,[<1-hop>\n\nThe ethics of this space remain di...,The first segment emphasizes the diabolically ...,multi_hop_specific_query_synthesizer
9,How do Anthropic's models compare to GPT-4 in ...,[<1-hop>\n\nThe ethics of this space remain di...,"According to the provided context, Anthropic h...",multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [18]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'StringIO' object has no attribute 'output'


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [19]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the opnAI company relate to the devel...,[The ethics of this space remain diabolically ...,"The context mentions that in the past, OpenAI ...",single_hop_specifc_query_synthesizer
1,Whaat was the breackthroug year for AI in 2023...,[Simon Willison’s Weblog Subscribe Stuff we fi...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
2,"As an AI research engineer, how does the term ...",[the document includes some of the clearest ex...,"In the document, 'ChatGPT' is one of the promi...",single_hop_specifc_query_synthesizer
3,What is Tencent?,[The year of slop Synthetic training data work...,Tencent is mentioned among the organizations w...,single_hop_specifc_query_synthesizer
4,H0w do the trainng data desgn and its critcal ...,[<1-hop>\n\nPhi series of models has consisten...,The context explains that careful design of th...,multi_hop_abstract_query_synthesizer
5,how much energy cost less AI environmental imp...,[<1-hop>\n\nNothing yet from Anthropic or Meta...,the energy cost of running AI prompts has drop...,multi_hop_abstract_query_synthesizer
6,How do inference-scaling models like OpenAI’s ...,[<1-hop>\n\nrun into the same roadblock: how g...,Inference-scaling models such as OpenAI’s o1 a...,multi_hop_abstract_query_synthesizer
7,how large language models (LLMs) and synthetic...,[<1-hop>\n\nNothing yet from Anthropic or Meta...,The context explains that large language model...,multi_hop_abstract_query_synthesizer
8,"Considering the developments in 2023 and 2024,...",[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,The year 2023 marked a significant breakthroug...,multi_hop_specific_query_synthesizer
9,How does the importance of synthetic data in t...,[<1-hop>\n\nthe document includes some of the ...,The context highlights that synthetic data pla...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [21]:
from langsmith import Client

client = Client()

dataset_name = "State of AI Across the Years!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="State of AI Across the Years!"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [22]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [23]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [25]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [26]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="State of AI"
)

  description="Check that the field is empty, alternative syntax for `is_empty: \&quot;field_name\&quot;`",
  description="Check that the field is null, alternative syntax for `is_null: \&quot;field_name\&quot;`",


In [27]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [28]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [29]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [30]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [31]:
rag_chain.invoke({"question" : "What are Agents?"})

'Based on the context provided, "agents" is an extremely vague and ambiguous term in AI. It generally refers to AI systems that can "go away and act on your behalf," similar to a travel agent model or large language models (LLMs) given access to tools they can use in a loop to solve problems. However, there is no single clear or widely accepted definition, and many different meanings exist. Despite excitement about AI agents, few real production examples currently exist, and their utility is limited by challenges such as gullibility—LLMs tend to believe anything they are told, which undermines their ability to make reliable decisions autonomously. Overall, agents remain more of a "coming soon" concept rather than a realized technology.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [32]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [33]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`:
- `labeled_helpfulness_evaluator`:
- `dope_or_nope_evaluator`:

* `qa_evaluator`: This evaluator is making use of LangSmith's default, built-in `qa` evaluator which given a question and a reference answer, is tuned to determine whether the candidate answer is correct by having an LLM match it against the provided reference answer.
* `labeled_helpfulness_evaluator` - This is a custom evaluator. The criterion we have set is to determine whether given the question and the reference answer, the candidate answer is "helpful" to the user. There can be cases where an answer based only on provided context is *correct* but not necessarily helpful based on the user's question. An example of this (generated below) is the case when the question is: "What is Tencent?" and the answer is: "Tencent is one of the 18 organizations that have higher scoring models than GPT-4-0314 on the Chatbot Arena leaderboard.". This answer is *correct* based on the provided context and reference answer, but is not really helpful if we simply judge the user's intent in asking the question.
* `dope_or_nope_evaluator` - Once again, this is a custom evaluator which checks whether the candidate answer is "dope, lit, or cool". It is somewhat subjective and relies on the LLM having enough context to be able to understand these terms (which a model like `gpt-4o` most likey does have).

## LangSmith Evaluation

In [34]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'back-government-49' at:
https://smith.langchain.com/o/f7ca0545-d8d4-5a10-81e4-f60466d7b363/datasets/53dab390-629d-4e94-8a2b-4b95e0da2258/compare?selectedSessions=3b67ee64-4160-42c2-ad4c-339b07ca6da9




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How does Google's development of models like G...,Google's development of models like Gemini 1.5...,,Google's development of models like Gemini 1.5...,1,1,0,5.98835,99257267-017a-4051-be98-26f422e5d0b2,f5f3a567-cc4d-4aba-b5ce-c949fa6e7ad9
1,How does the emphasis on synthetic data in Dee...,"Based on the provided context, DeepSeek v3 ext...",,"DeepSeek v3, a large 685B parameter model, lev...",1,1,0,6.07184,d8a266bf-bca0-4c76-a3f4-b17666853aff,ba74ed05-1d89-4f2c-a7f8-d5a6d6bbfbcf
2,How does the importance of synthetic data in t...,The importance of synthetic data in training l...,,The context highlights that synthetic data pla...,1,0,0,2.859351,5f6f51a6-60db-418c-8ad1-55e5af0f4ac8,ffa58f66-882b-4ee9-9e0b-cd061f93633a
3,"Considering the developments in 2023 and 2024,...",Advancements in Large Language Models (LLMs) d...,,The year 2023 marked a significant breakthroug...,1,1,0,5.266457,9b3fd180-a100-470d-820f-01c60fe04441,71c151fa-e6ae-44cd-8a64-8a991bd1299f
4,how large language models (LLMs) and synthetic...,Large language models (LLMs) increasingly use ...,,The context explains that large language model...,1,1,0,3.286085,01931b34-5f9b-4731-90ee-e0680aa186c9,ded9f063-05d7-4c46-ad21-617024e58ba4
5,How do inference-scaling models like OpenAI’s ...,Inference-scaling models like OpenAI’s o1 and ...,,Inference-scaling models such as OpenAI’s o1 a...,1,1,0,6.21819,df192445-f0c6-41a7-b5d5-2c9e4b4e51f1,8d60ffc5-44d5-48ec-8cb6-c6b9e4c7fa04
6,how much energy cost less AI environmental imp...,Based on the provided context:\n\nThe energy c...,,the energy cost of running AI prompts has drop...,1,1,0,5.532544,734b95d8-c917-4a03-afec-bf35387924df,f1c4d41b-ce28-4b86-a9de-cde48abac50e
7,H0w do the trainng data desgn and its critcal ...,Careful design of the training data is crucial...,,The context explains that careful design of th...,1,1,0,3.450312,657a9119-c377-4a29-893a-53e2ab0e87aa,a6c82b89-3176-4255-8ab5-94e0a42a19e7
8,What is Tencent?,Tencent is one of the 18 organizations that ha...,,Tencent is mentioned among the organizations w...,1,0,0,1.183365,420fb05d-4cfb-441f-9d76-67100ec15812,1af97e6c-8a8a-4995-8f41-5ef3f92bbd98
9,"As an AI research engineer, how does the term ...","Based on the provided context, ""ChatGPT"" is an...",,"In the document, 'ChatGPT' is one of the promi...",1,1,0,4.62142,78caa9cb-031f-41b2-9051-889410211533,4eb002e0-8d94-41bf-95fa-bb8d4422479b


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [35]:
DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

In [36]:
rag_documents = docs

In [37]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

It may be the case that the answer to the user's question can only be reasoned about by looking at a larger segment of the document. A smaller chunk size may then be deterimental because it is possible that the query matches against one of the chunks, but the answer is in a consecutive chunk. A longer chunk size can avoid this problem by fetching both the question and the answer in a single chunk. 

It can be hard to tune this `chunk_size` to arbitrary use cases, therefore it is also important to have the `chunk_overlap` parameter which can use overlapping parts of the document for different chunks. 

In [38]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

Unlike `text-embedding-3-small` which is tuned for low-latency and memory (it uses 1536 dimensional embeddings by default), `text-embedding-3-large` embedding model is better suited for accuracy, recall, precision and semantic understanding (it uses a 3072 dimensional embedding by default). Aside from being better at capturing semantic meaning in the embeddings, `text-embedding-3-large`'s larger embedding size helps capture more information about the underlying text. 

By using an embedding model better suited for semantic understanding, we hope to be able to do better at questions which involve more reasoning and understanding of underlying document themes.

In [39]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Across Years (Augmented)"
)

In [40]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [41]:
dope_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dope_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [42]:
dope_rag_chain.invoke({"question" : "what are Agents?"})

"Alright, here’s the scoop on “Agents” — the term’s kinda like that mysterious package you get in the mail: looks cool but you have zero clue what’s inside. It’s a super vague word in AI circles, and folks toss it around without pinning down one solid meaning.\n\nBasically, there are two main flavors:  \n1. The OG travel agent vibe — AI that actually *acts* for you in the real world, handling tasks on your behalf.  \n2. Or, the newer school — LLMs (large language models) hooked up with tools, running loops, trying to solve problems like a boss.\n\nBut here’s the kicker — these so-called “agents” haven’t really *arrived* yet. They’re more like that “coming soon” flick trailer you keep rewatching. One big roadblock? Gullibility. These AI systems believe whatever they’re told, which is a nightmare if you want them to make legit decisions for you. Without AGI-level smarts, beating gullibility? Yeah, that might still be outta reach.\n\nSo, in short: Agents are AI systems that supposedly go 

Finally, we can evaluate the new chain on the same test set!

In [43]:
evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

View the evaluation results for experiment: 'puzzled-detail-11' at:
https://smith.langchain.com/o/f7ca0545-d8d4-5a10-81e4-f60466d7b363/datasets/53dab390-629d-4e94-8a2b-4b95e0da2258/compare?selectedSessions=f12330af-cc29-4cea-b51a-a63838d21acc




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How does Google's development of models like G...,"Alright, let me break it down cool and clear f...",,Google's development of models like Gemini 1.5...,1,1,1,7.210575,99257267-017a-4051-be98-26f422e5d0b2,78311fc0-1b31-4a24-8aff-1d71876c2210
1,How does the emphasis on synthetic data in Dee...,"Yo, based on the context you dropped, here’s t...",,"DeepSeek v3, a large 685B parameter model, lev...",0,0,1,8.716907,d8a266bf-bca0-4c76-a3f4-b17666853aff,efcde1f8-3f42-4e1c-95f9-c6e1ce6df409
2,How does the importance of synthetic data in t...,"Alright, check this out. Synthetic data isn't ...",,The context highlights that synthetic data pla...,1,1,1,4.11682,5f6f51a6-60db-418c-8ad1-55e5af0f4ac8,715983ff-6e5f-42f7-b600-0b42b16d77cb
3,"Considering the developments in 2023 and 2024,...","Yo, so here’s the lowdown on LLMs through 2023...",,The year 2023 marked a significant breakthroug...,1,1,1,9.743914,9b3fd180-a100-470d-820f-01c60fe04441,f91c4ef7-7abb-4213-8dfc-6c7d68211e65
4,how large language models (LLMs) and synthetic...,"Alright, check it—LLMs and synthetic data are ...",,The context explains that large language model...,1,1,1,4.581459,01931b34-5f9b-4731-90ee-e0680aa186c9,620454db-f477-4495-8534-63ddcfb9a183
5,How do inference-scaling models like OpenAI’s ...,"Aight, here’s the lowdown: inference-scaling m...",,Inference-scaling models such as OpenAI’s o1 a...,1,1,1,5.628795,df192445-f0c6-41a7-b5d5-2c9e4b4e51f1,b0da26ff-d409-471f-8a4f-e955091f8586
6,how much energy cost less AI environmental imp...,"Alright, here’s the lowdown, fresh and fly:\n\...",,the energy cost of running AI prompts has drop...,1,0,1,5.909314,734b95d8-c917-4a03-afec-bf35387924df,2995644e-c0dc-47e9-8dbd-4b4f938dfecb
7,H0w do the trainng data desgn and its critcal ...,"Yo, let me break it down for you smooth and cl...",,The context explains that careful design of th...,1,1,1,4.075159,657a9119-c377-4a29-893a-53e2ab0e87aa,80398dbd-c25e-4bed-846f-8564646bff73
8,What is Tencent?,"Yo, based on the vibe from that context, Tence...",,Tencent is mentioned among the organizations w...,1,1,1,2.589954,420fb05d-4cfb-441f-9d76-67100ec15812,5e65a2b2-7273-4b99-b335-8dff15916581
9,"As an AI research engineer, how does the term ...","Alright, listen up! ChatGPT is basically a sup...",,"In the document, 'ChatGPT' is one of the promi...",1,1,1,4.371074,78caa9cb-031f-41b2-9051-889410211533,6e4dddcd-78d3-480c-ab7b-7ff9584feeef


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

#### Baseline RAG Pipeline Evaluation
Please find a screenshot of the baseline RAG pipeline evaluation below:

![baseline_rag_pipeline](baseline_rag_pipeline.png)

#### RAG Pipeline with Dopeness

Please find a screenshot of RAG pipeline evaluation enhanced for "dopeness" below:

![pipeline with dopeness](rag_pipeline_with_dopeness.png)

#### Analysis

* Impact on Dopeness: It is immediately obvious that the pipeline tuned for dopeness does way better on the dopeness criteria than the baseline pipeline. The performance goes from 0% dopeness to 100% dopeness. The reason for this is obvious: **We instructed our LLM to "answer the questions in a dope way, be cool!". **
* Impact on Helpfulness: On average, helpfulness in both pipelines is ~ 75% although the distribution of answers which were helpful changes slightly. This is slightly unexpected. I had hoped that with a better embedding model and larger chunks of the document, the model may be able to to better overall on this metric. This may have been true in some cases (Some of the answers marked unhelpful in baseline were marked helpful in the second pipeline), but not on average.
* Impact on correctness: While both pipelines are reasonably correct (100% vs 92%), interestingly, the **"dopeness pipeline" has one answer marked incorrect**. Let's quickly examine this case. The question is: `How does the emphasis on synthetic data in DeepSeek v3 relate to the themes of DeepSeek and DeepSeek-R1, and what are the ethical considerations of using such models in AI development?`. The **baseline pipeline answers**: `"Based on the provided context, DeepSeek v3 extensively uses synthetic data in its training process, continuing a trend seen in DeepSeek-R1 where larger models help generate reasoning data or synthetic examples for training smaller or fine-tuned models. This approach reflects a broader theme in DeepSeek’s development strategy: leveraging synthetic data not just as a cheap substitute but as a beneficial component that helps steer model training effectively.\n\nEthically, the use of such models raises complex considerations. The context notes ongoing legal debates about training models on people's content without permission, which is related to broader questions about the impact of AI on human work and society. While synthetic data can improve models, it also prompts issues around consent, fairness, and competitive harm, especially as AI-generated content increasingly resembles or replaces human-created work. Although the context does not provide specific ethical guidelines related to DeepSeek v3, it highlights that these legal and ethical challenges are profound and likely to influence AI’s future development.\n\nIn summary, DeepSeek v3’s emphasis on synthetic data builds on DeepSeek and DeepSeek-R1’s theme of using generated data to enhance training efficiency and model quality, while the ethical considerations involve complex legal and societal questions about the use of copyrighted or human-generated content in AI training and its broader impacts."` This is a well reasoned answer based on the context. The **dopeness pipeline answers**: `"Yo, based on the context you dropped, here’s the lowdown:\n\nDeepSeek v3 is this massive 685B parameter boss model, trained super efficiently compared to its peers. While DeepSeek’s story highlights cutting-edge training cost reductions and environmental impact improvements, the text doesn’t explicitly link DeepSeek v3’s development to a heavy emphasis on synthetic data like the Phi series does.\n\nThe Phi models are all about synthetic data because it’s structured, gradual, and way more learnable for language models—like a smooth jazz solo versus chaotic noise. Synthetic data helps models connect the dots token by token without the messy complexity of real-world data. It’s a direct, clean feed that leads to better reasoning.\n\nBut for DeepSeek v3 and DeepSeek-R1, the context mostly talks about scale, cost, and release info, not much about synthetic data vibes or its role in training. So, how DeepSeek’s synthetic data use ties to Phi’s themes or DeepSeek-R1’s approach isn’t spelled out.\n\nAs for ethical stuff—while synthetic data dodges some issues like privacy (no real humans harmed in data-making), there’s that “model collapse” fear floating around: training models on AI-generated data endlessly could make them forget or degrade. Thankfully, context says that’s a myth, and labs are *actually* using synthetic data smartly to guide models right.\n\nEthically, using synthetic data could mean less biased and more controlled training sets, but also risks if synthetic data inherits or amplifies subtle biases or errors. The context doesn’t deep dive into these ethical waters for DeepSeek, so gotta keep it chill and say I don’t see solid info here.\n\nIn a nutshell:  \n- Phi series vibes big on synthetic data for better learning.  \n- DeepSeek talks big on training scale and cost, but not much on synthetic data emphasis.  \n- Ethics around synthetic data? The collapse thing is busted, but nuanced pros and cons remain.  \n\nSo, summing up: I don’t know how DeepSeek’s synthetic data emphasis directly weaves into its own themes or ethics based on what we got here. Keep it cool, keep it curious!"` In this case, it seems the emphasis on dopeness tends to drown out some of the information which is more readily parsed (at least to my human brain) about ethics of training on copyrighted and human-generated content without permission. **This is more readily visible in the baseline answer.**