# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [None]:
#!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/ashurveer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ashurveer/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [22]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data

mkdir: data: File exists


In [6]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31524    0 31524    0     0  46184      0 --:--:-- --:--:-- --:--:-- 46222


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70549    0 70549    0     0   359k      0 --:--:-- --:--:-- --:--:--  360k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [8]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [9]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [10]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [11]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [12]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

unable to apply transformation: 'StringIO' object has no attribute 'output'


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 10, relationships: 28)

We can save and load our knowledge graphs as follows.

In [13]:
kg.save("ai_across_years_kg.json")
ai_across_years_kg = KnowledgeGraph.load("ai_across_years_kg.json")
ai_across_years_kg

KnowledgeGraph(nodes: 10, relationships: 28)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [14]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=ai_across_years_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [15]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.
1 .SingleHopSpecificQuerySynthesizer :
Generates straightforward, direct questions that can be answered using a single piece of information from the data. Like asking what is the color of the sky. The ans to this can be answered with direct lookup.
2. MultiHopAbstractQuerySynthesizer:
Creates broader, more open-ended questions that require you to connect ideas or themes from different parts of the data.Like asking, “How has technology changed our lives?”—you need to think about several things and summarize.
3.MultiHopSpecificQuerySynthesizer
Generates detailed, fact-based questions that require you to gather and combine specific pieces of information from different parts of the data.
Like asking, “Which two players scored in both the first and last games of the season?”—you need to check multiple records and put the answer together.








Finally, we can use our `TestSetGenerator` to generate our testset!

In [16]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Stability AI what is it,[The ethics of this space remain diabolically ...,Stability AI is mentioned as one of the organi...,single_hop_specifc_query_synthesizer
1,What is AI according to Simon Willison’s weblog?,[Simon Willison’s Weblog Subscribe Stuff we fi...,"According to Simon Willison’s weblog, AI refer...",single_hop_specifc_query_synthesizer
2,How does the document describe the significanc...,[the document includes some of the clearest ex...,The document mentions ChatGPT as one of the to...,single_hop_specifc_query_synthesizer
3,Large Language Models are what?,[Things we learned about LLMs in 2024 31st Dec...,Large Language Models are advanced AI systems ...,single_hop_specifc_query_synthesizer
4,What is GPT-3.5 Turbo and how does it compare ...,[punch massively above their weight. I run Lla...,GPT-3.5 Turbo is mentioned as a model that is ...,single_hop_specifc_query_synthesizer
5,Whas the advancments in LLMs in 2024 and break...,[<1-hop>\n\nThings we learned about LLMs in 20...,"In 2024, the field of Large Language Models sa...",multi_hop_abstract_query_synthesizer
6,how long input capacity increase help with com...,[<1-hop>\n\nThings we learned about LLMs in 20...,"In 2024, LLMs like Google’s Gemini 1.5 Pro and...",multi_hop_abstract_query_synthesizer
7,How does the integration of voice and live cam...,[<1-hop>\n\npunch massively above their weight...,The integration of voice and live camera modes...,multi_hop_abstract_query_synthesizer
8,How does the recent development and capabiliti...,[<1-hop>\n\npunch massively above their weight...,"The context highlights that in 2024, GPT-4 has...",multi_hop_specific_query_synthesizer
9,hOw cAn I uSe ChatGPT mOre effIcIentLy whiLe a...,[<1-hop>\n\nthe document includes some of the ...,The context highlights that ChatGPT is a promi...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [17]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [18]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Is Baidu involved in developing large language...,[The ethics of this space remain diabolically ...,"Yes, according to the context, Baidu is one of...",single_hop_specifc_query_synthesizer
1,What is the role of Artifical Intelligance in ...,[Simon Willison’s Weblog Subscribe Stuff we fi...,Artificial Intelligence is the academic field ...,single_hop_specifc_query_synthesizer
2,GTP is what?,[the document includes some of the clearest ex...,"The context does not explicitly define GTP, bu...",single_hop_specifc_query_synthesizer
3,"What is xAI, and how does it relate to the rec...",[The rise of inference-scaling “reasoning” mod...,The context discusses various advancements in ...,single_hop_specifc_query_synthesizer
4,How do structured and gradual learning process...,[<1-hop>\n\nof the best descriptions I’ve seen...,Structured and gradual learning processes impr...,multi_hop_abstract_query_synthesizer
5,What are the limitations and challenges of LLM...,[<1-hop>\n\nof the best descriptions I’ve seen...,The technical report emphasizes that despite a...,multi_hop_abstract_query_synthesizer
6,"How do AI technologies like GPT, ChatGPT, and ...",[<1-hop>\n\nyou talk to me exclusively in Span...,"AI technologies such as GPT, ChatGPT, and DALL...",multi_hop_abstract_query_synthesizer
7,How do the high costs and accessibility challe...,[<1-hop>\n\nThe ethics of this space remain di...,The high costs and accessibility challenges of...,multi_hop_abstract_query_synthesizer
8,how LLMs are built and their impact on AI deve...,[<1-hop>\n\nthe document includes some of the ...,The document explains that LLMs are quite easy...,multi_hop_specific_query_synthesizer
9,ChatGPT and Gemini both talk in Spanish and us...,[<1-hop>\n\nthe document includes some of the ...,"Yes, according to the context, both ChatGPT an...",multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [25]:
from langsmith import Client

client = Client()

dataset_name = "State of AI Across the Years updated!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="State of AI Across the Years!"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [26]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [27]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [29]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [30]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="State of AI"
)

In [31]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [32]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [33]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [34]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [35]:
rag_chain.invoke({"question" : "What are Agents?"})

'Based on the context provided, "agents" is an infuriatingly vague term with no single, clear, and widely understood meaning. Generally, it refers to AI systems that can act on your behalf, often thought of in two main ways: either as AI that goes and acts for you like a travel agent or as large language models (LLMs) given access to tools which they can use iteratively to solve problems. However, despite much discussion and excitement, true AI agents have not yet been realized in widespread production, partly because of challenges such as gullibility—i.e., AI\'s difficulty distinguishing truth from fiction—and the need for robust autonomy. The concept of agents still feels perpetually "coming soon" and may be dependent on achieving more advanced AI like AGI (Artificial General Intelligence).'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [36]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [37]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`:  Checks if the answer given by your RAG system is correct compared to the reference answer.
- `labeled_helpfulness_evaluator`: Judges how helpful the model’s answer is, taking into account the correct answer.
- `dope_or_nope_evaluator`: Evaluates if the answer is “dope, lit, or cool”—in other words, if it’s engaging, interesting, or has a fun/appealing style.

## LangSmith Evaluation

In [38]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'back-gold-8' at:
https://smith.langchain.com/o/8b16d96d-7566-4580-b923-510497302c71/datasets/49b58d75-6369-45c2-8bfa-627379055de2/compare?selectedSessions=03dc0f8b-702d-4c9d-8719-34166664efcd




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How does Google's development of large languag...,I don't know.,,Google has played a significant role in advanc...,0,0,0,1.221445,32141e68-656a-482f-952c-cc2968d4beb9,16e1ca23-5bd8-4b70-9893-62b547962f07
1,how are LLMs like in 2024 and what they do wit...,"Based on the provided context, here is what ca...",,"In 2024, LLMs have advanced significantly, wit...",1,1,0,11.288642,59acdfd0-54fc-46bb-a7b2-263f30d87ba9,da61cdd1-405a-43ee-a49f-1f5b31eea053
2,ChatGPT and Gemini both talk in Spanish and us...,Yes. Both ChatGPT and Google’s Gemini support ...,,"Yes, according to the context, both ChatGPT an...",1,1,0,1.961078,8349563a-15be-42f2-8ff7-f6732154e88f,786d966e-f37e-4382-a9cd-0028c4030b6b
3,how LLMs are built and their impact on AI deve...,According to Simon Willison's writings in the ...,,The document explains that LLMs are quite easy...,1,1,0,5.934514,8cb41562-201d-454f-9fe1-887ac3b42704,313a689e-84f9-46a4-9532-75802c4a3047
4,How do the high costs and accessibility challe...,The high costs and accessibility challenges of...,,The high costs and accessibility challenges of...,1,1,0,3.296162,e442e52f-0301-4864-b90f-1a88342c2cd9,d227e303-f409-4ccc-a476-bf8b9592e485
5,"How do AI technologies like GPT, ChatGPT, and ...",I don't know.,,"AI technologies such as GPT, ChatGPT, and DALL...",0,0,0,0.798896,d7eea0ee-2499-4012-8947-a7c8b97843cc,675025b3-44be-4eba-a66f-fb448a4471b0
6,What are the limitations and challenges of LLM...,The limitations and challenges of LLMs highlig...,,The technical report emphasizes that despite a...,1,1,0,8.070611,9ea2b0c3-ed40-44eb-97a5-94eb815befd8,b5c8e303-1e0c-4c3a-96ef-396319075bef
7,How do structured and gradual learning process...,Structured and gradual learning processes enha...,,Structured and gradual learning processes impr...,1,1,0,3.893588,b39ffe0d-e0a5-4f1c-bdf9-dbdcbc9cc629,9de0b991-a5c0-4209-ba7d-3bd8f2222028
8,"What is xAI, and how does it relate to the rec...",I don't know.,,The context discusses various advancements in ...,0,0,0,1.066449,47ec23b2-0b9e-4a69-be16-a94e4c476639,4ea9e503-7348-4fca-b89d-4f561b0414b0
9,GTP is what?,I don't know.,,"The context does not explicitly define GTP, bu...",0,0,0,0.661724,e0567ab7-0cf5-4f8f-9d7d-ad797c69d474,c3f915b3-58ff-4281-99ae-f5bfcee2f2dc


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [57]:
DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

In [58]:
rag_documents = docs

In [59]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?
Modifying chunk size into  smaller chunks retrieves precise information but there is a risk of missing information that is spread acorss multiple chunks. Increasing chunk size retrieves more information sometimes more than what is rerequired from different topics. The model may get confused by unreleatd content.

In [60]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

Different embedding models have different quality , capabilities and specializations. Better embeddings based on the context will result in better retrievals. It can improve accuracy of the results , reduce irrelevant information  and help the system understand complex queries

In [61]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Across Years (Augmented)"
)

In [62]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [63]:
dope_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dope_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [64]:
dope_rag_chain.invoke({"question" : "what are Agents?"})

'Alright, let me break it down cool and clear for you. Agents? Man, that term is all over the place with no single, solid meaning. Some folks think of agents as AI systems that roll out and handle tasks for you—like your personal travel agent but digital. Others see agents as LLMs (those big language models) plugging into tools and running loops to solve problems. Toss in “autonomy,” and it gets even fuzzier.\n\nBut straight up? Agents are kinda like that “coming soon” hype—promised but not really delivered yet. The big blocker? Gullibility. These AI agents believe just about anything, and that’s a huge deal if you want them to make smart decisions for you. Until they get past that, we’re stuck dreaming about agents that *actually* act on our behalf without getting played.\n\nSo yea, agents = AI systems intended to act for you, but right now they’re mostly prototypes struggling with trust issues, and we’re waiting on that next breakthrough to see them truly in action. Keep it real!'

Finally, we can evaluate the new chain on the same test set!

In [56]:
evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

View the evaluation results for experiment: 'artistic-appointment-70' at:
https://smith.langchain.com/o/8b16d96d-7566-4580-b923-510497302c71/datasets/49b58d75-6369-45c2-8bfa-627379055de2/compare?selectedSessions=95b27fcd-55c2-4ea0-970e-ef12716cb1a2




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How does Google's development of large languag...,"Yo, here’s the lowdown straight from the conte...",,Google has played a significant role in advanc...,1,1,1,5.902171,32141e68-656a-482f-952c-cc2968d4beb9,29979fa3-23e0-4fb1-9ec7-687109d2d06e
1,how are LLMs like in 2024 and what they do wit...,"Yo, that’s a hefty vibe you’re throwing down! ...",,"In 2024, LLMs have advanced significantly, wit...",1,1,1,8.499475,59acdfd0-54fc-46bb-a7b2-263f30d87ba9,469607e3-ec17-40ca-b9f7-6d5d1726de4c
2,ChatGPT and Gemini both talk in Spanish and us...,"Yo, you’re spot on with the vibe! Both ChatGPT...",,"Yes, according to the context, both ChatGPT an...",1,1,1,3.172315,8349563a-15be-42f2-8ff7-f6732154e88f,b3e38e31-bbf1-4058-bd85-422168fbe2c6
3,how LLMs are built and their impact on AI deve...,"Alright, let’s break it down Simon Willison st...",,The document explains that LLMs are quite easy...,1,1,1,5.093954,8cb41562-201d-454f-9fe1-887ac3b42704,8dd51cdf-2598-407c-9719-93442f4eeebe
4,How do the high costs and accessibility challe...,"Alright, here’s the lowdown in a chill vibe: T...",,The high costs and accessibility challenges of...,1,1,1,5.384981,e442e52f-0301-4864-b90f-1a88342c2cd9,11cdf1c9-5a6c-4426-9318-4d857943d6fb
5,"How do AI technologies like GPT, ChatGPT, and ...","Yo, here’s the lowdown on how these AI legends...",,"AI technologies such as GPT, ChatGPT, and DALL...",1,1,1,6.782168,d7eea0ee-2499-4012-8947-a7c8b97843cc,fe66889e-285d-4e78-a02a-4a5e49cd758e
6,What are the limitations and challenges of LLM...,"Yo, here’s the lowdown straight from Simon Wil...",,The technical report emphasizes that despite a...,1,0,1,6.771521,9ea2b0c3-ed40-44eb-97a5-94eb815befd8,2ae123f7-204e-49db-b89d-ab1bcd4dfb82
7,How do structured and gradual learning process...,"Alright, here's the lowdown: Structured and gr...",,Structured and gradual learning processes impr...,1,1,1,3.331964,b39ffe0d-e0a5-4f1c-bdf9-dbdcbc9cc629,f02fb242-c60d-4046-b513-d8a17dfabafe
8,"What is xAI, and how does it relate to the rec...","Yo, gotta keep it real — the context you dropp...",,The context discusses various advancements in ...,1,0,1,2.515757,47ec23b2-0b9e-4a69-be16-a94e4c476639,25cab3e8-05b2-4e18-bc0c-2af30a0f0104
9,GTP is what?,"Yo, looks like you meant **GPT**, not GTP, rig...",,"The context does not explicitly define GTP, bu...",1,1,1,1.967316,e0567ab7-0cf5-4f8f-9d7d-ad797c69d474,7f8c26ea-e7fb-4d33-88f1-5b3b2739b0e0


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

Refer to screenshot
 Diff Rag Chains.png

There is a difference in the correctness metrics  for the first 2 questions. This is because the chunk size has been increased and there is more information available for the model to use to generate the output. Also with the introduction of dopeness the dopeness metrics has change to "y" with a more cool , dope output. Also there is increase in latency numbers , this could be because larger chunks are used.

