# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [26]:
#!pip install -qU ragas==0.2.10

In [27]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [28]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bsmith53\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\bsmith53\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [29]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [30]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [31]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [32]:
!mkdir data

A subdirectory or file data already exists.


In [33]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 31524    0 31524    0     0   322k      0 --:--:-- --:--:-- --:--:--  331k


In [34]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 70549    0 70549    0     0   421k      0 --:--:-- --:--:-- --:--:--  425k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [35]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [36]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [37]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [38]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [39]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg


Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/22 [00:00<?, ?it/s]

unable to apply transformation: 'StringIO' object has no attribute 'output'


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 12, relationships: 43)

We can save and load our knowledge graphs as follows.

In [40]:
import locale
print(locale.getpreferredencoding())


cp65001


In [41]:
import sys

print("Default encoding:", sys.getdefaultencoding())         # usually 'utf-8'
print("Filesystem encoding:", sys.getfilesystemencoding())   # often 'utf-8' or 'mbcs' (Windows)




Default encoding: utf-8
Filesystem encoding: utf-8


In [42]:
kg.save("ai_across_years_kg.json")
ai_across_years_kg = KnowledgeGraph.load("ai_across_years_kg.json")
ai_across_years_kg

KnowledgeGraph(nodes: 12, relationships: 43)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [43]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=ai_across_years_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [44]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

#### Answer

SingleHopSpecifiQuerySynthesizer: The 'single hop' refers to the context from only a single node being used, where the 'specific' portion refers to the type of query being generated, which in this case is a direct (non-conceptual) query. An example would be "What is the capitol of Nebraska?" 

MultiHopAbstractQuerySynthesizer: The 'multi hop' refers to the context from multiple nodes being combined, and the 'abstract' portion refering to the type of query being generated. In this case, it is a conceptual question that demands a more complex answer, as oposed to a simple factual response. An example would be "What are key differnces between Frequentist and Bayesian statistical approaches?"

MultiHopSpecificQuerySynthesizer: The 'multi hop' refers to the context from multiple nodes being combined, and the 'specific' portion refers to the type of query being generated. In this case a direct (factual and non-conceptual) query. An example would be "Who had the most home runs from the team that won the World Series in 2024?"


Finally, we can use our `TestSetGenerator` to generate our testset!

In [46]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Is Anthropic a company or an organization?,[The ethics of this space remain diabolically ...,The context mentions Anthropic as one of the o...,single_hop_specifc_query_synthesizer
1,Who is Simon Willison and what has he contribu...,[Simon Willison’s Weblog Subscribe Stuff we fi...,Simon Willison is mentioned in the context of ...,single_hop_specifc_query_synthesizer
2,How is the term 'Llama' relevant in the contex...,[the document includes some of the clearest ex...,"In the provided context, 'Llama' is mentioned ...",single_hop_specifc_query_synthesizer
3,What is Anthropic's role in the development of...,[Things we learned about LLMs in 2024 31st Dec...,Anthropic launched the Claude 3 series in Marc...,single_hop_specifc_query_synthesizer
4,What is GPT-4 Turbo?,[punch massively above their weight. I run Lla...,The provided context does not include a defini...,single_hop_specifc_query_synthesizer
5,How do recent AI evaluation methods and benchm...,"[<1-hop>\n\non inference. The sequel to o1, o3...",Recent AI evaluation methods and benchmarking ...,multi_hop_abstract_query_synthesizer
6,How do recent developments in large language m...,"[<1-hop>\n\non inference. The sequel to o1, o3...",Recent developments in large language models (...,multi_hop_abstract_query_synthesizer
7,How do LLMs r accessible for hobbyists and run...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"The context shows that in 2023, Simon Willison...",multi_hop_abstract_query_synthesizer
8,WhY is Apple’s MLX librarY better than Apple I...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Apple’s MLX library is considered excellent be...,multi_hop_specific_query_synthesizer
9,how much did the large language models get bet...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In 2024, large language models (LLMs) signific...",multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [47]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/20 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

AttributeError: 'StringIO' object has no attribute 'mapping'

In [48]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are the key factors that have contributed...,[The ethics of this space remain diabolically ...,Despite the significant advancements in large ...,single_hop_specifc_query_synthesizer
1,Who is Simon Willison and what are his contrib...,[Simon Willison’s Weblog Subscribe Stuff we fi...,Simon Willison’s Weblog discusses his insights...,single_hop_specifc_query_synthesizer
2,From the perspective of an AI researcher and e...,[the document includes some of the clearest ex...,The context discusses ethical considerations s...,single_hop_specifc_query_synthesizer
3,How is Google contributing to advancements in ...,[Things we learned about LLMs in 2024 31st Dec...,Google has contributed to advancements in Larg...,single_hop_specifc_query_synthesizer
4,How do the recent emergence of voice and live ...,[<1-hop>\n\npunch massively above their weight...,The recent emergence of voice and live camera ...,multi_hop_abstract_query_synthesizer
5,How do the advancements in multimodal capabili...,[<1-hop>\n\nThings we learned about LLMs in 20...,"In 2024, significant progress has been made in...",multi_hop_abstract_query_synthesizer
6,How do the themes of model efishency and cost ...,[<1-hop>\n\npunch massively above their weight...,The context explains that LLM prices have cras...,multi_hop_abstract_query_synthesizer
7,How do the knowledge gaps and societal impacts...,[<1-hop>\n\nways we should not be using genera...,The context highlights that the rapid developm...,multi_hop_abstract_query_synthesizer
8,How does DeepSeek's use of synthetic data rela...,[<1-hop>\n\nways we should not be using genera...,DeepSeek's emphasis on training with synthetic...,multi_hop_specific_query_synthesizer
9,How do synthetic data and AI's power-user feat...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,The context highlights that synthetic data is ...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [49]:
from langsmith import Client

client = Client()

dataset_name = "State of AI Across the Years!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="State of AI Across the Years!"
)

LangSmithConflictError: Conflict for /datasets. HTTPError('409 Client Error: Conflict for url: https://api.smith.langchain.com/datasets', '{"detail":"Dataset with this name already exists."}')

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [50]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

NameError: name 'langsmith_dataset' is not defined

## Basic RAG Chain

Time for some RAG!


In [51]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [52]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [53]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [54]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="State of AI"
)

In [55]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [56]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [57]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [58]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [59]:
rag_chain.invoke({"question" : "What are Agents?"})

'Based on the provided context, "Agents" is an infuriatingly vague term with no single, clear, and widely understood meaning. Generally, it seems to converge around the idea of AI systems that can go away and act on your behalf. There are two main interpretations: one, like a travel agent model, where AI acts on your behalf; and two, large language models (LLMs) given access to tools they can use iteratively to solve problems. However, the term is used without consistent definition, and the concept of agents is often tied to autonomy, which also lacks a clear explanation. Despite much discussion and excitement, genuine AI agents running effectively in production are rare or non-existent, partly due to challenges like gullibility (LLMs believing anything told to them), making it hard for agents to make meaningful decisions on a user\'s behalf. In summary, agents refer broadly to AI systems designed to autonomously act for users, but the term remains ambiguous and the technology is still

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [60]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [61]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`:
- `labeled_helpfulness_evaluator`:
- `dope_or_nope_evaluator`:


#### Answer

`qa_evaluator`: This is evaluating the response based solely on factual correctness and semantic similarity to the reference answer. This is a standard LangChain evaluator. 

`labeled_helpfulness_evaluator`:This is evaluating the helpfulness of the response, based on the input: Is this submission helpful to the user...taking into account the correct reference answer" The LLM is tasked with making this determination by explicitly writing out the reasoning in a step by step manner, and then make a "yes" or "no" determination on helpfulness. 

`dope_or_nope_evaluator`:This is a different evaluator that doesn't judge based off of the reference answer. It simply determines whether the tone of the response is "dope" without regards for the correctness or helpfulness of the response. 



## LangSmith Evaluation

In [62]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'left-person-4' at:
https://smith.langchain.com/o/bec07d7d-df09-4fc3-917e-46e6c637d955/datasets/ca714c29-70f8-4c4f-8791-4e4d6bb51c72/compare?selectedSessions=dd8497da-a824-47a2-8c2f-900e47e7fc32




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How do recent developments in LLM training dat...,Recent developments in LLM training data and l...,,Recent developments highlight that the most su...,1,0,0,7.604801,16e322d0-8f3a-4e7b-bd98-0e0229a7c0ee,d8baf804-0f01-4ef5-af07-1373de31a5a1
1,How do synthetic data and ChatGPT exemplify re...,"Based on the provided context, synthetic train...",,The context highlights that synthetic data is ...,1,0,0,5.613343,71c2e20d-5813-4254-83cb-85779350a2c7,d00a4df8-f17d-4822-8302-6e65c49f76b9
2,How do recent developments in multi-modal audi...,Recent developments in multi-modal audio and v...,,The context highlights that ChatGPT has recent...,1,1,0,7.045951,c5463e7d-560d-4c0f-bf60-e01f058c9dc7,670fc97f-c490-4969-974f-c6407b9d53c0
3,china why did the inference scaling models in ...,Based on the provided context:\n\nInference-sc...,,"The context indicates that in 2024, models tra...",1,1,0,8.784912,c3cc7bc8-d8bd-4d39-b26e-8227f4f7a2da,79924e5d-7e9a-4282-8ce7-b33330b9b784
4,"How do the advancements in LLM capabilities, s...","Advancements in LLM capabilities, especially i...",,The context segments highlight significant adv...,1,1,0,5.092882,3d4f81e9-178b-4391-a88c-3858e8ec4d7f,4949c242-931c-4152-944e-b3c382eeea64
5,Considering the various organizations developi...,"Based on the provided context, the development...",,The context highlights that many organizations...,1,1,0,11.56603,87b81c9c-5d19-4217-9a27-7e17fa06974d,fa7e7762-db16-4fd3-a179-da71ae6dcf17
6,LLMs smart or dumb how about Simon Willison say?,"Simon Willison says that LLMs are ""really smar...",,Simon Willison’s weblog says 2023 was a big ye...,1,1,0,1.611913,9819fe6e-5dc6-46bb-8ee1-2c44d5a5676a,0081355f-5c5d-452c-bfe9-6140ba6e2725
7,How do recent advancements in multimodal capab...,Recent advancements in multimodal capabilities...,,Recent advancements in multimodal capabilities...,1,1,0,10.335745,ebd6eeea-5d49-4c2a-970f-36702cbd97aa,e6b49697-4a08-4702-af48-ce4033f0f154
8,what is stanford alpaca,I don't know,,The context mentions Stanford Alpaca in relati...,0,0,0,1.496921,9203d35b-ada4-4f52-b74f-0452edc48ed7,e5833e4e-8e9a-4279-af96-0acd99db9627
9,Considering the developments in AI during 2023...,"Based on Simon Willison’s weblog from 2023, th...",,"In 2023, it was considered the breakthrough ye...",1,1,0,5.141081,8e2f0f79-f184-48ef-af11-73af92d82d22,151abed5-f40b-4aa2-891b-98390d3ae975


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [63]:
DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

In [64]:
rag_documents = docs

In [65]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

#### Answer

When determining chunck size, and overlap, several considerations need to be made:

1. small chunk sizes: increase the likelihood that relevant information is split amongst chunks, but may eliminate situations where extraneous information within a chunk leads to confusion from the model. This also reduces the likelihood of inputs being to large for the context window, and therefore all information contained within a chunk would be considered. 

2. large chunk sizes: increases the amount of information and contextual understanding from the models. This also increases the likelihood of inputs being too large for a context window (and thus rendering the portion being cutout useless) and may increase latency. 



In [66]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

#### Answer 

The embedding model is crucial for the performance of RAG applications: 
1. vector dimensions and context window size (the amount of values an embedding contains and the size of chunks that are able to be processed) increases the models ability to find nuances and patterns within data, and better understand semantic relationships. However, this can also increase the likelihood of overfitting our data. Additionally, if the context window is too small, meaningful data is excluded and performance in the embeddig and retrieval will suffer. 

2. When embeddings are better, this improves the retrieval, which improves the quality of the prompt being passed to the LLM. This imporves the responses of the model, and therefore is a critical step in improving app performance. 

3. Larger and more complex embedding models come with additional cost and latency, although is dependent upon the situation. 





In [67]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Across Years (Augmented)"
)

In [68]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [69]:
dope_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dope_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [70]:
dope_rag_chain.invoke({"question" : "what are Agents?"})

'Alright, here’s the deal on "agents" straight from the context — the term is hella vague and kinda frustrating. People throw it around like everyone’s on the same page, but nah, there’s no one-size-fits-all meaning. Some peeps think of agents like your personal travel agent—they go off and act on your behalf. Others see agents as LLMs (large language models) hooked up with tools that they loop through to solve problems. \n\nBut here’s the kicker: no one really nailed a solid definition, and the whole thing is wrapped up in this tricky concept called "autonomy," which itself is blurry. Plus, agents still feel like they’re “coming soon” forever because of big hurdles, mainly gullibility—LLMs believe anything you feed them, so trusting these agents to make solid decisions is a tall order.\n\nSo, TL;DR: Agents are supposed to be AI systems that act for you, but what that looks like exactly is still a moving target in the AI game. Keep those hype trains parked for now.'

Finally, we can evaluate the new chain on the same test set!

In [71]:
evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

View the evaluation results for experiment: 'kind-winter-28' at:
https://smith.langchain.com/o/bec07d7d-df09-4fc3-917e-46e6c637d955/datasets/ca714c29-70f8-4c4f-8791-4e4d6bb51c72/compare?selectedSessions=d1e35d4c-5a9f-4464-a473-5cf7ecf04b71




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How do recent developments in LLM training dat...,"Alright, buckle up! The 2024 glow-up in LLMs i...",,Recent developments highlight that the most su...,1,0,1,4.496657,16e322d0-8f3a-4e7b-bd98-0e0229a7c0ee,b5fc19d4-6d15-4b38-a82f-1af212ff052d
1,How do synthetic data and ChatGPT exemplify re...,"Alright, here’s the lowdown with some swagger:...",,The context highlights that synthetic data is ...,1,0,1,5.051027,71c2e20d-5813-4254-83cb-85779350a2c7,db5309ee-e70a-4208-a5fe-1c9dcf9fc266
2,How do recent developments in multi-modal audi...,"Yo, check this out—the latest multi-modal move...",,The context highlights that ChatGPT has recent...,1,1,1,8.82682,c5463e7d-560d-4c0f-bf60-e01f058c9dc7,b8b308fb-5374-4f58-afd3-d5c1f2877990
3,china why did the inference scaling models in ...,"Alright, here’s the lowdown straight from the ...",,"The context indicates that in 2024, models tra...",1,1,1,5.199483,c3cc7bc8-d8bd-4d39-b26e-8227f4f7a2da,1cc6cc77-9a6f-41e5-a794-2b20b10c21a5
4,"How do the advancements in LLM capabilities, s...","Yo, based on the vibe from the context, here’s...",,The context segments highlight significant adv...,1,1,1,8.266931,3d4f81e9-178b-4391-a88c-3858e8ec4d7f,8a84148d-674b-466c-a8ef-047d318387b9
5,Considering the various organizations developi...,"Alright, check this out—based on the vibe from...",,The context highlights that many organizations...,1,1,1,7.558551,87b81c9c-5d19-4217-9a27-7e17fa06974d,e10a095e-e171-4bbc-9aaf-0d94059952d7
6,LLMs smart or dumb how about Simon Willison say?,"Alright, here’s the vibe Simon Willison throws...",,Simon Willison’s weblog says 2023 was a big ye...,1,0,1,2.134861,9819fe6e-5dc6-46bb-8ee1-2c44d5a5676a,8e261134-3c31-4908-91f8-7964a3bf2f67
7,How do recent advancements in multimodal capab...,"Yo, check this out! The 2024 LLM game went mul...",,Recent advancements in multimodal capabilities...,1,1,1,5.33093,ebd6eeea-5d49-4c2a-970f-36702cbd97aa,c9a46cb5-429a-4eaf-afdf-50017b84b6ee
8,what is stanford alpaca,"Stanford Alpaca? Alright, here’s the lowdown s...",,The context mentions Stanford Alpaca in relati...,1,1,1,2.509914,9203d35b-ada4-4f52-b74f-0452edc48ed7,040e3953-c92b-44e3-8d7c-3d9f4c83d53a
9,Considering the developments in AI during 2023...,"Alright, here’s the lowdown straight from Simo...",,"In 2023, it was considered the breakthrough ye...",1,1,1,6.840608,8e2f0f79-f184-48ef-af11-73af92d82d22,3de6a046-7f9a-4288-9d7d-6c76c068d2d4


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

The following plot highlights the differences between the two runs, specifically with respect to evaluating the following metrics:

-correctness: increase from 10/12 correct to 11/12 correct using the second prompt. This could be a result of using larger chunk sizes when chunking (going from 500 to 1000), or because of the different embedding model (going from the text-embedding-3-small embedding model to the text-embedding-3-large model). 

-dopeness: increase from 0/12 to 12/12. An expected increase given the fact that we explicitly stated in our prompt "You must answer the questions in a dope way, be cool!"

-helpfulness: both registered a 10/12 on the helpfulness metric. However, when looking at the results in langchain, the first run had 'no' for both correctness and helpfulness, while the second had a correct, but unhelpful response. The resaon for this likely lies in the prompt change where the addition of 'dopeness' may have impacted the helpfulness of the response. 

<img src="Comparison Screenshot.png">