# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [28]:
!pip install -qU ragas==0.2.10


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Python313\python.exe -m pip install --upgrade pip


In [2]:
!pip install -qU langchain-community==0.3.14


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Python313\python.exe -m pip install --upgrade pip


In [3]:
!pip install -qU  langchain-openai==0.2.14 


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Python313\python.exe -m pip install --upgrade pip


In [7]:
!pip install unstructured

Defaulting to user installation because normal site-packages is not writeable


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Python313\python.exe -m pip install --upgrade pip



Collecting unstructured
  Downloading unstructured-0.11.8-py3-none-any.whl.metadata (26 kB)
Collecting chardet (from unstructured)
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting filetype (from unstructured)
  Using cached filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Using cached python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting lxml (from unstructured)
  Downloading lxml-5.3.0-cp313-cp313-win_amd64.whl.metadata (3.9 kB)
Collecting nltk (from unstructured)
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting tabulate (from unstructured)
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting beautifulsoup4 (from unstructured)
  Downloading beautifulsoup4-4.13.3-py3-none-any.whl.metadata (3.8 kB)
Collecting emoji (from unstructured)
  Using cached emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Using cach

In [8]:
!pip install -qU langgraph==0.2.61 langchain-qdrant==0.2.0


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Python313\python.exe -m pip install --upgrade pip


In [9]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [10]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [11]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [12]:
!mkdir data

A subdirectory or file data already exists.


In [47]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (35) schannel: next InitializeSecurityContext failed: CRYPT_E_NO_REVOCATION_CHECK (0x80092012) - The revocation function was unable to check revocation for the certificate.


In [48]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (35) schannel: next InitializeSecurityContext failed: CRYPT_E_NO_REVOCATION_CHECK (0x80092012) - The revocation function was unable to check revocation for the certificate.


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [16]:
!pip install unstructured

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Python313\python.exe -m pip install --upgrade pip


In [18]:
!pip install --upgrade unstructured

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Python313\python.exe -m pip install --upgrade pip


In [20]:
!pip install "unstructured<0.16.0"

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Python313\python.exe -m pip install --upgrade pip


In [23]:
from langchain_community.document_loaders import DirectoryLoader

In [51]:
import sys
print(sys.executable)

c:\Users\dabra\AppData\Local\Programs\Python\Python39\python.exe


In [54]:
!{sys.executable} -m pip install unstructured

Collecting unstructured
  Downloading unstructured-0.16.17-py3-none-any.whl (1.8 MB)
Collecting unstructured-client
  Using cached unstructured_client-0.29.0-py3-none-any.whl (63 kB)
Collecting html5lib
  Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
Collecting pypdf>=4.0
  Using cached pypdf-5.2.0-py3-none-any.whl (298 kB)
Collecting eval-type-backport<0.3.0,>=0.2.0
  Using cached eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)
Collecting jsonpath-python<2.0.0,>=1.0.6
  Using cached jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)
Collecting pydantic<2.11.0,>=2.10.3
  Using cached pydantic-2.10.6-py3-none-any.whl (431 kB)
Installing collected packages: pypdf, eval-type-backport, jsonpath-python, pydantic, unstructured-client, html5lib, unstructured
  Attempting uninstall: pydantic
    Found existing installation: pydantic 2.10.1
    Uninstalling pydantic-2.10.1:
      Successfully uninstalled pydantic-2.10.1
Successfully installed eval-type-backport-0.2.2 html5lib-1.1 jsonp

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

chainlit 0.7.700 requires aiofiles<24.0.0,>=23.1.0, but you'll have aiofiles 24.1.0 which is incompatible.
chainlit 0.7.700 requires httpx<0.25.0,>=0.23.0, but you'll have httpx 0.28.1 which is incompatible.
You should consider upgrading via the 'c:\Users\dabra\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [55]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [34]:
!pip show ragas

Name: ragas
Version: 0.2.10
Summary: 
Home-page: 
Author: 
Author-email: 
License: 
Location: C:\Users\dabra\AppData\Roaming\Python\Python313\site-packages
Requires: appdirs, datasets, diskcache, langchain, langchain-community, langchain-core, langchain_openai, nest-asyncio, numpy, openai, pydantic, pysbd, tiktoken
Required-by: 


In [36]:
import sys
print(sys.executable)

c:\Users\dabra\AppData\Local\Programs\Python\Python39\python.exe


In [37]:
!{sys.executable} -m pip install ragas

Collecting ragas
  Downloading ragas-0.2.13-py3-none-any.whl (178 kB)
Collecting appdirs
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Collecting diskcache>=5.6.3
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
Installing collected packages: appdirs, diskcache, ragas
Successfully installed appdirs-1.4.4 diskcache-5.6.3 ragas-0.2.13


You should consider upgrading via the 'c:\Users\dabra\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [38]:
import ragas
print(ragas.__version__)

  from .autonotebook import tqdm as notebook_tqdm


0.2.13


In [40]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [41]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [42]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 0, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [46]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

In [56]:
default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)

In [59]:
docs

[Document(metadata={'source': 'data\\2023_llms.html'}, page_content="Simon Willison’s Weblog\n\nSubscribe\n\nStuff we figured out about AI in 2023\n\n31st December 2023\n\n2023 was the breakthrough year for Large Language Models (LLMs). I think it’s OK to call these AI—they’re the latest and (currently) most interesting development in the academic field of Artificial Intelligence that dates back to the 1950s.\n\nHere’s my attempt to round up the highlights in one place!\n\nLarge Language Models\n\nThey’re actually quite easy to build\n\nYou can run LLMs on your own devices\n\nHobbyists can build their own fine-tuned models\n\nWe don’t yet know how to build GPT-4\n\nVibes Based Development\n\nLLMs are really smart, and also really, really dumb\n\nGullibility is the biggest unsolved problem\n\nCode may be the best application\n\nThe ethics of this space remain diabolically complex\n\nMy blog in 2023\n\nHere’s the sequel to this post: Things we learned about LLMs in 2024.\n\nLarge Languag

In [None]:
import ragas
print(ragas.__version__)

  from .autonotebook import tqdm as notebook_tqdm


0.2.13


In [60]:
apply_transforms(kg, default_transforms)
kg

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]unable to apply transformation: axis 1 is out of bounds for array of dimension 1
                                                                                              

KnowledgeGraph(nodes: 0, relationships: 0)

We can save and load our knowledge graphs as follows.

In [61]:
kg.save("ai_across_years_kg.json")
ai_across_years_kg = KnowledgeGraph.load("ai_across_years_kg.json")
ai_across_years_kg

KnowledgeGraph(nodes: 0, relationships: 0)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [62]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=ai_across_years_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [110]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.


<div style="background-color: #E6E6FA; padding: 10px; border-radius: 5px;">
<span style="color: black;">
<b> ANSWER:  </b> 

As part of the Ragas library, ragas.testset.synthesizers is used for evaluating RAG systems. Its synthesizers are used to generate synthetic queries based on  documents provided and implement RAG tests.

The three query synthesizers mentioned above have  three different purposes while generating synthetic queries. 

1. <b> SingleHopSpecificQuerySynthesizer: </b> This generates queries that directly correspond to a single document or passage and works by synthesizing queries that explicitly reference the given information without requiring additional reasoning or combining multiple sources. This is relevant for use cases requiring direct fact extraction tasks.

2. <b> MultiHopAbstractQuerySynthesizer: </b> This creates complex queries that require reasoning / inference / generalization across multiple documents / sources, but in an abstract way. This is most relevant for testing RAGs designed to aggregate multiple pieces of information to answer a query.

3. <b> MultiHopSpecificQuerySynthesizer: </b> This generates complex queries that require retrieving multiple documents and mentioning specific details. It creates queries requiring fact linkage across multiple sources (inputs to RAG systems).

Therefore, these different synthesizers enable evaluation of RAG systems by generating different types of synthetic queries, from simple fact-based ones to complex multi-hop reasoning queries.

</span>
</div>

Finally, we can use our `TestSetGenerator` to generate our testset!

In [111]:
print(generator.knowledge_graph.nodes)

[Node(id: 70906c, type: document, properties: ['page_content', 'document_metadata', 'headlines', 'summary', 'summary_embedding']), Node(id: f32cbe, type: document, properties: ['page_content', 'document_metadata', 'headlines', 'summary', 'summary_embedding']), Node(id: e8e442, type: chunk, properties: ['page_content', 'entities', 'themes']), Node(id: 236fe6, type: chunk, properties: ['page_content', 'entities', 'themes']), Node(id: ae437a, type: chunk, properties: ['page_content', 'entities', 'themes']), Node(id: 9e232b, type: chunk, properties: ['page_content', 'entities', 'themes']), Node(id: 5410f2, type: chunk, properties: ['page_content', 'themes', 'entities']), Node(id: f4a556, type: chunk, properties: ['page_content', 'themes', 'entities']), Node(id: 6d8b51, type: chunk, properties: ['page_content', 'themes', 'entities']), Node(id: f9a3ac, type: chunk, properties: ['page_content', 'themes', 'entities']), Node(id: aeaf4c, type: chunk, properties: ['page_content', 'themes', 'entit

In [112]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios: 100%|██████████| 3/3 [00:22<00:00,  7.63s/it]
Generating Samples: 100%|██████████| 11/11 [00:09<00:00,  1.20it/s]


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the significance of Mistral in the con...,[Code may be the best application The ethics o...,Mistral is one of the organizations that have ...,single_hop_specifc_query_synthesizer
1,In what ways might the behavior of Large Langu...,[Based Development As a computer scientist and...,The behavior of Large Language Models might ch...,single_hop_specifc_query_synthesizer
2,"So like, what be the big deal with LLMs in 202...",[Simon Willison’s Weblog Subscribe Stuff we fi...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,What LLMs do and how they work?,[easy to follow. The rest of the document incl...,The document includes some of the clearest exp...,single_hop_specifc_query_synthesizer
4,What role does Meta play in the advancements o...,[Prompt driven app generation is a commodity a...,Meta is one of the 18 organizations with model...,single_hop_specifc_query_synthesizer
5,How have advancements in Large Language Models...,[<1-hop>\n\nCode may be the best application T...,"In 2023, Large Language Models (LLMs) were rec...",multi_hop_abstract_query_synthesizer
6,How does OpenAI's approach to AI ethics and le...,[<1-hop>\n\nCode may be the best application T...,OpenAI's approach to AI ethics and legality si...,multi_hop_abstract_query_synthesizer
7,How have advancements in model training costs ...,[<1-hop>\n\nCode may be the best application T...,Advancements in model training costs and the e...,multi_hop_abstract_query_synthesizer
8,What are some of the key insights about LLMs f...,[<1-hop>\n\neasy to follow. The rest of the do...,"In 2023, Large Language Models (LLMs) were rec...",multi_hop_specific_query_synthesizer
9,How do GPT-4 and GPT-4o exemplify advancements...,[<1-hop>\n\nfeed with the model and talk about...,GPT-4 and GPT-4o exemplify advancements in lar...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [83]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Generating personas: 100%|██████████| 2/2 [00:10<00:00,  5.07s/it]                                           
Generating Scenarios: 100%|██████████| 3/3 [00:11<00:00,  3.92s/it]
Generating Samples: 100%|██████████| 12/12 [00:07<00:00,  1.63it/s]


In [84]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Meta do what with LLMs?,[Code may be the best application The ethics o...,"Meta released Llama, which allowed running a u...",single_hop_specifc_query_synthesizer
1,What happen in September with prompt injection?,[Based Development As a computer scientist and...,"In September last year, the term 'prompt injec...",single_hop_specifc_query_synthesizer
2,What we learn about AI in 2023?,[Simon Willison’s Weblog Subscribe Stuff we fi...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,how use homebrew for run llama 2?,[easy to follow. The rest of the document incl...,You can run Llama 2 on your own Mac using LLM ...,single_hop_specifc_query_synthesizer
4,How have advancements in large language models...,[<1-hop>\n\nCode may be the best application T...,Advancements in large language models have sig...,multi_hop_abstract_query_synthesizer
5,How OpenAI and AI ethics and legality connect ...,[<1-hop>\n\nCode may be the best application T...,OpenAI was one of the first organizations to r...,multi_hop_abstract_query_synthesizer
6,How do the challenges of understanding and con...,[<1-hop>\n\nCode may be the best application T...,The challenges of understanding and controllin...,multi_hop_abstract_query_synthesizer
7,How has OpenAI contributed to the development ...,[<1-hop>\n\nCode may be the best application T...,OpenAI has played a significant role in the de...,multi_hop_abstract_query_synthesizer
8,What were the key advancements in AI and Large...,[<1-hop>\n\neasy to follow. The rest of the do...,"In 2023, significant advancements were made in...",multi_hop_specific_query_synthesizer
9,How has Claude contributed to advancements in ...,[<1-hop>\n\nThose of us who understand this st...,Claude has been a significant contributor to a...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [85]:
from langsmith import Client

client = Client()

dataset_name = "State of AI Across the Years!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="State of AI Across the Years!"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [86]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [87]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [88]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [89]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [90]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="State of AI"
)

In [91]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [92]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [93]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Finally, we can set-up our RAG LCEL chain!

In [94]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [95]:
rag_chain.invoke({"question" : "What are Agents?"})

'Agents refer to AI systems that can act on your behalf, but the term lacks a clear, widely accepted definition. There are different interpretations, including AI that behaves like a travel agent and LLMs that utilize tools to solve problems. Overall, "agents" are depicted as a somewhat vague concept that is still in the process of development and has not yet been fully realized in practical applications.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4o as our evaluation LLM for our base Evaluators.

In [96]:
eval_llm = ChatOpenAI(model="gpt-4o")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [97]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`:
- `labeled_helpfulness_evaluator`:
- `dope_or_nope_evaluator`:

<div style="background-color: #E6E6FA; padding: 10px; border-radius: 5px;">
<span style="color: black;">
<b> ANSWER:  </b> 


Each evaluator in your code is using LangSmith's LangChainStringEvaluator to assess the quality of generated responses based on different criteria.

(1) <b> qa_evaluator: </b> This QA (Question-Answer) evaluator checks how the generated response (prediction) aligns with an expected or reference answer. 
- The evaluator argument is set to "qa" suggesting this evaluator will leverage a built-in evaluator meant for general QA tasks. 
- The LLM (eval_llm) assesses the quality of the response.

(2) <b> labeled_helpfulness_evaluator: </b> This evaluator checks how helpful the response is.
- It uses a custom labeled criteria ("labeled_criteria") with a specific focus on helpfulness.
- The prepare_data function structures the input for evaluation: prediction (model's generated response), reference (expected answer), input (query)
- The LLM (eval_llm) dteremines how helpful the generated response is, benchmarked in comparison to the reference.

(3) <b> dope_or_nope_evaluator: </b> This evaluator is assessing "dopeness" (coolness).
- It uses a custom "criteria" evaluator with the criterion "dopeness", checking if the response is dope / lit / cool.
- The LLM (eval_llm) determines whether the generated response meets this informal, subjective standard.
</span>
</div>

## LangSmith Evaluation

In [98]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'monthly-wound-64' at:
https://smith.langchain.com/o/a880a293-74b1-4b82-a94d-e9a3daf7aa11/datasets/f4fa746b-ddac-42dc-b684-812378b0c61b/compare?selectedSessions=5c26707e-1d01-4a24-a629-ad8c76b8dda2




12it [05:17, 26.48s/it]


Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How do the challenges of evaluating LLMs and t...,The challenges of evaluating LLMs and their gu...,,The challenges of evaluating LLMs stem from th...,1,0,0,5.533696,f3ce003d-5d87-47ae-a43d-ba57a66a8441,476f216f-617e-478a-9887-0730271d2647
1,How have advancements in GPT-4 and GPT-4o infl...,The advancements in GPT-4 and the release of G...,,Advancements in GPT-4 and GPT-4o have signific...,0,0,0,5.740549,0ed56e13-9a66-4278-9b54-8bc06a996358,60aa3775-94a9-405a-81e3-ac2a00f0ff0a
2,How has Claude contributed to advancements in ...,Claude has contributed to advancements in larg...,,Claude has been a significant contributor to a...,0,0,0,6.845584,1ec814da-b35c-4704-a529-ee7d6358d458,c59502bf-cdf0-45c3-9cf5-c05825600c61
3,What were the key advancements in AI and Large...,"In 2023, key advancements in AI and Large Lang...",,"In 2023, significant advancements were made in...",1,0,1,3.648828,b5da1594-d1c3-4cb7-9079-de28360d7f4d,489aaacd-d166-439d-853d-6fc2b0d9ca1c
4,How has OpenAI contributed to the development ...,OpenAI has contributed to the development of l...,,OpenAI has played a significant role in the de...,1,0,0,5.816194,1ba383fe-b824-41d0-8bb4-ae2d225507e4,a78f81be-6603-41d5-8d7e-584621d61c1a
5,How do the challenges of understanding and con...,The challenges of understanding and controllin...,,The challenges of understanding and controllin...,1,0,0,5.239366,eb2ff29c-e9fb-4441-b9fe-fb006fa9aa51,d54107a4-e1a7-440a-8b03-db60875b49a7
6,How OpenAI and AI ethics and legality connect ...,The context discusses the legal and ethical im...,,OpenAI was one of the first organizations to r...,1,1,0,4.152866,12b050a9-0821-4bf7-b281-28fa692a8a08,421469e4-ade9-48d3-8733-f3fa40193be3
7,How have advancements in large language models...,Advancements in large language models (LLMs) h...,,Advancements in large language models have sig...,1,1,0,3.373136,8c687021-339d-4880-a5ba-d45233fbb033,3600c50b-ce7c-4ba3-98a6-1d2cf6255ed9
8,how use homebrew for run llama 2?,I don't know.,,You can run Llama 2 on your own Mac using LLM ...,0,0,0,1.742109,47984f6f-a5a6-4582-a028-8aa08b123843,c216929d-4c7d-4849-bb52-c6bbfa50cbde
9,What we learn about AI in 2023?,"In 2023, it was noted that it was a breakthrou...",,2023 was the breakthrough year for Large Langu...,1,0,0,3.541928,137ed418-b9ca-4933-afdf-7c9792b1bec4,b656e90b-e261-4be7-ad76-8f72d489e06d


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [99]:
DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

In [100]:
rag_documents = docs

In [101]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

<div style="background-color: #E6E6FA; padding: 10px; border-radius: 5px;">
<span style="color: black;">
<b> ANSWER:  </b> 

Modifying the chunk size in RecursiveCharacterTextSplitter can significantly impact the RAG system's performance in following ways.

1. The effectiveness, and precision performance of retrieval can be majorly impacted by the chunk size. This is because smaller chunks can enable precise retrieval but risk missing context due to frequent splitting, and larger chunks can enable better contextual understanding but may increase redundancy leading to higher costs and storage requriements.

2. Similarly, latency & computation cost are also severely impacted by chunk size. Smaller chunking increases the number of required retrievals slowing speed of execution. On the other hand, larger chunks increase the volume of data (text) / higher dimensional vectors passed to the LLM, increasing inference time and costs.

3. Similarly, context retention and redundancy are also significantly impacted by chosen chunk size. It should not be too high to limit redundant storage and embedding costs. It should, also, not be too low else the system can lose vital context / information / intelligence.

Therefore, it is critical to find the optimal chunk size for each use case / app / implementation to balance the above mentioned considerations of retrieval precision, latency, computation costs, and context retention.

</span>
</div>

In [102]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

<div style="background-color: #E6E6FA; padding: 10px; border-radius: 5px;">
<span style="color: black;">
<b> ANSWER:  </b> 

The embedding model of the RAG system has significant impact on retrieval accuracy, storage requirements, speed, costs, and overall performance. 

1. <b> Retrieval Accuracy / Precision:  </b> Since different embedding models encode text with varying semantic richness, more advanced models can capture deeper contextual meaning, leading to improved retrieval of relevant documents / chunks, enhancing the Retrieval component of a RAG pipeline. Less advanced models may miss these deeper nuances retrieving less relevant results.

2. <b> Storage requirements:  </b> Since each embedding model produces vectors of different dimensions, higher-dimensional embeddings improve semantic understanding but also increase storage costs (of vector database) and latency (due to higher computation volume).

3. <b> Storage requirements:  </b> Larger models require more compute power to generate embeddings and hence, are slower. Smaller models generate embeddings faster, which can improve real-time applications.

4. <b> Costs:  </b> Each API call has a cost per # of tokens. Larger models are more expensive than smaller ones but produce better quality outputs.

5. <b> Pipeline performeance:  </b> Switching between embedding models might not be seamless since older embedding model versions might not be compatible and require recomputation of all stored embeddings.

</span>
</div>

In [103]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Across Years (Augmented)"
)

In [104]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [105]:
dope_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dope_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [106]:
dope_rag_chain.invoke({"question" : "what are Agents?"})

'Agents, man, are these elusive AI systems that are supposed to act on your behalf, like a digital travel agent or a super-efficient assistant. But here\'s the kicker: the term "agents" is all over the place—no one seems to agree on what it really means. You’ve got folks thinking about them as tools with autonomy, running loops to solve problems, but then there’s the whole issue of gullibility—LLMs easily believe anything! So while there\'s a lot of talk and hype around agents, the reality is that they’re still just “coming soon” in a universe of prototypes with no solid production examples. They\'re a bit of a mirage in the AI desert, bro.'

Finally, we can evaluate the new chain on the same test set!

In [107]:
evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

View the evaluation results for experiment: 'timely-stomach-49' at:
https://smith.langchain.com/o/a880a293-74b1-4b82-a94d-e9a3daf7aa11/datasets/f4fa746b-ddac-42dc-b684-812378b0c61b/compare?selectedSessions=a489d665-fca9-4d75-896c-e588be121bf1




12it [04:21, 21.80s/it]


Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How do the challenges of evaluating LLMs and t...,"Yo, the struggles with evaluating LLMs and the...",,The challenges of evaluating LLMs stem from th...,1,0,1,4.648778,f3ce003d-5d87-47ae-a43d-ba57a66a8441,e7014d7a-9b93-43e4-abcb-e84735dade9f
1,How have advancements in GPT-4 and GPT-4o infl...,"Yo, check this out! The advancements in GPT-4 ...",,Advancements in GPT-4 and GPT-4o have signific...,1,0,1,4.276076,0ed56e13-9a66-4278-9b54-8bc06a996358,63c6e462-9444-4d9d-b638-de552eb2143a
2,How has Claude contributed to advancements in ...,I don't know.,,Claude has been a significant contributor to a...,0,0,0,1.358806,1ec814da-b35c-4704-a529-ee7d6358d458,438efdd3-1b14-43b1-a9af-649b27bb135b
3,What were the key advancements in AI and Large...,"In 2023, the landscape of Large Language Model...",,"In 2023, significant advancements were made in...",1,0,1,5.85419,b5da1594-d1c3-4cb7-9079-de28360d7f4d,aaedbe54-4ac6-47b9-b904-7e455fce97bf
4,How has OpenAI contributed to the development ...,OpenAI has made some serious waves in the worl...,,OpenAI has played a significant role in the de...,1,0,1,5.084471,1ba383fe-b824-41d0-8bb4-ae2d225507e4,fab6b8b3-53aa-4809-947b-a05e5bc1901d
5,How do the challenges of understanding and con...,"Yo, when it comes to LLMs being these mysterio...",,The challenges of understanding and controllin...,1,0,1,5.911435,eb2ff29c-e9fb-4441-b9fe-fb006fa9aa51,3e82dd5b-0c7a-404f-b4ac-2693c297fa7c
6,How OpenAI and AI ethics and legality connect ...,"Yo, it's a wild ride when you talk about OpenA...",,OpenAI was one of the first organizations to r...,1,0,1,4.311288,12b050a9-0821-4bf7-b281-28fa692a8a08,c7440775-5840-4b47-899a-8bf058258066
7,How have advancements in large language models...,"Yo, the advancements in large language models ...",,Advancements in large language models have sig...,1,0,1,5.910872,8c687021-339d-4880-a5ba-d45233fbb033,a37ce655-7fa1-4ff2-95a8-8c4b74c28ff8
8,how use homebrew for run llama 2?,I don't know.,,You can run Llama 2 on your own Mac using LLM ...,0,0,0,1.784549,47984f6f-a5a6-4582-a028-8aa08b123843,eae096b2-451e-429a-81f2-64ade8ced1b0
9,What we learn about AI in 2023?,"Yo, 2023 was a wild ride in the world of AI! H...",,2023 was the breakthrough year for Large Langu...,1,1,1,6.348835,137ed418-b9ca-4933-afdf-7c9792b1bec4,8bd17aeb-fa38-4227-86bc-77cbac9002b5


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

<div style="background-color: #E6E6FA; padding: 10px; border-radius: 5px;">
<span style="color: black;">
<b> ANSWER:  </b> 

The two RAG chains - rag_chain and dope_rag_chain vary wrt their prompt template / instructions where dope_rag_chain is additionally prompted to be dope.  



We will compare the performance of the two RAG chains (monthly-wound-64 using rag_chain revision_id = default_chain_init and timely-stomach-49 using dope_rag_chain with revision_id =  dope_chain) created above across the 3 key categories:

(1) <b> Evaluation scores (to assess quality of responses): </b>
- (a) Correctness: how accurate the model's response is compared to a reference answer
- (b) Dopeness:  how "cool," "engaging," or "creative" a response is (likely a subjective measure)
- (c) Helpfulness: how useful and relevant the response is to the user's question

(2) <b> Performance metrics (to assess speed and reliability of RAGs): </b> 
- P50 Latency: median response time (50th percentile)
- P99 Latency: 99th percentile response time and helpful in determining the worst case scenario
- Error Rate: % of requests that fail or produce an error; lower value means more reliability

(3) <b> Cost metrics (to track resource usage and cost associated with LLM usage / API calls):  </b> 
- Total Cost: total amount spent on API calls
- Prompt Tokens: # of tokens used in prompt sent to the LLM
- Completion Tokens: # of tokens generated by LLM as output
- Total Tokens: sum of prompt tokens and completion tokens accounting for overall cost by finding total API usage

</span>
</div>

![1a](images/correctness.png)

![1b](images/dopeness.png)

![1c](images/helpfulness.png)

![2a](images/P50_latency.png)

![2b](images/P99_latency.png)

![2c](images/error_rate.png)

![3b](images/prompt_tokens.png)

![3c](images/completion_tokens.png)

![3d](images/total_tokens.png)