# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will incur a charge of ~$3USD from OpenAI usage.

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Creating a Simple RAG Pipeline with LangChain v.0.2.0
  4. Task 4: Synthetic Dataset Generation for Evaluation using Ragas (Optional)

- 🤝 Breakout Room #2
  1. Task 1: Evaluating our Pipeline with Ragas
  2. Task 2: Testing OpenAI's Claim
  3. Task 3: Selecting an Advanced Retriever and Evaluating

> NOTE: This Notebook *does* contain a bonus challenge, outlined at the bottom of the notebook, which you can complete instead of the notebook for full marks on the assignment.

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://python.langchain.com/v0.2/docs/versions/v0_2/) of LangChain v0.2.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
from langchain_core.stores import InMemoryStore
from qdrant_client.http import model
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai langchain-qdrant

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-experimental 0.0.65 requires langchain-community<0.3.0,>=0.2.16, but you have langchain-community 0.3.0 which is incompatible.
langchain-experimental 0.0.65 requires langchain-core<0.3.0,>=0.2.38, but you have langchain-core 0.3.5 which is incompatible.
langgraph 0.2.16 requires langchain-core<0.3,>=0.2.27, but you have langchain-core 0.3.5 which is incompatible.
ragas 0.1.20 requires langchain-core<0.3, but you have langchain-core 0.3.5 which is incompatible.
langgraph-checkpoint 1.0.6 requires langchain-core<0.3,>=0.2.22, but you have langchain-core 0.3.5 which is incompatible.[0m[31m
[0m

We'll also get the "star of the show" today, which is Ragas!

In [2]:
!pip install -qU ragas

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-huggingface 0.1.0 requires langchain-core<0.4,>=0.3.0, but you have langchain-core 0.2.41 which is incompatible.[0m[31m
[0m

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [3]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [4]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.2.0

Building on what we've been learning, we'll be leveraging LangChain v0.2.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

- [`PyMuPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html)

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [5]:
from langchain_community.document_loaders import PyMuPDFLoader

doc1 = "Blueprint-for-an-AI-Bill-of-Rights.pdf"
loader = PyMuPDFLoader(
   doc1
)
documents = loader.load()

doc2 = "NIST.AI.600-1.pdf" 
loader = PyMuPDFLoader(
    doc2
)
documents.extend(loader.load())

In [6]:
documents[0].metadata

{'source': 'Blueprint-for-an-AI-Bill-of-Rights.pdf',
 'file_path': 'Blueprint-for-an-AI-Bill-of-Rights.pdf',
 'page': 0,
 'total_pages': 73,
 'format': 'PDF 1.6',
 'title': 'Blueprint for an AI Bill of Rights',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'Adobe Illustrator 26.3 (Macintosh)',
 'producer': 'iLovePDF',
 'creationDate': "D:20220920133035-04'00'",
 'modDate': "D:20221003104118-04'00'",
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

- [`RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain-text-splitters-character-recursivecharactertextsplitter)

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

CHUNK_SIZE = 300
CHUNK_OVERLAP = 50

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CHUNK_SIZE,
    chunk_overlap = CHUNK_OVERLAP
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [8]:
len(documents)

1639

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

- [`OpenAIEmbeddings`](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain-openai-embeddings-base-openaiembeddings)

> NOTE: We are purposefully using an older embedding model to try and answer the guiding question: Is TE3 better than Ada-002?

In [9]:
from langchain_openai import OpenAIEmbeddings

EMBEDDING_MODEL = "text-embedding-ada-002"

embeddings = OpenAIEmbeddings(
    model = EMBEDDING_MODEL
)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

- [`Qdrant`](https://api.python.langchain.com/en/latest/qdrant/langchain_qdrant.qdrant.QdrantVectorStore.html#langchain_qdrant.qdrant.QdrantVectorStore)

> NOTE: You'll need to provide the embedding dimension for Ada-002!

In [10]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

LOCATION = ":memory:"
COLLECTION_NAME = "PMarca Blogs"
VECTOR_SIZE = 1536

In [11]:
qdrant_client = QdrantClient(LOCATION)

qdrant_client.create_collection(
    collection_name = COLLECTION_NAME,
    vectors_config = VectorParams(size = VECTOR_SIZE, distance = Distance.COSINE)
)

qdrant_vector_store = QdrantVectorStore(
    client = qdrant_client,
    collection_name = COLLECTION_NAME,
    embedding = embeddings
)

qdrant_vector_store.add_documents(documents)

['d752f65bc1e84545b8b6e512dbb89dad',
 'eb10d464c3c34e43a04106e8ff6511bb',
 '8cd7e1d886c34cc99ddc0aa3a6743707',
 'c9e0aeeaa8974cf5802531357b1ee059',
 '44226808715d48b59fa1c709934df21a',
 '9bc8baf9f3b24cb2b05e6b101ef6780c',
 '7bed9a10986748fcb4055891ad750fce',
 '4f5509b20d7b4c6991e9e7d5efd0ae87',
 '0550df77f71543d89153246881650512',
 '5fcc5ecbfa0d47d5b83a3b3bc32ad696',
 '9110bc29f16a48049d1a2ef47c368fea',
 '76ce08dbfa4d425aaf9c6abdda9a853c',
 '012828de985b45e29803736cefc0ac25',
 '768fb729edf6419baa40b181dd62b1f8',
 '3cfd3a858bf74b64a97a064031f92b89',
 'be34aaa5aea84c22a2653be9d361781e',
 'a1b3939bbb584475b08c1326a3249c3b',
 'adca66fce6b24bcc8ca06536d5e9903c',
 'e007dae7e7064a908fc83b6eb89d75fe',
 '6f5bc2bd84254bb79cafb5659b21f7bc',
 '03aaab0a7e434815ab0f2cbbc41afd08',
 '67fbbf4c95244ca0b0061a34157b0237',
 '91e729e1e709435ca36430b4153f7d57',
 '47d9b86442214790ae6344b8eb803ebe',
 '785a7132f8b848b8b26ddc56bbf41a7d',
 '9d743045c5a94390a7df9fa833cd9869',
 '43fddf6fdacd4fed8a2e4abc1406b252',
 

####❓ Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

> NOTE: Check the [documentation](https://qdrant.tech/documentation/overview/) for more information about QDrant!

* By using specialized data structures and indexing techniques such as Hierarchical Navigable Small World (HNSW)
* By using Product Quantization

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [12]:
retriever = qdrant_vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [13]:
retrieved_documents = retriever.invoke("What is a rule of thumb for selecting an industry to invest in?")

In [14]:
for doc in retrieved_documents:
  print(doc)

page_content='Property. We also note that some risks are cross-cutting between these categories.' metadata={'source': 'NIST.AI.600-1.pdf', 'file_path': 'NIST.AI.600-1.pdf', 'page': 6, 'total_pages': 64, 'format': 'PDF 1.6', 'title': 'Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile', 'author': 'National Institute of Standards and Technology', 'subject': '', 'keywords': '', 'creator': 'Acrobat PDFMaker 24 for Word', 'producer': 'Adobe PDF Library 24.2.159', 'creationDate': "D:20240805141702-04'00'", 'modDate': "D:20240805143048-04'00'", 'trapped': '', '_id': '60f5fe728a7745d08176ab4f14efdf32', '_collection_name': 'PMarca Blogs'}
page_content='sectors, while data is helping to revolutionize global industries. Fueled by the power of American innovation, 
these tools hold the potential to redefine every part of our society and make life better for everyone.' metadata={'source': 'Blueprint-for-an-AI-Bill-of-Rights.pdf', 'file_path': 'Blueprint-fo

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [15]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")



In [16]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [17]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the following context. If you can't find the answer within the context, respond with 'I don't know'.

Question:
{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [18]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [19]:
question = "What are some risks in AI systems?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Some risks in AI systems include:

1. Technical / Model risks, which involve the potential for malfunction.
2. Biased data and discriminatory outcomes, particularly in high-stakes settings.
3. Opaque decision-making processes that can undermine trust.
4. Abuse, misuse, and unsafe repurposing by humans.
5. Risks of confabulation, bias, or homogenization due to automation bias.
6. Emotional entanglement between humans and AI systems, which could lead to negative psychological impacts.


In [20]:
question = "What considerations are there in AI safety?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

Considerations in AI safety include:

1. **Technical / Model Risks**: Risks arising from malfunctions of the AI system.
2. **Organizational Policies**: Implementing policies and practices that promote a safety-first mindset in the design, development, deployment, and use of AI systems to minimize potential negative impacts.
3. **Regular Evaluation**: The AI system should be regularly evaluated for safety risks, ensuring that it is demonstrated to be safe and that its residual negative risk does not exceed the risk tolerance.
4. **Fail-Safe Mechanisms**: The AI system should be capable of failing safely, particularly in scenarios where it may encounter issues.

These considerations aim to manage and mitigate risks associated with AI technologies.
[Document(metadata={'source': 'NIST.AI.600-1.pdf', 'file_path': 'NIST.AI.600-1.pdf', 'page': 6, 'total_pages': 64, 'format': 'PDF 1.6', 'title': 'Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile', 'a

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

In [21]:
# loader = PyMuPDFLoader(
#     "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
# )
# 
# eval_documents = loader.load()
doc1 = "Blueprint-for-an-AI-Bill-of-Rights.pdf"
loader = PyMuPDFLoader(
    doc1
)
eval_documents = loader.load()

doc2 = "NIST.AI.600-1.pdf"
loader = PyMuPDFLoader(
    doc2
)
eval_documents.extend(loader.load())

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

We are using the same document(s) as before, so we use different parameters to prevent data leakage. 

In [22]:
len(eval_documents)

761

> NOTE: 🛑 Running this cell as presented will incur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step. **YOU CAN SKIP THIS STEP BY LOADING THE `.csv` DIRECTLY FROM OUR REPOSITORY.** 🛑

#### Optional: SDG for Evaluation

In [24]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

num_qa_pairs = 20 # You can reduce the number of QA pairs to 5 if you're experiencing rate-limiting issues

testset = generator.generate_with_langchain_docs(eval_documents, num_qa_pairs, distributions)
testset.to_pandas()

embedding nodes:   0%|          | 0/1522 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/20 [00:00<?, ?it/s]

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What are some strategies that aim to mitigate ...,"[assessments, auditing mechanisms, assessment ...","Assessments, auditing mechanisms, assessment o...",simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True
1,How are risks associated with transparency and...,[34 \nMS-2.7-009 Regularly assess and verify t...,Risks associated with transparency and account...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
2,How can risks from confabulations impact real-...,[contextual and/or domain expertise. \nRisks ...,Risks from confabulations may impact real-worl...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
3,What expectations should be set for automated ...,[SAFE AND EFFECTIVE \nSYSTEMS \nWHAT SHOULD BE...,The expectations for automated systems should ...,simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True
4,What are the potential risks associated with d...,"[tracked, e.g., via a specialized type in a da...",The potential risks associated with data reuse...,simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True
5,How should machine learning models be monitore...,[based on changing real-world conditions or de...,This ongoing monitoring should include continu...,simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True
6,What is the purpose of red-teaming in identify...,[environment and in collaboration with AI deve...,The purpose of red-teaming in identifying pote...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
7,"How can the AI model be explained, validated, ...","[35 \nMEASURE 2.9: The AI model is explained, ...","The AI model can be explained, validated, and ...",simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
8,What risks do GAI systems pose to data privacy?,[2.4. Data Privacy \nGAI systems raise several...,GAI systems pose risks to data privacy by requ...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
9,How do school audio surveillance systems monit...,"[teenage girl was pregnant, and sent maternity...",School audio surveillance systems monitor stud...,simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True


Let's look at the output and see what we can learn about it!

In [25]:
testset.test_data[0]

DataRow(question="What are some strategies that aim to mitigate risks posed by the use of AI to companies' reputation, legal responsibilities, and product safety concerns, including documentation procedures specific to model assessments?", contexts=['assessments, auditing mechanisms, assessment of organizational procedures, dashboards to allow for ongoing \nmonitoring, documentation procedures specific to model assessments, and many other strategies that aim to \nmitigate risks posed by the use of AI to companies’ reputation, legal responsibilities, and other product safety \nand effectiveness concerns. \nThe Office of Management and Budget (OMB) has called for an expansion of opportunities \nfor meaningful stakeholder engagement in the design of programs and services. OMB also'], ground_truth="Assessments, auditing mechanisms, assessment of organizational procedures, dashboards to allow for ongoing monitoring, documentation procedures specific to model assessments, and many other stra

In [26]:
testset_df = testset.to_pandas()
testset_df.to_csv("testset.csv")

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [28]:
import pandas as pd

test_df = pd.read_csv("testset.csv")

In [29]:
test_df

Unnamed: 0.1,Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,0,What are some strategies that aim to mitigate ...,"['assessments, auditing mechanisms, assessment...","Assessments, auditing mechanisms, assessment o...",simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True
1,1,How are risks associated with transparency and...,['34 \nMS-2.7-009 Regularly assess and verify ...,Risks associated with transparency and account...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
2,2,How can risks from confabulations impact real-...,['contextual and/or domain expertise. \nRisks...,Risks from confabulations may impact real-worl...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
3,3,What expectations should be set for automated ...,['SAFE AND EFFECTIVE \nSYSTEMS \nWHAT SHOULD B...,The expectations for automated systems should ...,simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True
4,4,What are the potential risks associated with d...,"['tracked, e.g., via a specialized type in a d...",The potential risks associated with data reuse...,simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True
5,5,How should machine learning models be monitore...,['based on changing real-world conditions or d...,This ongoing monitoring should include continu...,simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True
6,6,What is the purpose of red-teaming in identify...,['environment and in collaboration with AI dev...,The purpose of red-teaming in identifying pote...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
7,7,"How can the AI model be explained, validated, ...","['35 \nMEASURE 2.9: The AI model is explained,...","The AI model can be explained, validated, and ...",simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
8,8,What risks do GAI systems pose to data privacy?,['2.4. Data Privacy \nGAI systems raise severa...,GAI systems pose risks to data privacy by requ...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
9,9,How do school audio surveillance systems monit...,"['teenage girl was pregnant, and sent maternit...",School audio surveillance systems monitor stud...,simple,[{'source': 'Blueprint-for-an-AI-Bill-of-Right...,True


In [30]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [31]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [32]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [33]:
response_dataset[0]

{'question': "What are some strategies that aim to mitigate risks posed by the use of AI to companies' reputation, legal responsibilities, and product safety concerns, including documentation procedures specific to model assessments?",
 'answer': "Some strategies to mitigate risks posed by the use of AI to companies' reputation, legal responsibilities, and product safety concerns include:\n\n1. **Mapping AI Technology and Legal Risks**: Implementing approaches to map the technology and legal risks associated with AI components, including the use of third-party data or software. This involves documenting risks related to infringement of third-party intellectual property or other rights.\n\n2. **Privacy Risk Examination**: Conducting examinations and documentation of privacy risks associated with the AI system, as identified in the risk management framework.\n\n3. **Documentation Procedures for Model Assessments**: Establishing thorough documentation procedures specific to model assessme

# 🤝 Breakout Room Part #2

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [34]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [35]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

In [36]:
results

{'faithfulness': 0.7625, 'answer_relevancy': 0.8706, 'context_recall': 0.7193, 'context_precision': 0.7632, 'answer_correctness': 0.6913}

In [37]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,contexts,answer,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What are some strategies that aim to mitigate ...,[and management. One possible way to further c...,Some strategies to mitigate risks posed by the...,"Assessments, auditing mechanisms, assessment o...",0.636364,0.922181,1.0,0.638889,0.687208
1,How are risks associated with transparency and...,[Action ID \nSuggested Action \nGAI Risks \nGV...,Risks associated with transparency and account...,Risks associated with transparency and account...,0.769231,0.970279,0.5,0.333333,0.942304
2,How can risks from confabulations impact real-...,"[many real-world applications, such as in heal...",Risks from confabulations can significantly im...,Risks from confabulations may impact real-worl...,1.0,0.951429,1.0,1.0,0.993654
3,What expectations should be set for automated ...,[SAFE AND EFFECTIVE \nSYSTEMS \nWHAT SHOULD BE...,The expectations for automated systems should ...,The expectations for automated systems should ...,1.0,0.923088,0.5,1.0,0.538034
4,What are the potential risks associated with d...,[Data reuse limits in sensitive domains. Data ...,The potential risks associated with data reuse...,The potential risks associated with data reuse...,1.0,0.983402,0.666667,1.0,0.841873
5,How should machine learning models be monitore...,[based on changing real-world conditions or de...,Machine learning models should be monitored an...,This ongoing monitoring should include continu...,1.0,0.964525,1.0,1.0,0.706525
6,What is the purpose of red-teaming in identify...,[behavior or outcomes of a GAI model or system...,The purpose of red-teaming in identifying pote...,The purpose of red-teaming in identifying pote...,1.0,1.0,1.0,1.0,0.829018
7,"How can the AI model be explained, validated, ...","[35 \nMEASURE 2.9: The AI model is explained, ...","The AI model can be explained, validated, and ...","The AI model can be explained, validated, and ...",1.0,0.996122,0.0,0.916667,0.661169
8,What risks do GAI systems pose to data privacy?,[Another cybersecurity risk to GAI is data poi...,I don't know.,GAI systems pose risks to data privacy by requ...,0.0,0.0,0.333333,0.805556,0.183138
9,How do school audio surveillance systems monit...,"[Conversation. Jan. 21, 2022.\nhttps://theconv...",I don't know.,School audio surveillance systems monitor stud...,0.0,0.0,0.666667,0.5,0.184911


## Task : Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #1:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

In [38]:
# Uses the TE3 embedding model (instead of ada) 
te3_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [39]:
# Creates a new collection for use with the te3 embedding model
qdrant_client.create_collection(
    collection_name=COLLECTION_NAME+"TE3",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Creates the vector store/index
qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME+"TE3",
    embedding=te3_embeddings,
)

# Adds documents to the store/index
qdrant_vector_store.add_documents(documents)

['6933c78cb0ed486cba7662ce2a419743',
 'cbcb01a317864a68b672703f089e543e',
 'a8723403f7404878b30a0f7e9cd4019f',
 '2946ce2af26442d783be725dfac0882e',
 '65cc905819a5474287772b7859dcdc1b',
 'b0607c3aa66d4f8b8fcaa919e0d12658',
 'e4e310f068654628b069e29af1243a0c',
 '9e4485a0cd45448cb72404b7800c9ab5',
 '05a1ea9f615944e5bc4afa4fca50b2bc',
 '2ec320b22a704f4b89099e0564598a17',
 'f83eccb6f2eb48209b89f12dfd6e2bef',
 '5f303d335cbc447f98a0c5e4976d6a5a',
 'e5a4db66ef844ca18b0f669e09295251',
 '7ec4d0a489314d1e862a9feab47d00bc',
 '6773b3876f874f5db16f09f87b736e8e',
 'f9b427cf26e74f5ab340a33268ac0661',
 '4b59bdf9b8a24aa2bb4b3a48b984ddad',
 'e1079fff255d4d3bac1600d79c9a6852',
 '601df9ee75a64b71aa03fb60d0eebb4f',
 '541dc76fee3c4e378d2af6702626f084',
 '9d502bead06545c18a930528ea25cb99',
 '80b467a2c0a147a58ef85259e8b55129',
 '4cdb296ae0b641498da8ffeeda840307',
 'ad46ae04e6484ce0ae5d0a9d67f23795',
 'b9e78c6f76a448c68a3bfaeb8fec405a',
 '1dbb0d4725144086879b619445cb2fe6',
 '8c31b0623bf448febacbeedd9c653692',
 

In [40]:
# Adapts the vector store/index as a retriever/callable
te3_retriever = qdrant_vector_store.as_retriever()

In [41]:
from langchain.chains.combine_documents import create_stuff_documents_chain

# Creates chain that can be used for passing list of documents to a model
document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

In [42]:
from langchain.chains import create_retrieval_chain

# Creates a retrieval chain that retrieves documents then passes them to a document chain (from previous step)
te3_retrieval_chain = create_retrieval_chain(te3_retriever, document_chain)

In [43]:
# Invoke chain for each test question and collect answer and context
answers = []
contexts = []

for question in test_questions:
  response = te3_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [44]:
# Creates a dataset from the results from the previous step
te3_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [45]:
# Use ragas to evaluate the dataset against our chosen metrics
te3_advanced_retrieval_results = evaluate(te3_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

In [46]:
# Print the evaluation results
te3_advanced_retrieval_results

{'faithfulness': 0.6881, 'answer_relevancy': 0.9720, 'context_recall': 0.7281, 'context_precision': 0.8246, 'answer_correctness': 0.6887}

In [47]:
# Compare the baseline results to our te3 retrieval results
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(te3_advanced_retrieval_results.items()), columns=['Metric', 'TE3'])

df_merged = pd.merge(df_baseline, df_comparison, on='Metric')

df_merged['Baseline -> TE3'] = df_merged['TE3'] - df_merged['ADA']

df_merged

Unnamed: 0,Metric,ADA,TE3,Baseline -> TE3
0,faithfulness,0.762458,0.688092,-0.074366
1,answer_relevancy,0.870611,0.972037,0.101427
2,context_recall,0.719298,0.72807,0.008772
3,context_precision,0.763158,0.824561,0.061404
4,answer_correctness,0.691255,0.68875,-0.002505


####❓ Question #3:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

Yes, on almost every metric we care about except context precision, te3 provided significantly better results. The answer relevancy was hugely improved.