# Advanced RAG Pipeline for Mistral Models with Q&A Automation and Model Evaluation using LlamaIndex and Ragas

> *This notebook should work well in the `Data Science 3.0` kernel on Amazon SageMaker Studio. It requires Python v3.10+*

## Introduction

This Jupyter Notebook is designed to evaluate the performance of the Retrieval-Augmented Generation (RAG) pipeline. The RAG pipeline leverages a retriever component to identify relevant context from a knowledge base and a generator component to produce fluent and informative responses based on the retrieved context.

In this notebook, we will explore the RAG pipeline using the [LlamaIndex Evaluation library](https://docs.llamaindex.ai/en/stable/optimizing/evaluation/evaluation), which provides a comprehensive set of tools for building and evaluating question-answering systems. Additionally, we will utilize the [Ragas](https://docs.Ragas.io/en/stable/) (RAG Assessment) framework, designed specifically for assessing the performance of RAG models.


---

## Why We Need Evaluators like LlamaIndex Evaluators and Ragas

We need evaluators like LlamaIndex evaluators and Ragas in a RAG pipeline primarily because language models, including those used in the RAG pipeline, can suffer from issues like hallucination, factual inconsistencies, and biases. Evaluators help us assess the performance and reliability of the RAG pipeline, ensuring that it provides accurate, relevant, and trustworthy responses.

#### Benefits

1. **Mitigating Hallucination**: Language models, especially large language models used in RAG pipelines, can sometimes generate plausible-sounding but factually incorrect or made-up information, a phenomenon known as hallucination. Evaluators help identify instances of hallucination by comparing the generated responses against the ground truth or the source knowledge base.

2. **Ensuring Faithfulness and Factual Correctness**: Evaluators assess the faithfulness and factual correctness of the generated responses by comparing them with the source knowledge base or reference data. This is crucial in domains where accurate and reliable information is essential, such as healthcare, finance, or legal contexts.

3. **Measuring Relevance and Context Understanding**: Evaluators can measure how relevant and contextually appropriate the generated responses are, given the input query and the retrieved context from the knowledge base. This helps identify cases where the RAG pipeline fails to understand the query or retrieves irrelevant information.

4. **Quantifying Performance**: Evaluators provide quantitative metrics, such as accuracy, precision, recall, and F1-score, which allow for objective comparisons of different RAG pipeline configurations, retriever-generator model combinations, or training strategies.

5. **Identifying Biases and Inconsistencies**: Evaluators can help identify biases and inconsistencies in the generated responses, which may arise due to biases in the training data or the language model itself. This is important for ensuring fairness and avoiding potentially harmful biases in the RAG pipeline's outputs.

6. **Tailored Evaluation**: Frameworks like Ragas provide a structured approach to creating tailored test sets and evaluating the RAG pipeline's performance on specific types of queries or domains, allowing for more targeted assessments.


## Objectives

The primary objectives of this notebook are:

1. **Implement the RAG pipeline**: We will set up the RAG pipeline using LlamaIndex, configuring the retriever and generator components according to best practices. Mistral 7B Instruct LLM will be utilized as the generator component in this example.

2. **Create and evaluate the LlamaIndex Query Engine**: We will create a LlamaIndex Query Engine to facilitate efficient retrieval and generation of answers from the knowledge base.

3. **Generate test sets with LlamaIndex RagDatasetGenerator and Ragas TestsetGenerator**: To evaluate the RAG pipeline's performance, we will generate synthetic test sets using LlamaIndex [RagDatasetGenerator module](https://docs.llamaindex.ai/en/stable/examples/llama_dataset/labelled-rag-datasets/). In this example, **Mistral 7B Instruct** model will be used to generate Q&A pairs (including "ground truth") for the synthetic dataset. Additionally, we will leverage the Ragas [TestsetGenerator module](https://docs.Ragas.io/en/latest/getstarted/testset_generation.html) to create tailored test sets aligned with our specific needs. As for the Ragas section, only Mistral 7B Instruct will be utilized.

4. **Evaluate pipeline performance using LlamaIndex evaluators and Ragas**: We will leverage both the LlamaIndex evaluators and the Ragas framework to comprehensively assess the performance of the RAG pipeline on a range of question-answering tasks. Using the generated test sets, we will analyze metrics such as faithfulness, relevancy, correctness, and other relevant measures to gain insights into the pipeline's strengths and weaknesses.

## Expected Outcomes

By the end of this notebook, we expect to achieve the following outcomes:

1. A functional RAG pipeline implemented using LlamaIndex, capable of answering questions based on a knowledge base.

2. A LlamaIndex Query Engine for efficient retrieval and generation of answers.

3. Synthetic test sets generated using LlamaIndex RagDatasetGenerator and tailored test sets created with the Ragas TestsetGenerator.

4. Comprehensive performance evaluation of the RAG pipeline using both LlamaIndex evaluators and the Ragas framework, including quantitative metrics such as faithfulness, relevancy, correctness, semantic similarity, and other relevant measures, as well as qualitative analysis.

5. Insights into the impact of different pipeline configurations on performance and identification of limitations and potential areas for improvement in the RAG pipeline.


---

## Setup and Requirements

To start exploring RAG patterns with a practical example, we'll first install some libraries that might not be present in the default notebook kernel image:

- [Amazon Bedrock](https://docs.aws.amazon.com/pythonsdk/) AWS Python SDKs `boto3` and `botocore` to be able to call the service
- [LlamaIndex](https://docs.llamaindex.ai/en/stable/getting_started/installation/) is an open-source framework to help integrate LLMs with trusted data sources, and measure the performance of data-connected LLM use-cases
- [Ragas](https://docs.Ragas.io/en/stable/) is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines
- [LangChain](https://python.langchain.com/docs/get_started/introduction) is an open-source framework for orchestrating common LLM patterns. In this example, it's only used with Ragas as an optional step to generate test dataset via langchain docs. The entire RAG pipeline in this example is implemented with LlamaIndex though.



In [None]:
# Install required packages
%pip install --upgrade --quiet --no-cache-dir --force-reinstall \
    boto3 \
    botocore \
    langchain \
    langchain-aws \
    llama-index \
    llama-index-embeddings-langchain \
    llama-index-llms-bedrock \
    llama-index-embeddings-bedrock \
    llama-index-llms-langchain \
    ragas \
    spacy \
    datasets


Now that we have everything installed, let's import the required libraries and do some initial setup. This will come in handy later on:

In [2]:
# Python Built-Ins:
import os  # For dealing with folder paths
import sys
import textwrap
from io import StringIO

def print_ww(*args, width: int = 100, **kwargs):
    """Like print(), but wraps output to `width` characters (default 100)"""
    buffer = StringIO()
    try:
        _stdout = sys.stdout
        sys.stdout = buffer
        print(*args, **kwargs)
        output = buffer.getvalue()
    finally:
        sys.stdout = _stdout
    for line in output.splitlines():
        print("\n".join(textwrap.wrap(line, width=width)))

# External Dependencies:
import nest_asyncio  # Needed for some asyncio-based libs to work in Jupyter notebooks
nest_asyncio.apply()  # Enable asyncio-based libs to work properly in this notebook

In this example, **Mistral 7B Instruct** is our default model, but feel free to pick any other available Mistral model to experiment with this RAG pipeline. You just need to change the `DEFAULT_MODEL` variable. 

You can also choose which Titan Embeddings model to be used throughtout this notebook. Just chage the `DEFAULT_EMBEDDINGS` if needed. By default, **Amazon Titan Text Embeddings V2** is utilized.

Additionally, you may want to change the AWS region as well. If so, just change the `AWS_REGION` variable below:

In [3]:
instruct_mistral7b_id="mistral.mistral-7b-instruct-v0:2"
instruct_mixtral8x7b_id="mistral.mixtral-8x7b-instruct-v0:1"
mistral_large_2_id="mistral.mistral-large-2407-v1:0"
titan_embeddings_g1="amazon.titan-embed-text-v1"
titan_text_embeddings_v2="amazon.titan-embed-text-v2:0"

DEFAULT_MODEL=instruct_mistral7b_id
DEFAULT_EMBEDDINGS=titan_text_embeddings_v2
AWS_REGION="us-west-2"

---

## Download and pre-process documents with Titan Text Embeddings and LlamaIndex

In this example, we'll create an in-memory semantic search index using:

- [Amazon Titan Embeddings v2](https://aws.amazon.com/about-aws/whats-new/2024/04/amazon-titan-text-embeddings-v2-amazon-bedrock/) on Amazon Bedrock, as a model to convert text of documents and user queries into numerical "embedding" vectors.
- LlamaIndex [VectorStoreIndex](https://docs.llamaindex.ai/en/stable/community/integrations/vector_stores/), to index the generated document vectors in-memory and retrieve the most similar documents for incoming queries/questions.

### Download the sample document: Amazon's 2023 shareholder letter.

In this example, we'll just use a single document for our RAG corpus: Amazon's 2023 annual letter to shareholders. Since the document itself is long, it'll be split into multiple separate entries in the search index.

First, run the cell below to download the file locally. It'll also create the /data folder if it doesn't exist yet.

In [4]:
from urllib.request import urlretrieve
from tqdm import tqdm  # For progress bar

DATA_ROOT = "./data"
URL_FILENAME_MAP = {
    "https://s2.q4cdn.com/299287126/files/doc_financials/2024/ar/Amazon-com-Inc-2023-Shareholder-Letter.pdf": "Amazon-com-Inc-2023-Shareholder-Letter.pdf"
}

# Create the local folder if it doesn't exist
os.makedirs(DATA_ROOT, exist_ok=True)

# Download files with progress bar
for url, filename in tqdm(URL_FILENAME_MAP.items(), unit="file"):
    urlretrieve(url, os.path.join(DATA_ROOT, filename))

100%|██████████| 1/1 [00:00<00:00,  3.87file/s]


Then, we can initially read the PDF files using LlamaIndex:

In [5]:
from llama_index.core import SimpleDirectoryReader
docs = SimpleDirectoryReader(input_files=["data/Amazon-com-Inc-2023-Shareholder-Letter.pdf"]).load_data()

### Split and vectorize the documents

Text vectorization models are machine learning models that convert text data into numerical vector representations. This process, known as vectorization, allows the text to be processed and analyzed using mathematical operations and algorithms. They typically place an upper limit on the length of text they can process as a single item. Additionally, we want each search result to be reasonably short for embedding results in the answer generation LLM prompt later.

To address this, we need to **split** the source document into shorter passages for indexing. LlamaIndex's [TokenTextSplitter](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#tokentextsplitter) offers a utility for this purpose.

Choosing **chunk_size** and **chunk_overlap** values for the TokenTextSplitter involves a trade-off between computational efficiency and preserving context.

Larger chunk sizes can help capture more context and meaning within each chunk, which is beneficial for tasks that require understanding the broader context of the text. However, larger chunk sizes also increase the computational requirements for processing each chunk, as more tokens need to be processed at once.

The chunk overlap helps maintain continuity and context between adjacent chunks by including some overlapping text. This overlap can be particularly useful when processing text sequentially, as it provides some context from the previous chunk, which can aid in understanding the current chunk. A larger overlap can help preserve more context, but it also increases redundancy and computational overhead, as the same tokens are processed multiple times across overlapping chunks.

Therefore, the chunk size and overlap must be optimized for the specific NLP task at hand(e.g., text summarization and Q&A), where preserving context is crucial,  and strike a balance between preserving context and keeping the computational requirements manageable.

In [6]:
from llama_index.core.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=102
)

To convert each document chunk into a single vector, we'll use the **Amazon Titan Text Embeddings V2 model**. It's a lightweight, efficient model ideal for high accuracy retrieval tasks at different dimensions. The model supports flexible embeddings sizes (256, 512, 1,024) and prioritizes accuracy maintenance at smaller dimension sizes, helping to reduce storage costs without compromising on accuracy. When reducing from 1,024 to 512 dimensions, Titan Text Embeddings V2 retains approximately 99% retrieval accuracy, and when reducing from 1,024 to 256 dimensions, the model maintains 97% accuracy. Additionally, Titan Text Embeddings V2 includes multilingual support for 100+ languages in pre-training as well as unit vector normalization for improving accuracy of measuring vector similarity.  


In [7]:
from llama_index.embeddings.bedrock import BedrockEmbedding
embed_model = BedrockEmbedding(model=DEFAULT_EMBEDDINGS,
                               region_name=AWS_REGION)

After configuring the splitting and vectorization parameters, we can proceed to set up and execute LlamaIndex's [IngestionPipeline](https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/) to load and process the data.

In [8]:
from llama_index.core.ingestion import IngestionPipeline

# Create an ingestion pipeline
pipeline = IngestionPipeline(
    transformations=[text_splitter, embed_model])

# save
pipeline.persist("./pipeline_storage")

# Run the ingestion pipeline
doc_nodes = pipeline.run(documents=docs)

print(f"Ingested {len(doc_nodes)} chunks from {len(docs)} source docs")
doc_nodes[0].metadata

Ingested 28 chunks from 11 source docs


{'page_label': '1',
 'file_name': 'Amazon-com-Inc-2023-Shareholder-Letter.pdf',
 'file_path': 'data/Amazon-com-Inc-2023-Shareholder-Letter.pdf',
 'file_type': 'application/pdf',
 'file_size': 101160,
 'creation_date': '2024-05-15',
 'last_modified_date': '2024-05-15'}

---

## Creating and Evaluating the LlamaIndex Query Engine

After completing the chunking and vectorization processes, we can proceed to index the data into a queryable storage system.
As the end-to-end querying process involves not only retrieving relevant documents but also generating textual answers from those documents, we need to define the configuration for Mistral at this stage. In this example, we will use **Mistral 7B Instruct**:

In [9]:
from llama_index.llms.bedrock import Bedrock
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex

import boto3  # AWS SDK for Python
boto3_bedrock = boto3.client("bedrock-runtime")

model_kwargs_mistral = {
    "temperature": 0.5,
    "top_p": 0.9,
    "top_k": 200,
    "max_tokens": 8192  # Max response length
}

# Initialize the Mistral model to formulate final answer from search results
llm = Bedrock(
    model=DEFAULT_MODEL,
    streaming=True,
    client=boto3_bedrock,
    model_kwargs=model_kwargs_mistral,
    region_name=AWS_REGION
)

# Set LlamaIndex settings
Settings.llm = llm
Settings.embed_model = embed_model 
Settings.chunk_size=512

# Create a vector index from documents
vector_index = VectorStoreIndex.from_documents(documents=docs, 
                                               doc_nodes=doc_nodes)
print("Number of nodes:", len(vector_index.docstore.docs))

# Create a query engine
query_engine = vector_index.as_query_engine(
    similarity_top_k=5,  # The top k=5 search results will be fed through to the LLM prompt
)

# store the created index to the local file system in case you need to re-load it into memory
os.makedirs("./indices", exist_ok=True)
vector_index.storage_context.persist("./indices/amazon-shareholder-letters-2023-mistral")

Number of nodes: 30


Now, let's execute some example questions against the vector index and the Mistral model. 
The query function takes the user's query as input and generates a response based on the relevant context and prompts. The answer should be present in the [source document](data/2023-Shareholder-Letter.pdf):

In [10]:
# define prompt viewing function
from IPython.display import Markdown, display

def display_prompt_dict(prompts_dict):
    for k, p in prompts_dict.items():
        text_md = f"**Prompt Key**: {k}" f"**Text:** "
        display(Markdown(text_md))
        print(p.get_template())
        display(Markdown(""))

In [11]:
# Defines the query or question that we want to ask the model.
query="What is the importance of building primitives for innovation and experimentation in AWS? Why is this approach so important for the overall AWS generative AI strategy?"

# Retrieves the prompts that will be used to generate a response to the query.
prompts_dict = query_engine.get_prompts()

# Displays the prompts that were generated for the given query.
display_prompt_dict(prompts_dict)

# Executes the query against the vector index and the Mistral model.
# The query function takes the user's query as input and generates a response based on the relevant context and prompts.
response = query_engine.query(query)
print_ww(response)

**Prompt Key**: response_synthesizer:text_qa_template**Text:** 

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 




**Prompt Key**: response_synthesizer:refine_template**Text:** 

The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 




 Primitives are foundational building blocks that enable rapid innovation and experimentation in
AWS. They are discrete, indivisible units that do one thing really well and are meant to be used
together. By building primitives, AWS provides developers with maximum freedom and flexibility to
create new applications and services. This approach is crucial for the overall AWS generative AI
strategy because it allows for the democratization of AI technology and empowers both internal and
external builders to transform customer experiences and invent new ones. The use of primitives also
enables the composition of building blocks across businesses and in new combinations, leading to new
possibilities for customers. Additionally, this approach requires patience as the benefits of the
first few primitive services may not be immediately apparent to customers before the full potential
of these building blocks is realized.


Here are some additional examples of in-context questions, that is, questions to which answers will be found in the source document loaded into this RAG pipeline.

In [12]:
# In-context questions:
print_ww(query_engine.query("Which premium brands started listing on Amazon in 2023? (List at least 5 brands)"))
print("---")
print_ww(query_engine.query("Which countries does Amazon see meaningful progress in as emerging geographies?"))
print("---")
print_ww(query_engine.query("What are some of the GenAI applications that Amazon is building for customer and seller service productivity?"))


 In the context provided, several premium brands started listing on Amazon in 2023. Some of these
brands include Coach, Victoria's Secret, Pit Viper, Martha Stewart, Clinique, Lancôme, and Urban
Decay.
---
 The context information mentions several countries as being part of Amazon's emerging geographies,
specifically India, Brazil, Australia, Mexico, Middle East, Africa, and Thailand.
---
 Based on the context, Amazon is building several GenAI applications for customer and seller service
productivity. Some of these applications include those that generate, customize, and edit high-
quality images, advertising copy, and videos, as well as customer and seller service productivity
apps. Additionally, Amazon Q, an expert on AWS, is mentioned as a capable work assistant that
answers questions, summarizes data, carries on coherent conversation, and takes action. It is
optimistic that much of this world-changing AI will be built on top of AWS.


Some examples of questions non-related to the source document

In [13]:
# Out-of-context questions:
print_ww(query_engine.query("Tell me more about Amazon's international expansion in the Netherlands"))
print("---")
print_ww(query_engine.query("What's the name of the new Amazon's LLM?"))
print("---")
print_ww(query_engine.query("What was the total cash and investment balances at Amazon.com at the end of 1927?"))


 Amazon's international expansion in the Netherlands is not explicitly mentioned in the provided
context information. However, the text does mention that Amazon sees meaningful progress in their
emerging geographies, including India, Brazil, Australia, Mexico, Middle East, Africa, and other
countries. The company aims to reduce delivery times and better tailor the customer experience in
these markets. While the Netherlands is not specifically named, it can be inferred that Amazon is
expanding its reach and operations in various international markets.
---
 The context does not provide the name of the new Amazon large language model (LLM) mentioned in the
text.
---
 The context information provided does not mention the cash and investment balances at Amazon.com at
the end of the year 1927. The information given pertains to the years 1996, 1997, and 2023, with the
cash and investment balances at the end of 1997 being $125 million.


In [14]:
# Cross-context question:
print_ww(query_engine.query("How many times is the revenue growth rate in 2023 bigger than the one in 1997?"))

 The revenue growth rate in 2023 was bigger than the one in 1997 in all segments: North America (12%
vs. 838%), International (11% vs. 738%), and AWS (13% vs. 838%). Therefore, the number of times the
revenue growth rate in 2023 is bigger than the one in 1997 is three for North America and
International, and infinite for AWS since the growth rate in 1997 was much higher. However, it's
important to note that the absolute growth figures are significantly larger in 2023 due to the much
larger base revenue in that year.


---

## RAG Automated Pipeline evaluation with LlamaIndex evaluators

In the sections below, we'll show 4 automated evaluations available throught LlamaIndex. However, there are some additional metrics out-of-the-box that can be found [here](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/):

1. **Faithfulness**: This metric verifies whether the final response is in agreement with (doesn't contradict) the retrieved document snippets.
2. **Relevancy**: This metrics checks whether the response and retrieved content were relevant to the query.
3. **Correctness**: This metric evaluates whether the generated answer is relevant and agreeing with a reference answer.
4. **Semantic Similarity**: Evaluates the quality of a question answering system by comparing the similarity between embeddings of the generated answer and the reference answer

---

### 1. Faithfulness to source documents

The **Faithfulness** metric evaluates the coherence between the generated response and the source document snippets retrieved during the search process. This assessment is essential for identifying any discrepancies or hallucinations introduced by the LLM


In [15]:
from llama_index.core.evaluation import FaithfulnessEvaluator

query="What technology transformation is Andy Jassy comparing the potential impact of Generative AI to?"

response = query_engine.query(query)

print("Question: ----------------")
print_ww(query)
print("\nAnswer: ----------------")
print_ww(response)
print("\n----------------")

faithfulness_evaluator = FaithfulnessEvaluator(llm=llm)
eval_result = faithfulness_evaluator.evaluate_response(response=response)

print_ww("Evaluation Result:", eval_result.passing)
print_ww(f"Reasoning:\n{eval_result.feedback}")

Question: ----------------
What technology transformation is Andy Jassy comparing the potential impact of Generative AI to?

Answer: ----------------
 Andy Jassy is comparing the potential impact of Generative AI to that of the cloud technology
transformation.

----------------
Evaluation Result: True
Reasoning:
 YES
The context mentions the development and use of foundation models (FMs) and generative AI (GenAI)
applications, as well as the importance of having access to powerful compute resources and software
tools for building and deploying these models. The information provided aligns with the context.


---

### 2. Relevancy of response + source nodes to the query

The **Relevancy** metric verifies the correspondence between the response and the retrieved source documents with the user's query. This evaluation is crucial for assessing whether the response properly addresses the user's question.

In [16]:
from llama_index.core.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(llm=llm)
eval_result = relevancy_evaluator.evaluate_response(query=query, response=response)

print_ww("Evaluation Result:", eval_result.passing)
print_ww(f"Reasoning:\n{eval_result.feedback}")

Evaluation Result: True
Reasoning:
 YES, the response is in line with the context information provided. Andy Jassy is comparing the
potential impact of Generative AI to that of the cloud technology transformation in the context of
the text.


---
#### Exploring differences between Relevancy and Faithfulness Evaluators

To illustrate the contrast between **relevancy** and **faithfulness**, let's examine the following question which isn't in the source data:

In [17]:
# Out-of-context question:
ooc_query = "When Amazon 'Bedrock Studio' will be launched?"

ooc_response = query_engine.query(ooc_query)
print_ww(ooc_query)
print_ww(ooc_response)

When Amazon 'Bedrock Studio' will be launched?
 The context information provided does not mention a specific launch date for Amazon's Bedrock
Studio. The text describes Bedrock as a service that is off to a strong start with tens of thousands
of active customers and that Amazon continues to iterate on, adding new models and features. It also
mentions that the majority of GenAI applications will ultimately be built by other companies using
the primitives that Amazon is building in AWS.


In [18]:
def evaluate_and_print_response(evaluator, ooc_response, ooc_query=None):
    """
    Evaluates the relevancy or faithfulness of a response to a given query using
    a provided evaluator, and prints the evaluation result, reasoning, and contexts.

    Args:
        evaluator (Union[RelevancyEvaluator, FaithfulnessEvaluator]): The evaluator
            to use for evaluation.
        ooc_response (str): The response to evaluate.
        ooc_query (str, optional): The original out-of-context query. If not provided,
            the response will be evaluated without a specific query context.

    Returns:
        Union[RelevancyEvaluationResult, FaithfulnessEvaluationResult]: The result
            of the evaluation.
    """
    evaluation_result = evaluator.evaluate_response(query=ooc_query, response=ooc_response)

    print_ww("Evaluation Result:", evaluation_result.passing)
    print_ww(f"Reasoning:\n{evaluation_result.feedback}")
    #print_ww("Contexts:\n", evaluation_result.contexts)
    #print_ww("Source:\n", ooc_response.source_nodes)

    return evaluation_result


The **Relevancy Evaluator** module is useful to measure if the response + source nodes match the query. Therefore, it helps measuring if the query was actually answered by the response. In this example, as the context information does not provide any details about the launch date of Amazon Bedrock Studio, then the evaluation result is **FALSE**. 


In [19]:
# Evaluate relevance of result to the original question:

evaluate_and_print_response(RelevancyEvaluator(llm=llm),
                            ooc_query=ooc_query, 
                            ooc_response=ooc_response)

Evaluation Result: False
Reasoning:
 NO. The context information does not mention a launch date for Amazon Bedrock Studio.


EvaluationResult(query="When Amazon 'Bedrock Studio' will be launched?", contexts=['service. Amazon Bedrock invented this layer and provides customers with the easiest way to build and scale\nGenAI applications with the broadest selection of first- and third-party FMs, as well as leading ease-of-usecapabilities that allow GenAI builders to get higher quality model outputs more quickly. Bedrock is off to avery strong start with tens of thousands of active customers after just a few months. The team continuesto iterate rapidly on Bedrock, recently delivering Guardrails (to safeguard what questions applications will\nanswer), Knowledge Bases (to expand models’ knowledge base with Retrieval Augmented Generation—or\nRAG—and real-time queries), Agents (to complete multi-step tasks), and Fine-Tuning (to keep teaching\nand refining models), all of which improve customers’ application quality. We also just added new modelsfrom Anthropic (their newly-released Claude 3 is the best performing larg

The **Faithfulness Evaluator** module is helpful to measure if the response from a query engine matches any source nodes. The context does not provide any information regarding the launch date of Amazon Bedrock Studio. Therefore, it can't answer that question.

This helps to measure if the response has been **HALLUCINATED**, which hasn't happened in this example.

In [20]:
# Evaluate faithfulness of response to retrieved content:

evaluate_and_print_response(FaithfulnessEvaluator(llm=llm), 
                            ooc_response=ooc_response)

Evaluation Result: False
Reasoning:
 NO. The context does not mention any specific launch date for Amazon's Bedrock Studio.


EvaluationResult(query=None, contexts=['service. Amazon Bedrock invented this layer and provides customers with the easiest way to build and scale\nGenAI applications with the broadest selection of first- and third-party FMs, as well as leading ease-of-usecapabilities that allow GenAI builders to get higher quality model outputs more quickly. Bedrock is off to avery strong start with tens of thousands of active customers after just a few months. The team continuesto iterate rapidly on Bedrock, recently delivering Guardrails (to safeguard what questions applications will\nanswer), Knowledge Bases (to expand models’ knowledge base with Retrieval Augmented Generation—or\nRAG—and real-time queries), Agents (to complete multi-step tasks), and Fine-Tuning (to keep teaching\nand refining models), all of which improve customers’ application quality. We also just added new modelsfrom Anthropic (their newly-released Claude 3 is the best performing large language model in the world),Meta (with Ll

---
## Automating Q&A Generation with LllamaIndex

LllamaInex provides tools designed to automatically generate datasets when provided with a set of documents to query. In the example below, we use the **RagDatasetGenerator** class to generate evaluation questions and reference answers(ground truth) from the source documents and the specified number of questions per node.

### Generate a Synthetic Test Set with LlamaIndex RagDatasetGenerator

In [21]:
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

dataset_generator = RagDatasetGenerator.from_documents(
    documents=docs,
    llm=llm,
    num_questions_per_chunk=2, # set the number of questions per nodes
    show_progress=True,
)

print(f"Number of nodes created: {len(dataset_generator.nodes)}")


Parsing nodes:   0%|          | 0/11 [00:00<?, ?it/s]

Number of nodes created: 30


> ⏰ **Note:** The code block below is used to generate a dataset of questions and reference answers("ground truth") from the source  document and may take some time to be completed.

In [22]:
%%time
# Since there are 30 nodes generated from the source document, there should be a total of 60 questions in the generated dataset
eval_questions = dataset_generator.generate_dataset_from_nodes()
eval_questions.to_pandas()

100%|██████████| 30/30 [00:30<00:00,  1.03s/it]
100%|██████████| 2/2 [00:02<00:00,  1.26s/it]
100%|██████████| 2/2 [00:03<00:00,  1.67s/it]
100%|██████████| 2/2 [00:03<00:00,  1.90s/it]
100%|██████████| 2/2 [00:03<00:00,  1.64s/it]
100%|██████████| 2/2 [00:03<00:00,  1.86s/it]
100%|██████████| 2/2 [00:02<00:00,  1.14s/it]
100%|██████████| 2/2 [00:03<00:00,  1.52s/it]
100%|██████████| 2/2 [00:02<00:00,  1.09s/it]
100%|██████████| 2/2 [00:02<00:00,  1.50s/it]
100%|██████████| 2/2 [00:05<00:00,  2.86s/it]
100%|██████████| 2/2 [00:04<00:00,  2.01s/it]
100%|██████████| 2/2 [00:02<00:00,  1.15s/it]
100%|██████████| 2/2 [00:02<00:00,  1.07s/it]
100%|██████████| 2/2 [00:04<00:00,  2.10s/it]
100%|██████████| 2/2 [00:07<00:00,  3.57s/it]
100%|██████████| 2/2 [00:04<00:00,  2.37s/it]
100%|██████████| 2/2 [00:04<00:00,  2.32s/it]
100%|██████████| 2/2 [00:02<00:00,  1.08s/it]
100%|██████████| 2/2 [00:05<00:00,  2.72s/it]
100%|██████████| 2/2 [00:03<00:00,  1.60s/it]
100%|██████████| 2/2 [00:02<00:0

CPU times: user 1.1 s, sys: 29.7 ms, total: 1.13 s
Wall time: 2min 15s





Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,Based on the information provided in the share...,"[Dear Shareholders:\nLast year at this time, I...",The total revenue for Amazon in 2023 grew 12%...,ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)
1,Which segment of Amazon's business had the hig...,"[Dear Shareholders:\nLast year at this time, I...",The segment of Amazon's business that had the...,ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)
2,"Based on the information provided, which event...","[In our Stores business, customers have enthus...","In Q4 2023, Amazon held an exclusive event fo...",ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)
3,In what ways did Amazon improve delivery speed...,"[In our Stores business, customers have enthus...","In 2023, Amazon improved delivery speeds by r...",ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)
4,Based on the information provided in the conte...,"[One is the benefit of regionalization, where ...",Amazon's regionalization efforts contributed ...,ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)
5,In what ways did Amazon manage to reduce their...,"[One is the benefit of regionalization, where ...","In 2023, Amazon managed to reduce their cost ...",ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)
6,What geographical locations did Amazon's inter...,"[expand selection and features, and move towar...","Based on the context provided, Amazon expande...",ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)
7,Which new offering did Amazon's Advertising bu...,"[expand selection and features, and move towar...",Amazon's Advertising business introduced a ne...,ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)
8,What significant advancements were made in AWS...,"[This work diminished short-term revenue, but ...","During the past year, AWS announced several s...",ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)
9,In what ways has Amazon seen growth and succes...,"[This work diminished short-term revenue, but ...","According to the shareholder letter, Amazon h...",ai (mistral.mistral-7b-instruct-v0:2),ai (mistral.mistral-7b-instruct-v0:2)


Then, we can run the evaluation on the dataset and visualize the results in a dataframe:

In [23]:
from llama_index.core import Response
import pandas as pd

# define jupyter display function
def display_eval_df(query: str, response: Response, eval_result: str) -> None:

    eval_df = pd.DataFrame(columns=['Query', 'Response', 'Source', 'Evaluation Result'])
        
    new_record = {
                    "Query": query,
                    "Response": str(response),
                    "Source": (
                        response.source_nodes[0].node.get_content()[:100] + "..."
                    ),
                    "Evaluation Result": eval_result,
                }
    eval_df = eval_df._append(new_record, ignore_index=True)


    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

Testing the first generated evaluation question with the **RelevancyEvaluator** class.

In [24]:
evaluator_mistral = RelevancyEvaluator(llm=llm)

# pick the first question from the generated evaluation dataset
eval_questions_df = eval_questions.to_pandas()
eval_question = eval_questions_df.iloc[0,0] 
response_vector = query_engine.query(eval_question)

eval_result = evaluator_mistral.evaluate_response(
    query=eval_question, response=response_vector
)

# print results
print("\n--------- Question ---------")
print_ww(eval_question)
print("\n--------- Response ---------")
print_ww(str(response_vector))
print("\n--------- Passed ---------")
print_ww(str(eval_result.passing))
print("\n--------- Feedback ---------")
print_ww(str(eval_result.feedback))
print("\n--------- Source ---------")
print_ww(response.source_nodes[0].node.get_content())

# show a DataFrame
display_eval_df(eval_question, response_vector, eval_result.passing)


--------- Question ---------
Based on the information provided in the shareholder letter, what was the percentage increase in
total revenue for Amazon in 2023 compared to the previous year?

--------- Response ---------
 The total revenue for Amazon in 2023 grew by 12% compared to the previous year.

--------- Passed ---------
True

--------- Feedback ---------
 YES, the response is in line with the context information provided as the shareholder letter states
that Amazon's total revenue grew 12% year-over-year in 2023.

--------- Source ---------
on-premises. These businesses will keep shifting online and into the cloud. In Media and
Advertising,
content will continue to migrate from linear formats to streaming. Globally, hundreds of millions of
peoplewho don’t have adequate broadband access will gain that connectivity in the next few years.
Last butcertainly not least, Generative AI may be the largest technology transformation since the
cloud (whichitself, is still in the early stag

Unnamed: 0,Query,Response,Source,Evaluation Result
0,"Based on the information provided in the shareholder letter, what was the percentage increase in total revenue for Amazon in 2023 compared to the previous year?",The total revenue for Amazon in 2023 grew by 12% compared to the previous year.,"Dear Shareholders: Last year at this time, I shared my enthusiasm and optimism for Amazon’s future. ...",True


---
### 3. Run LlamaIndex Evaluations for Faithfulness, Relevancy, and Correctness metrics

The **Correctness** metric checks the correctness of a question answering system, relying on a provided reference answer("ground truth"), query, and response. It assigns a score from 1 to 5 (with higher values indicating better quality) alongside an explanation for the rating. Conversely, both the Relevancy and Faithfulness evaluators return a score between 0 and 1, with higher values indicating better results.

In [25]:
from llama_index.core.evaluation import CorrectnessEvaluator

def run_evaluations(evaluation_dataset: pd.DataFrame, query_engine, language_model):
    """Run a batch evaluation on a list of questions and reference answers using a provided query engine.

    Args:
        evaluation_dataset (DataFrame): A list of questions and reference_answers(ground truth) to evaluate.
        query_engine (BaseQueryEngine): The query engine to use for answering the questions.
        language_model (LLM): The language model to use for evaluation.

    Returns:
        pd.DataFrame: A DataFrame containing the evaluation results, including the query,
            generated answer, faithfulness evaluation, and relevancy evaluation.
    """

    results_list = []
    faithfulness_evaluator = FaithfulnessEvaluator(llm=language_model)
    relevancy_evaluator = RelevancyEvaluator(llm=language_model)
    correctness_evaluator = CorrectnessEvaluator(llm=language_model)

    #for question, ground_truth in zip(evaluation_questions, evaluation_ground_truth):
    for index, row in evaluation_dataset.iterrows():

        question = row['query']  
        ground_truth = row['reference_answer']  
        
        response = query_engine.query(question)
        generated_answer = str(response)

        # Faithfulness evaluator
        faithfulness_results = faithfulness_evaluator.evaluate_response(response=response)
        
        # RelevancyEvaluator evaluator
        relevancy_results = relevancy_evaluator.evaluate_response(query=question, response=response)
        
        # CorrectnessEvaluator evaluator
        correctness_results = correctness_evaluator.evaluate(
            query=question,
            response=generated_answer,
            reference=ground_truth
        )

        current_evaluation = {
            "query": question,
            "generated_answer": generated_answer,
            "ground_truth": ground_truth,
            "faithfulness": faithfulness_results.passing,
            "faithfulness_feedback": faithfulness_results.feedback,
            "faithfulness_score": faithfulness_results.score,
            "relevancy": relevancy_results.passing,
            "relevancy_feedback": relevancy_results.feedback,
            "relevancy_score": relevancy_results.score,
            "correctness": correctness_results.passing,
            "correctness_feedback": correctness_results.feedback,
            "correctness_score": correctness_results.score,
        }
        results_list.append(current_evaluation)

    evaluations_df = pd.DataFrame(results_list)
    return evaluations_df


In [26]:
%%time
# Run evaluations for the first 5 rows only
evaluation_results_df = run_evaluations(eval_questions_df.head(5), query_engine, llm)
evaluation_results_df

CPU times: user 344 ms, sys: 6.61 ms, total: 350 ms
Wall time: 25.4 s


Unnamed: 0,query,generated_answer,ground_truth,faithfulness,faithfulness_feedback,faithfulness_score,relevancy,relevancy_feedback,relevancy_score,correctness,correctness_feedback,correctness_score
0,Based on the information provided in the share...,The total revenue for Amazon in 2023 grew by ...,The total revenue for Amazon in 2023 grew 12%...,True,YES\nThe context mentions that Amazon's total...,1.0,True,"YES, the response is in line with the context...",1.0,True,Both the generated and reference answers are i...,5.0
1,Which segment of Amazon's business had the hig...,The segment of Amazon's business that experie...,The segment of Amazon's business that had the...,False,The context does not directly support or cont...,0.0,True,"YES, the response is in line with the context...",1.0,True,The generated answer correctly identifies the ...,4.5
2,"Based on the information provided, which event...","In Q4 2023, Amazon held an exclusive event fo...","In Q4 2023, Amazon held an exclusive event fo...",True,YES (The context mentions the growth and expa...,1.0,True,"YES, the response is in line with the context...",1.0,True,The generated answer is fully relevant and cor...,5.0
3,In what ways did Amazon improve delivery speed...,"In 2023, Amazon made significant strides in i...","In 2023, Amazon improved delivery speeds by r...",True,"YES, the context supports the information tha...",1.0,True,YES. The response is in line with the context...,1.0,True,The generated answer is fully relevant to the ...,5.0
4,Based on the information provided in the conte...,Amazon's regionalization efforts led to a sig...,Amazon's regionalization efforts contributed ...,True,"YES, the context supports the information abo...",1.0,True,YES. The response is in line with the context...,1.0,True,The generated answer is fully relevant and cor...,5.0


In [27]:
# Show a single row
row=1
print("\n--- Query")
print_ww(evaluation_results_df.iloc[row,0])
print("\n--- Generated Answer")
print_ww(evaluation_results_df.iloc[row,1])
print("\n--- Ground Truth")
print_ww(evaluation_results_df.iloc[row,2])
print("\n")


--- Query
Which segment of Amazon's business had the highest percentage increase in revenue in 2023 compared
to the previous year, and what was the exact dollar amount of revenue generated in each year?
Additionally, what major events contributed to the revenue growth in this segment during the holiday
season?

--- Generated Answer
 The segment of Amazon's business that experienced the highest percentage increase in revenue year-
over-year (YoY) in 2023 was the AWS segment. The revenue for AWS grew by 13% YoY from $80B in 2022
to $91B in 2023.

In the Stores business, which includes Amazon's retail operations, revenue also grew significantly.
North America revenue increased by 12% YoY, International revenue grew by 11%, and the total retail
revenue grew from $514B in 2022 to $575B in 2023.

During the holiday season in Q4 2023, Amazon held Prime Big Deal Days, an exclusive event for Prime
members, followed by an extended Black Friday and Cyber Monday shopping event open to all custome

---

### 4. Semantic Similarity Evaluation 

The **SemanticSimilarityEvaluator** evaluates the quality of a question answering system by comparing the similarity between embeddings of the generated answer and the reference answer. Since we have reference answers (ground truth) within the generated dataset created by calling **RagDataSetGenerator**, we can use this evaluator to check the quality of this Q&A dataset via semantic similarity. Behind the scenes, it calculates the similarity score between embeddings of the generated answer and the reference answer (ground truth).



In [28]:
from llama_index.core.evaluation import SemanticSimilarityEvaluator

similiratity_evaluator = SemanticSimilarityEvaluator()

# Picks generated ansnwer and ground truth for the 3rd row of DataFrame. (feel free to use any other)
response = evaluation_results_df.loc[3,'generated_answer'] # generated_answer column
reference = evaluation_results_df.loc[3,'ground_truth'] # ground_truth column

print_ww("------ Generated Answer")
print_ww(response)
print_ww("------ Ground Truth")
print_ww(reference)
print_ww("------")

similarity_result = await similiratity_evaluator.aevaluate(
    response=response,
    reference=reference,
)

print("\nScore: ", similarity_result.score)
print("\nPassing: ", similarity_result.passing)  # default similarity threshold is 0.8

------ Generated Answer
 In 2023, Amazon made significant strides in improving delivery speeds by re-architecting its
network to store items closer to customers and expanding same-day facilities. This resulted in a
nearly 70% year-over-year increase in the number of items delivered same day or overnight. The
faster delivery times led to more frequent shopping on Amazon, as customers found it more convenient
to fulfill their needs with the platform. This trend was particularly noticeable in the growth of
Amazon's everyday essentials business, which experienced over 20% year-over-year growth in Q4 2023.
The regionalization efforts also helped trim transportation distances, leading to a reduction in
cost to serve on a per unit basis for the first time since 2018. This cost savings allowed Amazon to
invest further in speed improvements and add more selection at lower prices, making it a more
attractive option for customers.
------ Ground Truth
 In 2023, Amazon improved delivery speeds by r

---

## Automating Q&A Generation and Evaluation with Ragas


Ragas (RAG Assessment) offers a framework designed to assess RAG pipelines. The evaluation metrics used by Ragas, similarly to the ones we used above with LlamaIndex evaluators, are a set of metrics designed to assess the performance and safety of AI applications, particularly in the context of grounded conversational AI systems. Below you will find the metric we will be using with Ragas in this notebook.

1. **Faithfulness**
Faithfulness measures the extent to which an AI assistant's response is faithful to the provided context or information. It evaluates whether the assistant's response is consistent with the given facts and does not contradict or deviate from the provided context.
2. **Answer Relevancy**
Answer relevancy assesses how relevant the AI assistant's response is to the user's query or question. It evaluates whether the response addresses the core intent of the query and provides information that is directly relevant to the user's needs.
3. **Answer Similarity**
Answer similarity measures the semantic similarity between the AI assistant's response and the expected or ideal answer. It evaluates how closely the generated response matches the desired or ground truth answer in terms of meaning and content.
4. **Answer Correctness**
Answer correctness evaluates the factual accuracy and correctness of the AI assistant's response. It assesses whether the information provided in the response is true, accurate, and free from factual errors or inconsistencies.
5. **Context Precision**
Context precision measures the precision of the AI assistant's response concerning the provided context. It evaluates how well the assistant's response incorporates and utilizes the relevant information from the given context, without including irrelevant or extraneous information.
6. **Context Recall**
Context recall measures the recall of the AI assistant's response concerning the provided context. It evaluates how much of the relevant information from the given context is included and covered in the assistant's response.
7. **Context Entity Recall**
Context entity recall specifically evaluates the AI assistant's ability to identify and include relevant entities (e.g., names, places, organizations) from the provided context in its response.
8. **Harmfulness**
Harmfulness assesses the potential for the AI assistant's response to cause harm or promote harmful or unethical content. It evaluates whether the response contains offensive, biased, or potentially harmful language or information.

In [29]:
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_similarity,
    answer_correctness,    
    context_precision,
    context_recall,
    context_entity_recall
)
from ragas.metrics.critique import harmfulness

metrics = [
    faithfulness,
    answer_relevancy,
    answer_similarity,
    answer_correctness,
    context_precision,
    context_recall,
    context_entity_recall,
    harmfulness,
]

--- 
### Generate a Synthetic Test Set with Ragas TestsetGenerator

By leveraging the **TestsetGenerator**, you can create test sets tailored to specific domains, topics, or use cases, ensuring that your AI assistant is thoroughly evaluated across a wide range of scenarios. The generated test sets can include various types of queries, contexts, and expected responses, allowing you to assess the assistant's performance metrics such as faithfulness, relevance, similarity, correctness, precision, recall, and potential harmfulness.


In [30]:
from langchain_aws import ChatBedrock
from langchain_community.embeddings import BedrockEmbeddings

# init the Embeddings model
bedrock_embeddings = BedrockEmbeddings(
    region_name=AWS_REGION,
    model_id=DEFAULT_EMBEDDINGS
)

bedrock_model_mistral = ChatBedrock(
    model_id=DEFAULT_MODEL,
    model_kwargs=model_kwargs_mistral,
    region_name=AWS_REGION,
)


**TestsetGenerator:** This module is responsible for generating test sets for evaluating RAG pipelines. It provides a variety of test generation strategies, including simple, reasoning, and multi-context strategies.

In [31]:
from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.from_langchain(
    generator_llm=bedrock_model_mistral,
    critic_llm=bedrock_model_mistral,
    embeddings=bedrock_embeddings,
)


#### Option 1 - Generate with LlamaIndex documents

The following step will involve loading a set of documents and text chunks, allowing the Mistral model to generate potential questions based on these documents. Additionally, it will create reference answers (referred to as 'ground truth') for these questions, all based on the provided documents.

The `distributions` parameter in the `generate_with_llamaindex_docs` function is used to specify the probability distribution for generating different types of questions or prompts during testing or evaluation. It's a dictionary where the keys represent the types of questions or prompts, and the values represent the corresponding probabilities. In the given example:

```python
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}
```

This means that the generator will generate questions or prompts based on the following probabilities:

#### simple
- Probability: 0.5 (50%)
- Likely straightforward questions requiring simple answers.

#### reasoning 
- Probability: 0.25 (25%)
- Questions may require some reasoning or logical thinking.

#### multi_context
- Probability: 0.25 (25%) 
- Questions may involve multiple contexts or require information from multiple sources.

The sum of the probabilities equals 1.0 (100%). This allows controlling the mix of different question types generated for testing/evaluation purposes.By specifying the `distributions`, you can test the system's performance on different types of questions, ranging from simple to more complex multi-context scenarios. This can help identify areas for improvement and ensure robust performance across various question difficulties.

In [38]:
%%time
%%capture

print("Start dataset generation...")

testset = generator.generate_with_llamaindex_docs(documents=docs, 
                                                  test_size=4,
                                                  raise_exceptions=False,
                                                  with_debugging_logs=False, 
                                                  distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
print("Test dataset generated!")

CPU times: user 683 ms, sys: 84.7 ms, total: 767 ms
Wall time: 56.8 s


In [39]:
df = testset.to_pandas()
df 

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,"""How did Amazon's Free Cash Flow improve in 2...","[Dear Shareholders:\nLast year at this time, I...",Amazon's Free Cash Flow improved from -$12.8B ...,simple,"[{'page_label': '1', 'file_name': 'Amazon-com-...",True
1,"""How did Amazon's revenue grow in 2023 compar...","[Dear Shareholders:\nLast year at this time, I...",Amazon's total revenue grew 12% year-over-year...,simple,"[{'page_label': '1', 'file_name': 'Amazon-com-...",True
2,Which Amazon financial metric and business sec...,"[Dear Shareholders:\nLast year at this time, I...","In 2023, Amazon's AWS revenue experienced the ...",reasoning,"[{'page_label': '1', 'file_name': 'Amazon-com-...",True
3,How did enhancements in customer experience an...,"[Dear Shareholders:\nLast year at this time, I...","In 2023, Amazon's Free Cash Flow (FCF) improve...",multi_context,"[{'page_label': '1', 'file_name': 'Amazon-com-...",True


#### Option 2 - Generate with Langchain documents (Optional)

This step is optional since documents have been loaded via LlamaIndex above using _**generator.generate_with_llamaindex_docs()**_ method. This is only for those who want to use Ragas with Langchain instead.

In [None]:
%%time
%%capture

from langchain_community.document_loaders import PyPDFLoader
data = PyPDFLoader("data/Amazon-com-Inc-2023-Shareholder-Letter.pdf").load_and_split()

print("Start dataset generation...")

# generate testset
testset_langchain = generator.generate_with_langchain_docs(documents=data, 
                                                           test_size=4,
                                                           raise_exceptions=False,
                                                           with_debugging_logs=False, 
                                                           distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

print("Test dataset generated!")

In [None]:
df_langchain = testset_langchain.to_pandas()
df_langchain 

In [40]:
# add a new column to the DataFrame as some of the metrics 
# being evaluated require the answer column in the dataset
df["answer"]=''

# iterate through df variable and populate answer column 
for i, row in df.iterrows():
    # add a new column to the DataFrame 'df' created by the generator.generate_with_llamaindex_docs method
    print("\n--------------- Question ---------------")
    print(df.question[i])
    response = query_engine.query(df.question[i])
    df.loc[i, "answer"]=response.response
    print_ww("Answer: " + df.answer[i])


--------------- Question ---------------
 "How did Amazon's Free Cash Flow improve in 2023 compared to the previous year?
Answer:  The Free Cash Flow (FCF) of Amazon improved significantly in 2023 compared to the previous
year. The FCF increased from a negative value of $12.8 billion in 2022 to a positive value of $35.5
billion in 2023, representing an improvement of $48.3 billion. This improvement can be attributed to
the company's focus on cost optimization, operational efficiency, and revenue growth across its
various business segments.

--------------- Question ---------------
 "How did Amazon's revenue grow in 2023 compared to the previous year, and in which segments did this growth occur?
Answer:  In 2023, Amazon's total revenue grew by 12% year-over-year from $514B to $575B. This growth
was observed across all segments. North America revenue increased by 12% YoY from $316B to $353B,
International revenue grew by 11% YoY from $118B to $131B, and AWS revenue grew by 13% YoY from 

In [41]:
from datasets import Dataset 

synthetic_dataset = Dataset.from_pandas(df)
synthetic_dataset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done,answer
0,"""How did Amazon's Free Cash Flow improve in 2...","[Dear Shareholders:\nLast year at this time, I...",Amazon's Free Cash Flow improved from -$12.8B ...,simple,"[{'creation_date': '2024-05-15', 'file_name': ...",True,The Free Cash Flow (FCF) of Amazon improved s...
1,"""How did Amazon's revenue grow in 2023 compar...","[Dear Shareholders:\nLast year at this time, I...",Amazon's total revenue grew 12% year-over-year...,simple,"[{'creation_date': '2024-05-15', 'file_name': ...",True,"In 2023, Amazon's total revenue grew by 12% y..."
2,Which Amazon financial metric and business sec...,"[Dear Shareholders:\nLast year at this time, I...","In 2023, Amazon's AWS revenue experienced the ...",reasoning,"[{'creation_date': '2024-05-15', 'file_name': ...",True,"In 2023, Amazon's AWS revenue experienced the..."
3,How did enhancements in customer experience an...,"[Dear Shareholders:\nLast year at this time, I...","In 2023, Amazon's Free Cash Flow (FCF) improve...",multi_context,"[{'creation_date': '2024-05-15', 'file_name': ...",True,"In 2023, Amazon experienced significant impro..."


### Ragas Evaluation module

The evaluation step leverages the questions from the generated test set to assess the performance of the RAG pipeline. In our example scenario, Mistral is utilized to validate the answers produced by our RAG pipeline against the questions provided in the previously created test set.

Then, the LLM is tasked with evaluating how well the retrieved contexts align with the given questions. This step ensures that the contextual information provided to the LLM is relevant and appropriate for answering the queries.

Finally, the answers generated by Mistral are compared against the ground truth answers included in the test set. This comparison allows for a comprehensive evaluation of the LLM's performance in generating accurate and relevant responses.

In [42]:
%%time
from ragas import evaluate
import nest_asyncio  

# Only used when running on a jupyter notebook, otherwise you may want to remove this function
nest_asyncio.apply()

result = evaluate(
    synthetic_dataset,
    metrics=metrics,
    llm=bedrock_model_mistral,
    embeddings=bedrock_embeddings,
)

Evaluating:   0%|          | 0/32 [00:00<?, ?it/s]

CPU times: user 525 ms, sys: 50.5 ms, total: 576 ms
Wall time: 21.1 s


In [43]:
result.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done,answer,faithfulness,answer_relevancy,answer_similarity,answer_correctness,context_precision,context_recall,context_entity_recall,harmfulness
0,"""How did Amazon's Free Cash Flow improve in 2...","[Dear Shareholders:\nLast year at this time, I...",Amazon's Free Cash Flow improved from -$12.8B ...,simple,"[{'creation_date': '2024-05-15', 'file_name': ...",True,The Free Cash Flow (FCF) of Amazon improved s...,1.0,0.892206,0.887614,0.721903,1.0,1.0,0.5,0
1,"""How did Amazon's revenue grow in 2023 compar...","[Dear Shareholders:\nLast year at this time, I...",Amazon's total revenue grew 12% year-over-year...,simple,"[{'creation_date': '2024-05-15', 'file_name': ...",True,"In 2023, Amazon's total revenue grew by 12% y...",1.0,0.823959,0.978041,0.99451,1.0,1.0,0.210526,0
2,Which Amazon financial metric and business sec...,"[Dear Shareholders:\nLast year at this time, I...","In 2023, Amazon's AWS revenue experienced the ...",reasoning,"[{'creation_date': '2024-05-15', 'file_name': ...",True,"In 2023, Amazon's AWS revenue experienced the...",0.727273,0.545551,0.868412,0.563257,1.0,1.0,0.214286,1
3,How did enhancements in customer experience an...,"[Dear Shareholders:\nLast year at this time, I...","In 2023, Amazon's Free Cash Flow (FCF) improve...",multi_context,"[{'creation_date': '2024-05-15', 'file_name': ...",True,"In 2023, Amazon experienced significant impro...",1.0,0.78615,0.906229,0.414057,1.0,1.0,0.090909,0


---
## Conclusion

### Benefits of Using LLM Evaluators

Utilizing LLM evaluators like LlamaIndex's evaluator and Ragas can provide valuable insights into the performance and reliability of your RAG pipeline as you have explored along this example, particularly when evaluating the outputs of language models such as Mistral. Here are some key benefits:

### Assessing Response Quality

LLM evaluators can help assess the quality of the responses generated by your RAG pipeline by comparing them against various criteria, such as:

- **Correctness**: Evaluating if the generated answer matches the reference or ground truth answer, if available
- **Semantic Similarity**: Measuring the semantic similarity between the generated answer and the reference answer, even if they differ in wording.
- **Faithfulness**: Determining if the generated answer is faithful to the retrieved context, avoiding hallucinations or irrelevant information.

### Evaluating Retrieval Relevance
These evaluators can also assess the relevance of the retrieved context to the input query, ensuring that the RAG pipeline is providing appropriate information to the language model.

### Guideline Adherence
Evaluators like LlamaIndex and Ragas, can evaluate if the generated responses adhere to specific guidelines or constraints, which is crucial for maintaining control over the language model's outputs.

### Automated Question Generation
Tools like LlamaIndex and Ragas can automatically generate questions based on your data, allowing you to test your RAG pipeline's performance on a diverse set of queries without manual effort.

By leveraging these evaluation capabilities, you can systematically identify areas for improvement in your RAG pipeline, retrieval components, and ultimately enhance the accuracy and reliability of your Generative AI-powered applications.