# AISA Capstone 3 Assignment

## Overview
This notebook provides an environment for you to build your intuition on the steps to take when developing a high quality Retrieval Augmented Generation (RAG) solution. 
RAG solutions retrieve data before calling the large language model (LLM) to generate an answer. 
The retrieved data is used to augment the prompt to the LLM by adding the relevant retrieved data in context. 
Any RAG solution is only as good as the quality of the data retrieval process. 
The AISA Capstone 2 assignment focused on retrieval accuracy for RAG.
This notebook, follows on directly from that assignment, to focus on generating high-quality answers to question, 
and systematically assessing the quality of generated output.

The RAG solution developed here is enabled by the Llamaindex framework. This is a popular framework in the industry for developing RAG and Agent based solutions. In addition to providing a core set of tools for orchestration of RAG and Agent workflows, there is broad integration with a variety of platforms for model inference (LLM, embedding, ...), and, importantly, tooling for solution evaluation.

## Prerequisites for running the notebook
- That you have granted access to the Bedrock models that you are going to use, in the region (**us-west-2**) where you are going to use Bedrock - 
[reference](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access-modify.html)
- Your SageMakerExecutionRole has permissions to invoke Bedrock models - 
[reference](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-prereq.html)
- This notebook has been tested with SageMaker Notebook Instance running a `conda_python3` kernel
- The AWS region set for Amazon Bedrock use, needs to be in a region where the models being used are 1/ available, and 2/ enabled for use. This notebook was tested with Bedrock region `us-west-2`

## Implementation
This notebook uses llamaindex to define and execute the RAG solution. We will be using the following tools:

- **LLM (Large Language Model)**: e.g. Anthropic Claude Haiku available through Amazon Bedrock

  LLMs are used in the notebook for 1/ RAG response generation, to show the overall RAG workflow in actions, and 2/ for generating test questions on the indexed content (llamaindex nodes) for retrieval evaluation.
  
- **Text Embeddings Model**: e.g. Amazon Titan Embeddings available through Amazon Bedrock

  This embedding model is used to generate semantic vector representations of the content (llamaindex nodes) to be stored and the questions input to the RAG solution.
  
- **Document Loader**: SimpleDirectoryReader (Llamaindex)

  Before your chosen LLM can act on your data you need to load it. The way LlamaIndex does this is via data connectors, also called 'Reader'. Data connectors ingest data from different data sources and format the data into Document objects. A Document is a collection of data (currently text, and in future, images and audio) and metadata about that data.
  
  This implementation use SimpleDirectoryReader, which creates documents out of every file in a given directory. It can read a variety of formats including Markdown, PDFs, Word documents, and PowerPoint decks.

- **Vector Store**: VectorIndex (Llamaindex)

  In this notebook we are using this in-memory vector-store to store both the embeddings and the documents. In an enterprise context this could be replaced with a persistent store such as AWS OpenSearch, RDS Postgres with pgVector, ChromaDB, Pinecone or Weaviate.
  
  LlamaIndex abstracts the underlying vector database storage implementation with a VectorIndex class. This warps the Index, which is a data structure composed of Document objects, designed to enable querying by an LLM. The Index is designed to be complementary to your querying strategy.

----

Install required Python modules for constructing the RAG solution.
You only need to run this once. 

Don't stress if you see an error in the output of the `pip install`. While this is concerning, it will likely not effect the functioning of the notebook.

In [None]:
%pip install \
    llama-index \
    llama-index-llms-bedrock \
    llama-index-embeddings-bedrock

## Section 1: Setting up the baseline configuration with some sample content

Download the default RAG test source data to our target source_docs directory. 
You only need to run this once.

In [None]:
source_docs_dir = './source_docs/'

The following creates the source_docs directory and downloads a document to that directory. The contents of this directory, 
initially the document that is downloaded here, will be used in the steps that follow.

After running this notebook in its entirity and reviewing its operation, delete this content and add your own content to the directory.

In [None]:
# Download and load data
!mkdir -p {source_docs_dir}
!wget --no-check-certificate 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O {source_docs_dir}'/paul_graham_essay.txt'

Import auxilliarty modules

In [None]:
import logging
import sys
import os
import pandas as pd
import boto3  # AWS SDK for Python

In [None]:
# This is an output to screen helper method for making some output more easy to read.

import textwrap
from io import StringIO

def print_ww(*args, width: int = 100, **kwargs):
    """Like print(), but wraps output to `width` characters (default 100)"""
    buffer = StringIO()
    try:
        _stdout = sys.stdout
        sys.stdout = buffer
        print(*args, **kwargs)
        output = buffer.getvalue()
    finally:
        sys.stdout = _stdout
    for line in output.splitlines():
        print("\n".join(textwrap.wrap(line, width=width)))

In [None]:
# This is required when running within a jupyter notebook, otherwise you will get errors when llamaindex modules run
import nest_asyncio

nest_asyncio.apply()

Import required Python modules for constructing and evaluating the RAG solution

In [None]:
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.llms.bedrock import Bedrock

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.text_splitter import TokenTextSplitter


from llama_index.core.evaluation import (
    DatasetGenerator,
    RetrieverEvaluator,
    generate_question_context_pairs,
)

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Response,
)

## Configure the models that will be used for the RAG pipeline

**Note**: By default this notebook with use the `us-west-2` region. This region has support for the models used in this notebook. You should not need to change this setting.

In [None]:
AWS_REGION = "us-west-2"
# AWS_REGION = "us-east-1"  # this is an alternative setting to use if desired 

Define the set of Bedrock model IDs that we that we'll use when developing and testing our solution 

Establish a connection to the Amazon Bedrock service

In [None]:
boto3_bedrock = boto3.client("bedrock-runtime")

### Configure the target embeddings models for use with Llamaindex

In [None]:
titan_text_embeddings_multilingual_v1_id = "amazon.titan-embed-text-v1"
titan_text_embeddings_multilingual_v2_id = "amazon.titan-embed-text-v2:0"
cohere_text_embeddings_english_id = "cohere.embed-english-v3"
cohere_text_embeddings_multilingual_id = "cohere.embed-multilingual-v3"

Configure our chosen embeddings model for use with llama_index

In [None]:
titan_text_embeddings_v2 = BedrockEmbedding(model=titan_text_embeddings_multilingual_v2_id,region_name=AWS_REGION)
titan_text_embeddings_v1 = BedrockEmbedding(model=titan_text_embeddings_multilingual_v1_id,region_name=AWS_REGION)
cohere_text_embeddings_english = BedrockEmbedding(model=cohere_text_embeddings_english_id,region_name=AWS_REGION)
cohere_text_embeddings_multilingual= BedrockEmbedding(model=cohere_text_embeddings_english_id,region_name=AWS_REGION)

### Configure the target LLMs for use with Llamaindex

The following Mistral models can be used to produce questions for evaluation. 
The Titan model produces questions of lesser quality and sometimes not in the format needed by the tools. 

**Note** Most Bedrock LLMs do not *produce test questions* in a format that can be directly used for evaluation with the tooling as it is configured in this notebook.

In [None]:
instruct_mistral7b_id = "mistral.mistral-7b-instruct-v0:2"
instruct_mixtral8x7b_id = "mistral.mixtral-8x7b-instruct-v0:1"
titan_text_express_id = "amazon.titan-text-express-v1"
claude_haiku_3_id = "anthropic.claude-3-haiku-20240307-v1:0"
claude_sonnet_35_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"

In [None]:
# set the parameters to be applied when invoking the model
model_kwargs_llm = {
    "temperature": 0.1,
    "top_k": 200,
    "max_tokens": 4096
}

### NOTE: This notebook uses two additional LLMs !!
You will need to enable use to the following models in the Bedrock console
- Anthropic Claude Haiku 3
- Anthropic Clause Sonnet 3.5

This is in addition to the Mistral and Titan models used in Capstone 2.

If these are no enabled you will encounter errors later.

In [None]:
llm_mistral7b = Bedrock(model=instruct_mistral7b_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_mixtral8x7b = Bedrock(model=instruct_mixtral8x7b_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_titan_express = Bedrock(model=titan_text_express_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_haiku_3 = Bedrock(model=claude_haiku_3_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_sonnet_35 = Bedrock(model=claude_sonnet_35_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)

### Use the following cell to configure the embeddings model to use for the cells that follow

The embeddings model is a critical choice for the accuracy of your RAG solution.
Experiment with the options here to see which is best for your content.
If you want more, test with further alternatives. There are many that are readily supported by llama_index.

In [None]:
# After the first run, set this to match you intended configuration based on your learning from Capstone II

# embed_model = titan_text_embeddings_v1
embed_model = titan_text_embeddings_v2
# embed_model = cohere_text_embeddings_english
# embed_model = cohere_text_embeddings_multilingual

### Use the following cell to configure the LLM to use for the cells that follow
The LLM will be used for question generation and RAG answer generation in this notebook as it is currently configured.
The default value llm_mistral7b works well with the code and should be used if possible. 

In [None]:
llm_model = llm_mistral7b
# llm_model = llm_mixtral8x7b
# llm_model = llm_titan_express
# llm_model = llm_haiku_3

In [None]:
# Set LlamaIndex default model settings to what was set in the cells above
Settings.embed_model = embed_model
Settings.llm = llm_model

## Read in the documents for adding to our data store

Read in the documents in the 'data/source_docs' directory into a structure ready for use by llama_index

In [None]:
reader = SimpleDirectoryReader(source_docs_dir)
documents = reader.load_data()

Quick check here to see that all of your documents were read. The count should match the number of pages in the documents in source_docs

In [None]:
len(documents)

## Create and run the document ingestion pipeline

The following cell defines two different document ingestion pipelines. 
If you have time, test using both of these, and create you own and test with that also.

In [None]:
# Define two transformation for the ingestion pipelines for initial experimentation

transformations_00=[
        TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=100),
        embed_model,
    ]

transformations_01=[
        SentenceSplitter(chunk_size=512, chunk_overlap=100),
        TitleExtractor(),
        embed_model,
    ]


### Use the following cell to configure the data ingestion pipeline for processing the source data

In [None]:
# After the first run, set this to match you intended configuration based on your learning from Capstone II

pipeline = IngestionPipeline(transformations=transformations_00)
# pipeline = IngestionPipeline(transformations=transformations_01)


### Run the configured ingestion pipeline 

In [None]:
# run the pipeline
nodes = pipeline.run(documents=documents)
print(f"number of nodes: {len(nodes)}")

This may make test analysis easier. It is non-essential

In [None]:
# By default, the node ids are set to random uuids. 
# To ensure same id's per run, we manually set them to consistent sequential numbers.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

In [None]:
# validate that node has an embedding associated with it
for idx, node in enumerate(nodes):
    if node.id_ == "node_0":
        print(node.embedding)

## Create the VectorIndex 
This creates our vector database, in memory in this case,  using the nodes that were created in the previous step

In [None]:
vector_index = VectorStoreIndex(nodes=nodes)

## Test that we have a valid starting point for our evaluation
We run a quick system test with the defaul llama_index RAG workflow with a question that is relevant to our dataset

Instantiate a query engine object

In [None]:
query_engine = vector_index.as_query_engine(llm=llm_mistral7b)

Specify a question that has can be answered by the document(s) that have been ingested. For the default document, the following is a valid question.

In [None]:
example_query="""Based on Paul Graham's experience, why did he initially lose interest in studying philosophy
and switch to AI in college?"""

Run the default RAG pipeline with the example query. This should give a meaningful result. Don't worry if the answer is overly verbose, etc. We'll fix that later.

In [None]:
response = query_engine.query(example_query)
print(response)

----

# Evaluate the retrieval accuracy of the VectorIndex

## Create a set of question and node (context) pairs to drive the tests that follow
This uses the llm that give the methods and the document data stored in the nodes (created during document ingestion)

This will make many calls to the specified LLM (num_questions_per_chunk * number of nodes). This will likely be throttled by Bedrock. The llama_index API will work through the throttling except in extreme cases.

In [None]:
%%time
qa_dataset = generate_question_context_pairs(nodes, num_questions_per_chunk=1)

Take a look at the sample queries generated. This should show a meaningful questions related to your document content.

In [None]:
for item in list(qa_dataset.queries.items()):
    print(item[1])

## Instantiate a retriever against the index for testing



### Set the number of items to return from the Retriever
This is a trade-off item, more returned content is not always better. Consider how this may impact your pipeline and evaluation results and experiment with it.

In [None]:
# After the first run, set this to match you intended configuration based on your learning from Capstone II
# If you did not complete Capstone II then leave this as is

# number_of_items_to_return = 2
# number_of_items_to_return = 3
number_of_items_to_return = 4


In [None]:
retriever = vector_index.as_retriever(similarity_top_k=number_of_items_to_return)

Run a quick system test on the retriever and check that the output nodes look reasonable

In [None]:
example_query="""Based on Paul Graham's experience, why did he initially lose interest in studying philosophy
and switch to AI in college?"""

In [None]:
retrieved_nodes = retriever.retrieve(example_query)
print(retrieved_nodes)

## Evaluate the Quality of Retrieval from the VectorIndex

In [None]:
# This is a helper function to output the results of the evaluation

def display_results(name, eval_results):
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)
    mrr = full_df["mrr"].mean()
    precision = full_df["precision"].mean()
    recall = full_df["recall"].mean()

    metric_df = pd.DataFrame({"retrievers": [name], "mrr": [mrr],
                              "precision": [precision], "recall": [recall],
                             })
    return metric_df, full_df


Instantiate a RetrieverEvaluator with the metrics that we want to review

In [None]:
metrics = ["mrr", "precision", "recall"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)

In [None]:
# Evaluate on a single query 
# The output is verbose, but may be useful for looking at specific results

query_id = 1  # change this to math the query id of interest

sample_id, sample_query = list(qa_dataset.queries.items())[query_id]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

### Run evaulation on the entire test dataset (autogenerated above)

In [None]:
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

### Top-level Evaluation Results

In [None]:
summary, detail = display_results(f"top-{number_of_items_to_return} eval", eval_results)
summary

----
### This completes setting up the retriever and validating it's level of accuracy

The following sections are the focus of this notebook.

----

## Automating Q&A Generation with LllamaIndex

LllamaInex provides tools designed to automatically generate datasets when provided with a set of documents to query. In the example below, we use the **RagDatasetGenerator** class to generate evaluation questions and reference answers(ground truth) from the source documents and the specified number of questions per node.

In [None]:
%pip install spacy

In [None]:
from llama_index.core.llama_dataset.generator import RagDatasetGenerator, LabelledRagDataset

In [None]:
from llama_index.core.llama_dataset import (
    LabelledRagDataset,
    CreatedBy,
    CreatedByType,
    LabelledRagDataExample,
    BaseLlamaDataset
)

In [None]:

dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=llm_mixtral8x7b,
    num_questions_per_chunk=1, # set the number of questions per nodes
    show_progress=True,
)

print(f"Number of nodes created: {len(dataset_generator.nodes)}")


In [None]:
%%time
eval_questions = dataset_generator.generate_dataset_from_nodes()
eval_questions.to_pandas()

**Note**: The following cell saves the generated question and answers to a JSON file and so that we do not need to run
the question generation process above multiple times. 

In [None]:
eval_questions.save_json('eval_questions.json')
print(f"Saving {len(eval_questions.examples)} test cases")

Use the questions saved in the JSON file.

In [None]:
checkpointed_eval_questions = LabelledRagDataset.from_json('eval_questions.json')
print(f"Restoring {len(checkpointed_eval_questions.examples)} test cases")

In [None]:
# Convert the question set into a Pandas dataframe for ease of use for the cells that follow
eval_questions_df = checkpointed_eval_questions.to_pandas()

---

## RAG Automated Pipeline evaluation with LlamaIndex evaluators

In the sections below, we'll show 4 automated evaluations available throught LlamaIndex. However, there are some additional metrics out-of-the-box that can be found [here](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/):

1. **Faithfulness**: This metric verifies whether the final response is in agreement with (doesn't contradict) the retrieved document snippets.
2. **Relevancy**: This metrics checks whether the response and retrieved content were relevant to the query.
3. **Correctness**: This metric evaluates whether the generated answer is relevant and agreeing with a reference answer.

In [None]:
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator

**Note**: Configuring the LLM to use as the evaluator (aka Judge) of the output content from the RAG pipeline. 

For this, it is typical to use an LLM that has higher benchmark ratings than the LLM used for content generation.

In [None]:
# evaluator_llm = llm_mixtral8x7b
evaluator_llm = llm_sonnet_35

### Set up our default query engine for showing the baseline evaluation

**Note** as the number of chunks (aka items) returned increased the size of the prompt increases and the smaller models may fail.

In [None]:
# KEY CELL #1

llm_model = llm_mixtral8x7b
# llm_model = llm_haiku_3
# llm_model = llm_sonnet_35

number_of_items_to_return = 3

query_engine = vector_index.as_query_engine(llm=llm_model, similarity_top_k=number_of_items_to_return)

In [None]:
faithfulness_evaluator = FaithfulnessEvaluator(llm=evaluator_llm)
relevancy_evaluator = RelevancyEvaluator(llm=evaluator_llm)
correctness_evaluator = CorrectnessEvaluator(llm=evaluator_llm)

---

### Faithfulness to source documents

The **Faithfulness** metric evaluates the coherence between the generated response and the source document snippets retrieved during the search process. This assessment is essential for identifying any discrepancies or hallucinations introduced by the LLM


In [None]:
# Helper function for evaluating the faithfulness of the output of a specific test case

def evaluate_faithfulness_for_question(rag_engine, questions_df, question_number):

    eval_question = questions_df.iloc[0,0]
    response_vector = rag_engine.query(eval_question)

    eval_result = faithfulness_evaluator.evaluate_response(response=response_vector)

    print("Question: ----------------")
    print_ww(eval_question)
    print("\nAnswer: ----------------")
    print_ww(response)
    print("\n----------------")

    print_ww("Evaluation Result:", eval_result.passing)
    print_ww(f"Reasoning:\n{eval_result.feedback}")

Take a look at this evaluation in action by seeing the content inputs and outputs for the evaluation

In [None]:
question_number = 0
evaluate_faithfulness_for_question(query_engine, eval_questions_df, question_number)

---

### Relevancy of response + source nodes to the query

The **Relevancy** metric verifies the correspondence between the response and the retrieved source documents with the user's query. This evaluation is crucial for assessing whether the response properly addresses the user's question.

The **Relevancy Evaluator** module is useful to measure if the response + source nodes match the query. Therefore, it helps measuring if the query was actually answered by the response. In this example, as the context information does not provide any details about the launch date of Amazon Bedrock Studio, then the evaluation result is **FALSE**. 


In [None]:
# Helper function for evaluating the relevancy of the output of a specific test case

def evaluate_relevancy_for_question(rag_engine, questions_df, question_number):

    eval_question = questions_df.iloc[question_number,0] 
    response_vector = rag_engine.query(eval_question)

    eval_result = relevancy_evaluator.evaluate_response(
        query=eval_question, response=response_vector
    )

    # print results
    print("\n--------- Question ---------")
    print_ww(eval_question)
    print("\n--------- Response ---------")
    print_ww(str(response_vector))
    print("\n--------- Passed ---------")
    print_ww(str(eval_result.passing))
    print("\n--------- Feedback ---------")
    print_ww(str(eval_result.feedback))
    print("\n--------- Source ---------")
    print_ww(response_vector.source_nodes[0].node.get_content())

Testing the first generated evaluation question with the **RelevancyEvaluator** class.

In [None]:
question_number = 0
evaluate_relevancy_for_question(query_engine, eval_questions_df, question_number)

### Correctness of response for the query

The **Correctness** metric checks the correctness of a question answering system, relying on a provided reference answer("ground truth"), query, and response. It assigns a score from 1 to 5 (with higher values indicating better quality) alongside an explanation for the rating. 

In [None]:
# Helper function for evaluating the relevancy of the output of a specific test case

def evaluate_correctness_for_question(rag_engine, questions_df, question_number):

    eval_question = questions_df.iloc[question_number, 0]
    ground_truth = questions_df.iloc[question_number, 2]

    response_vector = rag_engine.query(eval_question)
    generated_answer = str(response_vector)

    correctness_results = correctness_evaluator.evaluate(
                query=eval_question,
                response=generated_answer,
                reference=ground_truth
            )

    # print results
    print("\n--------- Question ---------")
    print_ww(eval_question)
    print("\n--------- Response ---------")
    print_ww(generated_answer)
    print("\n--------- Passed ---------")
    print_ww(str(correctness_results.passing))
    print("\n--------- Feedback ---------")
    print_ww(str(correctness_results.feedback))
    print("\n--------- Ground Truth ---------")
    print_ww(ground_truth)
    print("\n--------- Source ---------")
    print_ww(response_vector.source_nodes[0].node.get_content())

The following cell shows an example of the correctness_evaluator being applied to a specific question. this is by way of the `evaluate_correctness_for_question` function created above.

This function will be useful when you want to dive deeper into understanding why a test is not passing.

In [None]:
question_number= 0
evaluate_correctness_for_question(query_engine, eval_questions_df, question_number)

----

### Setup for run of the full test set

The following function presents the results in a dataframe

In [None]:
from llama_index.core import Response
import pandas as pd

# define jupyter display function
def display_eval_df(query: str, response: Response, eval_result: str) -> None:

    eval_df = pd.DataFrame(columns=['Query', 'Response', 'Source', 'Evaluation Result'])
        
    new_record = {
                    "Query": query,
                    "Response": str(response),
                    "Source": (
                        response.source_nodes[0].node.get_content()[:250] + "..."
                    ),
                    "Evaluation Result": eval_result,
                }
    eval_df = eval_df._append(new_record, ignore_index=True)


    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

In [None]:
# This helper function will run the full set of tests and return the results

from time import sleep
sleep_number = 30

def sleep_and_note_location(sec, loc):
    print(f"location: {loc}")
    sleep(sec)

def run_evaluations(evaluation_dataset: pd.DataFrame, query_engine, evaluator_model):
    """Run a batch evaluation on a list of questions and reference answers using a provided query engine.

    Args:
        evaluation_dataset (DataFrame): A list of questions and reference_answers(ground truth) to evaluate.
        query_engine (BaseQueryEngine): The query engine to use for answering the questions.
        evaluator_model (LLM): The language model to use for evaluation.

    Returns:
        pd.DataFrame: A DataFrame containing the evaluation results, including the query,
            generated answer, faithfulness evaluation, and relevancy evaluation.
    """

    results_list = []
    faithfulness_evaluator = FaithfulnessEvaluator(llm=evaluator_model)
    relevancy_evaluator = RelevancyEvaluator(llm=evaluator_model)
    correctness_evaluator = CorrectnessEvaluator(llm=evaluator_model)

    #for question, ground_truth in zip(evaluation_questions, evaluation_ground_truth):
    for index, row in evaluation_dataset.iterrows():

        print(f"processing test case: {index + 1} / {len(evaluation_dataset)}")

        question = row['query']  
        ground_truth = row['reference_answer']
        
        response = query_engine.query(question)
        generated_answer = str(response)
        sleep_and_note_location(sleep_number, "faithfulness_evaluator")

        # Faithfulness evaluator
        faithfulness_results = faithfulness_evaluator.evaluate_response(response=response)
        sleep_and_note_location(sleep_number, "relevancy_evaluator")

        # RelevancyEvaluator evaluator
        relevancy_results = relevancy_evaluator.evaluate_response(query=question, response=response)
        sleep_and_note_location(sleep_number, "correctness_evaluator")
        
        # CorrectnessEvaluator evaluator
        correctness_results = correctness_evaluator.evaluate(
            query=question,
            response=generated_answer,
            reference=ground_truth
        )
        sleep_and_note_location(sleep_number, "end of iteration")

        current_evaluation = {
            "query": question,
            "generated_answer": generated_answer,
            "ground_truth": ground_truth,
            "faithfulness": faithfulness_results.passing,
            "faithfulness_feedback": faithfulness_results.feedback,
            "faithfulness_score": faithfulness_results.score,
            "relevancy": relevancy_results.passing,
            "relevancy_feedback": relevancy_results.feedback,
            "relevancy_score": relevancy_results.score,
            "correctness": correctness_results.passing,
            "correctness_feedback": correctness_results.feedback,
            "correctness_score": correctness_results.score,
        }
        results_list.append(current_evaluation)
        print(f"processed test case: {index + 1} / {len(evaluation_dataset)}")

    evaluations_df = pd.DataFrame(results_list)
    
    aggregate_results = {
        'number_of_test_cases': len(evaluations_df),
        'mean_faithfulness_score': round(evaluations_df['faithfulness_score'].mean(), 3),
        'mean_relevancy_score': round(evaluations_df['relevancy_score'].mean(), 3),
        'mean_correctness_score': round(evaluations_df['correctness_score'].mean(), 3)
    }

    return evaluations_df, aggregate_results


**Note**: The throttling delays implemented by Bedrock significantly slow the test process

It make take 4 minutes or more for a single test run. The default configuration will take at least 2 minutes.

Have a break while this is running and limit your runs to 10-30 test cases, except for final runs.

In [None]:
%%time

# KEY CELL: Running the configured evaluations with the generated test set

# Run evaluations for the first n rows of the generated test set only
n = 3
evaluation_results_df, aggregate_results = run_evaluations(eval_questions_df.head(n), query_engine, evaluator_llm)
evaluation_results_df

In [None]:
aggregate_results

----
## Pause

----



# Assignment Task #1: Baseline: Using your configuration and documents

Update the notebook to match your configuration for Capstone 2

- Use your document set (rather than the canned/biographic dataset provided with this example)
- Use the embeddings model that was best for your document set
- Use the ingestion pipeline that was best for your document set

If for some reason you did not complete Capstone 2, but you are completing Capstone 3, then note that use the content provided here.

Once you have updated your configuration, rerun the notebook to this point.

Answer the following questions in this cell:

1. What are the aggregate evaluation scores for your configuration?
2. Of the three evaluation measures, which one needs to be improved the most from your point of view and why?
3. Look at a two of the failed test cases, using the evaluation functions that show the detail outputs, and see why the test case failed. For each of the two queries, note both the test case query, and your reasoning as to its failure.


----
# Assignment Task #2: Experiment with the LLM for answer generation

Change the LLM model configuration for the query engine in the next two celles. 
The notebook up to this point will have been using the `Claude Haiku 3` model. 

Change the configuration of the query engine to use the **one** or **two** other models that have been configured for use already.

Then rerun the evaluations with the set of generated test cases.
For each model review the difference in the aggregate score and in the quality of the output text.

Answer the following questions in this cell:

1. Which model was best for your content and what were its scores?
2. Which model was worst for your content and what were its scores?
3. Summarize the difference that in output quality that you observed between the best and the worst performing LLM


In [None]:
# KEY CELL 

query_engine_llm = llm_mixtral8x7b        # The `default` model for this notebook - used until you change this setting
# query_engine_llm = llm_mistral7b        
# query_engine_llm = llm_haiku_3         
# query_engine_llm = llm_sonnet_35

In [None]:
# KEY CELL

# After you update the query_engine configuration, then go back and re-run the test cases

query_engine = vector_index.as_query_engine(llm=query_engine_llm, similarity_top_k=3) 

-----
# Assignment Task #3: Experimenting with changing the prompt

The default prompt provided by Llamaindex works quite well, but you can almost certainly do better. 

1. The cells below will change the default prompt to one that works better with the default content. Read that updated prompt and suggest two reasons why it might perform better the default prompt (refer back to the prompt engineering assignment).
2. Update the alternative prompt to better match the topic and goals of your RAG solution, and update the query_engine with the cells that follow. Then rerun the tests and experiment further to improve your prompt. It will help to look at the output for specific queries, to get a deeper sense of the changes driven by your prompt.
3. What are the final test metrics that you are getting for you configuration? 

### Take a look at the default prompt

In [None]:
from llama_index.core import PromptTemplate

In [None]:
# define prompt viewing function for the prompt we care about
prompt_template_key = "response_synthesizer:text_qa_template"

def get_response_synthesizer_text_qa_prompt(prompts_dict):
    for k, p in prompts_dict.items():
        if k == "response_synthesizer:text_qa_template":
            return p.get_template()

In [None]:
default_prompt = get_response_synthesizer_text_qa_prompt(query_engine.get_prompts())
print(default_prompt)

In [None]:
example_query="""Based on Paul Graham's experience, why did he initially lose interest in studying philosophy
and switch to AI in college?"""

In [None]:
response = query_engine.query(example_query)
print(response)

In [None]:
# Example alternate prompt

# The objective being, for this example, to get a more concise and clear answer
new_text_qa_prompt_str = (
    "You are an expert at book editor.\n"
    "Your task is to answer readers questions on the given information context,"
    "in a clear, consise and friendly, manner, in two or three sentences.\n"
    "If the answer to their question is not available from the context,"
    "reply that the question cannot be answered given the information that you have."
    "Output the answer directly without a preamble,"
    "(e.g. without saying `Based on the context,` or similar)."
    "<context>\n"
    "{context_str}\n"
    "</context>\n"
    "Given the context and not prior knowledge, "
    "Query: {query_str}\n"
    "Answer: "
)


In [None]:
# update the qa_prompt to the new prompt

query_engine.update_prompts(
    {prompt_template_key: PromptTemplate(new_text_qa_prompt_str)}
)

In [None]:
response = query_engine.query(example_query)
print(response)

**Note** Once you have updated the query_engine with your prompt then re-run the tests

----

-----
# Assignment Task #4: Wrapping Up

1. Do you think your customer would be satisified with the results? If there were not, what would you offer to do? 
2. We have been testing using 3 of the available end to end RAG evaluation methods supported by Llamaindex, which one or two might you also include for you customer, and why?
3. In two or three sentences, note how you might further experiment and improve your RAG pipeline, if your customer gave you more money to make it better.

## The following assignment tasks are completely optional 
The follow tasks are intended for students who want to dive deeper. They are more open ended and require changing and augmenting the code share above.

**Optional Task 1** Configure the query_engine to use an LLM from another LLM service provider and re-run the tests.

You may be able to reduce the sleep_number for throttling to speed up your testing, depending on the LLM service


**Optional Task 2** Add a reranking capability to the query engine

Adding an LLM Reranker to the query engine only requires a few lines of code and will increase you solution accuracy while reducing inference costs.