# AISA Capstone 3 Assignment

## Overview
This notebook provides an environment for you to build your intuition on the steps to take when developing a high quality Retrieval Augmented Generation (RAG) solution. 
RAG solutions retrieve data before calling the large language model (LLM) to generate an answer. 
The retrieved data is used to augment the prompt to the LLM by adding the relevant retrieved data in context. 
Any RAG solution is only as good as the quality of the data retrieval process. 
The AISA Capstone 2 assignment focused on retrieval accuracy for RAG.
This notebook, follows on directly from that assignment, to focus on generating high-quality answers to question, 
and systematically assessing the quality of generated output.

The RAG solution developed here is enabled by the Llamaindex framework. This is a popular framework in the industry for developing RAG and Agent based solutions. In addition to providing a core set of tools for orchestration of RAG and Agent workflows, there is broad integration with a variety of platforms for model inference (LLM, embedding, ...), and, importantly, tooling for solution evaluation.

## Prerequisites for running the notebook
- That you have granted access to the Bedrock models that you are going to use, in the region (**us-west-2**) where you are going to use Bedrock - 
[reference](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access-modify.html)
- Your SageMakerExecutionRole has permissions to invoke Bedrock models - 
[reference](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-prereq.html)
- This notebook has been tested with SageMaker Notebook Instance running a `conda_python3` kernel
- The AWS region set for Amazon Bedrock use, needs to be in a region where the models being used are 1/ available, and 2/ enabled for use. This notebook was tested with Bedrock region `us-west-2`

## Implementation
This notebook uses llamaindex to define and execute the RAG solution. We will be using the following tools:

- **LLM (Large Language Model)**: e.g. Anthropic Claude Haiku available through Amazon Bedrock

  LLMs are used in the notebook for 1/ RAG response generation, to show the overall RAG workflow in actions, and 2/ for generating test questions on the indexed content (llamaindex nodes) for retrieval evaluation.
  
- **Text Embeddings Model**: e.g. Amazon Titan Embeddings available through Amazon Bedrock

  This embedding model is used to generate semantic vector representations of the content (llamaindex nodes) to be stored and the questions input to the RAG solution.
  
- **Document Loader**: SimpleDirectoryReader (Llamaindex)

  Before your chosen LLM can act on your data you need to load it. The way LlamaIndex does this is via data connectors, also called 'Reader'. Data connectors ingest data from different data sources and format the data into Document objects. A Document is a collection of data (currently text, and in future, images and audio) and metadata about that data.
  
  This implementation use SimpleDirectoryReader, which creates documents out of every file in a given directory. It can read a variety of formats including Markdown, PDFs, Word documents, and PowerPoint decks.

- **Vector Store**: VectorIndex (Llamaindex)

  In this notebook we are using this in-memory vector-store to store both the embeddings and the documents. In an enterprise context this could be replaced with a persistent store such as AWS OpenSearch, RDS Postgres with pgVector, ChromaDB, Pinecone or Weaviate.
  
  LlamaIndex abstracts the underlying vector database storage implementation with a VectorIndex class. This warps the Index, which is a data structure composed of Document objects, designed to enable querying by an LLM. The Index is designed to be complementary to your querying strategy.

----

Install required Python modules for constructing the RAG solution.
You only need to run this once. 

Don't stress if you see an error in the output of the `pip install`. While this is concerning, it will likely not effect the functioning of the notebook.

In [1]:
%pip install \
    llama-index \
    llama-index-llms-bedrock \
    llama-index-embeddings-bedrock

Note: you may need to restart the kernel to use updated packages.


## Section 1: Setting up the baseline configuration with some sample content

Download the default RAG test source data to our target source_docs directory. 
You only need to run this once.

In [2]:
source_docs_dir = './source_docs/'

The following creates the source_docs directory and downloads a document to that directory. The contents of this directory, 
initially the document that is downloaded here, will be used in the steps that follow.

After running this notebook in its entirity and reviewing its operation, delete this content and add your own content to the directory.

In [3]:
# Download and load data
!mkdir -p {source_docs_dir}
!wget --no-check-certificate 'https://www.buhurtinternational.com/_files/ugd/d219c5_5623c98ecdd142c8a7221c2d00cb621a.pdf' -O {source_docs_dir}'/buhurt_armor_requirement.pdf'
!wget --no-check-certificate 'https://www.buhurtinternational.com/_files/ugd/d219c5_afd49fe587524088b3136c910bb2ca51.pdf' -O {source_docs_dir}'/buhurt_weapon_requirement.pdf'

--2025-02-25 10:54:56--  https://www.buhurtinternational.com/_files/ugd/d219c5_5623c98ecdd142c8a7221c2d00cb621a.pdf
Resolving www.buhurtinternational.com (www.buhurtinternational.com)... 34.149.87.45
Connecting to www.buhurtinternational.com (www.buhurtinternational.com)|34.149.87.45|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12555242 (12M) [application/pdf]
Saving to: ‘./source_docs//buhurt_armor_requirement.pdf’


2025-02-25 10:55:00 (3.39 MB/s) - ‘./source_docs//buhurt_armor_requirement.pdf’ saved [12555242/12555242]

--2025-02-25 10:55:00--  https://www.buhurtinternational.com/_files/ugd/d219c5_afd49fe587524088b3136c910bb2ca51.pdf
Resolving www.buhurtinternational.com (www.buhurtinternational.com)... 34.149.87.45
Connecting to www.buhurtinternational.com (www.buhurtinternational.com)|34.149.87.45|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 711599 (695K) [application/pdf]
Saving to: ‘./source_docs//buhurt_weapon_requiremen

Import auxilliarty modules

In [4]:
import logging
import sys
import os
import pandas as pd
import boto3  # AWS SDK for Python

In [5]:
# This is an output to screen helper method for making some output more easy to read.

import textwrap
from io import StringIO

def print_ww(*args, width: int = 100, **kwargs):
    """Like print(), but wraps output to `width` characters (default 100)"""
    buffer = StringIO()
    try:
        _stdout = sys.stdout
        sys.stdout = buffer
        print(*args, **kwargs)
        output = buffer.getvalue()
    finally:
        sys.stdout = _stdout
    for line in output.splitlines():
        print("\n".join(textwrap.wrap(line, width=width)))

In [6]:
# This is required when running within a jupyter notebook, otherwise you will get errors when llamaindex modules run
import nest_asyncio

nest_asyncio.apply()

Import required Python modules for constructing and evaluating the RAG solution

In [7]:
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.llms.bedrock import Bedrock

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.text_splitter import TokenTextSplitter


from llama_index.core.evaluation import (
    DatasetGenerator,
    RetrieverEvaluator,
    generate_question_context_pairs,
)

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Response,
)

## Configure the models that will be used for the RAG pipeline

**Note**: By default this notebook with use the `us-west-2` region. This region has support for the models used in this notebook. You should not need to change this setting.

In [8]:
# AWS_REGION = "us-west-2"
AWS_REGION = "us-east-1"  # this is an alternative setting to use if desired 

Define the set of Bedrock model IDs that we that we'll use when developing and testing our solution 

Establish a connection to the Amazon Bedrock service

In [9]:
boto3_bedrock = boto3.client("bedrock-runtime")

### Configure the target embeddings models for use with Llamaindex

In [10]:
titan_text_embeddings_multilingual_v1_id = "amazon.titan-embed-text-v1"
titan_text_embeddings_multilingual_v2_id = "amazon.titan-embed-text-v2:0"
cohere_text_embeddings_english_id = "cohere.embed-english-v3"
cohere_text_embeddings_multilingual_id = "cohere.embed-multilingual-v3"

Configure our chosen embeddings model for use with llama_index

In [11]:
titan_text_embeddings_v2 = BedrockEmbedding(model=titan_text_embeddings_multilingual_v2_id,region_name=AWS_REGION)
titan_text_embeddings_v1 = BedrockEmbedding(model=titan_text_embeddings_multilingual_v1_id,region_name=AWS_REGION)
cohere_text_embeddings_english = BedrockEmbedding(model=cohere_text_embeddings_english_id,region_name=AWS_REGION)
cohere_text_embeddings_multilingual= BedrockEmbedding(model=cohere_text_embeddings_english_id,region_name=AWS_REGION)

### Configure the target LLMs for use with Llamaindex

The following Mistral models can be used to produce questions for evaluation. 
The Titan model produces questions of lesser quality and sometimes not in the format needed by the tools. 

**Note** Most Bedrock LLMs do not *produce test questions* in a format that can be directly used for evaluation with the tooling as it is configured in this notebook.

In [12]:
instruct_mistral7b_id = "mistral.mistral-7b-instruct-v0:2"
instruct_mixtral8x7b_id = "mistral.mixtral-8x7b-instruct-v0:1"
titan_text_express_id = "amazon.titan-text-express-v1"
claude_haiku_3_id = "anthropic.claude-3-haiku-20240307-v1:0"
claude_sonnet_35_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"

In [13]:
# set the parameters to be applied when invoking the model
model_kwargs_llm = {
    "temperature": 0.1,
    "top_k": 200,
    "max_tokens": 4096
}

### NOTE: This notebook uses two additional LLMs !!
You will need to enable use to the following models in the Bedrock console
- Anthropic Claude Haiku 3
- Anthropic Clause Sonnet 3.5

This is in addition to the Mistral and Titan models used in Capstone 2.

If these are no enabled you will encounter errors later.

In [14]:
llm_mistral7b = Bedrock(model=instruct_mistral7b_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_mixtral8x7b = Bedrock(model=instruct_mixtral8x7b_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_titan_express = Bedrock(model=titan_text_express_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_haiku_3 = Bedrock(model=claude_haiku_3_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_sonnet_35 = Bedrock(model=claude_sonnet_35_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)

### Use the following cell to configure the embeddings model to use for the cells that follow

The embeddings model is a critical choice for the accuracy of your RAG solution.
Experiment with the options here to see which is best for your content.
If you want more, test with further alternatives. There are many that are readily supported by llama_index.

In [15]:
# After the first run, set this to match you intended configuration based on your learning from Capstone II

# embed_model = titan_text_embeddings_v1
# embed_model = titan_text_embeddings_v2
embed_model = cohere_text_embeddings_english
# embed_model = cohere_text_embeddings_multilingual

### Use the following cell to configure the LLM to use for the cells that follow
The LLM will be used for question generation and RAG answer generation in this notebook as it is currently configured.
The default value llm_mistral7b works well with the code and should be used if possible. 

In [16]:
llm_model = llm_mistral7b
# llm_model = llm_mixtral8x7b
# llm_model = llm_titan_express
# llm_model = llm_haiku_3

In [17]:
# Set LlamaIndex default model settings to what was set in the cells above
Settings.embed_model = embed_model
Settings.llm = llm_model

## Read in the documents for adding to our data store

Read in the documents in the 'data/source_docs' directory into a structure ready for use by llama_index

In [18]:
reader = SimpleDirectoryReader(source_docs_dir)
documents = reader.load_data()

Quick check here to see that all of your documents were read. The count should match the number of pages in the documents in source_docs

In [19]:
len(documents)

25

## Create and run the document ingestion pipeline

The following cell defines two different document ingestion pipelines. 
If you have time, test using both of these, and create you own and test with that also.

In [20]:
# Define two transformation for the ingestion pipelines for initial experimentation

transformations_00=[
        TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=100),
        embed_model,
    ]

transformations_01=[
        SentenceSplitter(chunk_size=512, chunk_overlap=100),
        TitleExtractor(),
        embed_model,
    ]

transformations_02=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
        embed_model,
    ]


### Use the following cell to configure the data ingestion pipeline for processing the source data

In [21]:
# After the first run, set this to match you intended configuration based on your learning from Capstone II

# pipeline = IngestionPipeline(transformations=transformations_00)
# pipeline = IngestionPipeline(transformations=transformations_01)
pipeline = IngestionPipeline(transformations=transformations_02)



### Run the configured ingestion pipeline 

In [22]:
# run the pipeline
nodes = pipeline.run(documents=documents)
print(f"number of nodes: {len(nodes)}")

100%|██████████| 1/1 [00:00<00:00,  1.51it/s]
100%|██████████| 1/1 [00:00<00:00,  3.17it/s]
100%|██████████| 1/1 [00:00<00:00,  2.22it/s]
100%|██████████| 1/1 [00:00<00:00,  1.71it/s]
100%|██████████| 1/1 [00:00<00:00,  2.73it/s]
100%|██████████| 1/1 [00:00<00:00,  3.07it/s]
100%|██████████| 1/1 [00:00<00:00,  3.78it/s]
100%|██████████| 1/1 [00:00<00:00,  2.93it/s]
100%|██████████| 1/1 [00:00<00:00,  2.05it/s]
100%|██████████| 1/1 [00:00<00:00,  2.70it/s]
100%|██████████| 1/1 [00:00<00:00,  2.89it/s]
100%|██████████| 1/1 [00:00<00:00,  1.57it/s]
100%|██████████| 1/1 [00:00<00:00,  1.72it/s]
100%|██████████| 1/1 [00:00<00:00,  2.14it/s]
100%|██████████| 1/1 [00:01<00:00,  1.15s/it]
100%|██████████| 1/1 [00:00<00:00,  1.30it/s]
100%|██████████| 1/1 [00:00<00:00,  4.46it/s]
100%|██████████| 1/1 [00:01<00:00,  1.28s/it]
100%|██████████| 1/1 [00:00<00:00,  2.29it/s]
100%|██████████| 1/1 [00:00<00:00,  2.68it/s]
100%|██████████| 1/1 [00:01<00:00,  1.85s/it]
100%|██████████| 1/1 [00:00<00:00,

number of nodes: 25


This may make test analysis easier. It is non-essential

In [23]:
# By default, the node ids are set to random uuids. 
# To ensure same id's per run, we manually set them to consistent sequential numbers.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

In [24]:
# validate that node has an embedding associated with it
for idx, node in enumerate(nodes):
    if node.id_ == "node_0":
        print(node.embedding)

[0.314453125, -0.0272216796875, 0.3125, -0.341796875, 0.0303955078125, 0.01043701171875, -0.035400390625, 0.0005035400390625, 0.00592041015625, -0.005889892578125, -0.251953125, 0.421875, -0.0233154296875, 0.2158203125, -0.044677734375, 0.064453125, 0.126953125, 0.8203125, -0.056396484375, 0.1201171875, 0.1328125, 0.142578125, -0.169921875, 0.326171875, 0.21875, -0.40234375, 0.01556396484375, 0.1826171875, -0.1962890625, 0.0458984375, -0.7890625, -0.4375, -0.0225830078125, -0.7109375, 0.185546875, 0.01275634765625, -0.0189208984375, 0.234375, 0.2421875, -0.00677490234375, 0.302734375, -0.16796875, 0.050048828125, -0.431640625, -0.26171875, -0.1923828125, 0.275390625, 0.1845703125, 0.3515625, 0.06787109375, 0.310546875, 0.201171875, 0.27734375, 0.1494140625, -0.103515625, 0.07421875, 0.064453125, 0.1845703125, -0.70703125, -0.2041015625, -0.01470947265625, 0.068359375, -0.05615234375, 0.3515625, -0.08837890625, 0.064453125, 0.375, -0.3046875, 0.28125, 0.224609375, 0.095703125, -0.277343

## Create the VectorIndex 
This creates our vector database, in memory in this case,  using the nodes that were created in the previous step

In [25]:
vector_index = VectorStoreIndex(nodes=nodes)

## Test that we have a valid starting point for our evaluation
We run a quick system test with the defaul llama_index RAG workflow with a question that is relevant to our dataset

Instantiate a query engine object

In [26]:
query_engine = vector_index.as_query_engine(llm=llm_mistral7b)

Specify a question that has can be answered by the document(s) that have been ingested.

In [72]:
example_query="What are the historical consistency requirements of an armor?"

Run the default RAG pipeline with the example query. This should give a meaningful result. Don't worry if the answer is overly verbose, etc. We'll fix that later.

In [28]:
response = query_engine.query(example_query)
print(response)

 An armor must align with historical sources and adhere to specific styles and time periods. It should not contain any indications of modern materials or manufacturing techniques. The armor's pieces, including shields and weapons, must consist of pieces from the same style. Styles are defined by their historical context and geographical location, such as Western European, Slavic, or Eastern influence. For Western European styles, there are distinct periods like the 14th, 15th, and transitional centuries. Prohibited features include neon colors, obvious nylon cords, plastic ties, visible welded seams, heat-induced discoloration, modern footwear, and other visible modern equipment.


----

# Evaluate the retrieval accuracy of the VectorIndex

## Create a set of question and node (context) pairs to drive the tests that follow
This uses the llm that give the methods and the document data stored in the nodes (created during document ingestion)

This will make many calls to the specified LLM (num_questions_per_chunk * number of nodes). This will likely be throttled by Bedrock. The llama_index API will work through the throttling except in extreme cases.

In [29]:
%%time
qa_dataset = generate_question_context_pairs(nodes, num_questions_per_chunk=1)

100%|██████████| 25/25 [00:39<00:00,  1.59s/it]

CPU times: user 124 ms, sys: 11.1 ms, total: 135 ms
Wall time: 39.8 s





Take a look at the sample queries generated. This should show a meaningful questions related to your document content.

In [30]:
for item in list(qa_dataset.queries.items()):
    print(item[1])

Based on the context information given, here is a question that could be asked for an upcoming quiz or examination:
What are the historical consistency requirements for armor equipment according to the given document?
In what categories does the document outline the use of armors, shields, and weapons in combat?
Which historical period does the 14th century Western European style of armor belong to?
If a competitor intends to use armor that deviates from the specified styles, what step must they take to ensure its validity according to the rules of Buhurt International?
What are the two main responsibilities of every competitor regarding their equipment in full contact fights according to the given context?
What are the minimum requirements for a helmet as stated in the context document, regarding protection and authenticity?
What material is recommended for the dome and visor of a helmet according to the given document?
What is the minimum recommended thickness for mild steel plates u

## Instantiate a retriever against the index for testing



### Set the number of items to return from the Retriever
This is a trade-off item, more returned content is not always better. Consider how this may impact your pipeline and evaluation results and experiment with it.

In [31]:
# After the first run, set this to match you intended configuration based on your learning from Capstone II
# If you did not complete Capstone II then leave this as is


number_of_items_to_return = 2
# number_of_items_to_return = 3
# number_of_items_to_return = 4


In [32]:
retriever = vector_index.as_retriever(similarity_top_k=number_of_items_to_return)

Run a quick system test on the retriever and check that the output nodes look reasonable

In [33]:
retrieved_nodes = retriever.retrieve(example_query)
print(retrieved_nodes)

[NodeWithScore(node=TextNode(id_='node_3', embedding=None, metadata={'page_label': '4', 'file_name': 'buhurt_armor_requirement.pdf', 'file_path': '/home/ec2-user/SageMaker/elvtr-ai-solution-architect/class-14/source_docs/buhurt_armor_requirement.pdf', 'file_type': 'application/pdf', 'file_size': 12555242, 'creation_date': '2025-02-25', 'last_modified_date': '2024-12-11', 'document_title': ' "Historical Guidelines for Authenticity and Technical Requirements of Armor and Equipment in Western Europe (14th-17th Centuries) with Slavic and Eastern Influences (14th-17th Centuries)"'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='47874cfe-b2ef-47ff-a9e0-662a808695be', node_type='4', metadata={'page_label': '4', 

## Evaluate the Quality of Retrieval from the VectorIndex

In [34]:
# This is a helper function to output the results of the evaluation

def display_results(name, eval_results):
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)
    mrr = full_df["mrr"].mean()
    precision = full_df["precision"].mean()
    recall = full_df["recall"].mean()

    metric_df = pd.DataFrame({"retrievers": [name], "mrr": [mrr],
                              "precision": [precision], "recall": [recall],
                             })
    return metric_df, full_df


Instantiate a RetrieverEvaluator with the metrics that we want to review

In [35]:
metrics = ["mrr", "precision", "recall"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)

In [36]:
# Evaluate on a single query
# The output is verbose, but may be useful for looking at specific results

query_id = 3  # change this to math the query id of interest

sample_id, sample_query = list(qa_dataset.queries.items())[query_id]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

Query: Which historical period does the 14th century Western European style of armor belong to?
Metrics: {'mrr': 1.0, 'precision': 0.5, 'recall': 1.0}



### Run evaulation on the entire test dataset (autogenerated above)

In [37]:
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

### Top-level Evaluation Results

In [38]:
summary, detail = display_results(f"top-{number_of_items_to_return} eval", eval_results)
summary

Unnamed: 0,retrievers,mrr,precision,recall
0,top-2 eval,0.78,0.44,0.88


----
### This completes setting up the retriever and validating it's level of accuracy

The following sections are the focus of this notebook.

----

## Automating Q&A Generation with LllamaIndex

LllamaInex provides tools designed to automatically generate datasets when provided with a set of documents to query. In the example below, we use the **RagDatasetGenerator** class to generate evaluation questions and reference answers(ground truth) from the source documents and the specified number of questions per node.

In [39]:
%pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [40]:
from llama_index.core.llama_dataset.generator import RagDatasetGenerator, LabelledRagDataset

In [41]:
from llama_index.core.llama_dataset import (
    LabelledRagDataset,
    CreatedBy,
    CreatedByType,
    LabelledRagDataExample,
    BaseLlamaDataset
)

In [42]:

dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=llm_mixtral8x7b,
    num_questions_per_chunk=1, # set the number of questions per nodes
    show_progress=True,
)

print(f"Number of nodes created: {len(dataset_generator.nodes)}")


Parsing nodes:   0%|          | 0/25 [00:00<?, ?it/s]

Number of nodes created: 25


In [43]:
%%time
eval_questions = dataset_generator.generate_dataset_from_nodes()
eval_questions.to_pandas()

100%|██████████| 25/25 [00:39<00:00,  1.59s/it]
100%|██████████| 1/1 [00:03<00:00,  3.06s/it]
100%|██████████| 1/1 [00:10<00:00, 10.47s/it]
100%|██████████| 1/1 [00:03<00:00,  3.34s/it]
100%|██████████| 1/1 [00:04<00:00,  4.26s/it]
100%|██████████| 1/1 [00:01<00:00,  1.84s/it]
100%|██████████| 1/1 [00:02<00:00,  2.63s/it]
100%|██████████| 1/1 [00:03<00:00,  3.15s/it]
100%|██████████| 1/1 [00:02<00:00,  2.97s/it]
100%|██████████| 1/1 [00:02<00:00,  2.42s/it]
100%|██████████| 1/1 [00:03<00:00,  3.09s/it]
100%|██████████| 1/1 [00:01<00:00,  1.02s/it]
100%|██████████| 1/1 [00:02<00:00,  2.47s/it]
100%|██████████| 1/1 [00:05<00:00,  5.03s/it]
100%|██████████| 1/1 [00:02<00:00,  2.39s/it]
100%|██████████| 1/1 [00:02<00:00,  2.74s/it]
100%|██████████| 1/1 [00:03<00:00,  3.97s/it]
100%|██████████| 1/1 [00:03<00:00,  3.76s/it]
100%|██████████| 1/1 [00:04<00:00,  4.22s/it]
100%|██████████| 1/1 [00:01<00:00,  1.04s/it]
100%|██████████| 1/1 [00:01<00:00,  1.33s/it]
100%|██████████| 1/1 [00:05<00:0

CPU times: user 584 ms, sys: 69.9 ms, total: 654 ms
Wall time: 2min 3s





Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"Question: According to the ""buhurt_armor_requi...",[],"To provide an accurate answer, I would need a...",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
1,"Question: Based on the ""Technical Requirements...",[Technical RequirementsforArmor\nTableofConten...,"The ""Technical Requirements for Armor"" docume...",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
2,"""Can you explain the role and responsibility o...",[1. IntroductionandDefinitions\nThisdocument w...,Based on the introduction of the document rel...,ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
3,"""Explain the prohibited features for armors an...",[2. Historical ConsistencyRequirements\n2.1.Eq...,"According to the given document, prohibited f...",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
4,"""What is the process for competitors to deviat...",[2.4.4.Authorizeddeviations● Competitors may d...,Competitors can deviate from the specified st...,ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
5,"For this quiz/examination, your task is to ans...",[3. General Requirements\n3.1.Competitorrespon...,Which body parts must be protected by the arm...,ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
6,"Question: According to the ""BuHurt Armor Requi...",[4. Helmetsrequirements\n4.1.Protection● Ahelm...,"According to the ""BuHurt Armor Requirement"" d...",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
7,"For the upcoming quiz/examination, one questio...",[4.2.Thicknessandcomposition● Domeandvisormust...,"Sure, here's an example question that could b...",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
8,"For the upcoming quiz, one question based on t...",[5. Armorrequirements\n5.1.Metalthicknessrecom...,"Sure, here's an example of a potential quiz q...",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
9,"For the upcoming quiz/examination, one questio...","[5.6.Joints● All joints (shoulders, elbows, kn...","Sure, here's an example question that could b...",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)


**Note**: The following cell saves the generated question and answers to a JSON file and so that we do not need to run
the question generation process above multiple times. 

In [44]:
eval_questions.save_json('eval_questions.json')
print(f"Saving {len(eval_questions.examples)} test cases")

Saving 25 test cases


Use the questions saved in the JSON file.

In [45]:
checkpointed_eval_questions = LabelledRagDataset.from_json('eval_questions.json')
print(f"Restoring {len(checkpointed_eval_questions.examples)} test cases")

Restoring 25 test cases


In [46]:
# Convert the question set into a Pandas dataframe for ease of use for the cells that follow
eval_questions_df = checkpointed_eval_questions.to_pandas()

---

## RAG Automated Pipeline evaluation with LlamaIndex evaluators

In the sections below, we'll show 4 automated evaluations available throught LlamaIndex. However, there are some additional metrics out-of-the-box that can be found [here](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/):

1. **Faithfulness**: This metric verifies whether the final response is in agreement with (doesn't contradict) the retrieved document snippets.
2. **Relevancy**: This metrics checks whether the response and retrieved content were relevant to the query.
3. **Correctness**: This metric evaluates whether the generated answer is relevant and agreeing with a reference answer.

In [47]:
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator

**Note**: Configuring the LLM to use as the evaluator (aka Judge) of the output content from the RAG pipeline. 

For this, it is typical to use an LLM that has higher benchmark ratings than the LLM used for content generation.

In [48]:
evaluator_llm = llm_mixtral8x7b
# evaluator_llm = llm_sonnet_35

### Set up our default query engine for showing the baseline evaluation

**Note** as the number of chunks (aka items) returned increased the size of the prompt increases and the smaller models may fail.

In [49]:
# KEY CELL #1

llm_model = llm_mixtral8x7b
# llm_model = llm_haiku_3
# llm_model = llm_sonnet_35

number_of_items_to_return = 3

query_engine = vector_index.as_query_engine(llm=llm_model, similarity_top_k=number_of_items_to_return)

In [50]:
faithfulness_evaluator = FaithfulnessEvaluator(llm=evaluator_llm)
relevancy_evaluator = RelevancyEvaluator(llm=evaluator_llm)
correctness_evaluator = CorrectnessEvaluator(llm=evaluator_llm)

---

### Faithfulness to source documents

The **Faithfulness** metric evaluates the coherence between the generated response and the source document snippets retrieved during the search process. This assessment is useful for identifying any discrepancies or hallucinations introduced by the LLM


In [51]:
# Helper function for evaluating the faithfulness of the output of a specific test case

def evaluate_faithfulness_for_question(rag_engine, questions_df, question_number):

    eval_question = questions_df.iloc[0,0]
    response_vector = rag_engine.query(eval_question)

    eval_result = faithfulness_evaluator.evaluate_response(response=response_vector)

    print("Question: ----------------")
    print_ww(eval_question)
    print("\nAnswer: ----------------")
    print_ww(response)
    print("\n----------------")

    print_ww("Evaluation Result:", eval_result.passing)
    print_ww(f"Reasoning:\n{eval_result.feedback}")

Take a look at this evaluation in action by seeing the content inputs and outputs for the evaluation

In [52]:
question_number = 0
evaluate_faithfulness_for_question(query_engine, eval_questions_df, question_number)

Question: ----------------
Question: According to the "buhurt_armor_requirement.pdf" document, what is the minimum thickness
requirement for the lower leg protection armor, as specified on page 1?

Answer: ----------------
 An armor must align with historical sources and adhere to specific styles and time periods. It
should not contain any indications of modern materials or manufacturing techniques. The armor's
pieces, including shields and weapons, must consist of pieces from the same style. Styles are
defined by their historical context and geographical location, such as Western European, Slavic, or
Eastern influence. For Western European styles, there are distinct periods like the 14th, 15th, and
transitional centuries. Prohibited features include neon colors, obvious nylon cords, plastic ties,
visible welded seams, heat-induced discoloration, modern footwear, and other visible modern
equipment.

----------------
Evaluation Result: True
Reasoning:
 For the first piece of information

---

### Relevancy of response + source nodes to the query

The **Relevancy** metric verifies the correspondence between the response and the retrieved source documents with the user's query. This evaluation is useful for assessing whether the response properly addresses the user's question.

The **Relevancy Evaluator** module is useful to measure if the response + source nodes match the query. Therefore, it helps measuring if the query was actually answered by the response. In this example, as the context information does not provide any details about the launch date of Amazon Bedrock Studio, then the evaluation result is **FALSE**. 


In [53]:
# Helper function for evaluating the relevancy of the output of a specific test case

def evaluate_relevancy_for_question(rag_engine, questions_df, question_number):

    eval_question = questions_df.iloc[question_number,0] 
    response_vector = rag_engine.query(eval_question)

    eval_result = relevancy_evaluator.evaluate_response(
        query=eval_question, response=response_vector
    )

    # print results
    print("\n--------- Question ---------")
    print_ww(eval_question)
    print("\n--------- Response ---------")
    print_ww(str(response_vector))
    print("\n--------- Passed ---------")
    print_ww(str(eval_result.passing))
    print("\n--------- Feedback ---------")
    print_ww(str(eval_result.feedback))
    print("\n--------- Source ---------")
    print_ww(response_vector.source_nodes[0].node.get_content())

Testing the first generated evaluation question with the **RelevancyEvaluator** class.

In [54]:
question_number = 0
evaluate_relevancy_for_question(query_engine, eval_questions_df, question_number)


--------- Question ---------
Question: According to the "buhurt_armor_requirement.pdf" document, what is the minimum thickness
requirement for the lower leg protection armor, as specified on page 1?

--------- Response ---------
 The context information provided does not include page 1 of the "buhurt_armor_requirement.pdf"
document, and therefore, I cannot provide the minimum thickness requirement for the lower leg
protection armor as specified on page 1. However, the context information does provide the minimum
thickness recommendations for various armor materials in section 5.1, which I have included in my
answer for reference. The minimum recommended thickness for steel plates of an armor is 1.5mm (mild
steel) or 0.8mm (tempered steel, hardened steel, stainless steel). The minimum recommended thickness
for titanium plates of an armor is 0.8mm.

--------- Passed ---------
False

--------- Feedback ---------
 NO, the response for the query is not in line with the context information 

### Correctness of response for the query

The **Correctness** metric checks the correctness of a question answering system, relying on a provided reference answer("ground truth"), query, and response. It assigns a score from 1 to 5 (with higher values indicating better quality) alongside an explanation for the rating. 

In [55]:
# Helper function for evaluating the relevancy of the output of a specific test case

def evaluate_correctness_for_question(rag_engine, questions_df, question_number):

    eval_question = questions_df.iloc[question_number, 0]
    ground_truth = questions_df.iloc[question_number, 2]

    response_vector = rag_engine.query(eval_question)
    generated_answer = str(response_vector)

    correctness_results = correctness_evaluator.evaluate(
                query=eval_question,
                response=generated_answer,
                reference=ground_truth
            )

    # print results
    print("\n--------- Question ---------")
    print_ww(eval_question)
    print("\n--------- Response ---------")
    print_ww(generated_answer)
    print("\n--------- Passed ---------")
    print_ww(str(correctness_results.passing))
    print("\n--------- Feedback ---------")
    print_ww(str(correctness_results.feedback))
    print("\n--------- Ground Truth ---------")
    print_ww(ground_truth)
    print("\n--------- Source ---------")
    print_ww(response_vector.source_nodes[0].node.get_content())

The following cell shows an example of the correctness_evaluator being applied to a specific question. this is by way of the `evaluate_correctness_for_question` function created above.

This function will be useful when you want to dive deeper into understanding why a test is not passing.

In [56]:
question_number= 0
evaluate_correctness_for_question(query_engine, eval_questions_df, question_number)


--------- Question ---------
Question: According to the "buhurt_armor_requirement.pdf" document, what is the minimum thickness
requirement for the lower leg protection armor, as specified on page 1?

--------- Response ---------
 The context information provided does not include page 1 of the "buhurt_armor_requirement.pdf"
document, and therefore, I cannot provide the minimum thickness requirement for the lower leg
protection armor as specified on page 1. However, based on the information provided in the context,
the minimum thickness recommendations for steel plates of an armor are 1.5mm for mild steel and
0.8mm for tempered steel, hardened steel, or stainless steel.

--------- Passed ---------
False

--------- Feedback ---------
The generated answer is somewhat relevant to the user query as it provides minimum thickness
recommendations for certain types of steel plates. However, it does not answer the user's question
about the minimum thickness requirement for the lower leg protecti

----

### Setup for run of the full test set

The following function presents the results in a dataframe

In [57]:
from llama_index.core import Response
import pandas as pd

# define jupyter display function
def display_eval_df(query: str, response: Response, eval_result: str) -> None:

    eval_df = pd.DataFrame(columns=['Query', 'Response', 'Source', 'Evaluation Result'])
        
    new_record = {
                    "Query": query,
                    "Response": str(response),
                    "Source": (
                        response.source_nodes[0].node.get_content()[:250] + "..."
                    ),
                    "Evaluation Result": eval_result,
                }
    eval_df = eval_df._append(new_record, ignore_index=True)


    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

In [58]:
# This helper function will run the full set of tests and return the results

from time import sleep
sleep_number = 30

def sleep_and_note_location(sec, loc):
    print(f"location: {loc}")
    sleep(sec)

def run_evaluations(evaluation_dataset: pd.DataFrame, query_engine, evaluator_model):
    """Run a batch evaluation on a list of questions and reference answers using a provided query engine.

    Args:
        evaluation_dataset (DataFrame): A list of questions and reference_answers(ground truth) to evaluate.
        query_engine (BaseQueryEngine): The query engine to use for answering the questions.
        evaluator_model (LLM): The language model to use for evaluation.

    Returns:
        pd.DataFrame: A DataFrame containing the evaluation results, including the query,
            generated answer, faithfulness evaluation, and relevancy evaluation.
    """

    results_list = []
    faithfulness_evaluator = FaithfulnessEvaluator(llm=evaluator_model)
    relevancy_evaluator = RelevancyEvaluator(llm=evaluator_model)
    correctness_evaluator = CorrectnessEvaluator(llm=evaluator_model)

    #for question, ground_truth in zip(evaluation_questions, evaluation_ground_truth):
    for index, row in evaluation_dataset.iterrows():

        print(f"processing test case: {index + 1} / {len(evaluation_dataset)}")

        question = row['query']
        ground_truth = row['reference_answer']
        
        response = query_engine.query(question)
        generated_answer = str(response)
        sleep_and_note_location(sleep_number, "faithfulness_evaluator")

        # Faithfulness evaluator
        faithfulness_results = faithfulness_evaluator.evaluate_response(response=response)
        sleep_and_note_location(sleep_number, "relevancy_evaluator")

        # RelevancyEvaluator evaluator
        relevancy_results = relevancy_evaluator.evaluate_response(query=question, response=response)
        sleep_and_note_location(sleep_number, "correctness_evaluator")
        
        # CorrectnessEvaluator evaluator
        correctness_results = correctness_evaluator.evaluate(
            query=question,
            response=generated_answer,
            reference=ground_truth
        )
        sleep_and_note_location(sleep_number, "end of iteration")

        current_evaluation = {
            "query": question,
            "generated_answer": generated_answer,
            "ground_truth": ground_truth,
            "faithfulness": faithfulness_results.passing,
            "faithfulness_feedback": faithfulness_results.feedback,
            "faithfulness_score": faithfulness_results.score,
            "relevancy": relevancy_results.passing,
            "relevancy_feedback": relevancy_results.feedback,
            "relevancy_score": relevancy_results.score,
            "correctness": correctness_results.passing,
            "correctness_feedback": correctness_results.feedback,
            "correctness_score": correctness_results.score,
        }
        results_list.append(current_evaluation)
        print(f"processed test case: {index + 1} / {len(evaluation_dataset)}")

    evaluations_df = pd.DataFrame(results_list)
    
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', None)
    
    aggregate_results = {
        'number_of_test_cases': len(evaluations_df),
        'mean_faithfulness_score': round(evaluations_df['faithfulness_score'].mean(), 3),
        'mean_relevancy_score': round(evaluations_df['relevancy_score'].mean(), 3),
        'mean_correctness_score': round(evaluations_df['correctness_score'].mean(), 3)
    }

    return evaluations_df, aggregate_results


**Note**: The throttling delays implemented by Bedrock significantly slow the test process

It make take 4 minutes or more for a single test run. The default configuration will take at least 2 minutes.

Have a break while this is running and limit your runs to 10-30 test cases, except for final runs.

In [95]:
%%time

# KEY CELL: Running the configured evaluations with the generated test set

# Run evaluations for the first n rows of the generated test set only
n = 15
evaluation_results_df, aggregate_results = run_evaluations(eval_questions_df.head(n), query_engine, evaluator_llm)
evaluation_results_df

processing test case: 1 / 15
location: faithfulness_evaluator
location: relevancy_evaluator
location: correctness_evaluator
location: end of iteration
processed test case: 1 / 15
processing test case: 2 / 15
location: faithfulness_evaluator
location: relevancy_evaluator
location: correctness_evaluator
location: end of iteration
processed test case: 2 / 15
processing test case: 3 / 15
location: faithfulness_evaluator
location: relevancy_evaluator
location: correctness_evaluator
location: end of iteration
processed test case: 3 / 15
processing test case: 4 / 15
location: faithfulness_evaluator
location: relevancy_evaluator
location: correctness_evaluator
location: end of iteration
processed test case: 4 / 15
processing test case: 5 / 15
location: faithfulness_evaluator
location: relevancy_evaluator
location: correctness_evaluator
location: end of iteration
processed test case: 5 / 15
processing test case: 7 / 15
location: faithfulness_evaluator
location: relevancy_evaluator
location: cor

Unnamed: 0,query,generated_answer,ground_truth,faithfulness,faithfulness_feedback,faithfulness_score,relevancy,relevancy_feedback,relevancy_score,correctness,correctness_feedback,correctness_score
0,"Question: According to the ""buhurt_armor_requirement.pdf"" document, what is the minimum thickness requirement for the lower leg protection armor, as specified on page 1?","I apologize, but I cannot provide an answer to this specific question based on the given context information. The excerpts provided do not contain any information about the minimum thickness requirement for lower leg protection armor, nor do they include content from page 1 of the document. The provided excerpts cover helmet specifications, general armor requirements, and specifications for torso, chest, back, hips, groin, hands, and arms protection, but do not mention specific requirements for lower leg armor thickness.","To provide an accurate answer, I would need access to the specific contents of the ""buhurt\_armor\_requirement.pdf"" document located at /home/ec2-user/SageMaker/elvtr-ai-solution-architect/class-14/source\_docs/. However, based on the provided context, I can tell you that the minimum thickness requirement for the lower leg protection armor is specified on page 1 of the document. To get the exact thickness requirement, please refer to the contents of the document itself.",False,"NO, the given context does not support the statement ""The minimum thickness requirement for lower leg protection armor is 2.5mm."" The context provided only includes information about helmet specifications, general armor requirements, and specifications for torso, chest, back, hips, groin, hands, and arms protection. It does not mention any requirements for lower leg armor thickness.",0.0,False,"NO, the response is not in line with the context information provided. The context information does include the minimum thickness recommendations for various types of armor, but it does not specifically mention the minimum thickness requirement for lower leg protection armor.",0.0,False,"The generated answer is relevant as it indicates that the required information is not available in the provided context. However, it could have been more helpful by stating that the minimum thickness requirement for lower leg protection armor is specified on page 1 of the document, similar to the reference answer.",2.0
1,"Question: Based on the ""Technical Requirements for Armor"" document, explain the historical consistency requirements for armor, including the considerations for equipment from historical sources, dates of sources, prohibited features, and consistency in equipment with regards to Western European style, Slavic influence, Eastern influence, and authorized deviations.","Based on the ""Technical Requirements for Armor"" document, the historical consistency requirements for armor are as follows:\n\n1. Equipment from historical sources:\n - Only armors derived from historical sources are permitted for use.\n - Armors must align with Authenticity Rules documents.\n\n2. Dates of sources:\n - Armors must align with sources dating between the 14th (1300) and 17th (1600) centuries.\n - Reproductions of armors predating the 13th century are prohibited for safety reasons.\n\n3. Prohibited features:\n - Features that indicate modern materials or manufacturing techniques are prohibited, including:\n - Neon colors\n - Obvious nylon cords\n - Plastic ties\n - Visible welded seams\n - Heat-induced discoloration\n - Modern footwear\n - Other visible modern equipment\n\n4. Consistency in Equipment:\n Armors, shields, and weapons must consist of pieces from the same style. The styles are defined as:\n\n a. Western European style:\n - 14th century: 1300 to 1380\n - Transitional: 1380 to 1420\n - 15th century: 1420 to 1500 (requires approval from the Authenticity Committee)\n - Includes countries like Great Britain, France, Germany, Italy, Scandinavian countries, etc.\n\n b. Slavic Influence:\n - Central Europe 14th: 1300 to 1400\n - Russian late armors: 1500 to 1700\n - Includes countries like Czech Republic, Poland, Russia, Ukraine, etc.\n\n c. Eastern influence:\n - Chinese style: 1300 to 1600\n - Japanese samurai style: 1400 to 1700\n - Middle-East style: 1300 to 1700\n - Includes countries like China, Japan, India, Korea, Iran, Iraq, Turkey, Egypt\n\nThe document does not provide specific information about authorized deviations in the given context.","The ""Technical Requirements for Armor"" document outlines the historical consistency requirements for armor under section 2.2. These requirements ensure that the armor used in competitions is consistent with historical sources and practices.\n\nFirstly, the equipment must come from historical sources. This means that the design, construction, and materials of the armor should be based on existing examples from historical records, artifacts, or reputable reconstructions.\n\nSecondly, the dates of the sources must be considered. The armor should be consistent with the time period it represents, and should not mix elements from different epochs. This helps maintain the authenticity and immersion of the historical reenactment or competition.\n\nThirdly, there are prohibited features. These are elements of armor that were not used in the specified historical period or culture, or that are considered anachronistic. For example, the use of rivets in certain periods or regions where they were not historically employed would be a prohibited feature.\n\nFourthly, consistency in equipment is emphasized. This includes consistency in equipment with regards to Western European style, Slavic influence, Eastern influence, and authorized deviations. \n\n- Western European style: The armor should reflect the typical design, construction, and materials used in Western Europe during the specified historical period. This includes considerations for the type of armor, the materials used, and the methods of construction.\n\n- Slavic influence: The armor should reflect the influence of Slavic cultures, if applicable. This could include specific design elements, materials, or construction techniques that are characteristic of Slavic armor.\n\n- Eastern influence: The armor should reflect the influence of Eastern cultures, if applicable. This could include specific design elements, materials, or construction techniques that are characteristic of Eastern armor.\n\n- Authorized deviations: There may be certain deviations from historical practices that are allowed, provided they do not violate the other historical consistency requirements. These deviations should be authorized and documented, and should not detract from the historical authenticity of the armor.",False,"For the question about authorized deviations in the given context, the answer is NO because the provided context does not include specific information about authorized deviations.",0.0,True,"Yes, the response for the query is in line with the context information provided. The response accurately summarizes the key points from the ""Technical Requirements for Armor"" document related to historical consistency requirements, including equipment from historical sources, dates of sources, prohibited features, and consistency in equipment with regards to Western European style, Slavic influence, and Eastern influence. The response also correctly mentions that the document does not provide specific information about authorized deviations.",1.0,True,"The generated answer is highly relevant and provides a detailed explanation of the historical consistency requirements for armor. It correctly identifies the need for equipment from historical sources, dates of sources, and prohibited features. The answer also expands on the requirement for consistency in equipment by providing specific styles and time periods for Western European, Slavic, and Eastern influences. However, it does not mention anything about prohibited features with regards to these styles, or any information about authorized deviations. The answer is mostly correct, but lacks some detail found in the reference answer.",4.5
2,"""Can you explain the role and responsibility of Authenticity Officers as mentioned in the introduction of the document related to buhurt armor requirements?""","I apologize, but I cannot provide information about the role and responsibilities of Authenticity Officers as mentioned in the introduction of the document. The given context information does not contain any specific details about Authenticity Officers or their roles in the introduction. The excerpts provided focus on helmet requirements, armor guidelines for various body parts, and procedures for validating equipment deviations. Without access to the introduction of the document or specific information about Authenticity Officers, I cannot accurately answer this question.","Based on the introduction of the document related to buhurt armor requirements, Authenticity Officers are responsible for the correct application of the document. Their role is to ensure that all equipment, including armors, clothes, shields, and weapons used by competitors, meet the historical source requirements outlined in the document. Authenticity Officers analyze historical sources, which can be primary (such as artifacts and museum originals) or secondary (such as images, paintings, statues, descriptions, and writings), using criticism and logic to determine the plausibility of an item's existence. They play a crucial role in maintaining the historical accuracy and authenticity of the competition.",False,"For the information provided, the answer would be NO, as the context does not contain any information about the taste of apple pie.",0.0,False,"NO, because the response is based on the lack of specific information about Authenticity Officers in the provided context, rather than evaluating whether the response is in line with the context information. The response does accurately reflect the provided context, however, as there is no information about Authenticity Officers in the context.",0.0,False,"The generated answer is not relevant to the user query as it does not provide any information about the role and responsibilities of Authenticity Officers in the context of the given document. Instead, it explains the lack of available information and the focus of the provided document excerpts.",1.0
3,"""Explain the prohibited features for armors and give an example of a modern material or manufacturing technique that would be not allowed in historical reenactment events, according to the given document.""","According to the document, prohibited features for armors include evident indications of modern materials or manufacturing techniques. Specifically, the document states:\n\n""Prohibited features encompass evident indications of modern materials or manufacturing techniques including: neon colors, obvious nylon cords, plastic ties, visible welded seams, heat-induced discoloration, modern footwear and other visible modern equipment.""\n\nAn example of a modern material or manufacturing technique that would not be allowed in historical reenactment events, based on this information, would be visible welded seams. Welding is a modern joining technique that would not have been used in the production of medieval armor, so any armor with visible welded seams would be considered inauthentic and prohibited in these events.\n\nOther examples mentioned that would not be allowed include neon colors, obvious nylon cords, and plastic ties, as these are all clearly modern materials that would not have been available during the historical periods being reenacted (14th to 17th centuries).","According to the given document, prohibited features for armors include any evident indications of modern materials or manufacturing techniques. Specifically, the use of neon colors, obvious nylon cords, plastic ties, visible welded seams, heat-induced discoloration, modern footwear, and other visible modern equipment are not permitted. \n\nAn example of a modern material or manufacturing technique that would not be allowed in historical reenactment events, according to the given document, is the use of heat-induced discoloration. This technique is often used in modern manufacturing processes to create a specific color or finish on a material, but it is not consistent with the historical methods of creating armor. Therefore, any armor with heat-induced discoloration would be considered a prohibited feature and not allowed in historical reenactment events.",True,"Yes, the information is supported by the context. According to the context, prohibited features for armors include evident indications of modern materials or manufacturing techniques, including visible welded seams.",1.0,True,"Yes, the response for the query is in line with the context information provided. The response accurately identifies the prohibited features for armors, including visible welded seams, and explains why welding is a modern manufacturing technique that would not have been used in the production of historical armor. The response also provides additional examples of modern materials and manufacturing techniques that would not be allowed in historical reenactment events. This is consistent with the information provided in the context, which states that prohibited features include evident indications of modern materials or manufacturing techniques, including neon colors, obvious nylon cords, plastic ties, visible welded seams, heat-induced discoloration, modern footwear, and other visible modern equipment.",1.0,True,"The generated answer is relevant and fully correct. It accurately identifies the prohibited features for armors as including modern materials or manufacturing techniques, and provides an example of a modern manufacturing technique (welded seams) that would not be allowed in historical reenactment events. The answer is clear, concise, and directly addresses the user's query.\n\nThe reason for not giving a full score of 5 is that the reference answer provides a slightly more detailed explanation of the concept and includes additional examples of prohibited features. However, the generated answer is still highly accurate and informative.",4.5
4,"""What is the process for competitors to deviate from the specified styles of armor, weapons, and shields, and who do they need to notify in order to obtain validation of their equipment choices?""","Based on the provided context, the process for competitors to deviate from the specified styles of armor, weapons, and shields is as follows:\n\n1. Competitors may deviate from the specified styles if they can provide sources supporting their equipment choices.\n\n2. To obtain validation of their armor, competitors who wish to deviate from the specified styles must send an email to AC@buhurtinternational.com.\n\nThis information is specifically mentioned in section 2.4.4 of the document titled ""Procedure for Validating Unique Competitor Equipment Deviations in Buhurt International Tournaments: Authorization.""","Competitors can deviate from the specified styles of armor, weapons, and shields by providing sources that support their equipment choices. Once they have done this, they must send an email to AC@buhurtinternational.com to obtain validation of their armor. This information is based on the provided context, and no prior knowledge was used in formulating the answer.",True,"YES\n\nThe context contains the same information as the given piece of information, stating that competitors may deviate from the specified styles if they provide sources supporting their equipment choices and must send an email to AC@buhurtinternational.com to obtain validation of their armor. The context is from section 2.4.4 of the document titled ""Procedure for Validating Unique Competitor Equipment Deviations in Buhurt International Tournaments: Authorization.""",1.0,True,"Yes, the response for the query is in line with the context information provided. The response accurately summarizes the process and the email address for obtaining validation of equipment choices from section 2.4.4 of the document.",1.0,True,"The generated answer is relevant and fully correct. It provides the exact same information as the reference answer, but it is more detailed and includes the specific section and title of the document where the information can be found. This makes the generated answer more informative and useful for the user. \n\nHowever, the generated answer is slightly longer and less concise than the reference answer. This is why I gave it a score of 4.5 instead of 5. \n\nOverall, the generated answer demonstrates a high level of accuracy and relevance to the user query.",4.5
5,"Question: According to the ""BuHurt Armor Requirement"" document, what are the specific protection requirements for a helmet, and what should be the characteristics of locking systems if they are visible?","According to the ""BuHurt Armor Requirement"" document, the specific protection requirements for a helmet are:\n\n1. The helmet must protect the entire head and neck from all sides.\n2. It must protect the whole head; open-face helmets are not allowed in any category.\n3. If the helmet includes a removable or pivoting visor, it should be secured to prevent any opening during a fight.\n4. A helmet must have a quilted or padded coif inside, or a leather suspension system with a quilted coif.\n5. The recommended minimum thickness of the coif is 5mm.\n6. A chinstrap must secure the helmet to the head, preventing any loss of it during a fight.\n7. Neck protection is mandatory and must cover the neck in any position.\n\nRegarding the characteristics of locking systems if they are visible:\n\n1. Locking systems must be based on historical sources.\n2. They must use authentic materials (leather, rope, etc.).\n3. Approved locking systems are displayed in Authenticity Rules documents.","According to the ""BuHurt Armor Requirement"" document, the specific protection requirements for a helmet include protecting the entire head and neck from all sides. Open-face helmets are not allowed in any category. Additionally, the helmet must conform to the Authenticity Rules documents. If the helmet includes a removable or pivoting visor, it should be secured to prevent any opening during a fight.\n\nRegarding locking systems, if they are visible, they must be based on historical sources and use authentic materials such as leather or rope. Approved locking systems are displayed in the Authenticity Rules documents.",True,"For the context provided, the answers to the following questions are:\n\nInformation: A helmet must protect the entire head and neck from all sides.\nAnswer: YES\n\nInformation: A helmet must be made of metal.\nAnswer: PARTIALLY (the dome and visor must be made of metal, but other parts of the helmet may not be)\n\nInformation: Open-face helmets are not allowed in any category.\nAnswer: YES\n\nInformation: A helmet must include a removable or pivoting visor.\nAnswer: NO\n\nInformation: A helmet must have a quilted or padded coif inside, or a leather suspension system with a quilted coif and a recommended minimum thickness of the coif is 5mm.\nAnswer: YES\n\nInformation: A chinstrap must secure the helmet to the head, preventing any loss of it during a fight.\nAnswer: YES\n\nInformation: Neck protection is mandatory and must cover the neck in any position.\nAnswer: YES\n\nInformation: Locking systems must be based on historical sources and use authentic materials (leather, rope, etc.).\nAnswer: PARTIALLY (only if visible)\n\nInformation: The whole armor must protect the following body parts: head, neck, chest, arms, legs, feet, and hands.\nAnswer: YES\n\nInformation: Poor quality equipment is not allowed in any category.\nAnswer: YES\n\nInformation: Marshals are allowed to deny any competitor with clearly poor quality equipment to enter the list.\nAnswer: YES",1.0,True,"Yes, the response for the query is in line with the context information provided. The response accurately summarizes the protection requirements for a helmet and the characteristics of locking systems as stated in the ""BuHurt Armor Requirement"" document, which is also reflected in the provided context.",1.0,True,"The generated answer is highly relevant and provides more detailed information than the reference answer, including specific helmet requirements and neck protection. It also accurately describes the characteristics of locking systems if they are visible. However, it slightly exceeds the scope of the question by including details about the chinstrap and coif, which were not explicitly mentioned in the reference answer.\n\n<details>\n <summary>Reasoning</summary>\n\n The generated answer is highly relevant and provides more detailed information than the reference answer, including specific helmet requirements and neck protection. It also accurately describes the characteristics of locking systems if they are visible. However, it slightly exceeds the scope of the question by including details about the chinstrap and coif, which were not explicitly mentioned in the reference answer.\n\n The generated answer fully covers the protection requirements for a helmet, including the need to protect the entire head and neck, the prohibition of open-face helmets, and the requirement for removable or pivoting visors to be secured. It also adds valuable information about the quilted or padded coif, the recommended minimum thickness of 5mm, and the mandatory neck protection.\n\n Furthermore, the generated answer accurately describes the characteristics of locking systems if they are visible, requiring them to be based on historical sources, use authentic materials, and be approved by the Authenticity Rules documents.\n\n Although the generated answer is highly accurate and relevant, it slightly exceeds the scope of the question by including details about the chinstrap and coif, which were not explicitly mentioned in the reference answer. This minor deviation from the reference answer results in a slightly lower score.\n</details>",4.5
6,"Question: Based on the provided document changelog, what is the most recent update as of December 2024 regarding the technical requirements for armors, weapons, and shields?","Based on the provided changelog information, the most recent update as of December 2024 regarding the technical requirements for armors, weapons, and shields is:\n\nThe document on ""Technical Requirements for armors, weapons and shields"" was created in December 2024. The changelog specifically states:\n\n""Date update: December 2024 - add: Document created""\n\nThis appears to be the initial creation of the document outlining the technical requirements for armors, weapons, and shields.\n\nAdditionally, for weapons specifically, there was an update in December 2024 that included:\n\n1. Changing the upwards weight error limit to 100 grams for all weapons.\n2. Adding subgenres of weapons for various categories (Duels, Group fights, etc.).\n3. Removing and moving certain specific requirements and weight/measurement details to a separate document.\n\nThese updates provide the most recent changes to the technical requirements as of December 2024 based on the given context.","Based on the provided document changelog, the most recent update as of December 2024 regarding the technical requirements for armors, weapons, and shields is the creation of the document.",True,"YES\n\nThe context includes a changelog for the technical requirements of armors, weapons, and shields, which includes an entry for December 2024 stating that the document was created and that there were updates regarding the technical requirements for weapons. Therefore, the information provided is supported by the context.",1.0,True,"Yes, the response for the query is in line with the context information provided. The changelog shows that the document outlining the technical requirements for armors, weapons, and shields was created in December 2024, and it includes specific updates regarding weapons at that time.",1.0,True,"The generated answer is highly relevant and almost entirely correct. It provides the exact details of the most recent update regarding the technical requirements for armors, weapons, and shields as stated in the changelog. The answer is only slightly less concise than the reference answer, hence the score of 4.5.",4.5
7,"Question: Based on the ""Historical Consistency Requirements"" section of the document, explain what is meant by ""Prohibited features"" and provide an example of such a feature as it relates to weapon equipment.","According to the ""Historical Consistency Requirements"" section, ""Prohibited features"" refer to evident indications of modern materials or manufacturing techniques that are not consistent with historical authenticity for weapons from the 14th to 17th centuries.\n\nAn example of a prohibited feature related to weapon equipment would be visible welded seams. This is explicitly mentioned in the document as one of the prohibited features, along with other examples such as neon colors, obvious nylon cords, plastic ties, and heat-induced discoloration.\n\nThese features are prohibited because they clearly indicate modern manufacturing methods or materials that would not have been present in historical weapons from the specified time period (1300-1600 CE). The goal is to maintain historical accuracy and authenticity in the weapons used in these events.","Based on the ""Historical Consistency Requirements"" section of the document, ""Prohibited features"" refer to elements in weapon equipment that do not align with historical sources or are not consistent with the time period or region being represented. An example of a prohibited feature could be the presence of synthetic materials in a weapon that is supposed to be made entirely of historical materials, such as wood, iron, or steel. Another example could be the use of modern mechanical components in a weapon that would not have been present in the historical equivalent. These prohibited features are not in line with the historical authenticity requirements set forth in the document.",True,"YES\n\nThe context includes a section called ""Historical Consistency Requirements"" which outlines the prohibition of certain features in weapon equipment that indicate modern materials or manufacturing techniques. Visible welded seams are explicitly mentioned as an example of a prohibited feature.",1.0,True,"Yes, the response is in line with the context information provided. The response correctly explains that prohibited features refer to modern materials or manufacturing techniques not consistent with historical authenticity for weapons from the 14th to 17th centuries, and provides an example of a prohibited feature (visible welded seams) as it relates to weapon equipment. This is consistent with the information provided in the ""Historical Consistency Requirements"" section of the document.",1.0,True,"The generated answer is relevant and mostly correct. It correctly identifies the concept of ""Prohibited features"" as elements that are not consistent with historical sources or time periods. The example provided, visible welded seams, is also accurate and directly taken from the document. The answer could have been improved by specifying the time period and type of weapon more clearly, as in the reference answer. However, the generated answer is still quite informative and precise.",4.5
8,"Question: Based on the document, what is the role of Authenticity Officers in buhurt competitions? Authenticity Officers are responsible for the correct application of the document in buhurt competitions, which includes group fights in various formats such as 5v5, 12v12, and 30v30. Additionally, they are responsible for ensuring that all equipment, including armors, clothes, shields, and weapons, meet the requirements set forth in the document.","I apologize, but I cannot find any specific information about Authenticity Officers or their role in the given context. The excerpts provided do not mention Authenticity Officers or describe their responsibilities in buhurt competitions.\n\nThe context does mention some authenticity requirements for weapons and armor, and there are references to an ""Authenticity Rules document"" for various equipment components. However, there is no explicit information about who enforces these rules or the role of Authenticity Officers.\n\nThe closest relevant information I can find is in the first excerpt, which mentions that competitors who deviate from specified styles must send an email to AC@buhurtinternational.com to obtain validation of their armor. This suggests that there is some form of authentication process, but it doesn't specify who carries out this process or their title.\n\nWithout more specific information in the given context, I cannot confirm or describe the role of Authenticity Officers in buhurt competitions.","Authenticity Officers are responsible for ensuring that all equipment used in buhurt competitions, which include group fights in various formats such as 5v5, 12v12, and 30v30, adhere to the requirements set forth in the document. This includes armors, clothes, shields, and weapons. They are also responsible for the correct application of the document in buhurt competitions.",True,"For the first piece of information: ""Apple pie is generally double-crusted,"" the answer is YES, as this is explicitly stated in the context.\n\nFor the second piece of information: ""I apologize, but I cannot find any specific information about Authenticity Officers or their role in the given context,"" the answer is YES, as this is a direct quote from my response indicating that the context does not mention Authenticity Officers or their role.\n\nFor the third piece of information: ""Competitors who deviate from specified styles must send an email to AC@buhurtinternational.com to obtain validation of their armor,"" the answer is YES, as this is stated in the context.\n\nFor the fourth piece of information: ""All weapons must be based on historical sources and repeat the form of the original in all three dimensions,"" the answer is YES, as this is stated in the context.\n\nFor the fifth piece of information: ""The edge of both striking and non-striking edges must be rebated and have a thickness of 2mm or more,"" the answer is YES, as this is stated in the context.\n\nFor the sixth piece of information: ""Every weapon to be used at a tournament must be first approved for safety by the representatives of the Marshals during weapons check,"" the answer is YES, as this is stated in the context.\n\nFor the seventh piece of information: ""Flat rubber soles with low-profile texture are permitted to replace historical soles,"" the answer is YES, as this is stated in the context.\n\nFor the eighth piece of information: ""It is prohibited to decorate weapons with images that are clearly modern and/or offensive,"" the answer is YES, as this is stated in the context.\n\nFor the ninth piece of information: ""The blade is the main striking edge of the weapon,"" the answer is YES, as this is stated in the context.\n\nFor the tenth piece of information: ""The hilt is composed of the pommel, handle, and guard,"" the answer is YES, as this is stated in the context.\n\nFor the eleventh piece of information: ""When holding the bladed weapon firmly, the guard is not allowed to extend the gauntlet of the bearer from more than 3.5cm,"" the answer is YES,",1.0,False,"NO, because the response acknowledges the existence of an ""Authenticity Rules document"" and mentions some authenticity requirements, but it also clarifies that there is no explicit information about who enforces these rules or the role of Authenticity Officers in the given context.",0.0,False,"The generated answer is not relevant to the user query as it does not provide any information about the role of Authenticity Officers in buhurt competitions. Instead, it explains the lack of specific information in the document about the role of Authenticity Officers.",1.0
9,"What modern countries are included in the Western Europe style definition for armors, shields, and weapons that must be consistent in equipment style, according to the Authenticity Rules documents?","According to the Authenticity Rules documents, the Western Europe style definition for armors, shields, and weapons includes the following modern countries:\n\nGreat Britain, Ireland, France, Portugal, Spain, Germany, Italy, Norway, Denmark, Sweden, Finland, Austria, Switzerland, Belgium, and the Netherlands.","The Western Europe style definition for armors, shields, and weapons that must be consistent in equipment style, according to the Authenticity Rules documents, includes the following modern countries: Great Britain, Ireland, France, Portugal, Spain, Germany, Italy, Norway, Denmark, Sweden, Finland, Austria, Switzerland, Belgium, and the Netherlands.",False,"The given piece of information is not supported by the context. The context only provides information about the historical consistency requirements for armors, weapons, and shields, and does not mention anything about the countries associated with the Western Europe style definition for armors, shields, and weapons.",0.0,True,"Yes, the response for the query is in line with the context information provided. The context indicates that the Western Europe style definition for armors, shields, and weapons includes armors from the modern countries listed in the response.",1.0,True,"The generated answer is identical to the reference answer, both in content and form. It is relevant and fully correct.",5.0


In [96]:
aggregate_results

{'number_of_test_cases': 15,
 'mean_faithfulness_score': 0.667,
 'mean_relevancy_score': 0.733,
 'mean_correctness_score': 4.0}

----
## Pause

----



# Assignment Task #1: Baseline: Using your configuration and documents

Update the notebook to match your configuration for Capstone 2

- Use your document set (rather than the canned/biographic dataset provided with this example)
- Use the embeddings model that was best for your document set
- Use the ingestion pipeline that was best for your document set

If for some reason you did not complete Capstone 2, but you are completing Capstone 3, then note that use the content provided here.

Once you have updated your configuration, rerun the notebook to this point.

Answer the following questions in this cell:

## 1. What are the aggregate evaluation scores for your configuration?

| name | score |
|----|----|
| mean_faithfulness_score | 1.0 |
| mean_relevancy_score | 0.95 |
| mean_correctness_score | 3.75 |

## 2. Of the three evaluation measures, which one needs to be improved the most, from your point of view, and why?

As the answers are cropped, I updated the `run_evaluations` function to not truncate the displayed answers.

Looking at the collected scores, `correctnessness` and `faithfulness` could be improved. Looking at some of the questions and evaluations, there are some false positive that could most probably be avoided. See below answer. I will focus on the `faithfulness` in the next steps of the exercise.

## 3. Look at a couple of the failed test cases, using the evaluation functions that show the detail outputs, and see why the test case failed. For each of the two queries, note both the test case query, and your reasoning as to its failure.

Some the failed test cases were quite interesting.

Question:

> According to the "buhurt_armor_requirement.pdf" document, what is the minimum thickness requirement for the chest protector in the buhurt armor setup?

Generated Answer:

> The minimum recommended thickness for the chest protector in the buhurt armor setup is 1.5mm for mild steel or 0.8mm for tempered steel, hardened steel, or stainless steel. This information can be found in section 5.2 of the document, which covers the armor requirements.

Faithfulness:

> The context provided is about the technical requirements for armors, weapons, and shields, and it does not contain any information about the taste of **apple pies**.

----
# Assignment Task #2: Experiment with the LLM for answer generation

Change the LLM model configuration for the query engine in the next two notebook cells. 

Change the configuration of the query engine to use the **one** or **two** other models that have been configured for use already.

Then rerun the evaluations with the set of generated test cases.
For each model review the difference in the aggregate score and in the quality of the output text.

Answer the following questions in this cell:

## 1. Which model was best for your content and what were its scores?


While running the tests I did some research about the models, it seems that Sonnet 3.5 will be providing the highest consistency and reliability in our scneario. The evaluation should be more consistent and reliable, better at providing detailed and strctured feedback.

| name | llm_mixtral8x7b | llm_sonnet_35 | llm_haiku_3 |
|----|----|----|----|
| mean_faithfulness_score | 1.0 | 0.65 | 0.5 |
| mean_relevancy_score | 0.95 | 0.65 | 0.8 |
| mean_correctness_score | 3.75 | 3.575 | 3.625 |

## 2. Which model was worst for your content and what were its scores?

Using the evaluated question that was a false positive on the first task as a reference. We can see that the faithfulness of Sonnet 3.5 gives an accurate feedback, and does not talk about **apple pies** like the others.

**Sonnet 3.5:**

Even though the test fails for this question, the faithfulness is accurate and is relevant to the question/answer received. We can consider this evaluation valid.

Question:

> According to the "buhurt_armor_requirement.pdf" document, what is the minimum thickness requirement for the chest protector in the buhurt armor setup?

Generated Answer:

> According to the document, the minimum recommended thickness for steel plates of armor is 1.5mm for mild steel or 0.8mm for tempered steel, hardened steel, or stainless steel. For titanium plates, the minimum recommended thickness is 0.8mm. \n\nThe document specifies that the torso and back must be protected with metal plates. Additionally, an under-armor layer called a gambeson is mandatory for protecting the torso and back. This gambeson should consist of at least a layer of felted fabric quilted with fabric.\n\nIt's important to note that these are recommendations, and the document emphasizes the use of metal plates for chest protection without specifying a strict minimum thickness specifically for the chest area. The guidelines provided are general for all armor plates.

Faithfulness:

> NO, the given context does not support the statement "According to the document, the minimum recommended thickness for steel plates of armor is 1.5mm for mild steel or 0.8mm for tempered steel, hardened steel, or stainless steel. For titanium plates, the minimum recommended thickness is 0.8mm." The context provided does not mention any specific minimum recommended thickness for steel or titanium plates of armor.

**Haiku 3:**

Question:

> According to the "buhurt_armor_requirement.pdf" document, what is the minimum thickness requirement for the chest protector in the buhurt armor setup?

Generated Answer:

> The provided context information does not mention any specific minimum thickness requirements for lower leg protection armor. The document focuses on providing technical specifications for other armor components such as the helmet, coif, chinstrap, neck protection, chest, hips, groin, hands, and arms, but does not discuss requirements for lower leg protection.

Faithfulness:

> NO, the given context does not mention if apple pies taste bad or not. It only provides information about what an apple pie is and how it is typically prepared.

## 3. Summarize the difference that in output quality that you observed between the best and the worst performing LLM

Even if Sonnet 3.5 has lower results than the others, it provided the best accurate evaluations and feedback. No false positive were observed in its evaluation.


In [63]:
# KEY CELL

# query_engine_llm = llm_mixtral8x7b        # The `default` model for this notebook - used until you change this setting
# query_engine_llm = llm_mistral7b
# query_engine_llm = llm_haiku_3
query_engine_llm = llm_sonnet_35

In [64]:
# KEY CELL

# After you update the query_engine configuration, then go back and re-run the test cases

query_engine = vector_index.as_query_engine(llm=query_engine_llm, similarity_top_k=3)

-----
# Assignment Task #3: Experimenting with changing the prompt

The default prompt provided by Llamaindex works quite well, but you can almost certainly do better. 

## 1. The cells below will change the default prompt to one that works better with the default content. Read that updated prompt and suggest two reasons why it might perform better the default prompt (refer back to the prompt engineering assignment).

The updated prompt follows all the prompt engineering best practices:
- Role and context setting
- Structure and format answer
- Error prevention and guideline

## 2. Update the alternative prompt to better match the topic and goals of your RAG solution, and update the query_engine with the cells that follow. Then rerun the tests and experiment further to improve your prompt. It will help to look at the output for specific queries, to get a deeper sense of the changes driven by your prompt.

Using `llm_sonnet_35`:

| name | default prompt | custom prompt |
|----|----|----|
| mean_faithfulness_score | 0.65 | 0.7 |
| mean_relevancy_score | 0.65 | 0.6 |
| mean_correctness_score | 3.575 | 3.125 |

## 3. What are the final test metrics that you are getting for you configuration?

Results are quite similar to the default prompt, even though I think my custom prompt could help provide better answers. However, reviewing the generated questions, I realized that some questions does not provide anything meaningful to the evaluation:
> For this quiz/examination, your task is to answer the following question based on the provided context information:

To make things simple, I simply removed the irrelevant question from the `eval_questions_df` with the below script and re-run the 2 evaluations with `n = 15 questions`.

With `15 clean questions` and `llm_sonnet_35`

| name | default prompt | custom prompt |
|----|----|----|
| mean_faithfulness_score | 0.667 | 0.733 |
| mean_relevancy_score | 0.733 | 0.8 |
| mean_correctness_score | 4.0 | 3.9 |

We can see that after removing the irrelevant questions, results improved for both prompts. Looking at the questions/answers, I can see a few more trick questions that could have been "graded" differently by the LLM. But I think, for the goal of this assignment, it is already a success.

In [83]:
eval_questions_df

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"Question: According to the ""buhurt_armor_requirement.pdf"" document, what is the minimum thickness requirement for the lower leg protection armor, as specified on page 1?",[],"To provide an accurate answer, I would need access to the specific contents of the ""buhurt\_armor\_requirement.pdf"" document located at /home/ec2-user/SageMaker/elvtr-ai-solution-architect/class-14/source\_docs/. However, based on the provided context, I can tell you that the minimum thickness requirement for the lower leg protection armor is specified on page 1 of the document. To get the exact thickness requirement, please refer to the contents of the document itself.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
1,"Question: Based on the ""Technical Requirements for Armor"" document, explain the historical consistency requirements for armor, including the considerations for equipment from historical sources, dates of sources, prohibited features, and consistency in equipment with regards to Western European style, Slavic influence, Eastern influence, and authorized deviations.","[Technical RequirementsforArmor\nTableofContents\nTechnicalRequirementsforArmor 11.IntroductionandDefinitions 21.1.Equipment 21.2.List 21.3.Categories 21.4.Historicalsources 22.HistoricalConsistencyRequirements 32.1.Equipmentfromhistoricalsources 32.2.Datesofsources 32.3.Prohibitedfeatures 32.4.ConsistencyinEquipment 32.4.1.WesternEuropestyle 32.4.2.SlavicInfluence 32.4.3.Easterninfluence 32.4.4.Authorizeddeviations 43.GeneralRequirements 53.1.Competitorresponsibility 53.2.Qualityequipment 53.3.Safeequipment 53.4.Accesstothelist 54.Helmetsrequirements 64.1.Protection 64.2.Thicknessandcomposition 74.3.Coif 74.4.Chinstrap 74.5.Neckprotection 75.Armorrequirements 85.1.Metalthicknessrecommendations 85.2.Chest 85.3.Groin 85.4.Hands 85.5.Arms 85.6.Joints 95.7.Legs 95.8.Feet 95.9.Shoes 96.Changelog 10\nTechnicalRequirementsforarmors,weaponsandshields 1]","The ""Technical Requirements for Armor"" document outlines the historical consistency requirements for armor under section 2.2. These requirements ensure that the armor used in competitions is consistent with historical sources and practices.\n\nFirstly, the equipment must come from historical sources. This means that the design, construction, and materials of the armor should be based on existing examples from historical records, artifacts, or reputable reconstructions.\n\nSecondly, the dates of the sources must be considered. The armor should be consistent with the time period it represents, and should not mix elements from different epochs. This helps maintain the authenticity and immersion of the historical reenactment or competition.\n\nThirdly, there are prohibited features. These are elements of armor that were not used in the specified historical period or culture, or that are considered anachronistic. For example, the use of rivets in certain periods or regions where they were not historically employed would be a prohibited feature.\n\nFourthly, consistency in equipment is emphasized. This includes consistency in equipment with regards to Western European style, Slavic influence, Eastern influence, and authorized deviations. \n\n- Western European style: The armor should reflect the typical design, construction, and materials used in Western Europe during the specified historical period. This includes considerations for the type of armor, the materials used, and the methods of construction.\n\n- Slavic influence: The armor should reflect the influence of Slavic cultures, if applicable. This could include specific design elements, materials, or construction techniques that are characteristic of Slavic armor.\n\n- Eastern influence: The armor should reflect the influence of Eastern cultures, if applicable. This could include specific design elements, materials, or construction techniques that are characteristic of Eastern armor.\n\n- Authorized deviations: There may be certain deviations from historical practices that are allowed, provided they do not violate the other historical consistency requirements. These deviations should be authorized and documented, and should not detract from the historical authenticity of the armor.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
2,"""Can you explain the role and responsibility of Authenticity Officers as mentioned in the introduction of the document related to buhurt armor requirements?""","[1. IntroductionandDefinitions\nThisdocument will displayall requirementsforarmors, shields, andweaponsusedinanycategory. Everycompetitorissolelyresponsibleforthequalityof his/herequipment. MarshalsandAuthenticityOfficersareresponsibleforthecorrect applicationof thisdocument.\n1.1.Equipment● Armors, clothes, shieldsandweaponsusedbyacompetitor.\n1.2.List● Placewherethefightsoccur.\n1.3.Categories● Duel: 1v1swordandshield, swordandbuckler, longsword, polearm.● Buhurt: Groupfight, 5v5, 12v12, 30v30.● Outrance\n1.4.Historicalsources● It is theresult of researchtoconcludetheplausibilityof theitemhadexisted. Historicalsources can be primary (artifacts, museum originals…) or secondary (images,paintings, statues, description, writings…). Any historical source must be analyzedthroughcriticismandlogic.\nTechnicalRequirementsforarmors,weaponsandshields 2]","Based on the introduction of the document related to buhurt armor requirements, Authenticity Officers are responsible for the correct application of the document. Their role is to ensure that all equipment, including armors, clothes, shields, and weapons used by competitors, meet the historical source requirements outlined in the document. Authenticity Officers analyze historical sources, which can be primary (such as artifacts and museum originals) or secondary (such as images, paintings, statues, descriptions, and writings), using criticism and logic to determine the plausibility of an item's existence. They play a crucial role in maintaining the historical accuracy and authenticity of the competition.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
3,"""Explain the prohibited features for armors and give an example of a modern material or manufacturing technique that would be not allowed in historical reenactment events, according to the given document.""","[2. Historical ConsistencyRequirements\n2.1.Equipmentfromhistoricalsources● Onlyarmorsderivedfromhistorical sourcesarepermittedforuse.● Armorsmust alignwithAuthenticityRulesdocuments.\n2.2.Datesofsources● Armorsmust alignwithsourcesdatingbetweenthe14th(1300)and17th(1600)centuries.● For safety considerations, reproductions of armors predating the 13th century areprohibited.\n2.3.Prohibitedfeatures● Prohibited features encompass evident indications of modern materials or manufacturingtechniques including: neoncolors, obvious nyloncords, plastic ties, visibleweldedseams,heat-induceddiscoloration, modernfootwearandothervisiblemodernequipment.\n2.4.ConsistencyinEquipmentArmors, shields and weapons must consist of pieces fromthesamestyle. Distinct styles aredefinedinmodern-daytermsas:\n2.4.1.WesternEuropestyle● 14thcentury:from1300to1380● Transitional:from1380to1420● 15th century: from1420 to 1500. XVth stylearmor must beapprovedby theAuthenticity Committee. We recommended seeking approval before buyingsucharmor.● Western Europe includes the following modern countries: Great Britain,Ireland, France, Portugal, Spain, Germany, Italy, Norway, Denmark, Sweden,Finland, Austria, Switzerland, Belgium, andtheNetherlands2.4.2.SlavicInfluence● Central Europe14th:from1300to1400● Russianlatearmors:from1500to1700● SlavicInfluenceincludesthefollowingmoderncountries:CzechRepublic,Romania, Hungary, Poland, Slovakia, Slovenia, Croatia, Latvia, Estonia,Moldova, Serbia, Ukraine, Russia, andBelarus.2.4.3.Easterninfluence● Chinesestyle: from1300to1600● Japanesesamurai style: from1400to1700● Middle-East style: from1300to1700● Easterninfluenceincludes thefollowingmoderncountries: China, Japan, India,Korea, Iran, Iraq, Turkey, Egypt\nTechnicalRequirementsforarmors,weaponsandshields 3]","According to the given document, prohibited features for armors include any evident indications of modern materials or manufacturing techniques. Specifically, the use of neon colors, obvious nylon cords, plastic ties, visible welded seams, heat-induced discoloration, modern footwear, and other visible modern equipment are not permitted. \n\nAn example of a modern material or manufacturing technique that would not be allowed in historical reenactment events, according to the given document, is the use of heat-induced discoloration. This technique is often used in modern manufacturing processes to create a specific color or finish on a material, but it is not consistent with the historical methods of creating armor. Therefore, any armor with heat-induced discoloration would be considered a prohibited feature and not allowed in historical reenactment events.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
4,"""What is the process for competitors to deviate from the specified styles of armor, weapons, and shields, and who do they need to notify in order to obtain validation of their equipment choices?""","[2.4.4.Authorizeddeviations● Competitors may deviate from these specified styles if they provide sourcessupportingtheirequipment choices.● Competitors who deviate from these specified styles must send an email toAC@buhurtinternational.comtoobtainvalidationof theirarmor.\nTechnicalRequirementsforarmors,weaponsandshields 4]","Competitors can deviate from the specified styles of armor, weapons, and shields by providing sources that support their equipment choices. Once they have done this, they must send an email to AC@buhurtinternational.com to obtain validation of their armor. This information is based on the provided context, and no prior knowledge was used in formulating the answer.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
6,"Question: According to the ""BuHurt Armor Requirement"" document, what are the specific protection requirements for a helmet, and what should be the characteristics of locking systems if they are visible?","[4. Helmetsrequirements\n4.1.Protection● Ahelmet istheheadprotection. It must protect theentireheadandneckfromall sides.Ahelmetiscomposedbythefollowingparts\n● A helmet must protect the whole head. Open-face helmets are not allowed in anycategory.● Helmet must conformtotheAuthenticityRulesdocuments.● If thehelmet includesaremovableorpivotingvisor, it shouldbesecuredtoprevent anyopeningduringafight.● Locking systems, if visible, must be based on historical sources and use authenticmaterials(leather, rope, etc.).● ApprovedlockingsystemsaredisplayedinAuthenticityRulesdocuments.\nTechnicalRequirementsforarmors,weaponsandshields 6]","According to the ""BuHurt Armor Requirement"" document, the specific protection requirements for a helmet include protecting the entire head and neck from all sides. Open-face helmets are not allowed in any category. Additionally, the helmet must conform to the Authenticity Rules documents. If the helmet includes a removable or pivoting visor, it should be secured to prevent any opening during a fight.\n\nRegarding locking systems, if they are visible, they must be based on historical sources and use authentic materials such as leather or rope. Approved locking systems are displayed in the Authenticity Rules documents.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
10,"Question: Based on the provided document changelog, what is the most recent update as of December 2024 regarding the technical requirements for armors, weapons, and shields?","[6. Changelog\nThiswill bethechangelogof thisdocument. It will beupdatedregularlyandwewilldisplaywhat hasbeenremoved/added/reworded/formatted. Thiswaywekeeptransparent communicationandclarityabout therules\nDateupdate:December2024-add: Document created\nTechnicalRequirementsforarmors,weaponsandshields 10]","Based on the provided document changelog, the most recent update as of December 2024 regarding the technical requirements for armors, weapons, and shields is the creation of the document.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
12,"Question: Based on the ""Historical Consistency Requirements"" section of the document, explain what is meant by ""Prohibited features"" and provide an example of such a feature as it relates to weapon equipment.",[Technical RequirementsforWeaponsTableofContents1. IntroductionandDefinitions............................................................................................................................11.1. Equipment...........................................................................................................................................21.2. List.......................................................................................................................................................21.3. Categories...........................................................................................................................................21.4. StrikingEdge/ Non-strikingEdge(of theWeapon).............................................................................21.5. Historical sources................................................................................................................................22. Historical ConsistencyRequirements............................................................................................................32.1. Equipment fromhistorical sources......................................................................................................32.2. Datesof sources.................................................................................................................................32.3. Prohibitedfeatures..............................................................................................................................32.4. ConsistencyinEquipment...................................................................................................................32.4.1. WesternEuropestyle.......................................................................................................................32.4.2. SlavicInfluence................................................................................................................................32.4.3. Easterninfluence............................................................................................................................. 32.4.4. Authorizeddeviations.......................................................................................................................43. General Requirements......................................................................................................................................53.1. Competitorresponsibility.....................................................................................................................53.2. Qualityequipment...............................................................................................................................53.3. Accesstothelist.................................................................................................................................54. Weaponsrequirements.....................................................................................................................................64.1. Typesof weapons...............................................................................................................................64.1.1. Bladedweapons...............................................................................................................................64.1.2. Haftedweapons...............................................................................................................................65. General requirements.......................................................................................................................................75.1. Safetyrequirements............................................................................................................................75.2. Authenticityrequirements....................................................................................................................75.3. Specificrequirementsforbladedweapons.........................................................................................75.4. Specificrequirementsforhaftedweapons..........................................................................................85.4.1. Components.....................................................................................................................................85.4.2. Haft Requirements...........................................................................................................................96. Outranceweapons..........................................................................................................................................106.1. Point of balance................................................................................................................................ 107. DuelsWeapons................................................................................................................................................117.1. Type1Bladed....................................................................................................................................117.2. type2bladed.....................................................................................................................................118. Buhurt/GroupWeapons..................................................................................................................................128.1. Type2&3Bladed..............................................................................................................128.2. Type1&2Hafted.............................................................................................................. 129. Changelog.......................................................................................................................................................13],"Based on the ""Historical Consistency Requirements"" section of the document, ""Prohibited features"" refer to elements in weapon equipment that do not align with historical sources or are not consistent with the time period or region being represented. An example of a prohibited feature could be the presence of synthetic materials in a weapon that is supposed to be made entirely of historical materials, such as wood, iron, or steel. Another example could be the use of modern mechanical components in a weapon that would not have been present in the historical equivalent. These prohibited features are not in line with the historical authenticity requirements set forth in the document.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
13,"Question: Based on the document, what is the role of Authenticity Officers in buhurt competitions? Authenticity Officers are responsible for the correct application of the document in buhurt competitions, which includes group fights in various formats such as 5v5, 12v12, and 30v30. Additionally, they are responsible for ensuring that all equipment, including armors, clothes, shields, and weapons, meet the requirements set forth in the document.","[IntroductionandDefinitions\nThisdocument will displayall requirementsforarmors, shields, andweaponsusedinanycategory.Everycompetitorissolelyresponsibleforthequalityof his/herequipment. MarshalsandAuthenticityOfficersareresponsibleforthecorrect applicationof thisdocument.\n1.1.Equipment● Armors, clothes, shieldsandweaponsusedbyacompetitor.\n1.2.List● Placewherethefightsoccur.\n1.3.Categories● Duel: 1v1swordandshield, swordandbuckler, longsword, polearm.\n● Buhurt: Groupfight, 5v5, 12v12, 30v30.\n● Outrance\n1.4.StrikingEdge/ Non-strikingEdge(oftheWeapon)● StrikingEdgeisapart of theweapondesignedtohit anopponent.\n● Non-strikingedgeisapart of theweaponnot designedtohit anopponent.\nHerearesomeexampleswiththestrikingedgebeingred\n1.5.Historical sourcesIt istheresult of researchtoconcludetheplausibilityof theitemhadexisted. Historical sourcescanbe primary (artifacts, museumoriginals…) or secondary (images, paintings, statues, description,writings…). Anyhistorical sourcemust beanalyzedthroughcriticismandlogic.]","Authenticity Officers are responsible for ensuring that all equipment used in buhurt competitions, which include group fights in various formats such as 5v5, 12v12, and 30v30, adhere to the requirements set forth in the document. This includes armors, clothes, shields, and weapons. They are also responsible for the correct application of the document in buhurt competitions.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
14,"What modern countries are included in the Western Europe style definition for armors, shields, and weapons that must be consistent in equipment style, according to the Authenticity Rules documents?","[2.Historical ConsistencyRequirements\n2.1.Equipmentfromhistorical sources● Onlyweaponsderivedfromhistorical sourcesarepermittedforuse.\n● Weaponsmust alignwithAuthenticityRulesdocuments.\n2.2.Datesofsources● Weapons must align with sources dating between the 14th (1300) and 17th (1600)centuries.\n● For safety considerations, reproductions of armors predating the 13th century areprohibited, asareweaponsof thesameperiod.\n2.3.Prohibitedfeatures● Prohibited features encompass evident indications of modern materials or manufacturingtechniques including: neoncolors, obvious nyloncords, plastic ties, visibleweldedseams,heat-induceddiscolorationandothervisiblemodernequipment.\n2.4.ConsistencyinEquipmentArmors, shields and weapons must consist of pieces from the same style. Distinct styles aredefinedinmodern-daytermsas:\n2.4.1. WesternEuropestyle● 14thcentury: from1300to1380\n● Transitional: from1380to1420\n● 15th century: from1420 to 1500. XVth style armor must be approvedby theAuthenticityCommittee. Werecommendedseekingapproval beforebuyingsucharmor.\n● Western Europe includes the following modern countries: Great Britain, Ireland, France,Portugal, Spain, Germany, Italy, Norway, Denmark, Sweden, Finland, Austria, Switzerland,Belgium, andtheNetherlands\n2.4.2. SlavicInfluence● Central Europe14th: from1300to1400\n● Russianlatearmors: from1500to1700\n● Slavic Influence includes the following modern countries: Czech Republic, Romania,Hungary, Poland, Slovakia, Slovenia, Croatia, Latvia, Estonia, Moldova, Serbia, Ukraine,Russia, andBelarus.]","The Western Europe style definition for armors, shields, and weapons that must be consistent in equipment style, according to the Authenticity Rules documents, includes the following modern countries: Great Britain, Ireland, France, Portugal, Spain, Germany, Italy, Norway, Denmark, Sweden, Finland, Austria, Switzerland, Belgium, and the Netherlands.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)


In [84]:
# eval_questions_df = eval_questions_df.drop(index=[5, 7, 8, 9, 11, 16, 20, 21, 22, 23])
eval_questions_df

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"Question: According to the ""buhurt_armor_requirement.pdf"" document, what is the minimum thickness requirement for the lower leg protection armor, as specified on page 1?",[],"To provide an accurate answer, I would need access to the specific contents of the ""buhurt\_armor\_requirement.pdf"" document located at /home/ec2-user/SageMaker/elvtr-ai-solution-architect/class-14/source\_docs/. However, based on the provided context, I can tell you that the minimum thickness requirement for the lower leg protection armor is specified on page 1 of the document. To get the exact thickness requirement, please refer to the contents of the document itself.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
1,"Question: Based on the ""Technical Requirements for Armor"" document, explain the historical consistency requirements for armor, including the considerations for equipment from historical sources, dates of sources, prohibited features, and consistency in equipment with regards to Western European style, Slavic influence, Eastern influence, and authorized deviations.","[Technical RequirementsforArmor\nTableofContents\nTechnicalRequirementsforArmor 11.IntroductionandDefinitions 21.1.Equipment 21.2.List 21.3.Categories 21.4.Historicalsources 22.HistoricalConsistencyRequirements 32.1.Equipmentfromhistoricalsources 32.2.Datesofsources 32.3.Prohibitedfeatures 32.4.ConsistencyinEquipment 32.4.1.WesternEuropestyle 32.4.2.SlavicInfluence 32.4.3.Easterninfluence 32.4.4.Authorizeddeviations 43.GeneralRequirements 53.1.Competitorresponsibility 53.2.Qualityequipment 53.3.Safeequipment 53.4.Accesstothelist 54.Helmetsrequirements 64.1.Protection 64.2.Thicknessandcomposition 74.3.Coif 74.4.Chinstrap 74.5.Neckprotection 75.Armorrequirements 85.1.Metalthicknessrecommendations 85.2.Chest 85.3.Groin 85.4.Hands 85.5.Arms 85.6.Joints 95.7.Legs 95.8.Feet 95.9.Shoes 96.Changelog 10\nTechnicalRequirementsforarmors,weaponsandshields 1]","The ""Technical Requirements for Armor"" document outlines the historical consistency requirements for armor under section 2.2. These requirements ensure that the armor used in competitions is consistent with historical sources and practices.\n\nFirstly, the equipment must come from historical sources. This means that the design, construction, and materials of the armor should be based on existing examples from historical records, artifacts, or reputable reconstructions.\n\nSecondly, the dates of the sources must be considered. The armor should be consistent with the time period it represents, and should not mix elements from different epochs. This helps maintain the authenticity and immersion of the historical reenactment or competition.\n\nThirdly, there are prohibited features. These are elements of armor that were not used in the specified historical period or culture, or that are considered anachronistic. For example, the use of rivets in certain periods or regions where they were not historically employed would be a prohibited feature.\n\nFourthly, consistency in equipment is emphasized. This includes consistency in equipment with regards to Western European style, Slavic influence, Eastern influence, and authorized deviations. \n\n- Western European style: The armor should reflect the typical design, construction, and materials used in Western Europe during the specified historical period. This includes considerations for the type of armor, the materials used, and the methods of construction.\n\n- Slavic influence: The armor should reflect the influence of Slavic cultures, if applicable. This could include specific design elements, materials, or construction techniques that are characteristic of Slavic armor.\n\n- Eastern influence: The armor should reflect the influence of Eastern cultures, if applicable. This could include specific design elements, materials, or construction techniques that are characteristic of Eastern armor.\n\n- Authorized deviations: There may be certain deviations from historical practices that are allowed, provided they do not violate the other historical consistency requirements. These deviations should be authorized and documented, and should not detract from the historical authenticity of the armor.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
2,"""Can you explain the role and responsibility of Authenticity Officers as mentioned in the introduction of the document related to buhurt armor requirements?""","[1. IntroductionandDefinitions\nThisdocument will displayall requirementsforarmors, shields, andweaponsusedinanycategory. Everycompetitorissolelyresponsibleforthequalityof his/herequipment. MarshalsandAuthenticityOfficersareresponsibleforthecorrect applicationof thisdocument.\n1.1.Equipment● Armors, clothes, shieldsandweaponsusedbyacompetitor.\n1.2.List● Placewherethefightsoccur.\n1.3.Categories● Duel: 1v1swordandshield, swordandbuckler, longsword, polearm.● Buhurt: Groupfight, 5v5, 12v12, 30v30.● Outrance\n1.4.Historicalsources● It is theresult of researchtoconcludetheplausibilityof theitemhadexisted. Historicalsources can be primary (artifacts, museum originals…) or secondary (images,paintings, statues, description, writings…). Any historical source must be analyzedthroughcriticismandlogic.\nTechnicalRequirementsforarmors,weaponsandshields 2]","Based on the introduction of the document related to buhurt armor requirements, Authenticity Officers are responsible for the correct application of the document. Their role is to ensure that all equipment, including armors, clothes, shields, and weapons used by competitors, meet the historical source requirements outlined in the document. Authenticity Officers analyze historical sources, which can be primary (such as artifacts and museum originals) or secondary (such as images, paintings, statues, descriptions, and writings), using criticism and logic to determine the plausibility of an item's existence. They play a crucial role in maintaining the historical accuracy and authenticity of the competition.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
3,"""Explain the prohibited features for armors and give an example of a modern material or manufacturing technique that would be not allowed in historical reenactment events, according to the given document.""","[2. Historical ConsistencyRequirements\n2.1.Equipmentfromhistoricalsources● Onlyarmorsderivedfromhistorical sourcesarepermittedforuse.● Armorsmust alignwithAuthenticityRulesdocuments.\n2.2.Datesofsources● Armorsmust alignwithsourcesdatingbetweenthe14th(1300)and17th(1600)centuries.● For safety considerations, reproductions of armors predating the 13th century areprohibited.\n2.3.Prohibitedfeatures● Prohibited features encompass evident indications of modern materials or manufacturingtechniques including: neoncolors, obvious nyloncords, plastic ties, visibleweldedseams,heat-induceddiscoloration, modernfootwearandothervisiblemodernequipment.\n2.4.ConsistencyinEquipmentArmors, shields and weapons must consist of pieces fromthesamestyle. Distinct styles aredefinedinmodern-daytermsas:\n2.4.1.WesternEuropestyle● 14thcentury:from1300to1380● Transitional:from1380to1420● 15th century: from1420 to 1500. XVth stylearmor must beapprovedby theAuthenticity Committee. We recommended seeking approval before buyingsucharmor.● Western Europe includes the following modern countries: Great Britain,Ireland, France, Portugal, Spain, Germany, Italy, Norway, Denmark, Sweden,Finland, Austria, Switzerland, Belgium, andtheNetherlands2.4.2.SlavicInfluence● Central Europe14th:from1300to1400● Russianlatearmors:from1500to1700● SlavicInfluenceincludesthefollowingmoderncountries:CzechRepublic,Romania, Hungary, Poland, Slovakia, Slovenia, Croatia, Latvia, Estonia,Moldova, Serbia, Ukraine, Russia, andBelarus.2.4.3.Easterninfluence● Chinesestyle: from1300to1600● Japanesesamurai style: from1400to1700● Middle-East style: from1300to1700● Easterninfluenceincludes thefollowingmoderncountries: China, Japan, India,Korea, Iran, Iraq, Turkey, Egypt\nTechnicalRequirementsforarmors,weaponsandshields 3]","According to the given document, prohibited features for armors include any evident indications of modern materials or manufacturing techniques. Specifically, the use of neon colors, obvious nylon cords, plastic ties, visible welded seams, heat-induced discoloration, modern footwear, and other visible modern equipment are not permitted. \n\nAn example of a modern material or manufacturing technique that would not be allowed in historical reenactment events, according to the given document, is the use of heat-induced discoloration. This technique is often used in modern manufacturing processes to create a specific color or finish on a material, but it is not consistent with the historical methods of creating armor. Therefore, any armor with heat-induced discoloration would be considered a prohibited feature and not allowed in historical reenactment events.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
4,"""What is the process for competitors to deviate from the specified styles of armor, weapons, and shields, and who do they need to notify in order to obtain validation of their equipment choices?""","[2.4.4.Authorizeddeviations● Competitors may deviate from these specified styles if they provide sourcessupportingtheirequipment choices.● Competitors who deviate from these specified styles must send an email toAC@buhurtinternational.comtoobtainvalidationof theirarmor.\nTechnicalRequirementsforarmors,weaponsandshields 4]","Competitors can deviate from the specified styles of armor, weapons, and shields by providing sources that support their equipment choices. Once they have done this, they must send an email to AC@buhurtinternational.com to obtain validation of their armor. This information is based on the provided context, and no prior knowledge was used in formulating the answer.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
6,"Question: According to the ""BuHurt Armor Requirement"" document, what are the specific protection requirements for a helmet, and what should be the characteristics of locking systems if they are visible?","[4. Helmetsrequirements\n4.1.Protection● Ahelmet istheheadprotection. It must protect theentireheadandneckfromall sides.Ahelmetiscomposedbythefollowingparts\n● A helmet must protect the whole head. Open-face helmets are not allowed in anycategory.● Helmet must conformtotheAuthenticityRulesdocuments.● If thehelmet includesaremovableorpivotingvisor, it shouldbesecuredtoprevent anyopeningduringafight.● Locking systems, if visible, must be based on historical sources and use authenticmaterials(leather, rope, etc.).● ApprovedlockingsystemsaredisplayedinAuthenticityRulesdocuments.\nTechnicalRequirementsforarmors,weaponsandshields 6]","According to the ""BuHurt Armor Requirement"" document, the specific protection requirements for a helmet include protecting the entire head and neck from all sides. Open-face helmets are not allowed in any category. Additionally, the helmet must conform to the Authenticity Rules documents. If the helmet includes a removable or pivoting visor, it should be secured to prevent any opening during a fight.\n\nRegarding locking systems, if they are visible, they must be based on historical sources and use authentic materials such as leather or rope. Approved locking systems are displayed in the Authenticity Rules documents.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
10,"Question: Based on the provided document changelog, what is the most recent update as of December 2024 regarding the technical requirements for armors, weapons, and shields?","[6. Changelog\nThiswill bethechangelogof thisdocument. It will beupdatedregularlyandwewilldisplaywhat hasbeenremoved/added/reworded/formatted. Thiswaywekeeptransparent communicationandclarityabout therules\nDateupdate:December2024-add: Document created\nTechnicalRequirementsforarmors,weaponsandshields 10]","Based on the provided document changelog, the most recent update as of December 2024 regarding the technical requirements for armors, weapons, and shields is the creation of the document.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
12,"Question: Based on the ""Historical Consistency Requirements"" section of the document, explain what is meant by ""Prohibited features"" and provide an example of such a feature as it relates to weapon equipment.",[Technical RequirementsforWeaponsTableofContents1. IntroductionandDefinitions............................................................................................................................11.1. Equipment...........................................................................................................................................21.2. List.......................................................................................................................................................21.3. Categories...........................................................................................................................................21.4. StrikingEdge/ Non-strikingEdge(of theWeapon).............................................................................21.5. Historical sources................................................................................................................................22. Historical ConsistencyRequirements............................................................................................................32.1. Equipment fromhistorical sources......................................................................................................32.2. Datesof sources.................................................................................................................................32.3. Prohibitedfeatures..............................................................................................................................32.4. ConsistencyinEquipment...................................................................................................................32.4.1. WesternEuropestyle.......................................................................................................................32.4.2. SlavicInfluence................................................................................................................................32.4.3. Easterninfluence............................................................................................................................. 32.4.4. Authorizeddeviations.......................................................................................................................43. General Requirements......................................................................................................................................53.1. Competitorresponsibility.....................................................................................................................53.2. Qualityequipment...............................................................................................................................53.3. Accesstothelist.................................................................................................................................54. Weaponsrequirements.....................................................................................................................................64.1. Typesof weapons...............................................................................................................................64.1.1. Bladedweapons...............................................................................................................................64.1.2. Haftedweapons...............................................................................................................................65. General requirements.......................................................................................................................................75.1. Safetyrequirements............................................................................................................................75.2. Authenticityrequirements....................................................................................................................75.3. Specificrequirementsforbladedweapons.........................................................................................75.4. Specificrequirementsforhaftedweapons..........................................................................................85.4.1. Components.....................................................................................................................................85.4.2. Haft Requirements...........................................................................................................................96. Outranceweapons..........................................................................................................................................106.1. Point of balance................................................................................................................................ 107. DuelsWeapons................................................................................................................................................117.1. Type1Bladed....................................................................................................................................117.2. type2bladed.....................................................................................................................................118. Buhurt/GroupWeapons..................................................................................................................................128.1. Type2&3Bladed..............................................................................................................128.2. Type1&2Hafted.............................................................................................................. 129. Changelog.......................................................................................................................................................13],"Based on the ""Historical Consistency Requirements"" section of the document, ""Prohibited features"" refer to elements in weapon equipment that do not align with historical sources or are not consistent with the time period or region being represented. An example of a prohibited feature could be the presence of synthetic materials in a weapon that is supposed to be made entirely of historical materials, such as wood, iron, or steel. Another example could be the use of modern mechanical components in a weapon that would not have been present in the historical equivalent. These prohibited features are not in line with the historical authenticity requirements set forth in the document.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
13,"Question: Based on the document, what is the role of Authenticity Officers in buhurt competitions? Authenticity Officers are responsible for the correct application of the document in buhurt competitions, which includes group fights in various formats such as 5v5, 12v12, and 30v30. Additionally, they are responsible for ensuring that all equipment, including armors, clothes, shields, and weapons, meet the requirements set forth in the document.","[IntroductionandDefinitions\nThisdocument will displayall requirementsforarmors, shields, andweaponsusedinanycategory.Everycompetitorissolelyresponsibleforthequalityof his/herequipment. MarshalsandAuthenticityOfficersareresponsibleforthecorrect applicationof thisdocument.\n1.1.Equipment● Armors, clothes, shieldsandweaponsusedbyacompetitor.\n1.2.List● Placewherethefightsoccur.\n1.3.Categories● Duel: 1v1swordandshield, swordandbuckler, longsword, polearm.\n● Buhurt: Groupfight, 5v5, 12v12, 30v30.\n● Outrance\n1.4.StrikingEdge/ Non-strikingEdge(oftheWeapon)● StrikingEdgeisapart of theweapondesignedtohit anopponent.\n● Non-strikingedgeisapart of theweaponnot designedtohit anopponent.\nHerearesomeexampleswiththestrikingedgebeingred\n1.5.Historical sourcesIt istheresult of researchtoconcludetheplausibilityof theitemhadexisted. Historical sourcescanbe primary (artifacts, museumoriginals…) or secondary (images, paintings, statues, description,writings…). Anyhistorical sourcemust beanalyzedthroughcriticismandlogic.]","Authenticity Officers are responsible for ensuring that all equipment used in buhurt competitions, which include group fights in various formats such as 5v5, 12v12, and 30v30, adhere to the requirements set forth in the document. This includes armors, clothes, shields, and weapons. They are also responsible for the correct application of the document in buhurt competitions.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)
14,"What modern countries are included in the Western Europe style definition for armors, shields, and weapons that must be consistent in equipment style, according to the Authenticity Rules documents?","[2.Historical ConsistencyRequirements\n2.1.Equipmentfromhistorical sources● Onlyweaponsderivedfromhistorical sourcesarepermittedforuse.\n● Weaponsmust alignwithAuthenticityRulesdocuments.\n2.2.Datesofsources● Weapons must align with sources dating between the 14th (1300) and 17th (1600)centuries.\n● For safety considerations, reproductions of armors predating the 13th century areprohibited, asareweaponsof thesameperiod.\n2.3.Prohibitedfeatures● Prohibited features encompass evident indications of modern materials or manufacturingtechniques including: neoncolors, obvious nyloncords, plastic ties, visibleweldedseams,heat-induceddiscolorationandothervisiblemodernequipment.\n2.4.ConsistencyinEquipmentArmors, shields and weapons must consist of pieces from the same style. Distinct styles aredefinedinmodern-daytermsas:\n2.4.1. WesternEuropestyle● 14thcentury: from1300to1380\n● Transitional: from1380to1420\n● 15th century: from1420 to 1500. XVth style armor must be approvedby theAuthenticityCommittee. Werecommendedseekingapproval beforebuyingsucharmor.\n● Western Europe includes the following modern countries: Great Britain, Ireland, France,Portugal, Spain, Germany, Italy, Norway, Denmark, Sweden, Finland, Austria, Switzerland,Belgium, andtheNetherlands\n2.4.2. SlavicInfluence● Central Europe14th: from1300to1400\n● Russianlatearmors: from1500to1700\n● Slavic Influence includes the following modern countries: Czech Republic, Romania,Hungary, Poland, Slovakia, Slovenia, Croatia, Latvia, Estonia, Moldova, Serbia, Ukraine,Russia, andBelarus.]","The Western Europe style definition for armors, shields, and weapons that must be consistent in equipment style, according to the Authenticity Rules documents, includes the following modern countries: Great Britain, Ireland, France, Portugal, Spain, Germany, Italy, Norway, Denmark, Sweden, Finland, Austria, Switzerland, Belgium, and the Netherlands.",ai (mistral.mixtral-8x7b-instruct-v0:1),ai (mistral.mixtral-8x7b-instruct-v0:1)


### Take a look at the default prompt

In [65]:
from llama_index.core import PromptTemplate

In [66]:
# define prompt viewing function for the prompt we care about
prompt_template_key = "response_synthesizer:text_qa_template"

def get_response_synthesizer_text_qa_prompt(prompts_dict):
    for k, p in prompts_dict.items():
        if k == "response_synthesizer:text_qa_template":
            return p.get_template()

In [67]:
default_prompt = get_response_synthesizer_text_qa_prompt(query_engine.get_prompts())
print(default_prompt)

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


In [73]:
response = query_engine.query(example_query)
print(response)

Armors must be derived from historical sources dating between the 14th (1300) and 17th (1600) centuries and align with Authenticity Rules documents. They must consist of pieces from the same style (Western Europe, Slavic Influence, or Eastern influence) and cannot have evident indications of modern materials or manufacturing techniques. Reproductions of armors predating the 13th century are prohibited for safety reasons.


In [88]:
# Example alternate prompt

# The objective being, for this example, to get a more concise and clear answer
new_text_qa_prompt_str = (
    "You are an expert in medieval history, and more specifically in armoury.\n"
    "You are part of the Buhurt International committee, an organization running medieval fights events.\n"
    "Your role in the committee is to review armour authenticity based on certain outlined criterias.\n"
    "Your task is to answer readers questions on the given information context,"
    "in a clear, consise and friendly, manner, in two or three sentences.\n"
    "If the answer of their question is not outlined in the documents (e.g. minimum thickness requirements),"
    "reply that Buhurt International does not provide any guideline for this matter.\n"
    "If the answer to their question is not available from the context,"
    "reply that the question cannot be answered given the information that you have."
    "Output the answer directly without a preamble,"
    "(e.g. without saying `Based on the context,` or similar)."
    "<context>\n"
    "{context_str}\n"
    "</context>\n"
    "Given the context and not prior knowledge, "
    "Query: {query_str}\n"
    "Answer: "
)


In [89]:
# update the qa_prompt to the new prompt

query_engine.update_prompts(
    {prompt_template_key: PromptTemplate(new_text_qa_prompt_str)}
)

In [90]:
response = query_engine.query(example_query)
print(response)

Armor must be derived from historical sources dating between the 14th (1300) and 17th (1600) centuries. It must align with Authenticity Rules documents and consist of pieces from the same style (Western Europe, Slavic Influence, or Eastern influence). Prohibited features include evident indications of modern materials or manufacturing techniques such as neon colors, obvious nylon cords, plastic ties, visible welded seams, and heat-induced discoloration.


**Note** Once you have updated the query_engine with your prompt then re-run the tests

In [94]:
# reset to the default prompt (if desired) to see the difference once more

query_engine.update_prompts(
    {prompt_template_key: PromptTemplate(default_prompt)}
)

----

-----
# Assignment Task #4: Wrapping Up

## 1. Do you think your customer would be satisified with the results? If there were not, what would you offer to do?

The results of assignment 3 are quite good. I was expecting to reach some better results, but it seems that some questions are still a bit tricky, strangely formulated.

I think for my use case and data, we would need to elaborate our our grading questions, or fine-tune the LLM generating them.

## 2. You have been testing using 3 of the available end to end RAG evaluation methods supported by Llamaindex. Breifly review the other evaluation methods on Llamaindex website. Which one or two might you also include for you customer, and why?

Even though it was a little more complicated than using the retrieval evaluation, the question generation came up with tricky questions that really help build a robust solution and review some questions that some domain expert wouldn't think of.

In my context, as this RAG would be used to outline regulations, it is very important to check for hallucination. We wouldn't want our solution to give answers that were not part of the RAG documents.

I could not find hallucination evaluation out of the box (I might not have dig enough in the documation), but I can see some great integration like **deepeval**, **uptrain**, **RAGChecker** that would provide this kind of metric and nice looking dashboard to talk through with the customer.

The Guideline evaluator would be a good tool to add in the evaluation too as we need quite strict answers in this medieval fight regulations topic.

## 3. In two or three sentences, note how you might further experiment and improve your RAG pipeline, if your customer gave you more money to make it better.

- Clean up the PDFs before using them in the RAG: build a sanitation step in this notebook or in a lambda that will convert the PDF in text and remove some the unecessary data that could be confusing for the LLM (e.g. header, images, etc...)
- Import more relevant data into RAG
- Find a model that would be a bit more optimized on medieval topics
- Improve the prompt by review answers, false positive and evaluation results
- Implement a monitor workflow, where we collect data from user requests, review the answers and notify developers to improve prompt/model/flag bugs.

## The following assignment tasks are completely optional 
The follow tasks are intended for students who want to dive deeper. They are more open ended and require changing and augmenting the code share above.

**Optional Task 1** Configure the query_engine to use an LLM from another LLM service provider and re-run the tests.

You may be able to reduce the sleep_number for throttling to speed up your testing, depending on the LLM service


**Optional Task 2** Add a reranking capability to the query engine

Adding an Reranker to the query engine pipeline only requires a few lines of code and will increase you solution accuracy. If you use a light-model, specialized reranking model, you can typically reduce overall inference costs. 

For this exercise, using an LLM Reranker is fine. You should see an improvement in the output quality.

Overall, if you use a reranker, consider increasing the value of k (number of chunks) from the retriever and into the reranker, and have a smaller number of chunks (e.g. 2 or 3) selected after reranking for the LLM synthesis.