# LlamaIndex Evaluation

## Introduction
---
[**LlamaIndex**](https://www.llamaindex.ai/) is an open-source framework that connect the data sources to large language models (LLMs). Developers can utilize LlamaIndex to build the generative AI application powered by large language model, particularly retrieval augmented generation (RAG).

LlamaIndex offers evaluation based on the requirements and objectives based on what you want. If you want to get an overall idea of how [RAG](https://aws.amazon.com/what-is/retrieval-augmented-generation/) system is doing, you can start with **end-to-end evaluation (E2E)** as an overall sanity check.

Otherwise, if you have an idea of which components you want to iterate step-by-step, you may want to start with a **component-wise evaluation**. However, you may run into the risk of <u>premature optimization</u> - making or optimizing model selection or parameter choices without assessing the overall application needs.

There are no one-size-fit-all approaches, evaluation is somewhat you iterate, and experiment based on your application objectives. 


## Set up

In [None]:
%pip install -qU --quiet -r requirements.txt

In [1]:
import boto3
import nest_asyncio
import llama_index
import pandas as pd
import ast
from typing import Tuple, List
import json
from IPython.display import display, Markdown

nest_asyncio.apply()
boto_session = boto3.session.Session()

bedrock_client = boto_session.client(
    service_name='bedrock',
    region_name=boto_session.region_name
)

## Bring our own Bedrock model to Llama Index

### Set up embedding model
---
We will set up Amazon Bedrock Embedding to use **amazon.titan-embed-text-v2:0** embedding model. If you want to explore other embedding model, you can use `list_supported_models()` method from `BedrockEmbedding` class to see which embedding models are supported.

In [2]:
from llama_index.embeddings.bedrock import BedrockEmbedding

supported_models = BedrockEmbedding.list_supported_models()
print(json.dumps(supported_models, indent=2))

titan_embedding = BedrockEmbedding(
    model_name='amazon.titan-embed-text-v2:0',
    region_name=boto_session.region_name,
)
print(titan_embedding.get_text_embedding("hello world")[:5])

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /opt/conda/lib/python3.11/site-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Package punkt_tab is already up-to-date!


{
  "amazon": [
    "amazon.titan-embed-text-v1",
    "amazon.titan-embed-text-v2:0",
    "amazon.titan-embed-g1-text-02"
  ],
  "cohere": [
    "cohere.embed-english-v3",
    "cohere.embed-multilingual-v3"
  ]
}
[-0.020442573353648186, 0.056570641696453094, 0.006846333388239145, -0.009064160287380219, 0.038828033953905106]


### Set up LLM
---

<div class="alert alert-block alert-warning">
    At the time of writing, Bedrock class from llama_index is not yet supported Meta Llama 3 models. We have made the adjustment on the code to fit in. Please refer to <b>utils</b> folder for the code reference.
</div>

We will make use of **Llama 3 70B** and **Llama 3.1 70B** for our candidate, and evaluator LLMs.

In [3]:
from utils.bedrock import Bedrock

llama3_70b_model_id = 'meta.llama3-70b-instruct-v1:0'
llama3_1_70b_model_id = 'meta.llama3-1-70b-instruct-v1:0'

llama3_70b_bedrock_llm = Bedrock(
    model=llama3_70b_model_id,
    temperature=.2,
    max_tokens=2048,
)

llama3_1_70b_bedrock_llm = Bedrock(
    model=llama3_1_70b_model_id,
    temperature=.01,
    max_tokens=2048,
)

print(
    llama3_1_70b_bedrock_llm.complete('What is L in LLM?',).text
)

L in LLM stands for "Large".


### Connect to existing vector database
----

Next, we connect to the existing ChromaDB from our prerequisite step.

In [4]:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

chroma_db_dir = './../vector_db'
chroma_collection_name = 'amazon-shareholder-letters'

_db = chromadb.PersistentClient(path=chroma_db_dir)
chroma_collection = _db.get_or_create_collection(chroma_collection_name)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)


In [5]:
def get_query_engine(
    vector_db: llama_index.vector_stores,
    storage_context: llama_index.core.storage,
    embedding_model: llama_index.embeddings,
    llm: llama_index.llms,
    top_k: int = 3,
    response_mode: str = 'refine',
    verbose: bool = False
) -> Tuple[llama_index.core.indices, llama_index.core.query_engine]:
    index = VectorStoreIndex.from_vector_store(
        vector_store=vector_db,
        embed_model=embedding_model,
        storage_context=storage_context,
    )
    query_engine = index.as_query_engine(
        llm=llm,
        similarity_top_k=top_k,
        response_mode=response_mode,
        verbose=verbose,
    )
    return index, query_engine

In [6]:
vector_index, query_engine = get_query_engine(
    vector_db=vector_store,
    storage_context=storage_context,
    embedding_model=titan_embedding,
    llm=llama3_70b_bedrock_llm,
    top_k=5,
)

## Component-wised Evaluation
---

If you would do in-depth evaluation of your RAG application, it is useful to break it into each individual components.

For instance, you may want to investigate whether your retrieval retrieved the right documents or not, and also the LLM used the context and output the right result/response or not. Being able to isolate and deal with these issues separately can help reduce complexity and guide you in a step-by-step manner to a more satisfactory overall result.

### Prepare Question-context pairs
---

LlamaIndex also provides the evaluation dataset generation, we will use `generate_question_context_pairs` for question generation on specific chunks or nodes of your vector store. 

In this example, I will use only sample 5 nodes from vector store and use **Llama 3 70B** for question generation.

In [7]:
from llama_index.core.evaluation import (
    generate_question_context_pairs,
)
import random

_qa_prompt_tmpl = """
<Instructions>
Here is the context:
<context>
{context_str}
</context>

Your task is to generate {num_questions_per_chunk} questions 
that can be answered using the provided context, following these rules:

<rules>
1. The question should make sense to humans even when read without the given context.
2. The question should be fully answered from the given context.
3. The question should be framed from a part of context that contains important information. It can also be from tables, code, etc.
4. The answer to the question should not contain any links.
5. The question should be of moderate difficulty up to difficult.
6. The question must be reasonable and must be understood and responded by humans.
7. Do not use phrases like 'provided context', etc. in the question.
8. Your question should be able to be referenced in full sentence from the context.
8. Never create question that will refer back to the context.
</rules>

To generate the question, first identify the most important or relevant part of the context. 
Then frame a question around that part that satisfies all the rules above.
Think step-by-step and follow the <rule>.

Output only the generated question with a "?" at the end, no other text or characters.
</Instructions>
"""

nodes = vector_store._get(limit=30, where={}).nodes
selected_nodes = random.sample(nodes, 5)
qa_dataset = generate_question_context_pairs(
    nodes=selected_nodes,
    llm=llama3_70b_bedrock_llm,
    num_questions_per_chunk=1,
    qa_generate_prompt_tmpl=_qa_prompt_tmpl
)

100%|██████████| 5/5 [00:03<00:00,  1.57it/s]


Let's look at the example questions.

In [8]:
queries = qa_dataset.queries.values()
print(list(queries))

['What benefits do companies gain by moving to AWS?', 'How much faster can models be trained in Amazon SageMaker compared to other platforms?', 'What metrics does the company use to measure its market leadership?', 'What is unique about the GenAI revolution compared to the mass modernization of on-premises infrastructure to the cloud?', 'In what year did the company expand its international consumer segment?']


### Retrieval Evaluation
---

For **retrieval evaluation**, it will run over one query/question against the ground-truth document/node.
Please refer to two documentations on the [usage pattern](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/usage_pattern_retrieval/) and the [example usage](https://docs.llamaindex.ai/en/stable/examples/evaluation/retrieval/retriever_eval/).

You can use various built-in metrics to fit on your use cases, here are few examples:

- **Hit rate**: measures the ratio of relevant documents retrieved to the total number of relevant documents available.
- **MRR (Mean Reciprocal Rank)**: helps understand the average position of the first relevant item across all user lists. 
- **Precision**: measures the accuracy of a retrieval system by calculating the ratio of relevant documents to the number of retrieved documents.
- **Recall**: measures the percentage of relevant documents that are returned in response to a query.
- **AP (Average Precision)**: helps understanding the relevant items are ranked within the list of retrieval contexts.
- **NDCG (Normalized discount cumulative gain)**: evaluates the retrieval's ability (or search algorithm in vector store) to sort items based on relevance.

Let's define 2 retrievers: one with `top_k=3` and another one with `top_k=5`. 

<div class="alert alert-block alert-info">
    <b>In production environment</b>, you may want to try <u>different types or techniques of retriever</u>. So you know which parameters is best for your use case.
</div>

In [9]:
from llama_index.core.evaluation import RetrieverEvaluator

_top_3_retriever = vector_index.as_retriever(
    similarity_top_k=3,
)
_top_5_retriever = vector_index.as_retriever(
    similarity_top_k=5,
)
retrieved_metric = ['hit_rate', 'mrr', 'precision', 'recall', 'ap', 'ndcg']

for _loop in enumerate(qa_dataset.queries.items()):
    _idx = _loop[0]
    _id = _loop[1][0]
    _question = _loop[1][1]
    _expected = qa_dataset.relevant_docs[_id]
    top3_retriever_evaluator = RetrieverEvaluator.from_metric_names(
        retrieved_metric, retriever=_top_3_retriever
    )
    top5_retriever_evaluator = RetrieverEvaluator.from_metric_names(
        retrieved_metric, retriever=_top_5_retriever
    )
    top3_eval_result = top3_retriever_evaluator.evaluate(_question, _expected)
    top5_eval_result = top5_retriever_evaluator.evaluate(_question, _expected)
    display(Markdown(
        "<font color='blue'>sample {} Top 3 result:\n{}</font>".format(
            _idx+1, top3_eval_result
        )
    ))
    display(Markdown(
        "<font color='green'>sample {} Top 5 result:\n{}</font>".format(
            _idx+1, top5_eval_result
        )
    ))
    display(Markdown(
        " ---- "
    ))

<font color='blue'>sample 1 Top 3 result:
Query: What benefits do companies gain by moving to AWS?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.3333333333333333, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.46927872602275644}
</font>

<font color='green'>sample 1 Top 5 result:
Query: What benefits do companies gain by moving to AWS?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.2, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.3391602052736161}
</font>

 ---- 

<font color='blue'>sample 2 Top 3 result:
Query: How much faster can models be trained in Amazon SageMaker compared to other platforms?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.3333333333333333, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.46927872602275644}
</font>

<font color='green'>sample 2 Top 5 result:
Query: How much faster can models be trained in Amazon SageMaker compared to other platforms?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.2, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.3391602052736161}
</font>

 ---- 

<font color='blue'>sample 3 Top 3 result:
Query: What metrics does the company use to measure its market leadership?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.3333333333333333, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.46927872602275644}
</font>

<font color='green'>sample 3 Top 5 result:
Query: What metrics does the company use to measure its market leadership?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.2, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.3391602052736161}
</font>

 ---- 

<font color='blue'>sample 4 Top 3 result:
Query: What is unique about the GenAI revolution compared to the mass modernization of on-premises infrastructure to the cloud?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.3333333333333333, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.46927872602275644}
</font>

<font color='green'>sample 4 Top 5 result:
Query: What is unique about the GenAI revolution compared to the mass modernization of on-premises infrastructure to the cloud?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.2, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.3391602052736161}
</font>

 ---- 

<font color='blue'>sample 5 Top 3 result:
Query: In what year did the company expand its international consumer segment?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.3333333333333333, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.46927872602275644}
</font>

<font color='green'>sample 5 Top 5 result:
Query: In what year did the company expand its international consumer segment?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.2, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.3391602052736161}
</font>

 ---- 

We can run against the entire dataset instead of one-by-one evaluation.

In [10]:
eval_results = await top3_retriever_evaluator.aevaluate_dataset(qa_dataset)

In [11]:
def get_results(
    eval_name: str,
    eval_results: list,
    metrics: list = retrieved_metric
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)
    full_df = pd.DataFrame(metric_dicts)
    columns = {
        'retrievers_name': [eval_name],
        **{k: [full_df[k].mean()] for k in metrics},
    }
    summary_df = pd.DataFrame(columns)
    return full_df, summary_df

In [12]:
full_df, summary_df = get_results("top-k-3 retrieval", eval_results)
summary_df

Unnamed: 0,retrievers_name,hit_rate,mrr,precision,recall,ap,ndcg
0,top-k-3 retrieval,1.0,1.0,0.333333,1.0,1.0,0.469279


### Response Evaluation
---

For **response evaluation**, mostly we will focus whether the generated response matched with the retrieved context, and does it also match with the question.

We will focus on two metrics in this example:
- **Faithfulness**: measures if the response matches with the retrieved contexts, or the LLM hallucinated and generated different response.
- **Answer relevancy**: measures that the retrieved context and the response is actually relevant and consistent for the given question.


Please refer to the [usage pattern documentation](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/usage_pattern/) for more information.

For both of metrics, you can perform evaluation on overall contexts and responses, or you wish to validate the response against each source node. Below is the example of evaluate on the whole response and contexts.

```{py3}
ff_evaluator = FaithfulnessEvaluator(llm=llama3_1_70b_bedrock_llm)
query_engine = vector_index.as_query_engine()
response = query_engine.query(sample_question)
eval_result = ff_evaluator.evaluate_response(response=response)
```

In [13]:
def get_rag_resp_n_source(
    query_engine: llama_index.core.query_engine,
    input_prompt: str
) -> Tuple[str, List[str]]:
    _rag_resp = query_engine.query(input_prompt)
    llm_resp = _rag_resp.response
    return llm_resp, _rag_resp.source_nodes

In [14]:
queries = qa_dataset.queries.values()
sample_question = list(queries)[0]

Here we ran **faithfulness** and **answer relevancy** against each context node.

In [15]:
from llama_index.core.evaluation import (
    RelevancyEvaluator,
    FaithfulnessEvaluator,
)

faithfulness_eval = FaithfulnessEvaluator(llm=llama3_1_70b_bedrock_llm)
relevancy_eval = RelevancyEvaluator(llm=llama3_1_70b_bedrock_llm)
resp_str, source_nodes = get_rag_resp_n_source(query_engine, sample_question)

display(Markdown('Question: {}'.format(sample_question)))
display(Markdown('LLM response: {}'.format(resp_str)))
for _idx, source_node in enumerate(source_nodes):
    ff_eval_result = faithfulness_eval.evaluate(
        query=sample_question,
        response=resp_str,
        contexts=[source_node.get_content()]
    )
    rel_eval_result = relevancy_eval.evaluate(
        query=sample_question,
        response=resp_str,
        contexts=[source_node.get_content()]
    )
    display(Markdown(
        "<font color='brown'>Source {_idx} chunk: {context}".format(
            _idx=_idx+1, context=source_node.text
        )
    ))
    display(Markdown(
        "<font color='brown'>Source {_idx} score: {score}".format(
            _idx=_idx+1, score=source_node.score
        )
    ))
    display(Markdown(
        "<font color='blue'>Faithfulness passing {}, score {}".format(
            ff_eval_result.passing,
            ff_eval_result.score
        )
    ))
    display(Markdown(
        "<font color='green'>Relevancy passing {}, score {}".format(
            rel_eval_result.passing,
            rel_eval_result.score
        )
    ))
    display(Markdown(' ---- '))

Question: What benefits do companies gain by moving to AWS?

LLM response: You are a highly advanced Q&A system that strictly operates in two modes when refining existing answers:
1. **Rewrite** an original answer using the new context.
2. **Repeat** the original answer if the new context isn't useful.
Never reference the original answer or context directly in your answer.
When in doubt, just repeat the original answer.
New Context: page: 0
source: _raw_data/AMZN-2021-Shareholder-Letter.pdf

<font color='brown'>Source 1 chunk: in AWS. Our new customer pipeline is robust, as are our active migrations. Many companies use
discontinuous periods like this to step back and determine what they strategically want to change, and we
find an increasing number of enterprises opting out of managing their own infrastructure, and preferring to
move to AWS to enjoy the agility, innovation, cost-efficiency, and security benefits. And most importantly

<font color='brown'>Source 1 score: 0.6370195051812849

<font color='blue'>Faithfulness passing False, score 0.0

<font color='green'>Relevancy passing True, score 1.0

 ---- 

<font color='brown'>Source 2 chunk: times, it’s neither what customers want nor best for customers in the long term, so we’re taking a different
tack. One of the many advantages of AWS and cloud computing is that when your business grows, you can
seamlessly scale up; and conversely, if your business contracts, you can choose to give us back that capacity
and cease paying for it. This elasticity is unique to the cloud, and doesn’t exist when you’ve already made

<font color='brown'>Source 2 score: 0.5151359279955711

<font color='blue'>Faithfulness passing False, score 0.0

<font color='green'>Relevancy passing True, score 1.0

 ---- 

<font color='brown'>Source 3 chunk: Paramount), and even critical government agencies switched to AWS (e.g. CIA, along with several other
U.S. Intelligence agencies). But, one of the lesser-recognized beneficiaries was Amazon’s own consumer
businesses, which innovated at dramatic speed across retail, advertising, devices (e.g. Alexa and FireTV),
Prime Video and Music, Amazon Go, Drones, and many other endeavors by leveraging the speed with which
AWS let them build.Primitives, done well, rapidly accelerate builders’ ability to innovate.

<font color='brown'>Source 3 score: 0.4735027421261401

<font color='blue'>Faithfulness passing False, score 0.0

<font color='green'>Relevancy passing False, score 0.0

 ---- 

<font color='brown'>Source 4 chunk: part to our helping companies optimize their AWS footprint to save money. Concurrently, companies were
stepping back and determining what they wanted to change coming out of the pandemic. Many concluded
that they didn’t want to continue managing their technology infrastructure themselves, and made the
decision to accelerate their move to the cloud. This shift by so many companies (along with the economy
recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.

<font color='brown'>Source 4 score: 0.4462324726660458

<font color='blue'>Faithfulness passing False, score 0.0

<font color='green'>Relevancy passing True, score 1.0

 ---- 

<font color='brown'>Source 5 chunk: overnight, from working with colleagues and technology on-premises to working remotely. AWS played a
major role in enabling this business continuity. Whether companies saw extraordinary demand spikes, or
demand diminish quickly with reduced external consumption, the cloud’s elasticity to scale capacity up and
down quickly, as well as AWS’s unusually broad functionality helped millions of companies adjust to these
difficult circumstances.

<font color='brown'>Source 5 score: 0.4051489517706361

<font color='blue'>Faithfulness passing False, score 0.0

<font color='green'>Relevancy passing False, score 0.0

 ---- 

## End-to-End Evaluation
---

Normally, **end-to-end evaluation** will be your guiding signal for the RAG application, i.e., will the application generate the right responses given the data sources and a set of queries. In general, we will use this evaluation to gain an intuition for which components we want to dive deeper on.

Several evaluation metrics can be;
- **Correctness**: this metric evaluates the relevance and correctness of a generated answer against a reference answer.
- **Faithfulness**: this metric evaluates the response matches with the source nodes (please refer to **response evaluation** section for code example).
- **Guideline**: this metric evaluates how RAG or Generative AI application respond to the given guidelines or not.
- **Pairwise**: this metric evaluates if the evaluation LLM would prefer which one LLM (or even query engine) over another.
- **Relevancy**: this metric evaluates if the response and the contexts match the query.
- **Semantic Similarity**: this metric evaluates the quality of a question answering system via semantic similarity, by comparing the similarity score between embedding of the generated answer and the reference. This doesn't guarantee the correctness of the response, it is more for capturing the <u>relevancy</u>.

<div class="alert alert-block alert-info">
    We will pick one example question from the evaluation dataset generated during the prerequisite step for illustration purpose.
</div>

In [16]:
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    GuidelineEvaluator,
    PairwiseComparisonEvaluator,
    RelevancyEvaluator,
    SemanticSimilarityEvaluator,
)

In [17]:
import pandas as pd
import ast

eval_df = pd.read_csv('./../_eval_data/eval_dataframe.csv')
eval_df['source_chunk'] = eval_df.source_chunk.apply(lambda s: list(ast.literal_eval(s)))
sample_question = eval_df.question[1]
expected_output = eval_df.ref_answer[1]
reference_context = eval_df.source_chunk[1]
llm_resp, source_nodes = get_rag_resp_n_source(query_engine, sample_question)

In [18]:
display(Markdown('''
**Quesiton**: {question}

---

**Golden response**: {expected_out}

---

**LLM response**: {llm_resp}
'''.format(
    question=sample_question,
    expected_out=expected_output,
    llm_resp=llm_resp
)))


**Quesiton**: What are the company's priorities in terms of spending and culture in a business incurring net losses?

---

**Golden response**: The company's priorities in terms of spending and culture in a business incurring net losses are to spend wisely and maintain a lean culture, continually reinforcing a cost-conscious culture.

---

**LLM response**: The company's priorities are to work hard to spend wisely and maintain a lean culture, understanding the importance of continually reinforcing a cost-conscious culture, particularly in a business incurring net losses.


### Correctness evaluation

In [19]:
correctness_eval = CorrectnessEvaluator(llm=llama3_1_70b_bedrock_llm)
result = correctness_eval.evaluate(
    query=sample_question,
    response=llm_resp,
    reference=expected_output,
    contexts=[doc.text for doc in source_nodes],
)
print(result.feedback)

The generated answer is very similar to the reference answer, conveying the same priorities of spending wisely and maintaining a lean culture, especially in a business incurring net losses. The slight rewording does not affect the overall correctness of the answer, but it is not as concise as the reference answer.


### Guideline evaluation
---

For this evaluation, you will need to provide the guideline to the `GuidelineEvaluator` class. Your guideline will depend on the objective of your application. This can be useful and make improvement for the RAG application, especially in the prompt engineering component.

In [20]:
GUIDELINES = [
    'The response should fully answer the query.',
    'The response should avoid being vague or ambiguous.',
]

guideline_evaluators = [
    GuidelineEvaluator(llm=llama3_1_70b_bedrock_llm, guidelines=guideline)
    for guideline in GUIDELINES
]
for guideline, evaluator in zip(GUIDELINES, guideline_evaluators):
    eval_result = evaluator.evaluate(
        query=sample_question,
        response=llm_resp,
        reference=expected_output,
        contexts=[doc.text for doc in source_nodes],
    )
    print("=====")
    print('Guideline: {}'.format(guideline))
    print('Pass: {}'.format(eval_result.passing))
    print('Feedback: {}'.format(eval_result.feedback))

=====
Guideline: The response should fully answer the query.
Pass: False
Feedback: The response does not fully answer the query as it does not provide specific details about the company's priorities in terms of spending and culture. It only provides general statements about being cost-conscious and maintaining a lean culture. To fully answer the query, the response should provide more concrete information about the company's priorities, such as specific areas where they are cutting costs or investing in order to achieve their goals.
=====
Guideline: The response should avoid being vague or ambiguous.
Pass: False
Feedback: The response is too vague and does not provide specific details about the company's priorities in terms of spending and culture. It would be more helpful to provide concrete examples or metrics that illustrate the company's approach to cost management and cultural values during a period of net losses.


### Pairwise evaluation
---

You can choose to use `PairwiseComparisonEvaluator` when you are unsure which candidate LLMs to use or the configuration within query engine. Use this to iterate and pick the optimal parameters choices.

For comparison with **Llama 3 70B**, let's define additional **Llama 3.1 405B** candidate LLM.

In [21]:
llama3_1_405b_model_id = 'meta.llama3-1-405b-instruct-v1:0'

llama3_1_405b_bedrock_llm = Bedrock(
    model=llama3_1_405b_model_id,
    temperature=.1,
    max_tokens=2048,
)

In [22]:
_, query_engine_v2 = get_query_engine(
    vector_db=vector_store,
    storage_context=storage_context,
    embedding_model=titan_embedding,
    llm=llama3_1_405b_bedrock_llm,
    top_k=5,
)

In [23]:
resp1 = query_engine.query(sample_question)
resp2 = query_engine_v2.query(sample_question)

In [24]:
pairwise_eval = PairwiseComparisonEvaluator(llm=llama3_1_70b_bedrock_llm)
eval_result = await pairwise_eval.aevaluate(
    query=sample_question,
    response=resp1,
    second_response=resp2,
    reference=expected_output,
)

In [25]:
display(Markdown(
    "<b>Question</b>: {}".format(sample_question)
))
display(Markdown(
    "<b><font color='brown'>Expected output</b>: {}</font>".format(
        expected_output
    )
))
display(Markdown(
    "<b><font color='blue'>Llama 3 70B response</b>: {}</font>".format(resp1)
))
display(Markdown(
    "<b><font color='darkblue'>Llama 3.1 405B response</b>: {}</font>".format(resp2)
))
display(Markdown(
    "<b><font color='green'>Evaluation result</b>: {}</font>".format(
        eval_result.feedback
    )
))
display(Markdown(' ---- '))

<b>Question</b>: What are the company's priorities in terms of spending and culture in a business incurring net losses?

<b><font color='brown'>Expected output</b>: The company's priorities in terms of spending and culture in a business incurring net losses are to spend wisely and maintain a lean culture, continually reinforcing a cost-conscious culture.</font>

<b><font color='blue'>Llama 3 70B response</b>: We will work hard to spend wisely and maintain our lean culture.</font>

<b><font color='darkblue'>Llama 3.1 405B response</b>: The company prioritizes maintaining a lean culture and continually reinforcing a cost-conscious culture, particularly in a business incurring net losses, by working hard to spend wisely.</font>

<b><font color='green'>Evaluation result</b>: After evaluating the responses from Assistant A and Assistant B, I found that both assistants provided relevant answers that align with the user's question and the provided reference. However, Assistant B's response is more comprehensive and accurately reflects the company's priorities in terms of spending and culture as stated in the reference.

Assistant B's response explicitly mentions the importance of maintaining a lean culture and continually reinforcing a cost-conscious culture, which is a direct quote from the reference. Additionally, Assistant B's response provides more context by specifying that this priority is particularly relevant in a business incurring net losses.

Assistant A's response, while concise and relevant, lacks the depth and detail provided by Assistant B. It only mentions spending wisely and maintaining a lean culture without providing additional context or emphasizing the importance of a cost-conscious culture.

Therefore, based on the factors of helpfulness, relevance, accuracy, depth, and level of detail, I conclude that Assistant B's response is better.

Final Verdict: [[B]]</font>

 ---- 

### Relevancy evaluation

In [26]:
rel_eval = RelevancyEvaluator(llm=llama3_1_70b_bedrock_llm)
rel_eval_result = rel_eval.evaluate(
    query=sample_question,
    response=llm_resp,
    reference=expected_output,
    contexts=[doc.text for doc in source_nodes]
)
print(rel_eval_result.passing)
print(rel_eval_result.score)

True
1.0


### Semantic evaluation
---

You can change the `similarity_mode` to `DOT_PRODUCT` or `EUCLIDEAN`, by default it will use `cosine`.

In [27]:
from llama_index.core.base.embeddings.base import SimilarityMode

evaluator = SemanticSimilarityEvaluator(
    embed_model=titan_embedding,
    similarity_mode=SimilarityMode.DEFAULT,
    similarity_threshold=0.6,
)

result = await evaluator.aevaluate(
    response=llm_resp,
    reference=expected_output,
)
print(result.feedback)

Similarity score: 0.9111703519420223


## Summary
---

In this notebook, we have implemented RAG evaluation using `LlamaIndex` evaluation framework. `LlamaIndex` is an open-source library, which helps AI developers to build generative AI application. It also offers comprehensive set of evaluation tools from **end-to-end** to **component-wised** evaluation.

By integrating LlamaIndex evaluation into your existing RAG workflow, you can assess the workflow output as a whole, use it to find better LLM or parameters choices, and drill down to pinpoint the components or area of improvement. Overall, the **LlamaIndex evaluation** framework streamlines the process of assessing your application's performance, allowing you to iterate and improve your index and query strategies more efficiently.

However, this evaluation is relied heavily on Large language model (LLMs) for assessment. This can introduce potential drawbacks, including bias, increased costs, and consistency concerns.