# Prerequisite Step: Set up vector database and evaluation dataset
---

This notebook is requisite step for all the RAG evaluation task in this repository. It covers the aspects of;
- Download the sample dataset, we will utilize Amazon Shareholder letters as our data sources.
- Set up **Chroma** database as our vector database.
- Genreate a **synthetic dataset** for a QnA-RAG application using Meta Llama foundation model via the Bedrock API, Python and Langchain.


You will need to have access to **Amazon Bedrock** foundation model for embedding the documents, please refer to the [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html) for more details.

<div class="alert alert-block alert-info">
    <b>Note</b>: We will be using <b>Amazon Titan Text Embedding v2 model</b> (<i>amazon.titan-embed-text-v2:0</i>). Please refer to its capability <a href='https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html'>here</a>.
</div>

## Set up
---

Install the dependency libraries for the notebook

In [1]:
%pip install -qU --quiet -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Set up Vector database

### Download the dataset
---

We will be using Amazon shareholder letter from 2021 to 2023 as our datasources. 

In [2]:
import os
from urllib.request import urlretrieve
from typing import List, Optional, Dict, Tuple
import pandas as pd
import re
import time
import random

url_file_map = {
    'https://s2.q4cdn.com/299287126/files/doc_financials/2024/ar/Amazon-com-Inc-2023-Shareholder-Letter.pdf': 'AMZN-2023-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf': 'AMZN-2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf': 'AMZN-2021-Shareholder-Letter.pdf',
}
data_dir = './_raw_data/'
os.makedirs(data_dir, exist_ok=True)
for _key in url_file_map.keys():
    urlretrieve(_key, os.path.join(data_dir, url_file_map.get(_key)))

### Chunking strategy
---

**Chunking data** before loading it into a vector database is often necessary because vector databases are optimized for **efficient similarity search and retrieval operations on high-dimensional vector data**. These databases typically have limitations on the maximum size or dimensionality of vectors they can store and process efficiently. 

By chunking or splitting large datasets into smaller, manageable chunks, it becomes easier to load and index the data within the vector database's constraints. Chunking also facilitates parallel processing, allowing multiple chunks to be loaded concurrently, improving overall performance and scalability. Additionally, it provides a way to manage and update the data incrementally, as new chunks can be added or existing ones can be modified without requiring a complete reload of the entire dataset.

There are multiple chunking strategy, however for simplicity, we will use [`RecursiveCharacterTextSplitter()` as our chunking strategy](https://python.langchain.com/docs/how_to/recursive_text_splitter/).


There are a few parameters we can configure for our `RecursiveCharacterTextSplitter`:

- `chunk_size`: The maximum size of a chunk, where size is determined by the length_function.
- `chunk_overlap`: Target overlap between chunks. Overlapping chunks helps to mitigate loss of information when context is divided between chunks.
- `length_function`: Function determining the chunk size.
- `is_separator_regex`: Whether the separator list (defaulting to ["\n\n", "\n", " ", ""]) should be interpreted as regex.

<div class="alert alert-block alert-warning">
    <b>Note</b>: You must take the <b>chunk_size</b> into account as each embedding model will have limitation on the length of input token.
</div>

In [3]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


loader = PyPDFDirectoryLoader(data_dir)
pages = loader.load_and_split()
print('Total document pages: {}'.format(len(pages)))
print('Sample data load: {}'.format(pages[0].page_content[:100]))

rec_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=102
)
rec_docs_splitted = rec_splitter.split_documents(pages)
print(' ----- ')
print('Total chunks: {}'.format(len(rec_docs_splitted)))
print('Sample chunk:\n{}'.format(rec_docs_splitted[3].page_content))

Total document pages: 48
Sample data load: Dear Shareholders:
Last year at this time, I shared my enthusiasm and optimism for Amazon’s future. 
 ----- 
Total chunks: 322
Sample chunk:
available, tens of millions added last year alone, and several premium brands starting to list on Amazon
(e.g. Coach, Victoria’s Secret, Pit Viper, Martha Stewart, Clinique, Lancôme, and Urban Decay).
Being sharp on price is always important, but particularly in an uncertain economy, where customers are
careful about how much they’re spending. As a result, in Q4 2023, we kicked off the holiday season with Prime


### Prepare Vector Database
---

Once we have our chunk documents, next step, we will prepare vector database. In this example, we will utilize **Chroma database**. [**ChromaDB**](https://www.trychroma.com/) is an open-source vector database designed for building applications that require efficient semantic search and retrieval capabilities.

But before creating ChromaDB, we will need to initialize the embedding model function. We can use [`BedrockEmbeddings` class](https://api.python.langchain.com/en/latest/embeddings/langchain_aws.embeddings.bedrock.BedrockEmbeddings.html) from **langchain_aws** to do this.


In [4]:
import langchain_aws
from langchain_aws import BedrockEmbeddings
import boto3

boto_session = boto3.session.Session()
titan_model_id = 'amazon.titan-embed-text-v2:0'
titan_embedding_fn = BedrockEmbeddings(
    model_id=titan_model_id,
    region_name=boto_session.region_name
)
titan_embedding_fn.embed_query('Hello')[: 2]

[-0.0635838583111763, 0.05780351161956787]

Now with embedding function, we can specify `Chroma`

In [5]:
from langchain_chroma import Chroma

chroma_db_dir = './vector_db'
chroma_collection_name = 'amazon-shareholder-letters'

# Init from Chroma client
vector_store = Chroma(
    collection_name=chroma_collection_name,
    embedding_function=titan_embedding_fn,
    persist_directory=chroma_db_dir,
)

Load the documents to ChromaDB

In [6]:
if len(vector_store.get().get('ids')) == 0:
    vector_store = Chroma.from_documents(
        collection_name=chroma_collection_name,
        documents=rec_docs_splitted,
        persist_directory=chroma_db_dir,
        embedding=titan_embedding_fn,
    )

### Test query our vector store
---
Now we have vector database with the data in it, let's test it out.

In [7]:
sample_question = '''
Amazon discusses its investments and progress in various areas, such as Generative AI, logistics, and healthcare. 
How do these initiatives relate to the company's strategy of building "primitives" or foundational building blocks, 
and what potential customer experiences or business opportunities do they enable?'''

search_result = vector_store.similarity_search_with_relevance_scores(
    query=sample_question.strip(),
    k=3,
)

In [8]:
search_result

[(Document(metadata={'page': 5, 'source': '_raw_data/AMZN-2022-Shareholder-Letter.pdf'}, page_content='the investment hypothesis to go after it.\nOne final investment area that I’ll mention, that’s core to setting Amazon up to invent in every area of our\nbusiness for many decades to come, and where we’re investing heavily isLarge Language Models (“LLMs”)\nand Generative AI. Machine learning has been a technology with high promise for several decades, but it’s\nonly been the last five to ten years that it’s started to be used more pervasively by companies. This shift was'),
  0.3977445048117627),
 (Document(metadata={'page': 5, 'source': '_raw_data/AMZN-2023-Shareholder-Letter.pdf'}, page_content='optimize the movement of our growing robotic fleet, and better manage the bottlenecks in our facilities.\nSometimes, people ask us “what’s your next pillar? Y ou have Marketplace, Prime, and AWS, what’s next?”\nThis, of course, is a thought-provoking question. However, a question people never

Now, we have establishing ChromaDB for our vector database, which will be used in the subseqent notebooks for RAG evaluation.

## Synthetic Evaluation dataset
---

In this section, we will generate the **evaluation dataset** for QnA task. This output of this step will be reused in other RAG evaluation notebooks within this repository.

**Synthetic dataset generation** provides a practical solution for generating datasets that mimic real human interactions, enabling efficient and scalable evaluation of RAG systems. By leveraging large language models and knowledge retrieval context, the proposed approach ensures that the synthetic datasets are diverse, realistic, and representative of real-world scenarios. 

This solution is relevant for developers and researchers working on RAG systems, as it streamlines the evaluation process and accelerates the iterative development cycle, ultimately leading to better-performing AI systems.

### Set up langchain
---

You can use various Llama model to test out, this section we will use **Llama 3.1 70B** as question generator LLM, and **Llama 3.1 405B** for evaluate the quality of questions.

<div class="alert alert-block alert-info">
    <b>Note</b>: Feel free to adjust the foundation model here!
</div>

In [9]:
import langchain_aws
from langchain_aws import ChatBedrock


llama3_1_70b_model_id = 'meta.llama3-1-70b-instruct-v1:0'
llama3_1_405b_model_id = 'meta.llama3-1-405b-instruct-v1:0'

llama3_1_70b_langchain = ChatBedrock(
    model_id=llama3_1_70b_model_id,
    region_name=boto_session.region_name,
    model_kwargs={
        'temperature': 0.1,
        'max_gen_len': 4096,
    }
)
llama3_1_405b_langchain = ChatBedrock(
    model_id=llama3_1_405b_model_id,
    region_name=boto_session.region_name,
    model_kwargs={
        'temperature': 0.1,
        'max_gen_len': 4096,
    }
)

### Initial question generation
---

As first step, we will generate sample questions, we use each of the generated chunks to generate synthetic questions that a real chatbot user may ask. We will prompt the LLM to analyze a chunk of the shareholder's letter and generate relevant question.

In [10]:
import langchain_core

def llm_invoke(
    llm: langchain_aws.chat_models,
    _my_prompt_template: str,
    max_retries: int = 3,
    wait_in_seconds: int = 90
) -> langchain_core.messages.ai.AIMessage:
    '''Function to invoke langchain_aws chat models with back off mechanism
    '''
    attempt = 0
    generated_content = None
    assert max_retries > 0, 'Max retries needs to be more than 0'
    while attempt < max_retries:
        try:
            generated_content = llm.invoke(_my_prompt_template)
            break
        except Exception as e:
            print(e)
            attempt += 1
            time.sleep(wait_in_seconds)

    if (attempt >= max_retries) & (generated_content is None):
        print('-- Exceed attempt, no output is generated!')
    return generated_content

In [11]:
from langchain.prompts import PromptTemplate

initial_question_prompt_template = PromptTemplate(
    input_variables=['context'],
    template='''
    <Instructions>
    Here is the context:
    <context>
    {context}
    </context>

    Your task is to generate 1 question that can be answered using the provided context, following these rules:

    <rules>
    1. The question should make sense to humans even when read without the given context.
    2. The question should be fully answered from the given context.
    3. The question should be framed from a part of context that contains important information. It can also be from tables, code, etc.
    4. The answer to the question should not contain any links.
    5. The question should be of moderate difficulty up to difficult.
    6. The question must be reasonable and must be understood and responded by humans.
    7. Do not use phrases like 'provided context', etc. in the question.
    8. You can frame the questions using the word "and", "or" that can be decomposed into more than one question.
    9. Your question should be able to be referenced in full sentence from the context.
    10. Never create question that will refer back to the context.
    </rules>

    To generate the question, first identify the most important or relevant part of the context. 
    Then frame a question around that part that satisfies all the rules above.
    Think step-by-step and follow the <rule>.

    Output only the generated question with a "?" at the end, no other text or characters.
    </Instructions>
    ''')




In [12]:
question = llm_invoke(
    llm=llama3_1_70b_langchain,
    _my_prompt_template=initial_question_prompt_template.format(context=rec_docs_splitted[2].page_content)
)
print(question.content)

What is the broadest retail selection offered by the company's Stores business?


### Answer Generation
---
Next, we will use the generated question to generate a reference response for each question.

In [13]:
answer_prompt_template = PromptTemplate(
    input_variables=['context', 'question'],
    template='''
    <Instructions>
    <role>You are an experienced QA Engineer for building large language model applications.</role>
    <task>It is your task to generate an answer to the following question <question>{question}</question> only based on the <context>{context}</context></task>
    The output should be only the answer generated from the context.

    <rules>
    1. Only use the given context as a source for generating the answer.
    2. Be as precise as possible with answering the question.
    3. Be concise in answering the question and only answer the question at hand rather than adding extra information.
    </rules>

    Only output the generated answer as a sentence. No extra characters.
    </Task>
    </Instructions>
    ''')

answer = llm_invoke(
    llm=llama3_1_70b_langchain,
    _my_prompt_template=answer_prompt_template.format(
        context=rec_docs_splitted[2].page_content,
        question=question.content
    )
)
print(answer.content)

The company's Stores business offers the broadest retail selection, with hundreds of millions of products.


### Extracting Relevant Context
---

To make the dataset verifiable you can use the following prompt to extract the relevant sentences from the given context to answer the generated question. Knowing the relevant sentences you can easily check whether the question and answer are correct.

In [14]:
source_prompt_template = PromptTemplate(
    input_variables=['context', 'question'],
    template='''
    <Instructions>
    Here is the context:
    <context>
    {context}
    </context>

    Your task is to extract the relevant sentences from the given context that can potentially help answer the following question.
    You are not allowed to make any changes to the sentences from the context.

    <question>
    {question}
    </question>

    Output only the relevant sentences you found, one sentence per line, without any extra characters or explanations.
    </Instructions>
    ''')

source_sentence = llm_invoke(
    llm=llama3_1_70b_langchain,
    _my_prompt_template=source_prompt_template.format(
        context=rec_docs_splitted[2].page_content,
        question=question.content
    )
)
print(source_sentence.content)
print(' ---- ')
print(rec_docs_splitted[2].page_content)

In our Stores business, customers have enthusiastically responded to our relentless focus on selection, price, and convenience.
We continue to have the broadest retail selection, with hundreds of millions of products
 ---- 
in 2022 to $35.5B (up $48.3B).
While we’ve made meaningful progress on our financial measures, what we’re most pleased about is the
continued customer experience improvements across our businesses.
In our Stores business, customers have enthusiastically responded to our relentless focus on selection, price,
and convenience. We continue to have the broadest retail selection, with hundreds of millions of products


### Evolving Questions to fit end-users behaviour
---


When generating question & answer pairs from the same prompt for the whole dataset it might appear that the questions are repetitive, similar in form and thus not mimic real enduser behavior. In this section you evolve the existing generated question to for example make it shorter and more precise. The prompt for generating questions that fit your use case heavily depend on your use case and thus your prompt must reflect your endusers by for instance setting the rules accordingly or by providing examples.


In [15]:
question_compress_ = PromptTemplate(
    input_variables=['question'],
    template='''
    <Instructions>
    <role>You are an experienced linguistics expert for building testsets for large language model applications.</role>

    <task>It is your task to rewrite the following question in a more indirect and compressed form, following these rules:

    <rules>
    1. Make the question more indirect
    2. Use abbreviations if applicable, but remain professional, and represent the same context.
    </rules>

    <question>
    {question}
    </question>

    Your output should only be the rewritten question with a question mark "?" at the end. 
    Do not provide any other explanation or text.
    </task>
    </Instructions>
    ''')


compressed_question = llm_invoke(
    llm=llama3_1_70b_langchain,
    _my_prompt_template=question_compress_.format(
        question=question.content
    )
)
print(compressed_question.content)

What is the Stores biz's most extensive retail assortment?


### Putting it all together to automate dataset generation
---

We will put everything together, and will output to pandas dataframe.

In [16]:
def generate_qa_from_doc(
    doc: langchain_core.documents.base.Document,
    llm: langchain_aws.chat_models,
    verbose: bool = False
) -> Tuple[str, str, str, str, list, Dict[str, str]]:
    _reference_chunk = doc.page_content
    _metadata = doc.metadata
    _question = question = llm_invoke(
        llm=llm,
        _my_prompt_template=initial_question_prompt_template.format(context=_reference_chunk)
    )
    question = _question.content
    _answer = llm_invoke(
        llm=llm,
        _my_prompt_template=answer_prompt_template.format(
            context=_reference_chunk,
            question=_question
        )
    )
    answer = _answer.content

    _source_sentence = llm_invoke(
        llm=llm,
        _my_prompt_template=source_prompt_template.format(
            context=_reference_chunk,
            question=question
        )
    )
    source_sentence = _source_sentence.content

    _compressed_question = llm_invoke(
        llm=llm,
        _my_prompt_template=question_compress_.format(
            question=question
        )
    )
    compressed_question = _compressed_question.content
    if verbose:
        print('Here is the generated question from {}'.format(llm.model_id))
        print(' ------ ')
        print('Question: {}'.format(question))
        print('Compressed Question: {}'.format(compressed_question))
        print('Answer: {}'.format(answer))
        print('Source sentence: {}'.format(source_sentence))
        print('Source chunk: {}'.format(_reference_chunk))
        print('Metadata: {}'.format(_metadata))
    return [question, compressed_question, answer, source_sentence, [_reference_chunk], _metadata]

In [17]:
def generate_qna_dataframe(
    docs: List[langchain_core.documents.base.Document], 
    llm: langchain_aws.chat_models,
    verbose: bool = False
) -> pd.DataFrame:
    print('Generating {} questions ...'.format(len(docs)))
    qna_df = pd.DataFrame(
        columns=[
            "question", "compressed_question", "ref_answer", "source_sentence",
            'source_chunk', 'source_document'
        ]
    )
    for idx, doc in enumerate(docs):
        _out_row = generate_qa_from_doc(
            doc=doc,
            llm=llm,
            verbose=verbose
        )
        qna_df.loc[len(qna_df)] = _out_row
        if (idx+1) % 10 == 0:
            print(' -- Generating {} QnA pairs...'.format(idx+1))

    print('Generate completed!')
    return qna_df

In [18]:
qna_df = generate_qna_dataframe(
    docs=random.sample(rec_docs_splitted, 20),
    llm=llama3_1_70b_langchain,
)

Generating 20 questions ...
 -- Generating 10 QnA pairs...
 -- Generating 20 QnA pairs...
Generate completed!


In [19]:
print(qna_df.shape)
qna_df.head()

(20, 6)


Unnamed: 0,question,compressed_question,ref_answer,source_sentence,source_chunk,source_document
0,What are the names of the chips that were anno...,Which 2nd-gen chipsets are being utilized by A...,Trainium and Inferentia.,announced second versions of our Trainium and ...,[announced second versions of our Trainium and...,{'source': '_raw_data/AMZN-2023-Shareholder-Le...
1,What are the company's priorities in terms of ...,What are key spend & cultural priorities for a...,The company's priorities in terms of spending ...,We will work hard to spend wisely and maintain...,"[the present value of future cash flows, we’ll...",{'source': '_raw_data/AMZN-2021-Shareholder-Le...
2,What was the annual revenue run rate of AWS fi...,What was AWS' ARR 15 yrs post-investment decis...,AWS had an annual revenue run rate of $85B fif...,"Fifteen years later, AWS is now an $85B annual...","[be investing so much in cloud computing. But,...",{'source': '_raw_data/AMZN-2022-Shareholder-Le...
3,What is the year in which Amazon's original Sh...,When was Amazon's inaugural Shareholder Letter...,The year in which Amazon's original Shareholde...,"P .S. As we have always done, our original 199...","[Amazon, all of which are still in their early...",{'source': '_raw_data/AMZN-2022-Shareholder-Le...
4,What were the changes in Amazon.com's employee...,What shifts occurred in Amazon's workforce & D...,Amazon.com's employee base grew from 158 to 61...,• Amazon.com’s employee base grew from 158 to ...,"[Infrastructure\nDuring 1997, we worked hard t...",{'source': '_raw_data/AMZN-2023-Shareholder-Le...


## Assessing the questions quality using Critique Agents
---

Critique agents are a technique used in natural language processing (NLP) to evaluate the quality and suitability of questions in a dataset for a particular task or application. In this case, the critique agents are employed to assess whether the questions in a dataset are valid for a Retrieval-Augmented Generation (RAG) system, which is a type of language model that combines information retrieval and generation capabilities.

The two main metrics evaluated by the critique agents are relevance and groundedness.

### Relevance

Relevance measures <u>how useful and applicable a question</u> is for a specific domain or context. In the context of business analysis, the relevance prompt evaluates questions based on the following criteria:

- Is the question directly relevant to the work of financial and business analysts on Wall Street?
- Does the question address a practical problem or use case that analysts might encounter?
- Is the question clear and well-defined, avoiding ambiguity or vagueness?
- Does the question require a substantive answer that demonstrates understanding of financial topics?
- Would answering the question provide insights or knowledge that could be applied to real-world company evaluation tasks?

The relevance score ranges from 1 to 5, with a higher score indicating greater relevance and usefulness for business analysts.


### Groundedness

Groundedness measures <u>how well a question can be answered based on the provided context</u> or information. The groundedness prompt evaluates questions based on the following criteria:

- Can the question be answered using only the information provided in the given context?
- Does the context provide little, some, substantial, or all the information needed to answer the question?

The groundedness score also ranges from 1 to 5, with the following interpretations:
- The question cannot be answered at all based on the given context.
- The context provides very little relevant information to answer the question.
- The context provides some relevant information to partially answer the question.
- The context provides substantial information to answer most aspects of the question.
- The context provides all the information needed to fully and unambiguously answer the question.


By evaluating both relevance and groundedness, the critique agents can help identify questions in the dataset that are well-suited for the RAG system, as well as those that may need to be revised, removed, or supplemented with additional context or information.

In [20]:
groundedness_check_prompt_template = PromptTemplate(
    input_variables=['context', 'question'],
    template='''
    <Instructions>
    You will be given a context and a question related to that context.
    Your task is to provide an evaluation of how well the given question can be answered using only the information provided in the context. Rate this on a scale from 1 to 5, where:

    1 = The question cannot be answered at all based on the given context
    2 = The context provides very little relevant information to answer the question
    3 = The context provides some relevant information to partially answer the question 
    4 = The context provides substantial information to answer most aspects of the question
    5 = The context provides all the information needed to fully and unambiguously answer the question

    First, read through the provided context carefully:

    <context>
    {context}
    </context>

    Then read the question:

    <question>
    {question}
    </question>

    Evaluate how well you think the question can be answered using only the context information. Provide your reasoning first in an <evaluation> section, explaining what relevant or missing information from the context led you to your evaluation score in only one sentence.

    Provide your evaluation in the following format:

    <rating>(Your rating from 1 to 5)</rating>
    <evaluation>(Your evaluation and reasoning for the rating)</evaluation>
    </Instructions>
    ''')

relevance_check_prompt_template = PromptTemplate(
    input_variables=['question'],
    template='''
    <Instructions>
    You will be given a question related to Amazon Shareholder letters. Your task is to evaluate how useful this question would be for a business analyst working in WallStreet.

    To evaluate the usefulness of the question, consider the following criteria:

    1. Relevance: Is the question directly relevant to your work? Questions that are too broad or unrelated to this domain should receive a lower rating.
    2. Practicality: Does the question address a practical problem or use case that analysts might encounter? Theoretical or overly academic questions may be less useful.
    3. Clarity: Is the question clear and well-defined? Ambiguous or vague questions are less useful.
    4. Depth: Does the question require a substantive answer that demonstrates understanding of financial topics? Surface-level questions may be less useful.
    5. Applicability: Would answering this question provide insights or knowledge that could be applied to real-world company evaluation tasks? Questions with limited applicability should receive a lower rating.

    Provide your evaluation in the following format:

    <rating>(Your rating from 1 to 5)</rating>
    <evaluation>(Your evaluation and reasoning for the rating)</evaluation>

    Here is an example: 
    <evaluation>The question is very relevant to the persona because it asks about financial information of a company</evaluation>
    <rating>5</rating>

    Here is the question:
    {question}
    
    </Instructions>
    ''')

In [21]:
def get_rating(eval_str: str) -> Optional[str]:
    pattern = r'<rating>(.*?)</rating>'
    _match = re.search(pattern, eval_str)
    if _match:
        rating = _match.group(1)
        return rating

    else:
        return None


def get_evaluation_reasoning(eval_str: str) -> Optional[str]:
    pattern = r'<evaluation>(.*?)</evaluation>'
    _match = re.search(pattern, eval_str)
    if _match:
        reasoning = _match.group(1)
        return reasoning

    else:
        return None

In [22]:
def append_rating_n_reasoning(
    df: pd.DataFrame,
    llm: langchain_aws.chat_models,
) -> pd.DataFrame:
    _df = df.copy()
    for idx, row in _df.iterrows():
        _question = row['question']
        _context_chunk = row['source_chunk'][0]
        _groundedness = llm_invoke(
            llm=llm,
            _my_prompt_template=groundedness_check_prompt_template.format(
                question=_question,
                context=_context_chunk
            )
        )
        _relevancy = llm_invoke(
            llm=llm,
            _my_prompt_template=relevance_check_prompt_template.format(
                question=_question,
            )
        )
        groundedness_rating = get_rating(_groundedness.content)
        groundedness_reason = get_evaluation_reasoning(_groundedness.content)
        relevance_rating = get_rating(_relevancy.content)
        relevance_reason = get_evaluation_reasoning(_relevancy.content)
        _df.at[idx, 'groundedness_rating'] = int(groundedness_rating)
        _df.at[idx, 'groundedness_reason'] = groundedness_reason
        _df.at[idx, 'relevance_rating'] = int(relevance_rating)
        _df.at[idx, 'relevance_reason'] = relevance_reason
        if (idx+1) % 2 == 0:
            print(' --- processing up to {} rows'.format(idx+1))

    return _df



In [23]:
qna_with_assessment_df = append_rating_n_reasoning(
    df=qna_df,
    llm=llama3_1_405b_langchain,
)

 --- processing up to 2 rows
 --- processing up to 4 rows
 --- processing up to 6 rows
 --- processing up to 8 rows


ERROR:root:Error raised by bedrock service: An error occurred (ThrottlingException) when calling the InvokeModel operation (reached max retries: 4): Too many requests, please wait before trying again. You have sent too many requests.  Wait before trying again.


An error occurred (ThrottlingException) when calling the InvokeModel operation (reached max retries: 4): Too many requests, please wait before trying again. You have sent too many requests.  Wait before trying again.
 --- processing up to 10 rows
 --- processing up to 12 rows
 --- processing up to 14 rows
 --- processing up to 16 rows


ERROR:root:Error raised by bedrock service: An error occurred (ThrottlingException) when calling the InvokeModel operation (reached max retries: 4): Too many requests, please wait before trying again. You have sent too many requests.  Wait before trying again.


An error occurred (ThrottlingException) when calling the InvokeModel operation (reached max retries: 4): Too many requests, please wait before trying again. You have sent too many requests.  Wait before trying again.
 --- processing up to 18 rows
 --- processing up to 20 rows


In [24]:
qna_with_assessment_df

Unnamed: 0,question,compressed_question,ref_answer,source_sentence,source_chunk,source_document,groundedness_rating,groundedness_reason,relevance_rating,relevance_reason
0,What are the names of the chips that were anno...,Which 2nd-gen chipsets are being utilized by A...,Trainium and Inferentia.,announced second versions of our Trainium and ...,[announced second versions of our Trainium and...,{'source': '_raw_data/AMZN-2023-Shareholder-Le...,5.0,The context provides all the necessary informa...,1.0,The question is not relevant to a business ana...
1,What are the company's priorities in terms of ...,What are key spend & cultural priorities for a...,The company's priorities in terms of spending ...,We will work hard to spend wisely and maintain...,"[the present value of future cash flows, we’ll...",{'source': '_raw_data/AMZN-2021-Shareholder-Le...,5.0,The context provides all the necessary informa...,5.0,The question is very relevant to a business an...
2,What was the annual revenue run rate of AWS fi...,What was AWS' ARR 15 yrs post-investment decis...,AWS had an annual revenue run rate of $85B fif...,"Fifteen years later, AWS is now an $85B annual...","[be investing so much in cloud computing. But,...",{'source': '_raw_data/AMZN-2022-Shareholder-Le...,5.0,The context provides all the necessary informa...,5.0,This question is highly relevant to a business...
3,What is the year in which Amazon's original Sh...,When was Amazon's inaugural Shareholder Letter...,The year in which Amazon's original Shareholde...,"P .S. As we have always done, our original 199...","[Amazon, all of which are still in their early...",{'source': '_raw_data/AMZN-2022-Shareholder-Le...,5.0,The context explicitly mentions the year 1997 ...,2.0,The question is somewhat relevant to a busines...
4,What were the changes in Amazon.com's employee...,What shifts occurred in Amazon's workforce & D...,Amazon.com's employee base grew from 158 to 61...,• Amazon.com’s employee base grew from 158 to ...,"[Infrastructure\nDuring 1997, we worked hard t...",{'source': '_raw_data/AMZN-2023-Shareholder-Le...,5.0,The context provides all the necessary informa...,4.0,The question is relevant to a business analyst...
5,What are some examples of companies and indust...,"Which entities (cos., sectors) have seen subst...",Examples of companies and industries significa...,whole companies sprang up quickly on top of AW...,[blocks over time (we now have over 240 at bui...,{'source': '_raw_data/AMZN-2023-Shareholder-Le...,5.0,The context provides all the information neede...,4.0,The question is relevant to a business analyst...
6,What opportunity does Amazon.com have as large...,What prospects emerge for Amazon as major play...,Amazon.com has a window of opportunity as larg...,We have a window of opportunity as larger play...,"[customers money and precious time. Tomorrow, ...",{'source': '_raw_data/AMZN-2023-Shareholder-Le...,4.0,The context provides substantial information t...,5.0,
7,What are some ways in which serendipitous inte...,How do chance encounters facilitate innovation?,Serendipitous interactions can help the proces...,moments from people staying behind after a mee...,[moments from people staying behind after a me...,{'source': '_raw_data/AMZN-2022-Shareholder-Le...,4.0,The context provides substantial information t...,2.0,The question is somewhat relevant to a busines...
8,What enhancements were made to the store in 1997?,What '97 upgrades did the store undergo?,"In 1997, the store was substantially enhanced.",We maintained a dogged focus on improving the ...,"[could not get any other way, and began servin...",{'source': '_raw_data/AMZN-2022-Shareholder-Le...,2.0,The context provides very little relevant info...,2.0,The question is somewhat relevant to a busines...
9,What is the company's goal in the rapidly evol...,What strategic objectives is the co. pursuing ...,The company's goal is to solidify and extend i...,Our goal is to move quickly to solidify and ex...,[landscape has continued to evolve at a fast p...,{'source': '_raw_data/AMZN-2021-Shareholder-Le...,4.0,The context provides substantial information t...,5.0,This question is highly relevant to a business...


In [25]:
eval_data_dir = './_eval_data'
os.makedirs(eval_data_dir, exist_ok=True)
qna_with_assessment_df.to_csv(os.path.join(eval_data_dir, 'eval_dataframe.csv'), index=False, header=True)

## Conclusion
---

This is the end of this prerequisite notebook, we have downloaded Amazon Shareholders and ingested them onto vector datastore (ChromaDB). We also use **Amazon Bedrock** to generate synthetic dataset. This technique can be powerful technique for evaluating retrieval augmented generation (RAG) workflow, specifically in the early development when real-world data is difficult to obtain. By leveraging large language models, this approach enables the creation of diverse, realistic, and representative dataset that mimic real human interactions.