# Prerequisite Step: Set up vector database and evaluation dataset
---

This notebook is requisite step for all the RAG evaluation task in this repository. It covers the aspects of;
- Download the sample dataset, we will utilize Amazon Shareholder letters as our data sources.
- Set up **Chroma** database as our vector database.
- Use **DeepEval** library to generate the **golden or evaluation** dataset for our RAG application.


You will need to have access to **Amazon Bedrock** foundation model for embedding the documents, please refer to the [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html) for more details.

<div class="alert alert-block alert-info">
    <b>Note</b>: We will be using <b>Amazon Titan Text Embedding v2 model</b> (<i>amazon.titan-embed-text-v2:0</i>). Please refer to its capability <a href='https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html'>here</a>.
</div>

## Set up
---

Install the dependency libraries for the notebook

In [4]:
%pip install -qU --quiet -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Set up Vector database

### Download the dataset
---

We will be using Amazon shareholder letter from 2021 to 2023 as our datasources. 

In [5]:
import os
from urllib.request import urlretrieve

url_file_map = {
    'https://s2.q4cdn.com/299287126/files/doc_financials/2024/ar/Amazon-com-Inc-2023-Shareholder-Letter.pdf': 'AMZN-2023-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf': 'AMZN-2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf': 'AMZN-2021-Shareholder-Letter.pdf',
}
data_dir = './_raw_data/'
os.makedirs(data_dir, exist_ok=True)
for _key in url_file_map.keys():
    urlretrieve(_key, os.path.join(data_dir, url_file_map.get(_key)))

### Chunking strategy
---

**Chunking data** before loading it into a vector database is often necessary because vector databases are optimized for **efficient similarity search and retrieval operations on high-dimensional vector data**. These databases typically have limitations on the maximum size or dimensionality of vectors they can store and process efficiently. 

By chunking or splitting large datasets into smaller, manageable chunks, it becomes easier to load and index the data within the vector database's constraints. Chunking also facilitates parallel processing, allowing multiple chunks to be loaded concurrently, improving overall performance and scalability. Additionally, it provides a way to manage and update the data incrementally, as new chunks can be added or existing ones can be modified without requiring a complete reload of the entire dataset.

There are multiple chunking strategy, however for simplicity, we will use [`RecursiveCharacterTextSplitter()` as our chunking strategy](https://python.langchain.com/docs/how_to/recursive_text_splitter/).


There are a few parameters we can configure for our `RecursiveCharacterTextSplitter`:

- `chunk_size`: The maximum size of a chunk, where size is determined by the length_function.
- `chunk_overlap`: Target overlap between chunks. Overlapping chunks helps to mitigate loss of information when context is divided between chunks.
- `length_function`: Function determining the chunk size.
- `is_separator_regex`: Whether the separator list (defaulting to ["\n\n", "\n", " ", ""]) should be interpreted as regex.

<div class="alert alert-block alert-warning">
    <b>Note</b>: You must take the <b>chunk_size</b> into account as each embedding model will have limitation on the length of input token.
</div>

In [14]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


loader = PyPDFDirectoryLoader(data_dir)
pages = loader.load_and_split()
print('Total document pages: {}'.format(len(pages)))
print('Sample data load: {}'.format(pages[0].page_content[:100]))

rec_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=102
)
rec_docs_splitted = rec_splitter.split_documents(pages)
print(' ----- ')
print('Total chunks: {}'.format(len(rec_docs_splitted)))
print('Sample chunk:\n{}'.format(rec_docs_splitted[3].page_content))

Total document pages: 48
Sample data load: Dear Shareholders:
Last year at this time, I shared my enthusiasm and optimism for Amazon’s future. 
 ----- 
Total chunks: 380
Sample chunk:
available, tens of millions added last year alone, and several premium brands starting to list on Amazon(e.g. Coach, Victoria’s Secret, Pit Viper, Martha Stewart, Clinique, Lancôme, and Urban Decay).
Being sharp on price is always important, but particularly in an uncertain economy, where customers are
careful about how much they’re spending. As a result, in Q4 2023, we kicked off the holiday season with Prime


### Prepare Vector Database
---

Once we have our chunk documents, next step, we will prepare vector database. In this example, we will utilize **Chroma database**. [**ChromaDB**](https://www.trychroma.com/) is an open-source vector database designed for building applications that require efficient semantic search and retrieval capabilities.

But before creating ChromaDB, we will need to initialize the embedding model function. We can use [`BedrockEmbeddings` class](https://api.python.langchain.com/en/latest/embeddings/langchain_aws.embeddings.bedrock.BedrockEmbeddings.html) from **langchain_aws** to do this.


In [15]:
import langchain_aws
from langchain_aws import BedrockEmbeddings
import boto3

boto_session = boto3.session.Session()
titan_model_id = 'amazon.titan-embed-text-v2:0'
titan_embedding_fn = BedrockEmbeddings(
    model_id=titan_model_id,
    region_name=boto_session.region_name
)
titan_embedding_fn.embed_query('Hello')[: 2]

[-0.0635838583111763, 0.05780351161956787]

Now with embedding function, we can specify `Chroma`

In [16]:
from langchain_chroma import Chroma

chroma_db_dir = './_vector_db'
chroma_collection_name = 'amazon-shareholder-letters'

# Init from Chroma client
vector_store = Chroma(
    collection_name=chroma_collection_name,
    embedding_function=titan_embedding_fn,
    persist_directory=chroma_db_dir,
)

Load the documents to ChromaDB

In [18]:
if len(vector_store.get().get('ids')) == 0:
    vector_store = Chroma.from_documents(
        collection_name=chroma_collection_name,
        documents=rec_docs_splitted,
        persist_directory=chroma_db_dir,
        embedding=titan_embedding_fn,
    )

### Test query our vector store
---
Now we have vector database with the data in it, let's test it out.

In [19]:
sample_question = '''
Amazon discusses its investments and progress in various areas, such as Generative AI, logistics, and healthcare. 
How do these initiatives relate to the company's strategy of building "primitives" or foundational building blocks, 
and what potential customer experiences or business opportunities do they enable?'''

search_result = vector_store.similarity_search_with_relevance_scores(
    query=sample_question.strip(),
    k=3,
)

In [20]:
search_result

[(Document(metadata={'page': 5, 'source': '_raw_data/AMZN-2022-Shareholder-Letter.pdf'}, page_content='One final investment area that I’ll mention, that’s core to setting Amazon up to invent in every area of our\nbusiness for many decades to come, and where we’re investing heavily is Large Language Models (“LLMs”)\nand Generative AI . Machine learning has been a technology with high promise for several decades, but it’s'),
  0.43749576860014605),
 (Document(metadata={'page': 3, 'source': '_raw_data/AMZN-2023-Shareholder-Letter.pdf'}, page_content='otherU.S. Intelligence agencies). But, one of the lesser-recognized beneficiaries was Amazon’s own consumerbusinesses, which innovated at dramatic speed across retail, advertising, devices (e.g. Alexa and FireTV),Prime Video and Music, Amazon Go, Drones, and many other endeavors by leveraging the speed with whichAWS let them build. Primitives, done well, rapidly accelerate builders’ ability to innovate .'),
  0.419795440924723),
 (Document(me

Now, we have establishing ChromaDB for our vector database, which will be used in the subseqent notebooks for RAG evaluation.

## Synthetic Evaluation dataset

In this section, we will generate the **golden** or **evaluation** dataset used to evaluate RAG application. We will utilize `DeepEval` library for this purpose.


### Synthesizer
---

`DeepEval`'s **Synthesizer** offers a fast and easy to automatically get started with testing your LLM by generating high-quality evaluation datasets (inputs, expected outputs, and contexts) from scratch. The default of **Synthesizer** class will be using `OpenAI`, hence we will need to create two custom LLM handlers to use with our **Amazon Bedrock** model, one for embedding model, and one for language models.

<div class="alert alert-block alert-info">
    <b>Remark</b>: We will pass the LLMs in langchain form to DeepEval.
</div>

Please refer to [DeepEval's source code](https://github.com/confident-ai/deepeval/blob/main/deepeval/synthesizer/synthesizer.py) and [documentation](https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data) to help with adjustment.

#### Custom Bedrock Embedding Model

In [21]:
import deepeval
from deepeval.synthesizer import Synthesizer
from deepeval.models import DeepEvalBaseEmbeddingModel
from typing import List


class BedrockEmbeddingDeepEval(DeepEvalBaseEmbeddingModel):
    def __init__(self, model: langchain_aws.embeddings):
        self.model = model

    def load_model(self):
        return self.model

    def embed_text(self, text: str) -> List[float]:
        embedding_model = self.load_model()
        return embedding_model.embed_query(text)

    def embed_texts(self, texts: List[str]) -> List[List[float]]:
        embedding_model = self.load_model()
        return embedding_model.embed_documents(texts)

    async def a_embed_text(self, text: str) -> List[float]:
        embedding_model = self.load_model()
        return await embedding_model.aembed_query(text)

    async def a_embed_texts(self, texts: List[str]) -> List[List[float]]:
        embedding_model = self.load_model()
        return await embedding_model.aembed_documents(texts)

    def get_model_name(self) -> str:
        embedding_model = self.load_model()
        return embedding_model.model_id

    def get_provider(self) -> str:
        model_id = self.get_model_name()
        return model_id.split('.')[0]

In [22]:
titan_embedding_deepeval = BedrockEmbeddingDeepEval(model=titan_embedding_fn)
titan_embedding_deepeval.embed_text('Hello')[:2]

[-0.0635838583111763, 0.05780351161956787]

#### Custom Bedrock LLM
---
For text generation model, we will use **Anthropic Claude 3 Sonnet on Amazon Bedrock**. However, please feel free to change to other LLMs like Llama 3.1 70B or 405B.

In [23]:
import langchain_aws
from langchain_aws import ChatBedrock

claude3_sonnet_model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'
llama3_1_70b_model_id = 'meta.llama3-1-70b-instruct-v1:0'

claude_sonnet_langchain = ChatBedrock(
    model_id=claude3_sonnet_model_id,
    region_name=boto_session.region_name
)
claude_sonnet_langchain.invoke('What is L in LLM?')

AIMessage(content='In the context of LLM, the \'L\' stands for "Large":\n\nLLM = Large Language Model\n\nA large language model (LLM) is a type of artificial intelligence system that is trained on vast amounts of text data to understand and generate human-like language. These models have a very large number of parameters (the values that encode the model\'s knowledge), often in the billions or trillions.\n\nSome key characteristics of LLMs:\n\n- Trained on massive text corpora crawled from the internet or other sources, allowing them to develop broad knowledge.\n- Can understand and generate text on a wide range of topics in a contextual, semantically coherent way.\n- Support natural language tasks like text generation, question answering, summarization, translation, etc.\n- Models like GPT-3, PaLM, Jurassic-1, LaMDA fall under the LLM category.\n\nThe "large" component refers to these models\' immense scale in terms of the amount of data used for training and the huge number of parame

In [24]:
from deepeval.models import DeepEvalBaseLLM


class BedrockTextGenDeepEval(DeepEvalBaseLLM):
    def __init__(
        self,
        model: langchain_aws.chat_models
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        llm_model = self.load_model()
        return llm_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        llm_model = self.load_model()
        res = await llm_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        llm_model = self.load_model()
        return llm_model.model_id

    def get_provider(self):
        model_id = self.get_model_name()
        return model_id.split('.')[0]

In [25]:
claude_sonnet_deepeval = BedrockTextGenDeepEval(model=claude_sonnet_langchain)

#### Initialize Synthesizer with custom LLM

In [26]:
custom_synthesizer = Synthesizer(
    model=claude_sonnet_deepeval,
    critic_model=claude_sonnet_deepeval,
    embedder=titan_embedding_deepeval,
    context_quality_threshold=.8,
    context_similarity_threshold=.8,
)

In [31]:
import time;time.sleep(60)
_out = custom_synthesizer.generate_goldens_from_docs(
    document_paths=[os.path.join(data_dir, 'AMZN-2023-Shareholder-Letter.pdf')],
    include_expected_output=True,
    max_contexts_per_document=4,
    max_goldens_per_context=2,
    chunk_size=512,
    chunk_overlap=102,
    _send_data=False,
    num_evolutions=1,
)

Event loop is already running. Applying nest_asyncio patch to allow async execution...


✨ 🚀 ✨ Loading Documents: 100%|██████████| 1/1 [00:00<00:00, 219.60it/s]
✨ 🧩 ✨ Generating Contexts: 100%|██████████| 12/12 [00:04<00:00,  2.64it/s]


✨ Generating up to 8 goldens using DeepEval (using anthropic.claude-3-sonnet-20240229-v1:0 and amazon.titan-embed-text-v2:0, use case=QA, method=docs): 100%|██████████| 8/8 [00:47<00:00,  5.99s/it]


### Save evaluation to dataframe and file

In [32]:
eval_df = custom_synthesizer.to_pandas()
eval_df.head(2)

Unnamed: 0,input,actual_output,expected_output,context,retrieval_context,n_chunks_per_context,context_length,evolutions,context_quality,synthetic_input_quality,source_file
0,Rewritten Input: Explain Amazon's core mission...,,Amazon's core mission is to make customers' li...,"[across Amazon. Y et, I think every one of us ...",,1,2361,[Reasoning],0.8,1.0,./_raw_data/AMZN-2023-Shareholder-Letter.pdf
1,Compare Amazon's approach to empowering builde...,,Amazon's approach to empowering builders and i...,"[across Amazon. Y et, I think every one of us ...",,1,2361,[Comparative],0.8,0.6,./_raw_data/AMZN-2023-Shareholder-Letter.pdf


In [33]:
eval_data_dir = './_eval_data'
os.makedirs(eval_data_dir, exist_ok=True)
eval_df.to_csv(os.path.join(eval_data_dir, 'eval_dataframe.csv'), index=False, header=True)