# Retrieval Augmented Question & Answering with LangChain


### Context
Previously we saw that the model told us how to to change the tire, however we had to manually provide it with the relevant data and provide the contex ourselves. We explored the approach to leverage the model availabe under Bedrock and ask questions based on it's knowledge learned during training as well as providing manual context. While that approach works with short documents or single-ton applications, it fails to scale to enterprise level question answering where there could be large enterprise documents which cannot all be fit into the prompt sent to the model. 

### Pattern
We can improve upon this process by implementing an architecure called Retreival Augmented Generation (RAG). RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. 

In this notebook we explain how to approach the pattern of Question Answering to find and leverage the documents to provide answers to the user questions.

### Challenges
- How to manage large document(s) that exceed the token limit
- How to find the document(s) relevant to the question being asked

### Proposal
To the above challenges, this notebook proposes the following strategy
#### Prepare documents
![Embeddings](./images/Embeddings_lang.png)

Before being able to answer the questions, the documents must be processed and a stored in a document store index
- Load the documents
- Process and split them into smaller chunks
- Create a numerical vector representation of each chunk using Amazon Bedrock Titan Embeddings model
- Create an index using the chunks and the corresponding embeddings
#### Ask question
![Question](./images/Chatbot_lang.png)

When the documents index is prepared, you are ready to ask the questions and relevant documents will be fetched based on the question being asked. Following steps will be executed.
- Create an embedding of the input question
- Compare the question embedding with the embeddings in the index
- Fetch the (top N) relevant document chunks
- Add those chunks as part of the context in the prompt
- Send the prompt to the model under Amazon Bedrock
- Get the contextual answer based on the documents retrieved

## Use Case
#### Dataset
To explain this architecture pattern we are using the documents from IRS. These documents explain topics such as:
- Original Issue Discount (OID) Instruments
- Reporting Cash Payments of Over $10,000 to IRS
- Employer's Tax Guide

#### Persona
Let's assume a persona of a layman who doesn't have an understanding of how IRS works and if some actions have implications or not.

The model will try to answer from the documents in easy language.


## Implementation
In order to follow the RAG approach this notebook is using the LangChain framework where it has integrations with different services and tools that allow efficient building of patterns such as RAG. We will be using the following tools:

- **LLM (Large Language Model)**: Anthropic Claude V1 available through Amazon Bedrock

  This model will be used to understand the document chunks and provide an answer in human friendly manner.
- **Embeddings Model**: Amazon Titan Embeddings available through Amazon Bedrock

  This model will be used to generate a numerical representation of the textual documents
- **Document Loader**: PDF Loader available through LangChain

  This is the loader that can load the documents from a source, for the sake of this notebook we are loading the sample files from a local path. This could easily be replaced with a loader to load documents from enterprise internal systems.

- **Vector Store**: FAISS available through LangChain

  In this notebook we are using this in-memory vector-store to store both the embeddings and the documents. In an enterprise context this could be replaced with a persistent store such as AWS OpenSearch, RDS Postgres with pgVector, ChromaDB, Pinecone or Weaviate.
- **Index**: VectorIndex

  The index helps to compare the input embedding and the document embeddings to find relevant document
- **Wrapper**: wraps index, vector store, embeddings model and the LLM to abstract away the logic from the user.

## Setup


In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
%pip install langchain>=0.1.11
%pip install pypdf==4.1.0
%pip install langchain-community faiss-cpu==1.8.0 tiktoken==0.6.0 sqlalchemy==2.0.28
%pip install boto3


Note: you may need to restart the kernel to use updated packages.
Collecting pypdf==4.1.0
  Downloading pypdf-4.1.0-py3-none-any.whl (286 kB)
Installing collected packages: pypdf
Successfully installed pypdf-4.1.0
Note: you may need to restart the kernel to use updated packages.
Collecting langchain-community
  Downloading langchain_community-0.2.17-py3-none-any.whl (2.3 MB)
Collecting faiss-cpu==1.8.0
  Downloading faiss_cpu-1.8.0-cp38-cp38-win_amd64.whl (14.5 MB)Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.4 requires pathlib, which is not installed.
anaconda-project 0.9.1 requires ruamel-yaml, which is not installed.



Collecting tiktoken==0.6.0
  Downloading tiktoken-0.6.0-cp38-cp38-win_amd64.whl (798 kB)
Collecting sqlalchemy==2.0.28
  Downloading SQLAlchemy-2.0.28-cp38-cp38-win_amd64.whl (2.1 MB)
Collecting regex>=2022.1.18
  Downloading regex-2024.9.11-cp38-cp38-win_amd64.whl (274 kB)
Collecting requests>=2.26.0
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.4.0-cp38-cp38-win_amd64.whl (101 kB)
Collecting dataclasses-json<0.7,>=0.5.7
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting typing-inspect<1,>=0.4.0
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting marshmallow<4.0.0,>=3.18.0
  Downloading marshmallow-3.22.0-py3-none-any.whl (49 kB)
Installing collected packages: charset-normalizer, requests, typing-inspect, sqlalchemy, marshmallow, regex, dataclasses-json, tiktoken, langchain-community, faiss-cpu
  Attempting uninstall: requests
    Found existing installation: requ

In [8]:
import json
import os
import sys

import boto3
import botocore

boto3_bedrock = boto3.client('bedrock-runtime', region_name='eu-west-2') 

In [9]:
import warnings

from io import StringIO
import sys
import textwrap
import os
from typing import Optional

# External Dependencies:
import boto3
from botocore.config import Config

warnings.filterwarnings('ignore')

def print_ww(*args, width: int = 100, **kwargs):
    """Like print(), but wraps output to `width` characters (default 100)"""
    buffer = StringIO()
    try:
        _stdout = sys.stdout
        sys.stdout = buffer
        print(*args, **kwargs)
        output = buffer.getvalue()
    finally:
        sys.stdout = _stdout
    for line in output.splitlines():
        print("\n".join(textwrap.wrap(line, width=width)))
        

## Configure langchain

We begin with instantiating the LLM and the Embeddings model. Here we are using Anthropic Claude for text generation and Amazon Titan for text embedding.

Note: It is possible to choose other models available with Bedrock. You can replace the `model_id` as follows to change the model.

`llm = Bedrock(model_id="amazon.titan-text-express-v1")`

Check [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids-arns.html) for Available text generation and embedding models Ids under Amazon Bedrock.

In [16]:
# We will be using the Titan Embeddings Model to generate our Embeddings.
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

# - create the Anthropic Model
#llm = Bedrock(model_id="amazon.titan-text-lite-v1", client=boto3_bedrock, model_kwargs={})
llm = Bedrock(model_id="amazon.titan-text-express-v1", client=boto3_bedrock, model_kwargs={})
#bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=boto3_bedrock)
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0", client=boto3_bedrock)

`Note: As an exercise. If you have time, update the cell above so that it uses the "new/appropriate" version/style of code, so that the deprication issue/warning is resolved.`

## Data Preparation
Let's first download some of the files to build our document store. For this example we will be using public IRS documents from [here](https://www.irs.gov/publications).

In [11]:
from urllib.request import urlretrieve

os.makedirs("data", exist_ok=True)
files = [
    "https://www.irs.gov/pub/irs-pdf/p1544.pdf",
    "https://www.irs.gov/pub/irs-pdf/p15.pdf",
    "https://www.irs.gov/pub/irs-pdf/p1212.pdf",
]
for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

After downloading we can load the documents with the help of [DirectoryLoader from PyPDF available under LangChain](https://python.langchain.com/en/latest/reference/modules/document_loaders.html) and splitting them into smaller chunks.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also the embeddings model has a limit of the length of input tokens limited to 8192 tokens, which roughly translates to ~32,000 characters. For the sake of this use-case we are creating chunks of roughly 1000 characters with an overlap of 100 characters using [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html).

In [18]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
#from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain_community.document_loaders.pdf import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 100,
)
docs = text_splitter.split_documents(documents)

In [19]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)
print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')

Average length among 81 documents loaded is 5889 characters.
After the split we have 560 documents more than the original 81.
Average length among 560 documents (after split) is 912 characters.


We had 3 PDF documents which have been split into smaller ~500 chunks.

Now we can see how a sample embedding would look like for one of those chunks

In [20]:
try:
    sample_embedding = np.array(bedrock_embeddings.embed_query(docs[0].page_content))
    print("Sample embedding of a document chunk: ", sample_embedding)
    print("Size of the embedding: ", sample_embedding.shape)

except ValueError as error:
    if "AccessDeniedException" in str(error):
        print(f"\x1b[41m{error}\
        \nTo troubeshoot this issue please refer to the following resources.\
         \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
         \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\x1b[0m\n")      
        class StopExecution(ValueError):
            def _render_traceback_(self):
                pass
        raise StopExecution        
    else:
        raise error

ValueError: Error raised by inference endpoint: Unable to locate credentials

Following the similar pattern embeddings could be generated for the entire corpus and stored in a vector store.

This can be easily done using [FAISS](https://github.com/facebookresearch/faiss) implementation inside [LangChain](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html) which takes  input the embeddings model and the documents to create the entire vector store. Using the Index Wrapper we can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents and calling the LLM. [VectorStoreIndexWrapper](https://python.langchain.com/en/latest/modules/indexes/getting_started.html#one-line-index-creation) helps us with that.

**⚠️⚠️⚠️ NOTE: it might take few minutes to run the following cell ⚠️⚠️⚠️**

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

vectorstore_faiss = FAISS.from_documents(
    docs,
    bedrock_embeddings,
)

wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

## Question Answering

Now that we have our vector store in place, we can start asking questions.

In [None]:
query = """Is it possible that I get sentenced to jail due to failure in filings?"""

The first step would be to create an embedding of the query such that it could be compared with the documents

In [None]:
query_embedding = vectorstore_faiss.embedding_function.embed_query(query)
np.array(query_embedding)

We can use this embedding of the query to then fetch relevant documents.
Now our query is represented as embeddings we can do a similarity search of our query against our data store providing us with the most relevant information.

In [None]:
relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print_ww(f'## Document {i+1}: {rel_doc.page_content}.......')
    print('---')

Now we have the relevant documents, it's time to use the LLM to generate an answer based on these documents. 

We will take our inital prompt, together with our relevant documents which were retreived based on the results of our similarity search. We then by combining these create a prompt that we feed back to the model to get our result. At this point our model should give us highly informed information on how we can change the tire of our specific car as it was outlined in our manual.

LangChain provides an abstraction of how this can be done easily.

### Example #1: Using LangChain with RetrievalQA
You have the possibility to use the wrapper provided by LangChain which wraps around the Vector Store and takes input the LLM.
This wrapper performs the following steps behind the scences:
- Take the question as input
- Create question embedding
- Fetch relevant documents
- Stuff the documents and the question into a prompt
- Invoke the model with the prompt and generate the answer in a human readable manner.

In [None]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
prompt_template = """

Human: Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context

Question: {question}

Assistant:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [None]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)
answer = qa({"query": query})
print_ww(answer)

`Note: As an exercise. If you have time, update the cell above so that it uses the "new/appropriate" version/style of code, so that the deprication issue/warning is resolved.`

That answer shows that full response, which includes a lot of noise. Zeroing in on primary aspect of the natural language message that we might return to the end user ...

In [1]:
print_ww(answer['result'])

NameError: name 'print_ww' is not defined

Let's ask a different question:

In [None]:
query_2 = "What is the difference between market discount and qualified stated interest"

In [None]:
answer_2 = qa({"query": query_2})
# show the full response
print_ww(answer_2)

That answer shows that full response, which includes a lot of noise. Zeroing in on primary aspect of the natural language message that we might return to the end user ...

In [None]:
print_ww(answer_2['result'])

### Example # 2
Now let's have another look at using [RetrievalQA](https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html) where you can customize how the documents fetched should be added to prompt using `chain_type` parameter. Also, if you want to control how many relevant documents should be retrieved then change the `k` parameter in the cell below to see different outputs. In many scenarios you might want to know which were the source documents that the LLM used to generate the answer, you can get those documents in the output using `return_source_documents` which returns the documents that are added to the context of the LLM prompt. `RetrievalQA` also allows you to provide a custom [prompt template](https://python.langchain.com/en/latest/modules/prompts/prompt_templates/getting_started.html) which can be specific to the model.

In the cell below you see an example of how to control the prompt such that the LLM stays grounded and doesn't answer outside the context.

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """

Human: Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context

Question: {question}

Assistant:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)
query = "Is it possible that I get sentenced to jail due to failure in filings?"
result = qa({"query": query})
print_ww(result['result'])

In [None]:
result['source_documents']

# Capstone Assignment Part 2

Don't panic! This is not as difficult as it might first seem.

Using the notebook above as a guide and/or starting point, consider and explore the following questions and tasks. As with most things in life, you'll get the most out this exercise by putting in a reasonable amount of effort. That effort may in in research and/or in writing code.

Put all your answers into the notebook that you are going to submit as your completed assignment.
For each Task, there is an ask to consider the results and note your findings, please these notes in to the notebook.

This is a great opportunity to use or develop your markdown skills.


#### **Task 1**


**Task 1** (for everyone) 

Change the base content used here (the PDFs in the data folder) to be content that is meaningful to you in some way. It might be that it relates to the business domain that you are interested in, or on a topic that interests you. 

For this task, your content should be approximately 15 pages of text. That could be in a single document or in multiple documents. Related to that content create 10 questions or so. These questions will be your initial test set that you will use to determine the quality of your RAG solution.

Upload your content and update the notebook to create a FAISS index of your documents, and update the question examples to use a subset of your questions.

Take a look at the outputs and consider the solutions accuracy. The notebook at this point is your baseline. You might want to make a snapshot of it before you progress further.

**Task 1 answer**
Two documents were uploaded to the folder "./newdata". The documents:  _'WiredScore-Factsheet-NorthRowTowers.pdf'_ and _'SmartScore-Factsheet-NorthRowTowers.pdf'_ contain the summary of the result achived for the certifications WiredScore and SmartScore.

The questions are:
1. What WiredScore level did North Row Towers buidlign achieved?
2. What is the overview Set-up experience in North ow Towers?
3. What is the overview Future-ready experience in North ow Towers?
4. What are the connected building highlights at North Row towers about Mobile and Internet?
5. What are the connected building highlights at North Row towers about Resiliency?
6. What is WiredScore?
7. What SmartScore level did North Row Towers buidlign achieved?
8. What is the overview Sustainabilty experience in North ow Towers?
9. What is the overview Access and Navigation experience in North ow Towers?
10. What are the smart building highlights at North Row towers about Health and Wellbeing?
11. What are the smart building highlights at North Row towers about Saftey and Security?
12. What is SmartScore?

#### **Question 1** Baseline Evaluation 

**Question 1** Baseline Evaluation 

We want to improve the quality of the output of the solution. We know that we have not done any tuning for the content so far, so our intuition is that we can do better.  Before we start experimenting to improve the quality, we really need to objectively measure the quality that we have.

You have a set of test questions (from task 1) to help drive your evaluation.

You decide that you will first test that retrieval aspect of the solution. You want a metric for if the right chunks of content being returned for your test queries.

A. Describe the ground-truth data would your create so that you can measure this? (hint: test query, document chunk)

B. Describe how you might determine success or failure of retrieval for a test query

C. Is your determination of success/failure binary (True/False) or graded 0.0 - 1.0 or both? Describe the reasoning for your choice.

D. Describe your approach for calculating an aggregate metric (or metrics) for measuring performance of your test set with this base configuration. And, outline the reasoning for your choice.

**Question 1 answer**

**A. Measuring ground-truth**.
The ground truth would be the chuck section that would answer the questions. For instance for the questions _"What is the overview Future-ready experience in North tow Towers?"_ we will expect that the answer consider the chunk: 	_"To meet your needs far into the future, the building is equipped with spare capacity in the risers, point of entry and telecommunications room. This ensures the building can accommodate new and emerging technologies, providing you with the flexibility and adaptability you require."_

**B. Determining success or failure for a test query**.
DeepEval is a popular open source framework that to evalaute LLM perfomance. For RAG they propose [the RAGAS metric](https://docs.confident-ai.com/docs/metrics-ragas). RAGAS formed as the average of four differnt metrics, which are:

- **Answer relevancy**: "If someone asks you who the first man to walk on the moon was, and you answer Columbus was the first to find America, your answer is entirely irrelevant". The metric run an LLM and look for the percentage (%) of retrieved chucks that are relevant to the query. [see more here](https://docs.confident-ai.com/docs/metrics-answer-relevancy)
- **Contextual precession**: "Suppose you ask someone about who the president of the US was during the Apollo 11 mission. In that case, that person says, “Obama was president when Bin Laden was eliminated, and John F Kennedy was president when Amstrong stepped on the Moon.” The person has the answer but puts forward an irrelevant one first". This metric check whether your retrieval system ranks the relevant documents higher than the others. [see more here](https://docs.confident-ai.com/docs/metrics-contextual-precision)
- **Contextual recall**: Contextual recall measures whether the retrieval context is sufficient to answer the problem independent of the rank. [see more here](https://docs.confident-ai.com/docs/metrics-contextual-recall)
- **Faithfulness**:This metric purely evaluates the ability of the final LLM to produce the output with the context provided. A high score is achieved if there is not any contradiction between the output and the retrieved context. [see more here](https://docs.confident-ai.com/docs/metrics-faithfulness)

**C. Metric form**.
All these metrics can be ealisy converted to fail/passed if they is above a given treshold and if using deepdeval library the the LLM's reason can be retrived too. Nevertheless, given that we do not know for this case what treshold is acceptable we will use numbers 0.0 - 1.0 and develop our intuition of what is a good treshold.

**D. Aggregated Metric**.
The RAGAS framework and uses the average of the four previous metrics. 
Nevertheless each of these metrics measure different parts of the RAG solution:
- **Answer relevancy**: Relevant data retrived
- **Contextual precession**: Correct rank of the data retrieved
- **Contextual recall**: Completness of the data retrieved
- **Faithfulness**: overall test of the solution including Retrival and the final LLM model

Given that in this part we will focus on tunnig the retrival part of the solution, **we will only use the metrics _Answer relevancy_,  _Contextual precession_ and aggregate them as an average as our final metric**. By doing so we tackle the main goal "retival" while developing faster and test cheap as each metric needs to call an LLM. 

#### Task 2

**Task 2** (strongly encouraged, but optional) 

Implement, and perhaps refine, your evaluation methods and capture your baseline metrics, by running your set of test cases.

Briefly summarize your results (the baseline metrics) and any intuitions that you have about the results.

#### **Question 2** Chunking

**Question 2** Chunking

You know from your research that a critical factor in RAG solutions is how the document corpus is chunked (prior to embedding being created for the chunk, etc). Consider your content and the chunking options that are commonly used in RAG solutions. 

Choose 2 or 3 chunking options/variants that you believe might be better than baseline option that is configured in this notebook, and which are therefore worth experimenting with.

For each option that you choose, briefly outline why you think that might provide better results for your solution and particular document set.

For consideration: 

https://www.pinecone.io/learn/chunking-strategies/

https://python.langchain.com/docs/how_to/semantic-chunker/

https://blog.langchain.dev/a-chunk-by-any-other-name/

**Question 2 answer**

Playing with the original data we have identified that the main difficulty for the retrival is that PDFs are summaries of the certification and to make it easier for readers it uses tables and contaires to show the information. So, in the current form the fixed leght chuck loses the semnatics contained in the structure. Therefore the chucks can use this structure to be enriched with elements such as: _Titles, Column Headers, Row Headers_. 

Some interesting chunking options for that can be:
1. Document Specific Splitting.
    - [Unstructured](https://app.unstructured.io/) has a model for tables that we can try
    - Use Amazon's Textract to identify tables and other semantic in the text. There is a tutoria created by [AWS](https://github.com/aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation)
2. Sructured Chunking. The idea is to use the strucure of the document to split the document.The pdf we are using can also be retrieved in HTML format directly from the web. Then we can use the HTML to retrive the titles header etc.
3. Contextual chunking. [Saad-Falcon et all. (2023)](https://arxiv.org/pdf/2309.08872v1) proposed to convert pdf into HTML to identify the structure and then add that stucure information to the chuck metadata. LangChain have an example of this in [this code](https://github.com/rajib76/langchain_examples/blob/main/examples/how_to_parse_pdf_with_complex_tables.py).   
3. Semantic Chunking. The idea is to use an embedding that checks the embeddin differences of senteces to decide if the sentences should be in the same chunk or should be in different chunks. LangChain has a [tutorial](https://python.langchain.com/docs/how_to/semantic-chunker/) that can be used as reference.

Aditionally the data should be pre-processed. For instance '\n' (next line) is used more than normal as the text was forces to fit an space, therefore this can be simplified

#### Task 3

**Task 3** (strongly encouraged, but optional) 

Update the implementation to use one or more of your chuncking options.

For the evaluation process, you may need to create an new ground-truth dataset of each chunking option. The good news is that we're only using 10 test cases or so.

Run evaluation methods and capture the revised metrics. 

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.

For consideration: 

https://github.com/aws-samples/amazon-bedrock-claude-2-and-3-with-langchain-popular-use-cases/blob/main/Amazon%20Bedrock%20%26%20Langchain%20Sample%20Solutions.ipynb

#### Question 3: Embeddings

**Question 3** Embeddings

You know that one of the major factors that will impact the accuracy of the retrieval solution is the embeddings model that is used to encode the document corpus and the questions that get asked. 

Your customer has asked you to select two alternative models to the one in the baseline solution, from the set of embeddings models available to you in Amazon Bedrock and from Voyagai. Your goal is suggest two that will provide better retrieval performance that the baseline.

Choose two models and describe your reasoning on why each of those models might provide better results. 

As with much of generative AI, the best option will need testing with the content and likely cannot be fully determined from research/experience alone, but you goal is to provide a brief rationale for your recommendation.


**Question 3 Answer**
One great way to benchmark the different retrieval models is using the [Huggigface leaderboard](https://huggingface.co/spaces/mteb/leaderboard) where it can be found the best embedding model for certain task. In the "Retrieval" tab we can see these models, the most similar task to our is _QuoraRetrival_. 
Nevertheless not all of them are available in bedrock (eu-west-2). So from the Cohere models and vogageai, we suggest to try the models: **embed-multilingual-v3.0** with 88.92 in the _QuoraRetrival_ task and given that the use case can be in different languages, this model can be used for that too. From Vogageai we sugest **voyage-multilingual-2** due to the multilingual capabilites.

It worth mentioning that aditionnally to Embeddings there are another type of methods to transform the data to be called later, these are the _keyword-frequency based methods_ like TF-IDF and BM-25. These methods shine when we lookfor chunks that match particular words. For our use case we will prefer embeddings model as we will not need matching keywords apart from ensuring that the chunks come from the particlar building report, which we can handle by using _metadata filtering_. 

#### Task 4

**Task 4** (strongly encouraged, but optional)

Using your best (perhaps only) chunking strategy, experiment with applying the embeddings models that you recommended.

Run evaluation methods and capture the revised metrics. 

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.

For consideration:

https://python.langchain.com/docs/integrations/text_embedding/voyageai/

https://docs.voyageai.com/docs/embeddings

#### **Question 4** K

**Question 4** K

How many matching results, K, fed into your LLM (in the augmented prompt) is going to have a impact on the cost, latency and the quality of output generated by the solution. In an ideal world, there would only be one chunk, K = 1, and that chunk would have all the information that is needed for the users question, and would be right-sized, with little data/text that is not relevant to the question.

A. K is set to 3 for the baseline solution. If you set this to 1, 2, or 4, or 5 how will this change your evaluation metrics?

B. If your retrieval metrics are well-designed, there is likely very little change if K is set to 3, 4 or 5. Why is this the case?

C. It was noted (above) that we do not want K to be large, how might we test what is the best size for K? Hint: it goes beyond just testing retrieval metrics.


**Question 4 Answer**

A. In the original case the result doesnt change the Faithfulness metric as the model answer the same independent of K

B. the faithfulness metric changed little because the model gives prioriry to the first retrieved chunks. That is why _Contextual Presision_ is such an impotant metric as it shows this issue

C. Setting K works as tradeoff between giving more context and providing irrelevant chunks to the LLM. Then, _Faithfulness_ is the most important metric to find the K value, as we need to test if the final response is correct or not independent of the value of K.

#### Task 5

**Task 5** (strongly encouraged, but optional)

Experiment with different values of K (1, 2, 3, 4). 

Evaluate and capture the retrieval metrics for each value of K.

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.


#### **Question 5** Re-ranking

**Question 5** Re-ranking

Adding a re-ranking model to our RAG pipeline helps us 1/ provide a better set of document chunks for RAG, and 2/ may allow us to reduce the number of chunks that are used for augmenting the prompt.

Breifly explain how re-ranking helps with with points 1 and 2 above.

Given the characteristics of your document content and RAG pipeline we have here, can you recommend one or two re-reranking models to experiment with? Brielfy outline the rationale for your recommendation.

For consideration:

https://blog.voyageai.com/2024/03/15/boosting-your-search-and-rag-with-voyages-rerankers/

https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/

**Question 5 answer**

Rerankers work by processing the raw retrieved text in response to the actual query, using transformer models to analyze the document in the context of the specific query. This method minimizes information loss and enhances relevance, as the evaluation is tailored to the individual query rather than a precomputed, generalized interpretation.

Rerankers are precises but it comes at the cost of speed. Unlike embeddings where the computational heavy lifting is done ahead of time, rerankers require running a full transformer model computation for each query-document pair during the query time. This results in slower response times but significantly higher accuracy and relevance.

[Wang et al. (2024)](https://arxiv.org/pdf/2407.01219) tested different parameters for the RAG solution. For rerankig activiteis they found that 
_"monoT5 as a comprehensive method balancing performance and efficiency. RankLLaMA is suitable for achieving the best performance, while TILDEv2 is ideal for the quickest experience on a fixed collection"_. 
Given our case I would suggest using monoT5 or RankLLaMA as lattency is not a big problem for our client. 


#### Task 6

**Task 6** (encouraged, but optional)

Add re-ranking to the RAG solution. Using the Voyagai models will likely be easiest. Experiment with a couple re-ranking models.

After adding the re-ranker you will likely find that you get best evaluation results by having a value of 3 or 4 K for retreival from the FAISS vector database, and then taking 2 or 3 of the re-ranked chunks.  

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.

For consideration:

https://python.langchain.com/docs/integrations/document_transformers/voyageai-reranker/

#### Task 7

**Task 7** (encouraged, but optional)

Likely the evaluation metrics that you are using thus far is not sensitive to which rank in the set of documents the positive chunk hit/hits are. Order of the chunk, aka rank, is important as 1/ it will likely help the LLM produce a better output, 2/ if the retrieval system consistently returns chunk results in an optimal order, it allows us to more aggressively prune the retrieval results before giving the chunks to the LLM.

Add to the set evaluation metric(s) to have a metrics that has values correct order. The new metric will produce the highest value for correct answers, chunk(s), in the top rank(s), lower values for chunks in none top ranks. This metric better reflects our retrieval system objectives.

Once you have this, redo task 6, and see what the optimal configuration is for K and the re-ranking configuration.

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.

For consideration:

https://www.evidentlyai.com/ranking-metrics/evaluating-recommender-systems

https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54

#### **Question 6** Distance

**Question 6** Distance

An aspect of RAG tuning that easy to experiment with, but is often overlooked it the distance measure that is used to compare the embedding of the query, with the embedding of the documents.

What is the distance measure/method that is used in the baseline implementation?

Your customer wants to experiment with one or two other distance measure for this pipeline. Which measures are you going to recommend? Provide a brief rationale of your recommendation.

Reference:

https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.faiss.FAISS.html

https://python.langchain.com/docs/integrations/vectorstores/faiss/

https://github.com/facebookresearch/faiss/wiki/MetricType-and-distances
https://www.pinecone.io/learn/series/faiss/faiss-tutorial/

**Question 6 Answer**
Distance metrics are essential for measuring the proximity between data points, enabling the effective identification of similar objects during the retrieval stage. By employing advanced distance metrics, we can enhance the precision of our search results, thereby improving the overall accuracy and reliability of the RAG system.

The default distance metric used by the FAISS retriever is **Cosine similarity**, which can be modified at the initialization stage by adjusting the **_distance_strategy_** parameter.

The following are the [possible values for FAISS retrieval](https://api.python.langchain.com/en/latest/utilities/langchain_aws.utilities.utils.DistanceStrategy.html#langchain-aws-utilities-utils-distancestrategy), along with their meanings (as referenced by [Saika (2024)](https://medium.com/@parikshitsaikia1619/unlock-rags-potential-with-distance-metrics-and-rerankers-42df4f171f5a)):

1. **COSINE**: Cosine similarity evaluates the cosine of the angle between two vectors originating from the same point. It ranges from -1 (opposite directions) to 1 (same direction), with 0 indicating orthogonality. This metric is particularly beneficial for measuring the direction of vectors rather than their magnitude, making it ideal for comparing texts or documents where content relevance is more significant than length.

2. **EUCLIDEAN_DISTANCE**: This measurement calculates the straight-line distance between two points (vectors) in space, determining the square root of the sum of squared differences across corresponding vector elements. It is intuitive and particularly effective for scenarios where the physical distance matters, such as retrieving visually similar images. For example, queries like, _“Give me 10 locations for holidays with activities similar to those in Marveya, Spain,”_ benefit from this metric.

3. **DOT_PRODUCT**: The dot product assesses how much one vector aligns with another, considering both their magnitude and direction. A higher dot product indicates a stronger similarity in both features and alignment. This metric is especially useful for retrieving specific information, such as drug compositions or barcodes. It suits queries like, _“Give me 10 locations for holidays with temperatures between 25 and 30 degrees Celsius in summer.”_

4. **MAX_INNER_PRODUCT**: This metric retrieves the vector with the maximum inner product, highlighting the most similar vectors in terms of both content and context. This method is useful in contexts where we want to retrieve items that are not only similar but also relevant in a more nuanced way, representing a stronger association between the query and the results.

5. **JACCARD**: Jaccard Similarity measures the size of the intersection divided by the size of the union of two sets. It assesses the similarity between two sets (or documents) based solely on the overlap of their elements—typically words or terms. It doesn't consider the context or meaning of the terms, making it best suited for keyword-based searches, where the presence or absence of specific words is more critical than their overall semantic content.

**Recommendation**: For our use case, where we aim to retrieve descriptions of buildings—such as, _“How is the setup in Two Towers?”_—I recommend utilizing **Cosine similarity** combined with **Dot Product**. This approach will help us retrieve relevant information specific to the _Two Towers_, while we can further refine our results using metadata filters to exclude unrelated content, such as _Black Tower_.

#### Task 8

**Task 8** (encouraged, but optional)

Experiment with the setting the distance method for the vector database comparsion to the alternatives that you suggested above.

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.


#### **Question 7** - Wrapping up the Retrieval System

**Question 7** - Wrapping up the Retrieval System

At this point we have completed a full iteration of tuning of the data retrieval aspect of our RAG solution. 

A. Briefly outline two or three insights from this exercise

B. Do you have further suggestions for how the accuracy of the retrieval system could be further tuned.

**Question 7 answer**

Throughout this exercise, we have recognized the critical interplay between various components of our RAG solution. Each aspect, from the initial data collection to the retrieval methods, affects the overall effectiveness of the system. Understanding how changes in one area can ripple through the system has been illuminating.

One of the most significant insights has been the impact of the chunking strategy on our retrieval results. The choice between larger, context-preserving chunks and smaller, precise chunks fundamentally alters the nature of the data being processed. For example, while larger chunks maintain contextual integrity, enabling the model to capture broader meanings, smaller chunks offer granularity which may improve specific queries. This has emphasized the need to carefully consider how we segment our data before presenting it to the foundational model.


About suggestions ways for Further Tuning the Accuracy of the Retrieval System I can see:

1. **Implementing Filters**: As previously mentioned, incorporating filters into the retrieval process is vital. This will ensure that we do not retrieve or display data from other clients or irrelevant buildings. By configuring these filters effectively, we can refine our results to show only the most pertinent information, thereby enhancing user experience and data security.

2. **Source Attribution**: Including the source of the data in the retrieval results would also be beneficial. By providing users with the origin of the information, they can access the original deliverables if they wish to delve deeper into the context. This not only enhances transparency but also builds trust in our solution, as users can verify data and its relevance to their specific needs.

3. **User Feedback Loop**: Establishing a user feedback mechanism can provide ongoing insights into the effectiveness of the retrieval system. By allowing users to rate the relevance and accuracy of retrieved results, we can continuously fine-tune our algorithms and strategies based on real-world interactions. This iterative process can lead to significant long-term improvements in the accuracy and relevance of our searches.

By implementing these suggestions, we can further enhance the accuracy and reliability of our RAG solution, ensuring that it meets user expectations and provides valuable insights efficiently.