### Lab2: Data Prep for Retrieval Augmented Generation (RAG)

In this notebook, we will ingest all the travel document PDF into a Vector Store for RAG workflow. Retrieval Augmented Generation (RAG) requires the indexation of relevant unstructured documents into a vector database. Then given a end uset query, the relevant chunks are retrieved and passed as context to the model, which generates an answer. This can best be described by the following flow.

<img src="./images/rag-workflow.png" />

### Data Ingestion Workflow

In a Retrieval-Augmented Generation (RAG) system, data ingestion is a critical workflow where raw data is prepared and loaded into a vector database (VectorDB) so that it can be efficiently retrieved and used by a language model during query time. The ingestion process involves several key steps, including sourcing the data, chunking it into manageable pieces, embedding it into vector representations, and loading these embeddings into the vector database. Here's an expanded explanation of each step:

1. Data Source
Description: The data source is the origin from which the information is collected. This could be a variety of formats such as documents (PDFs, Word files), databases, APIs, websites (using web scraping), or even multimedia content like images or videos.

2. Chunking
Description: Chunking involves breaking down the data into smaller, manageable pieces or “chunks.” This is essential because embedding entire documents may not be efficient or effective, and smaller chunks allow the system to provide more precise responses based on specific segments of information.

3. Embedding
Description: Embedding is the process of converting text chunks into numerical vector representations using a pre-trained embedding model. These vectors capture the semantic meaning of the chunks, allowing the system to compare and retrieve chunks based on their content rather than exact keyword matches.

4. Loading to VectorDB
Description: The final step in the ingestion workflow is loading the embedded vectors and their metadata into a vector database (VectorDB). This allows for efficient storage and retrieval of the chunks based on their semantic similarity to user queries.

#### Imports and Set up

In [None]:
from langchain_aws.embeddings import BedrockEmbeddings
from langchain_aws.chat_models import ChatBedrock
from pathlib import Path
from rich import print as rprint
from langchain.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS, DistanceStrategy
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain_core.prompts import (
    ChatPromptTemplate,
    FewShotChatMessagePromptTemplate,
)
from langchain.output_parsers import PydanticToolsParser
from io import BytesIO
import pickle
import time
import json

import faiss
import datetime as dt

#### Amazon Bedrock Models

The Amazon Titan Embedding v2 model is part of the Amazon Titan family of foundation models available through Amazon Bedrock, optimized for creating high-quality vector embeddings from textual data. Embedding models convert text into numerical representations (vectors) that capture the semantic meaning of the content. These vectors can then be used for various tasks, such as information retrieval, clustering, similarity search, and more, which are critical in applications like search engines, recommendation systems, and Retrieval-Augmented Generation (RAG) workflows.

In [None]:
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

llm = ChatBedrock(
    model_id=model_id,
    model_kwargs={"max_tokens": 500}
)

bedrock_embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v1" 
)

#### Load the Travel Documents Dataset 

- `PyPDFLoader:` This is a utility in LangChain that efficiently reads and extracts structured text data from PDF documents. It can handle complex PDF structures, making it ideal for preprocessing documents for use in information retrieval, RAG systems, or other document-based AI applications.

In [None]:
docs_path = Path("./data/us/")
doc_files = list(docs_path.glob("*.pdf"))

section_chunks = []

for doc_path in doc_files:
    loader = PyPDFLoader(file_path=doc_path.as_posix())
    #loader.parser = PyPDFOutlineParser()
    sections = loader.load()
    for sec in sections:
        sec.metadata.update({"file": doc_path.name})
    
    section_chunks += sections

A PyPDFLoader instance is created, which is a utility in LangChain designed to load and extract text from PDF files.
The loader.load() method is called to extract sections (chunks of text) from the PDF.

In [None]:
rprint(section_chunks[0])

#### Explore the embedding

In [None]:
try:
    sample_embedding = bedrock_embeddings.embed_query(section_chunks[0].page_content)
    modelId = bedrock_embeddings.model_id
    rprint("Embedding model Id :", modelId)
    rprint("Size of the embedding: ", len(sample_embedding))
    print("Sample embedding of a document chunk: ", sample_embedding[:30])

except ValueError as error:
    if  "AccessDeniedException" in str(error):
        print(f"\x1b[41m{error}\
        \nTo troubleshoot this issue please refer to the following resources.\
        \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/setting-up.html\
        \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\
        \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
              \x1b[0m")
        class StopExecution(ValueError):
            def _render_traceback_(self):
                pass
        raise StopExecution        
    else:
        raise error

#### Ingest Data into FAISS Vector DB

We will utilize a few different techniques when loading the documents that will help improve the retrieval quality.

#### 1. Outline based splitting
By default LangChain's `PyPDFLoader` will break each document up into pages. We could then potentially use a chunking strategy such as `RecursiveCharacterTextSplitter` to further break down the pages into smaller chunks. 
However, this could lead to suboptimal results if the most relevant information we are looking for is split across multiple pages. 

Now we are ready to ingest the documents into the vector store. This can be done easily using the [LangChain FAISS integration](https://python.langchain.com/docs/integrations/vectorstores/faiss/) which takes in the embeddings model and the documents to create the entire vector store.


In [None]:
vec_store_time_stamp = dt.datetime.now().strftime("%Y%m%d%H%M%S")

docstore = InMemoryDocstore()
index = faiss.IndexFlatL2(len(sample_embedding))
vector_db = FAISS(embedding_function=bedrock_embeddings, 
                  index=index, 
                  index_to_docstore_id={},
                  docstore=docstore, 
                  distance_strategy=DistanceStrategy.COSINE)

#### 2. Parent Document Retriever
After we've loaded the document as individual sections, we will further split these sections by paragraphs using the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter/). These are the chunks that will be used for embeddings, however during retrieval we'll utilize the [ParentDocumentRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever/) to retrieve the entire section that the chunk belongs to. This is done to ensure that the context provided to the model is as complete as possible.

Next we build the `ParentDocumentRetriever` combining an FAISSbased vector store and key-value based `InMemoryStore`. The vector store will be used to find section segments that were generated using through splitting with the `RecursiveCharacterSplitter`. Each section segment will contain a key reference to the full section document. The key reference will be used to retrieve the entire section text. Note that the `InMemoryStore` is essentially a python dictionary, in production you would want to use a persistent store such as [DynamoDB](https://aws.amazon.com/dynamodb/).

In [None]:
child_splitter = RecursiveCharacterTextSplitter(
    separators=["\n", "\n\n"], chunk_size=2000, chunk_overlap=250
)

in_memory_store_file = f"section_doc_store_{vec_store_time_stamp}.pkl"
vector_store_file = f"section_vector_store_{vec_store_time_stamp}.pkl"
local_vector_config = "local_config.json"

# if we previously ingested the docs we can reuse the existing index
if Path(local_vector_config).exists():
    in_memory_store_file = json.load(open(local_vector_config))["in_memory_store_file"]
    vector_store_file = json.load(open(local_vector_config))["vector_store_file"]
    
    store = pickle.load(open(in_memory_store_file, "rb"))
    vector_db_buff = BytesIO(pickle.load(open(vector_store_file, "rb")))
    vector_db = FAISS.deserialize_from_bytes(serialized=vector_db_buff.read(), embeddings=bedrock_embeddings, allow_dangerous_deserialization=True)
    
    retriever = ParentDocumentRetriever(
        vectorstore=vector_db,
        docstore=store,
        child_splitter=child_splitter,
    )

# ingest the document into the index
else:
    store = InMemoryStore()
    
    retriever = ParentDocumentRetriever(
        vectorstore=vector_db,
        docstore=store,
        child_splitter=child_splitter,
    )
    
    retriever.add_documents(section_chunks, ids=None)
    pickle.dump(store, open(in_memory_store_file, "wb"))
    pickle.dump(vector_db.serialize_to_bytes(), open(vector_store_file, "wb"))
    
    with open(local_vector_config, "w") as f:
        json.dump({"in_memory_store_file": in_memory_store_file, "vector_store_file": vector_store_file}, f)

### Content Generation Workflow
Let's first explore the various ways that we can query the vector store exclusively. [Semantic search](https://www.elastic.co/what-is/semantic-search) considers the context and intent of a query. Unlike traditional keyword based searches, semantic search utilize embedding that capture the meaning of the text. This allows for more relevant results to be returned. 

In [None]:
# Search query
query = "Top attractions in New York City?"

# Search for the 3 most relevant documents
results = vector_db.similarity_search(query, k=3)

rprint(dumps(results, pretty=True))

#### Maximum marginal relevance search (MMR)
If you’d like to look up for some similar documents, but you’d also like to receive diverse results, MMR is a method you should consider. Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.

In [None]:
results = vector_db.max_marginal_relevance_search(query, k=3, fetch_k=10)

rprint(dumps(results, pretty=True))

In [None]:
# Search query
query = "Summarize top attractions in athens and barcelona"

# Search for the 3 most relevant documents
results = vector_db.similarity_search(query, k=3)

rprint(dumps(results, pretty=True))

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

template = """Answer the question based only on the following context. 
If the context does not provide sufficient information to answer the question, politely indicate that you are unable to assist. 
Only answer questions related to model risk and model governance.

<context>
{context}
</context>

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
output_parser = StrOutputParser()

# in the first step we retrieve the context and pass through the input question
setup_and_retrieval = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
    
)

# In the subsequent steps pass the context and question to the prompt, send the prompt to the llm and parse the output as a string
chain = setup_and_retrieval | prompt | llm | output_parser

In [None]:
query = "Summarize top attractions in athens and barcelona"

response = chain.invoke(query)
rprint(response)

### Improve RAG Performance with Advanced methods of retrieval

The main driver of performance for RAG pipelines is the retrieval mechanism. This step involves identifying a subset of documents that are most relevant to the original query. The common baseline is generally to embed the query in its original form and pull the top-K nearest documents. However, for some datasets this begins to fall short in cases where queries address multiple topics or, more generally, are phrased in a way that is incompatible or is dissimilar to the documents that should be retrieved. We look at how it is possible to improve on these types of queries. 

Given the increase complexity of the tasks in this section, we choose to leverage Claude 3 Sonnet in this part of the pipeline. 

#### 1.Decomposition

For more complex queries, it may be helpful to breakdown the original question into sub-problems each having their own retrieval step. We perform query decomposition to return the original question or an equivalent set of questions each with a single target.

This process is driven by the underlying model. We define the system prompt describing the intended task and supply static few-shot examples to enable the model to better generalize. Removing these examples yields results that are less robust.

In [None]:
decomp_system_prompt = """You are an expert travel assistant that prepares queries to be sent to a search component.
These queries may be very complex, involving multiple destinations, activities, or conditions. Your job is to simplify complex travel-related queries into multiple queries that can be answered in isolation from each other.

If the query is simple, keep it as it is.

If there are acronyms or words you are not familiar with, do not try to rephrase them.

Here are some examples of how to respond in a standard interaction:

<example>
- Query: Compare the best beach destinations in Hawaii and the Bahamas for a family vacation.
Decomposed Questions: [SubQuery(sub_query='What are the best beach destinations in Hawaii for a family vacation?'), SubQuery(sub_query='What are the best beach destinations in the Bahamas for a family vacation?')]
</example>

<example>
- Query: Find the best time to visit Paris and what activities are recommended during that time.
Decomposed Questions: [SubQuery(sub_query='What is the best time to visit Paris?'), SubQuery(sub_query='What activities are recommended in Paris during the best time to visit?')]
</example>

<example>
- Query: Suggest hiking spots in Colorado and nearby accommodations.
Decomposed Questions: [SubQuery(sub_query='What are the best hiking spots in Colorado?'), SubQuery(sub_query='What are the nearby accommodations available for hiking spots in Colorado?')]
</example>

<example>
- Query: What is the capital of Australia?
Decomposed Questions: [SubQuery(sub_query='What is the capital of Australia?')]
</example>"""

To ensure a consistent format is returned for subsequent steps, we use Pydantic, a data-validation library. We rely on a Pydantic-based helper function for doing the tool config translation for us in a way that ensures we avoid potential mistakes when defining our tool config schema in a JSON dictionary.

We define `SubQuery` to be a query corresponding to a subset of the points of a larger parent query. 

In [None]:
class SubQuery(BaseModel):
    """You have performed query decomposition to generate a subquery of a question"""

    sub_query: str = Field(description="A unique subquery of the original question.")

We define the prompt template leveraging the previously defined system prompt. We then expose `SubQuery` as a tool the model can leverage. This enables to model to format one or more requests to this tool.

In [None]:
query_decomposition_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", decomp_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)

llm_with_tools = llm.bind_tools([SubQuery])
decomp_query_analyzer = query_decomposition_prompt | llm_with_tools | PydanticToolsParser(tools=[SubQuery])

In [None]:
queries = decomp_query_analyzer.invoke({"question": "Summarize top attractions in athens and barcelona"})
queries

#### Expansion

Query expansion is similar to decomposition in that it produces multiple queries as a strategy to improve the odds of hitting a relevant result. However, expansion returns multiple different wordings of the original query.  

We define the system prompt to consistently return 3 versions of the original query. 

In [None]:
paraphrase_system_prompt = """You are an expert at converting user questions into database queries. 
You have access to a database of travel destinations and a list of recent destinations for travelers. 

Perform query expansion. If there are multiple common ways of phrasing a user question 
or common synonyms for key words in the question, make sure to return multiple versions 
of the query with the different phrasings.

If there are acronyms or words you are not familiar with, do not try to rephrase them.

Always return at least 3 versions of the question."""

We define the prompt template leveraging the previously defined system prompt. We then expose `ParaphrasedQuery` as a tool the model can leverage. This enables to model to format one or more requests to this tool.

In [None]:
class ParaphrasedQuery(BaseModel):
    """You have performed query expansion to generate a paraphrasing of a question."""

    paraphrased_query: str = Field(description="A unique paraphrasing of the original question.")

We define the prompt template leveraging the previously defined system prompt. We then expose `ParaphrasedQuery` as a tool the model can leverage. This enables to model to format one or more requests to this tool.

In [None]:
query_expansion_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", paraphrase_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)
llm_with_tools = llm.bind_tools([ParaphrasedQuery])
query_expansion = query_expansion_prompt | llm_with_tools | PydanticToolsParser(tools=[ParaphrasedQuery])

Now no matter the nature of the query, the model generates alternatives that can be sent for retrieval in parallel.

In [None]:
query_expansion.invoke({"question": "Summarize top attractions in athens and barcelona"})

### Key Take Aways

In Retrieval-Augmented Generation (RAG), query analysis through decomposition and expansion enhances performance by refining search accuracy and retrieval quality:

- Query Decomposition: Breaks complex queries into manageable sub-queries, helping retrieve more precise, relevant information for each part of the question.
- Query Expansion: Enriches the query with synonyms or related terms, increasing the likelihood of retrieving contextually relevant documents.

Using these query analysis techniques improve RAG's relevance, context accuracy, and overall response quality.

### Conclusion

We will leverage the FAISS vector DB we established in the lab to integrate with Agents in the upcoming labs. You have successfully completed RAG lab, please proceed to the next labs