# Complete Document Retrieval and RAG

## Overview
- **Retrieval Pipeline** With customers having the ability to enter any number of possibilities into the solution, it is helpful to detect intent and normalize the query. Few-shots are a useful tool to tailor the normalization to the nature of the query in-line. 
- **Advanced methods** For more complex cases, it can be beneficial to generate hypothetical queries and documents solving for sub-queries and improving the semantic similarity.
- **Model answer generation** Once the model is shown a set of documents, it must generate an answer while staying as closely aligned to the contents of the documents as possible. We cover self-verification and citation as methods giving greater flexibility to the model for a given query and set of retrieved documents.

## Context

Retrieval Augmented Generation (RAG) requires the indexation of relevant unstructured documents into a vector database. Then given a customer query, the relevant are retrieved and past as context to the model, which generates an answer. This can best be described by the following flow.

<img src="./assets/rag-architecture.png" />

Once our documents (PDFs, CSV, Tables, JSON, ...) have been indexed into our knowledge base, we start working towards retrieval of a relevant subset of documents based on a given query. For many applications, the success of the retrieval is a strong indicator for the performance of the overall response. This notebook assumes you are familiar with the basics of RAG, embedding models and vector databases.

In this notebook, we seek to go beyond RAG to generate the model answer by applying other relevant steps in the answer pipeline.

<h2>Prerequisites</h2>

Before you can use Amazon Bedrock, you must carry out the following steps:

- Sign up for an AWS account (if you don't already have one) and IAM Role with the necessary permissions for Amazon Bedrock, see [AWS Account and IAM Role](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html#new-to-aws){:target="_blank"}.
- Request access to the foundation models (FM) that you want to use, see [Request access to FMs](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html#getting-started-model-access){:target="_blank"}. 
    
<h2>Setup</h2>

We import the relevant objects used in this notebook.

In [7]:
import boto3
import faiss
import re
from operator import itemgetter
from typing import List
from langchain_aws.chat_models.bedrock import ChatBedrock
from langchain_aws import BedrockEmbeddings
from langchain_core.prompts import (
    ChatPromptTemplate,
    FewShotChatMessagePromptTemplate,
)
from IPython.display import display_markdown, Markdown
from langchain_community.docstore import InMemoryDocstore
from langchain_community.vectorstores import FAISS
from pydantic import BaseModel, Field
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_core.prompts import HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain.output_parsers import PydanticToolsParser
from langchain_community.retrievers import WikipediaRetriever
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import AIMessage
from langchain_core.documents import Document
from langchain_core.runnables import (
    RunnableLambda,
    RunnableParallel,
    RunnablePassthrough,
    RunnableBranch,
)

Although this example leverages Nova Pro & Nova Lite, Bedrock supports many other models. This full list of models and supported features can be found [here](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference.html). The models are invoked via `bedrock-runtime`.

In [4]:
region = 'us-west-2'
bedrock = boto3.client(
    service_name = 'bedrock-runtime',
    region_name = region,
)

We use `ChatBedrock` and `BedrockEmbeddings` to interact with the Bedrock API. We enable `beta_use_converse_api` to use the Converse API.

In [91]:
modelId = "us.amazon.nova-lite-v1:0"
nova = ChatBedrock(
    model_id=modelId,
    client=bedrock,
    beta_use_converse_api=True
)
embeddingId = "amazon.titan-embed-text-v1"
embeddings = BedrockEmbeddings(
    model_id=embeddingId,
    client=bedrock)

We correctly get a generic answer message from the model.

In [92]:
display_markdown(Markdown(nova.invoke("Help me with my travel needs today.").content))

Of course! I'd be happy to help you with your travel needs today. Here are some common areas where you might need assistance:

1. **Itinerary Planning**:
   - **Destination**: Where are you traveling to?
   - **Duration**: How long will you be there?
   - **Activities**: What do you want to see and do?

2. **Transportation**:
   - **Getting to the Airport/Station**: Do you need information on public transport, ride-sharing services, or a taxi?
   - **Flights/Trains**: Have you booked your tickets? If not, I can help you find options.
   - **Local Transport**: Do you need advice on how to get around the city (subway, buses, taxis, bike rentals)?

3. **Accommodation**:
   - **Hotels/Airbnb**: Have you booked a place to stay? If not, I can help you find options based on your budget and preferences.
   - **Check-in/Check-out Times**: Do you need information on the procedures?

4. **Packing**:
   - **Weather**: What’s the weather like at your destination? Any special clothing or gear needed?
   - **Essentials**: List of items you shouldn’t forget (passport, tickets, chargers, etc.).

5. **Local Information**:
   - **Language**: Common phrases in the local language.
   - **Currency**: Information on the local currency and exchange rates.
   - **Emergency Numbers**: Local emergency contact numbers.

6. **Health and Safety**:
   - **Vaccinations**: Are there any required or recommended vaccinations for your destination?
   - **Travel Insurance**: Do you have travel insurance? If not, I can suggest some providers.
   - **Safety Tips**: General safety tips for your destination.

7. **Dining**:
   - **Restaurants**: Recommendations for local cuisine or specific types of food you enjoy.
   - **Dietary Restrictions**: Any dietary restrictions or preferences?

8. **Cultural Etiquette**:
   - **Customs**: Any local customs or etiquette you should be aware of?

Feel free to provide more details about your travel plans, and I can give you more specific advice!

## Reformating the initial query

### Intent Detection

In order to limit the scope of answers handled by the solution with RAG, a common first step in the answer pipeline is **Intent Detection or Classification**. This step is important to ensure the relevancy of the question to the indexed content, which works to limit the model's tendancy to answer questions that may not have been accounted for or tested by the application developers.

When requesting some information that is irrelevant to the previously stated purpose, we quickly see the model attempting to provide an answer.

In [93]:
display_markdown(Markdown(nova.invoke("I want to learn more about my mom's pie recipe").content))

Certainly! Pie recipes can vary widely depending on the type of pie your mom makes. Here are some common types of pies and their general ingredients and methods. If you provide more specifics about your mom's recipe, I can give more tailored advice.

### 1. **Apple Pie**
#### Ingredients:
- **Crust:**
  - 2 1/2 cups all-purpose flour
  - 1 teaspoon salt
  - 1 teaspoon sugar
  - 1 cup (2 sticks) unsalted butter, chilled and cut into small pieces
  - 6-8 tablespoons ice water

- **Filling:**
  - 6-8 medium apples (Granny Smith, Honeycrisp, or a mix)
  - 3/4 cup granulated sugar
  - 1/4 cup brown sugar
  - 1 teaspoon ground cinnamon
  - 1/4 teaspoon ground nutmeg
  - 1/4 teaspoon salt
  - 2 tablespoons all-purpose flour
  - 1 tablespoon lemon juice
  - 1 teaspoon vanilla extract
  - 1 egg, beaten (for egg wash)

#### Method:
1. **Crust:**
   - Mix flour, salt, and sugar in a bowl.
   - Cut in the butter until it resembles coarse crumbs.
   - Gradually add ice water until the dough comes together.
   - Divide dough in half, shape into disks, wrap in plastic, and chill for at least 1 hour.

2. **Filling:**
   - Peel, core, and slice apples.
   - In a large bowl, combine apples, sugars, spices, salt, flour, lemon juice, and vanilla.
   - Toss until apples are well-coated.

3. **Assembly:**
   - Roll out one disk of dough to fit a 9-inch pie dish.
   - Place in the dish and trim edges.
   - Add apple filling and dot with butter.
   - Roll out the second disk of dough, place over the filling, and crimp edges.
   - Brush with beaten egg and cut slits for steam.

4. **Baking:**
   - Bake at 425°F (220°C) for 20 minutes, then reduce heat to 375°F (190°C) and bake for 40-50 minutes until crust is golden and filling is bubbly.

### 2. **Cherry Pie**
#### Ingredients:
- **Crust:**
  - Same as apple pie crust

- **Filling:**
  - 3 cups fresh or frozen pitted cherries
  - 3/4 cup granulated sugar
  - 1/4 cup brown sugar
  - 1/4 cup all-purpose flour
  - 1 tablespoon lemon juice
  - 1 teaspoon vanilla extract
  - 1/2 teaspoon salt
  - 1/2 teaspoon ground cinnamon
  - 1/4 teaspoon ground nutmeg

#### Method:
1. **Crust:**
   - Same as apple pie crust

2. **Filling:**
   - In a large bowl, combine cherries, sugars, flour, lemon juice, vanilla, salt, cinnamon, and nutmeg.
   - Toss until cherries are well-coated.

3. **Assembly:**
   - Roll out one disk of dough to fit a 9-inch pie dish.
   - Place in the dish and trim edges.
   - Add cherry filling.

4. **Top Crust and Baking:**
   - Roll out the second disk of dough, place over the filling, and crimp edges.
   - Cut slits for steam and brush with beaten egg.
   - Bake at 425°F (220°C) for 20 minutes, then reduce heat to 375°F (190°C) and bake for 40-50 minutes until crust is golden and filling is bubbly.

### 3. **Pumpkin Pie**
#### Ingredients:
- **Crust:**
  - Same as apple pie crust

- **Filling:**
  - 2 cups pumpkin puree (fresh or canned)
  - 3/4 cup granulated sugar
  - 1 teaspoon salt
  - 1 teaspoon ground cinnamon
  - 1/2 teaspoon ground ginger
  - 1/4 teaspoon ground cloves
  - 2 large eggs
  - 1 can (12 oz) evaporated milk

#### Method:
1. **Crust:**
   - Same as apple pie crust

2. **Filling:**
   - In a large bowl, combine pumpkin, sugars, salt, cinnamon, ginger, and cloves.
   - Whisk in eggs, then stir in evaporated milk.

3. **Assembly:**
   - Roll out one disk of dough to fit a 9-inch pie dish.
   - Place in the dish and trim edges.
   - Pour filling into the crust.

4. **Baking:**
   - Bake at 425°F (220°C) for 15 minutes, then reduce heat to 350°F (175°C) and bake for 45-55 minutes until a knife inserted near the center comes out clean.

### General Tips:
- **Chilling the Dough:** Always chill the pie dough before rolling it out to make it easier to handle and prevent shrinking.
- **Blind Baking:** For a really crisp bottom crust, you can blind bake the crust before adding the filling. This means baking the crust alone for a few minutes before adding the filling.
- **Flavor Variations:** Feel free to experiment with different spices, fruits, or even add a bit of bourbon or rum to your pie for unique flavors.

If you can provide more details about your mom’s specific recipe, I can give you more precise guidance!

Hence, we provide an initial system prompt defining the model's role as an intent classifier. We supply the classes and few-shots to improve performance and ensure the model is aligned to the desired intended output, which needs to include `<intention></intention>` tags.

In [94]:
intent_system_prompt = """You are a precise classifier. Your task is to assess customer intent and categorize customer inquiry into one of the intentions. 

Intentions with their description:
vacation: Information on vacations, various travel destinations and my recent travels.
contact: Expressing the desire to talk to support.
irrelevant: Not related to vacations and travel.

Here is an example of how to respond in a standard interaction:
<example>
    Human: I am seeking a place that is sunny a family friendly.
    AI: <intention>vacation</intention>
</example>
<example>
    Human: I want to learn more about my mom's pie recipe
    AI: <intention>irrelevant</intention>
</example>
<example>
    Human: I want to talk to a someone.
    AI: <intention>contact</intention>
</example>

Think about your answer first before you respond. Think step-by-step and insert the classification in <intention></intention> tags and do not include anything after."""

We supply the prompt as part of `ChatPromptTemplate`and use the pipe operator to define a chain connecting the model to the resulting prompt.

In [95]:
intent_detection_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", intent_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)
intent_detection_chain = intent_detection_prompt | nova

We invoke the model with the same query and notice the classification result. We invite you to try additional questions.

In [96]:
display_markdown(Markdown(intent_detection_chain.invoke("Tell me about my mother's pie recipe").content))

<intention>irrelevant</intention>

Since we expect the answer to always contain these tags, we can parse it and branch off depending on the model's classification. 

### Dynamic few-shots

Although static few-shots are helpful, they have two major obstacles. On the one hand, they do not cover the breadth of necessary examples, and on the other, given that any submitted query is rarely relevant to all supplied examples, they often introduce unecessary tokens and noise to the prompt. In constrast, supplying dynamic few-shots from a larger corpus of examples enables us to select a number of the most relevant examples prior to inference. Evidently, these are determined by the nature of the query. Although we apply it to intend classification, dynamic few-shots can be applied anywhere in the RAG pipeline and generally yield stronger results compared to static examples. 

We bootstrap `few_shot_library` using examples distilled by **Amazon Nova Pro**. It is important to continuously iterate on the library after the initial deployment. During this phase, it is a general best practice to collect and label real interactions where the model made mistakes and append those to the set of examples.

In [97]:
few_shot_library = [
    {
        "question": "Can you recommend some tropical beach destinations?",
        "class": "vacation"
    },
    {
        "question": "I need to speak with a customer service representative.",
        "class": "contact"
    },
    {
        "question": "What's the best way to cook spaghetti?",
        "class": "irrelevant"
    },
    {
        "question": "Are there any family-friendly resorts in Florida?",
        "class": "vacation"
    },
    {
        "question": "How do I file a complaint about my recent stay?",
        "class": "contact"
    },
    {
        "question": "What's the weather like in Paris in June?",
        "class": "vacation"
    },
    {
        "question": "Can you help me with my car insurance claim?",
        "class": "irrelevant"
    },
    {
        "question": "I'd like to book an all-inclusive Caribbean cruise.",
        "class": "vacation"
    },
    {
        "question": "Is there a phone number for your reservations team?",
        "class": "contact"
    },
    {
        "question": "What's the best way to learn a new language?",
        "class": "irrelevant"
    },
    {
        "question": "Are there any good hiking trails in Yellowstone?",
        "class": "vacation"
    },
    {
        "question": "I need to update my billing information.",
        "class": "contact"
    },
    {
        "question": "How do I make homemade bread?",
        "class": "irrelevant"
    },
    {
        "question": "What are some popular tourist attractions in Rome?",
        "class": "vacation"
    },
    {
        "question": "Can I speak with a manager about my recent experience?",
        "class": "contact"
    },
    {
        "question": "What's the best time to visit Japan?",
        "class": "vacation"
    },
    {
        "question": "How do I reset my Netflix password?",
        "class": "irrelevant"
    },
    {
        "question": "Are there any good ski resorts in Colorado?",
        "class": "vacation"
    },
    {
        "question": "I need help with my online booking.",
        "class": "contact"
    },
    {
        "question": "What's the plot of the latest Marvel movie?",
        "class": "irrelevant"
    },
    {
        "question": "Can you suggest some budget-friendly European cities?",
        "class": "vacation"
    },
    {
        "question": "How do I request a refund for my canceled trip?",
        "class": "contact"
    },
    {
        "question": "What's the best way to train a puppy?",
        "class": "irrelevant"
    },
    {
        "question": "Are there any good wildlife safaris in Africa?",
        "class": "vacation"
    },
    {
        "question": "I need to change my flight reservation.",
        "class": "contact"
    },
    {
        "question": "What are some must-see landmarks in New York City?",
        "class": "vacation"
    },
    {
        "question": "How do I fix a leaky faucet?",
        "class": "irrelevant"
    },
    {
        "question": "Can you recommend some romantic getaways for couples?",
        "class": "vacation"
    },
    {
        "question": "I have a question about my loyalty points balance.",
        "class": "contact"
    },
    {
        "question": "What's the best way to prepare for a job interview?",
        "class": "irrelevant"
    },
    {
        "question": "Tell me about my travel history",
        "class": "vacation"
    },
    
]

In this notebook, we use FAISS (Facebook AI Similarity Search) [(github)](https://github.com/facebookresearch/faiss), which is an open-source library developed by Facebook AI Research for efficient similarity search and clustering of dense vector embeddings. We call the Lanchain's `FAISS` object to interact with the in-memory vector store.

We embed the examples using the Titan Embedding model.

In [98]:
embedding_size = 1536
index = faiss.IndexFlatL2(embedding_size)
embedding_fn = embeddings.embed_query
vectorstore = FAISS(embedding_fn, index, InMemoryDocstore({}), {})

`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


We use `SemanticSimilarityExampleSelector` to dynamically select the `k` most relevant examples based on our query. When instantiated, this object embeds the set of examples into our vector store of choice. `FewShotChatMessagePromptTemplate` defines the formatting of the selected examples into a given prompt. We define the template to be consistent with what will be generated by the model during intent classification.

In [99]:
example_selector = SemanticSimilarityExampleSelector.from_examples(
    few_shot_library,
    embeddings,
    vectorstore,
    k=5,
)

few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_selector=example_selector,
    example_prompt=(
        HumanMessagePromptTemplate.from_template("{question}")
        + AIMessagePromptTemplate.from_template("<intention>{class}</intention>")
    ),
    input_variables=["question"],
)

We print the relevant examples for a given query. Notice that the distribution of labels will change based on the nature of the query. This helps further align the model with our expectations.

In [100]:
display_markdown(Markdown(few_shot_prompt.format(question="tell me about my travels")))

Human: Tell me about my travel history
AI: <intention>vacation</intention>
Human: I'd like to book an all-inclusive Caribbean cruise.
AI: <intention>vacation</intention>
Human: Can you suggest some budget-friendly European cities?
AI: <intention>vacation</intention>
Human: Can I speak with a manager about my recent experience?
AI: <intention>contact</intention>
Human: How do I request a refund for my canceled trip?
AI: <intention>contact</intention>

We redefine the system prompt to accomodate for the dynamic few-shots.

In [101]:
few_shot_intent_system_prompt = """You are a precise classifier. Your task is to assess customer intent and categorize customer inquiry into one of the intentions. 

Intentions with their description:
vacation: Information on vacations, various travel destinations and my recent travels.
contact: Expressing the desire to talk to support.
irrelevant: Not related to vacations and travel.

Here is an example of how to respond in a standard interaction:
"""

We redefine the prompt template to accomodate for the dynamic few-shots. As expected, the final string created from `intent_detection_prompt` will change based on message similarity to previous examples.

In [102]:
few_shot_intent_detection_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", few_shot_intent_system_prompt),
        few_shot_prompt,
        ("human", "Think step-by-step and always ensure you insert the classification in <intention></intention> tags and do not include anything after.\
        Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)
few_shot_intent_chain = intent_detection_prompt | nova

We test the newly created chain.

In [103]:
display_markdown(Markdown(few_shot_intent_chain.invoke({"question": "tell me about my travel history"}).content))

<intention>vacation</intention>

### Normalizing the user message

We may want to restrict the queries that are sent to downstream inference without restricting the user experience. Normalizing messages enables us to do exactly this. It can often be used to set a certain tone, reduce length and extract the specific purpose of the message while reducing unecessary noise. Notice the role the rule book plays in determining the nature of the returned message.

Alternatively, it is common to supply few-shot examples as we have done in the previous step. We again return the resulting message in between tags.

In [104]:
norm_system_prompt = """You are a precise message synthesizer. Your task is to write a condensed message encompassing the latest original message's intent and main keywords. 
The condensed message must follow the rule book.

Rule book:
- Must be a complete sentence formulated as a request from the perspective of the original requester.
- No longer than 2 short sentences with no concatination.
- Never include names.
- It is safe to reformulate questions with only keyword as looking for information on the place they mention.
 
Think about your answer first before you respond. Think step-by-step and the condensed message in <condensed_message></condensed message> tags and do not include anything after."""

We define the prompt template incorporating the system prompt with the user defined message. 

In [105]:
norm_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", norm_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)
norm_chain = norm_prompt | nova

When executing the chain on a longer query, the returned message pulls out only the information necessary to the task at hand.

In [26]:
display_markdown(Markdown(norm_chain.invoke({"question": """I have been all around the world seing a bunch of stuff. 
I met a bunch of people like Bernard and Tamy. Tell me about my travel history"""}).content))

<condensed_message>Requesting information on travel history and people met.</condensed_message>

When executing the chain on a query that only has keywords, the model fills in the gap to provide additional context. Although the initial queries are quite different, notice that their resulting output is quite similar.

In [29]:
display_markdown(Markdown(norm_chain.invoke({"question": """New York"""}).content))

<condensed_message>Looking for information on New York.</condensed_message>

Once we have detected the message's intent and normalized it to some extent, we are able to have much greater assurance as to the nature of the messages sent to subsequent steps, namely the retrieval.

## Advanced methods of retrieval

The main driver of performance for RAG pipelines is the retrieval mechanism. This step involves identifying a subset of documents that are most relevant to the original query. The common baseline is generally to embed the query in its original form and pull the top-K nearest documents. However, for some datasets this begins to fall short in cases where queries address multiple topics or, more generally, are phrased in a way that is incompatible or is dissimilar to the documents that should be retrieved. We look at how it is possible to improve on these types of queries. 

Given the increase complexity of the tasks in this section, we choose to leverage Amazon Nova Pro in this part of the pipeline. 

In [44]:
modelId = "us.amazon.nova-pro-v1:0"
nova_pro = ChatBedrock(
    model_id=modelId,
    client=bedrock,
    beta_use_converse_api=True
)

### Decomposition

For more complex queries, it may be helpful to breakdown the original question into sub-problems each having their own retrieval step. We perform query decomposition to return the original question or an equivalent set of questions each with a single target.

This process is driven by the underlying model. We define the system prompt describing the intended task and supply static few-shot examples to enable the model to better generalize. Removing these examples yields results that are less robust.

In [45]:
decomp_system_prompt = """You are a expert assistant that prepares queries that will be sent to a search component. 
These queries may be very complex. Your job is to simplify complex queries into multiple queries that can be answered in isolation to eachother.

If the query is simple, then keep it as it is.

If there are acronyms or words you are not familiar with, do not try to rephrase them.
Here is an example of how to respond in a standard interaction:
<example>
- Query: Did Meta or Nvidia make more money last year?
Decomposed Questions: [SubQuery(sub_query='How much profit did Meta make last year?'), SubQuery(sub_query'How much profit did Nvidia make last year?')]
</example>
<example>
- Query: What is the capital of France?
Decomposed Questions: [SubQuery(sub_query='What is the capital of France?')]
</example>"""

To ensure a consistent format is returned for subsequent steps, we use Pydantic, a data-validation library. We rely on a Pydantic-based helper function for doing the tool config translation for us in a way that ensures we avoid potential mistakes when defining our tool config schema in a JSON dictionary.

We define `SubQuery` to be a query corresponding to a subset of the points of a larger parent query. 

In [46]:
class SubQuery(BaseModel):
    """You have performed query decomposition to generate a subquery of a question"""

    sub_query: str = Field(description="A unique subquery of the original question.")

We define the prompt template leveraging the previously defined system prompt. We then expose `SubQuery` as a tool the model can leverage. This enables to model to format one or more requests to this tool.

In [47]:
query_decomposition_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", decomp_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)

llm_with_tools = nova_pro.bind_tools([SubQuery])
decomp_query_analyzer = query_decomposition_prompt | llm_with_tools | PydanticToolsParser(tools=[SubQuery])

We asking a broad question about multiple destinations, the model chooses to return multiple calls to `SubQuery`. Each can be sent for document retrieval in parallel, thus ensuring we do not encure additional latency beyond that of the model inferencing. 

In [50]:
queries = decomp_query_analyzer.invoke({"question": "How do go on vacation in thailand and in California?"})
queries

[SubQuery(sub_query='How do you go on vacation in Thailand?'),
 SubQuery(sub_query='How do you go on vacation in California?')]

### Expansion

Query expansion is similar to decomposition in that it produces multiple queries as a strategy to improve the odds of hitting a relevant result. However, expansion returns multiple different wordings of the original query.  

We define the system prompt to consistently return 3 versions of the original query. 

In [51]:
paraphrase_system_prompt = """You are an expert at converting user questions into database queries. 
You have access to a database of travel destinations and a list of recent destinations for travelers. 

Perform query expansion. If there are multiple common ways of phrasing a user question 
or common synonyms for key words in the question, make sure to return multiple versions 
of the query with the different phrasings.

If there are acronyms or words you are not familiar with, do not try to rephrase them.

Always return at least 3 versions of the question."""

We define the prompt template leveraging the previously defined system prompt. We then expose `ParaphrasedQuery` as a tool the model can leverage. This enables to model to format one or more requests to this tool.

In [52]:
class ParaphrasedQuery(BaseModel):
    """You have performed query expansion to generate a paraphrasing of a question."""

    paraphrased_query: str = Field(description="A unique paraphrasing of the original question.")

We define the prompt template leveraging the previously defined system prompt. We then expose `ParaphrasedQuery` as a tool the model can leverage. This enables to model to format one or more requests to this tool.

In [54]:
query_expansion_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", paraphrase_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)
llm_with_tools = nova_pro.bind_tools([ParaphrasedQuery])
query_expansion = query_expansion_prompt | llm_with_tools | PydanticToolsParser(tools=[ParaphrasedQuery])

Now no matter the nature of the query, the model generates alternatives that can be sent for retrieval in parallel.

In [55]:
query_expansion.invoke({"question": "how to use travel to Canada and to Mexico?"})

[ParaphrasedQuery(paraphrased_query='What are the steps to travel to Canada and Mexico?'),
 ParaphrasedQuery(paraphrased_query='How can I plan a trip to both Canada and Mexico?'),
 ParaphrasedQuery(paraphrased_query='What do I need to know to travel to Canada and Mexico?')]

### Hypothetical Document Embeddings (HyDE)

Given that models have been trained large volumes of data, we can generate a relevant hypothetical document to answer the user question. Then for retrieval, this new (or *hypethetical*) document can be embedded with the original query. This approach has been shown in [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/abs/2212.10496) to improve recall. We define the system prompt relevant to this task.

In [56]:
hyde_system_prompt = """You are an expert about travel destinations all over the worlds. Your task is to provide your best response based on the question.
You need to produce a high-quality and complete sentence hyper focused on answer the question. 
Do not answer in bulletpoints.

Think about your answer first before you respond. Think step-by-step and the answer in <hyde></hyde> tags and do not include anything after."""

We define the prompt template leveraging the previously defined system prompt.

In [57]:
hyde_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", hyde_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)
hyde_chain = hyde_prompt | nova_pro | StrOutputParser()

We produce a document for the query in between tags that is be appended at retrieval time.

In [58]:
queries = hyde_chain.invoke({"question": "How do go on vacation in thailand and in California?"})
display_markdown(Markdown(queries))

<hyde>To answer the customer's question about vacationing in Thailand and California, I need to provide a comprehensive and detailed response that covers essential aspects of planning a trip to both destinations. This includes information on the best times to visit, must-see attractions, travel tips, and any specific cultural or logistical considerations for each location. Thailand and California offer vastly different experiences, so the response should highlight unique features and activities available in each place. Additionally, practical advice on transportation, accommodation, and local customs will enhance the usefulness of the response.</hyde>

To go on vacation in Thailand, plan your trip around the cool and dry season from November to February, explore iconic sites like the Grand Palace and Wat Pho in Bangkok, and immerse yourself in local culture by visiting vibrant night markets and participating in traditional festivals, while in California, choose between the sunny beaches of Southern California, the tech-savvy atmosphere of Silicon Valley, or the wine country of Napa and Sonoma, and enjoy diverse activities such as surfing in Malibu, hiking in Yosemite National Park, and experiencing the vibrant city life in Los Angeles or San Francisco.

In this section we demonstrated the possiblity of augmented the original message to produce stronger results. Naturally, this LLM-driven approach requires an additional inference, which introduces some additional latency.  

## Model answer generation

In most RAG pipelines, the number of documents shown to the model is driven by the retrieval mechanism. This generally returns up to some static number of documents provided they meeting the necessary similarity treshold. Often, this results in irrelevant documents being sent to the model for inference. Although we can easily intruct the model to ignore irrelevant documents, it is often useful for the model to explicitly call-out the documents it did use. Furthermore, many lines of research have demonstrated the effectiveness of enabling the model to correct itself. In both cases, we make an additional call to the model once an initial answer is generated in order to improve the output for the end-user. 

### Citation

We generate an output with `answer` and `docs` keys. `docs` contains a list of Langchain `Document` objects. These are the documents the model has picked as being relevant to answering the original query. Although the documents are currently returned with title and summaries, these keys are part of a `metadata` attribute letting you determine any number of field that may be relevant to be used by your application such as author, source URL, etc... 

We define the system prompt to generate the model answer. Note that this is a simple template that can be further augmented with additional sections better describing our task and intended output.

In [59]:
citation_system_prompt = """You're a helpful AI assistant. Given a user question and some article snippets, answer the user question. 
If none of the articles answer the question, just say you don't know.

Here are the articles: {context}
"""

This prompt is past as part the broader chat template.

In [60]:
citation_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", citation_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)

answer_generator = citation_prompt | nova_pro | StrOutputParser()

Lets use the `WikipediaRetriever` allowing us to interact with the Wikipedia API.

In [62]:
wiki = WikipediaRetriever(top_k_results=6, doc_content_chars_max=2000)

The `format_docs` helper function is used to format the documents returned by the retriever to make them more friendly to the model. We supply the document's title and summary snippet. At the end, we pass the function to a child of Lanchain's `Runnable` class. This simply enables us to call the function with a standard API (invoke, batch, stream, transform and compose). Many object in Langchain implement this interface including `BaseModel`. 

To demonstrate the power of citations, we also append an additional obviously irrelevant document to the formatted documents.

In [63]:
def format_docs(docs: List[Document]) -> str:
    """Convert Documents to a single string.:"""
    formatted = [
        f"Article Title: {doc.metadata['title']}\nArticle Snippet: {doc.page_content}"
        for doc in docs
    ]
    formatted.append("Article Title: This is an irrelevant document \
    Article Snippet: The document is most irrelevant.")
    return "\n\n" + "\n\n".join(formatted)


format = itemgetter("docs") | RunnableLambda(format_docs)

We define a chain as `RunnableParallel` object, which is an extention of `Runnable` that runs a mapping of Runnables in parallel, and returns a mapping of their outputs. We set the question property using `RunnablePassthrough`. This passes the input unchanged. Then, we assign values to keys in the prompt templates. 

In [64]:
citation_chain = (
    RunnableParallel(question=RunnablePassthrough(), docs=wiki)
    .assign(context=format)
    .assign(answer=answer_generator)
    .pick(["answer", "docs"])
)

When invoking the chain, it returns the original answer and the documents used for generation. Notice that some documents are relevant to the final answer and some are not. We can address this challenge with further LLM or metadata document filtering.

In [65]:
citation_chain.invoke("How do go on vacation in thailand and in California?")

{'answer': "Here's how you can go on vacation in Thailand and California:\n\n### Vacation in Thailand\n\n1. **Plan Your Itinerary**: Decide which cities or regions you want to visit. Popular destinations include Bangkok, Phuket, Chiang Mai, and Krabi.\n2. **Book Flights**: Find flights to major airports like Suvarnabhumi Airport (BKK) in Bangkok or Phuket International Airport (HKT).\n3. **Accommodation**: Book hotels, resorts, or Airbnbs in your chosen locations. Consider the type of experience you want (luxury, budget, etc.).\n4. **Visa**: Check if you need a visa. Many countries can enter Thailand visa-free for a limited period.\n5. **Transportation**: Arrange for local transportation. Options include taxis, tuk-tuks, Grab (ride-hailing service), and public transport.\n6. **Experience Local Culture**: Visit temples, markets, and try local cuisine. Participate in cultural activities or tours.\n\n### Vacation in California\n\n1. **Plan Your Itinerary**: Decide which cities or regions 

### Self-validation

Giving the model an opportunity to correct itself has been shown to increase performance on a number of tasks. We perform self-validation and define a set of formatting rules that align with the conversational tone we expect to have from our application. We define a system prompt with this task and set of rules.

In [66]:
valid_system_prompt = """You are a validator and message synthesize. 
Your task is to create one coherent answer and double check the original responses to the question {question} for common mistakes, including:
- Answer in bullet points. It should be a complete paragraph instead.
- Inaccuracies or things that seem impossible

If there are any of the above mistakes, rewrite the response. If there are no mistakes, just reproduce the original response.
Think about your answer first before you respond. 
If some exist, put all the issues and then put your final response in <validation></validation> tags and do not include anything after.
"""

We define the prompt template with the system prompt and original model answer.

In [67]:
validation_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", valid_system_prompt), 
        ("human", "Here is the original message produced: <orignal_message>{original}</orignal_message> How do you answer to the instructions?")]
)
validation_chain = validation_prompt | nova_pro | StrOutputParser()

We invoke model, which points out obvious issues in the original document and answers with a more consistent alternative. 

In [69]:
validation = validation_chain.invoke({
    "question" : "how to go to thailand from Montreal?",
    "original": "1- by plane 2-by car.",
})
display_markdown(Markdown(validation))

The original message contains inaccuracies and is overly simplistic. Traveling from Montreal to Thailand by car is not feasible due to geographical constraints and the need to cross multiple countries and bodies of water. Additionally, the response is presented in bullet points rather than a complete paragraph. 

Here is the revised response:

<validation>
Traveling from Montreal to Thailand typically involves flying, as it is the most practical and direct method. You would need to book a flight from Montreal's Pierre Elliott Trudeau International Airport to a major airport in Thailand, such as Suvarnabhumi Airport in Bangkok. This journey usually requires one or more layovers, depending on the airline and route chosen. It is important to check visa requirements and ensure all travel documents are in order before departure.
</validation>

## Putting it all together 

The previous components offer important primitives to build a performant RAG solution. They act as building blocks of a broader solution. We provide an example showcasing how they can be brought together in a single chain to improve response accuracy. To minimize latency and improve accuracy, we use Amazon Nova Lite for simpler tasks and Nova Pro where we need more performance. The pipeline is described by the following diagram.

<img src="./assets/rag-pipeline.png" width="400" height="400" />

First, we create helper functions to parse the return messages for the relevant section that can be found in between tags.

In [70]:
def parse_intent(ai_message: AIMessage) -> str:
    """Parse the AI message."""
    intent_pattern = r"<intention>(.*?)</intention>"
    intent_match = re.findall(intent_pattern, ai_message.content, flags=0)
    if intent_match:
        return intent_match[0]
    else:
        return "No intention found."

def parse_norm_message(ai_message: AIMessage) -> str:
    """Parse the AI message."""
    norm_pattern = r"<condensed_message>(.*?)</condensed_message>"
    norm_match = re.findall(norm_pattern, ai_message['question'].content, flags=0)
    if norm_match:
        return norm_match[0]
    else:
        return "Message could not be successfully normalized."

We define an end-to-end RAG chain primairly using LangChain Expression Language (LCEL), which allows us to define `Runnable` objects in success to one another. The resulting chain reuses many of the components we previously defined including intent detection with **dynamic few-shots, message normalization and citation**. 

In [76]:
rag_chain = RunnableParallel(
    question=RunnablePassthrough(),
    intent=few_shot_intent_detection_prompt | nova | parse_intent
) | RunnableBranch(
    (lambda payload: "vacation" == payload["intent"].lower(), lambda x: (
        RunnablePassthrough().pick(["question"])
        .assign(question=norm_chain)
        .assign(question=parse_norm_message)
        .assign(context=lambda inputs: wiki.invoke(inputs["question"]))
        .assign(answer=answer_generator)
        .pick(["answer", "context"])
    )),
    (lambda payload: "irrelevant" == payload["intent"].lower(), lambda x: AIMessage(content="I am only able to answer questions about travel and vacations.")),
    (lambda payload: "contact" == payload["intent"].lower(), lambda x: AIMessage(content="I am transfering you to an agent now...")),
    lambda payload: AIMessage(content="I am only able to answer questions about travel and vacations." )
)

display_markdown(Markdown(rag_chain.invoke("I want to know more about how to plan a vacation?")['answer']))

Based on the provided articles, here are some key points to consider when planning a vacation:

1. **Vacation Duration**:
   - The length of a summer vacation can vary significantly by country. For example, in countries like Spain, Portugal, and Italy, the summer break is typically three months, whereas in countries like Australia, the United Kingdom, and Germany, it may be two to six weeks.

2. **Vacation Rentals**:
   - HomeAway (now Vrbo) is a marketplace for vacation rentals, offering a variety of properties such as cabins, condos, villas, and more. It can be a useful resource for finding a place to stay during your vacation.

3. **Schedules and Time Management**:
   - Creating a schedule or timetable can help you manage your vacation time effectively. This can include planning daily activities, knowing the hours of operation for attractions, and setting times for specific events.

4. **Entertainment**:
   - If you’re looking for vacation-themed entertainment, the film "RV" (2006) is a comedy about a family road trip that might provide some inspiration or simply entertainment during your planning phase.

If you need more specific information or have particular questions about your vacation plans, feel free to ask!

It is evident that latency is increased in corralation with the number calls being made in succession. Hence, it is optimal to make calls in parallel where possible to reduce overall time to execute the entire pipeline. Notice in in our example that the intent detection could be made in parallel to message normalization and citation (model inference).

Additionally, it may be benifitial to modify the pipeline to include a query augmentation step for reasons described earlier in the notebook.

## Next steps

Where RAG enables single-turn conversations where users and agents alternate sending eachother messages, agents supply the ability to the application developer to build increased complexity into the conversation flow. These applications are characterized by increase **autonomy, reactivity, proactiveness, adaptability and situatedness**. They typically have some form of validation, the ability to loop back and call external functions to improve outputs. You can dive deeper into agents in the next lab of this workshop.

## Clean up

There is no necessary clean up for this notebook.