# Chapter 3 - RAG Part II: Chatting with Your Data

## Introducing Retrieval-Augmented Generation (RAG)

### Retrieving Relevant Documents

A RAG system for an AI app typically follows three core stages:
* **Indexing**: This stage involves preprocessing the external data source and storing embeddings that represent the data in a vector store where they can be easily retrieved.
* **Retrieval**: This stage involves retrieving the relevant embeddings and data stored in the
vector store based on a user’s query.
* **Generation**: This stage involves synthesizing the original prompt with the retrieved relevant
documents as one final prompt sent to the model for a prediction.

Let’s run through an example from scratch again, starting with the indexing stage:

In [1]:
from langchain_community.document_loaders import TextLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres.vectorstores import PGVector
from dotenv import load_dotenv
import os

load_dotenv()

# load the document, split it into chunks
raw_documents = TextLoader("./rime.txt").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(raw_documents)

# define embedding model
hf_embedding = HuggingFaceEmbeddings(
    model="sentence-transformers/all-mpnet-base-v2", # use this model to perform the embedding
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": False},
)

# vector store credentials
connection_credentials = f"postgresql+psycopg://{os.getenv('POSTGRES_USER')}:{os.getenv('POSTGRES_PASSWORD')}@localhost:8888/{os.getenv('POSTGRES_DB')}"

# embed each chunk and insert it into the vector store
db = PGVector.from_documents(documents=documents, embedding=hf_embedding, connection=connection_credentials)


The indexing stage is now complete. In order to execute the retrieval stage, we need to perform similarity search calculations—such as cosine similarity—between the user’s query and our stored embeddings, so relevant chunks of our indexed document are retrieved.

The retrieval process consist of:
1. Convert the user's query into embeddings.
2. Calculate the embeddings in the vector store that are most similar to the user's query.
3. Retrieve the relevant document embeddings and their corresponding text chunk.

We can represent these steps programmatically using LangChain as follows:

In [3]:
# create a retriever
retriever = db.as_retriever(search_kwargs={"k": 2})

# fetch query's relevant documents
docs = retriever.invoke(input="Who's the ancyent marinere?")
docs

[Document(id='2a80a61a-87a4-4795-8aa3-56150da2c80d', metadata={'source': './rime.txt'}, page_content='THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.\n\nARGUMENT.\n\nHow a Ship having passed the Line was driven by Storms to the cold Country towards the South Pole; and how from thence she made her course to the tropical Latitude of the Great Pacific Ocean; and of the strange things that befell; and in what manner the Ancyent Marinere came back to his own Country.\n\nI.\n\n     It is an ancyent Marinere,\n       And he stoppeth one of three:\n     "By thy long grey beard and thy glittering eye\n       "Now wherefore stoppest me?\n\n     "The Bridegroom\'s doors are open\'d wide\n       "And I am next of kin;\n     "The Guests are met, the Feast is set,--\n       "May\'st hear the merry din.--\n\n     But still he holds the wedding-guest--\n       There was a Ship, quoth he--\n     "Nay, if thou\'st got a laughsome tale,\n       "Marinere! come with me."'),
 Document(id='2301595d-ab31-4

Note that we are using a vector store method you haven’t seen before: ```as_retriever```. This function abstracts the logic of embedding the user’s query and the underlying similarity search calculations performed by the vector store to retrieve the relevant documents.

There is also an argument ```k```, which determines the number of relevant documents to fetch from the vector store. In this example, the argument ```k``` is specified as 2. This tells the vector store to return the two most relevant documents based on the user’s query.

## Generating LLM Predictions Using Relevant Documents

Once we’ve retrieved the relevant documents based on the user’s query, the final step is to add them to the original prompt as context and then invoke the model to generate a final output.

Here’s a code example continuing on from our previous example:

In [2]:
from langchain_deepseek import ChatDeepSeek
from langchain_core.prompts import ChatPromptTemplate

load_dotenv()

retriever = db.as_retriever(search_kwargs={"k": 2})

prompt = ChatPromptTemplate.from_template(
    template=
    """
    Answer the question based only on the following context:
    {context}

    Question: {question}
    """
)
llm = ChatDeepSeek(model="deepseek-chat", temperature=0.0)


In [3]:
chain = prompt | llm

# fetch relevant documents
question = """Who's the ancyent marinere?"""
docs = retriever.invoke(input=question) # get_relevant_documents method is deprecated use invoke instead

# run the workflow
answer = chain.invoke(input={"context": docs, "question": question})
print(f"answer: {answer.content}\n\ndocs: {docs}")

answer: Based solely on the provided context, the "ancyent Marinere" is an old sailor who stops a wedding guest and begins to tell him a story about a ship. He is described as having a "long grey beard," a "glittering eye," and a "skinny hand." He uses his compelling gaze to force the wedding guest to listen to his tale.

docs: [Document(id='6f392f3d-d539-47d7-9347-3d15ff35c80d', metadata={'source': './rime.txt'}, page_content='THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.\n\nARGUMENT.\n\nHow a Ship having passed the Line was driven by Storms to the cold Country towards the South Pole; and how from thence she made her course to the tropical Latitude of the Great Pacific Ocean; and of the strange things that befell; and in what manner the Ancyent Marinere came back to his own Country.\n\nI.\n\n     It is an ancyent Marinere,\n       And he stoppeth one of three:\n     "By thy long grey beard and thy glittering eye\n       "Now wherefore stoppest me?\n\n     "The Bridegroom\'s doors 

Note the following changes:
* We implement dynamic context and question variables into our prompt, which allows us to define a ```ChatPromptTemplate``` the model can use to generate a response.
* We define a ```DeepSeek``` interface to act as our LLM. ```Temperature``` is set to ```0``` to eliminate the creativity in outputs from the model.
* We create a chain to compose the prompt and LLM. A reminder: the ```|``` (pipe) operator takes the output of prompt and uses it as the input to llm.
* We invoke the chain passing in the context variable (our retrieved relevant docs) and the user’s question to generate a final output.

We can encapsulate this retrieval logic in a single function:

In [2]:
from typing import Any
from langchain_core.runnables import chain

@chain
def qa(question: str) -> dict[str, Any]:

    # fetch relevant documents
    docs = retriever.invoke(input=question)
    # prepare prompt
    formatted_prompt =  prompt.invoke(input={"context": docs, "question": question})

    answer = llm.invoke(input=formatted_prompt) # return llm's answer

    return {"answer": answer, "docs": docs} # return answer and relevant docs

In [6]:
response = qa.invoke(input="From where to where was the ship sailing?")

print(f"answer: {response['answer'].content}\n\nrelevant docs: {response['docs']}")

answer: Based solely on the provided context, the ship's journey is described in the "Argument" section as follows:

From: The cold Country towards the South Pole
To: The tropical Latitude of the Great Pacific Ocean

relevant docs: [Document(id='6f392f3d-d539-47d7-9347-3d15ff35c80d', metadata={'source': './rime.txt'}, page_content='THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.\n\nARGUMENT.\n\nHow a Ship having passed the Line was driven by Storms to the cold Country towards the South Pole; and how from thence she made her course to the tropical Latitude of the Great Pacific Ocean; and of the strange things that befell; and in what manner the Ancyent Marinere came back to his own Country.\n\nI.\n\n     It is an ancyent Marinere,\n       And he stoppeth one of three:\n     "By thy long grey beard and thy glittering eye\n       "Now wherefore stoppest me?\n\n     "The Bridegroom\'s doors are open\'d wide\n       "And I am next of kin;\n     "The Guests are met, the Feast is set,--\n  

Notice how we now have a new runnable ```qa``` function that can be called with just a question and takes care to first fetch the relevant docs for context, format them into the prompt, and finally generate the answer. The ```@chain``` decorator turns the function into a ```runnable``` chain. This notion of encapsulating multiple steps into a single function will be key to building interesting apps with LLMs.

## Query Transformation

One of the major problems with a basic RAG system is that it relies too heavily on the quality of a user’s query to generate an accurate output. In a production setting, a user is likely to construct their query in an incomplete, ambiguous, or poorly worded manner that leads to model hallucination.

_Query transformation_ is a subset of strategies designed to modify the user’s input to
answer the first RAG problem question: _How do we handle the variability in the
quality of a user’s input?_

### Rewrite-Retrieve-Read (RRR)

The Rewrite-Retrieve-Read strategy proposed by a Microsoft Research team simply prompts the LLM to rewrite the user’s query before performing retrieval. To illustrate, let’s return to the chain we built in the previous section, this time invoked with a poorly worded user query:

In [7]:
response = qa.invoke(input="Today I woke up and brushed my teeth, then I sat down to read the news. But then I forgot the food on the cooker and my house burned. From where to where was the ship sailing?")

response["answer"].content

"I cannot answer that question based on the provided context. The documents describe a ship caught in a storm and sinking, but they do not contain any information about the ship's departure point or destination."

The model failed to answer the question because it was distracted by the irrelevant information provided in the user’s query. Now let’s implement the RRR prompt:

In [13]:
from langchain_core.messages import BaseMessage

rewrite_prompt = ChatPromptTemplate.from_template("""Extract the question from the following text, rephrasing it if bad constructed. End the question with '**'. Text: {question} answer:""")

def parse_rewriter_output(message: BaseMessage) -> str:
    return message.content.strip('"').strip('**')

rewriter = rewrite_prompt | llm | parse_rewriter_output

@chain
def qa_rrr(question: str) -> dict[str, Any]:
    # rewrite the query
    new_question = rewriter.invoke(input=question)
    # fetch relevant documents
    docs = retriever.invoke(input=new_question)
    # format prompt
    formatted_prompt = prompt.invoke(input={"context": docs, "question": new_question})

    answer = llm.invoke(input=formatted_prompt) # return llm's answer

    return {"answer": answer, "docs": docs, "question": question, "new_question": new_question} # return answer and relevant docs

In [16]:
response = qa_rrr.invoke(input="Today I woke up and brushed my teeth, then I sat down to read the news. But then I forgot the food on the cooker and my house burned down. From where to where was the ship sailing? Sadly, I became homeless.")

print(f"original question: {response['question']}\n\nnew question: {response['new_question']}\n\nanswer: {response['answer'].content}")


original question: Today I woke up and brushed my teeth, then I sat down to read the news. But then I forgot the food on the cooker and my house burned down. From where to where was the ship sailing? Sadly, I became homeless.

new question: From where to where was the ship sailing?

answer: Based solely on the provided context, the ship's journey is described in the "ARGUMENT" section of the first document:

The ship was sailing from "the cold Country towards the South Pole" to "the tropical Latitude of the Great Pacific Ocean."


### Multi-query Retrieval

A user’s single query can be insufficient to capture the full scope of information required to answer the query comprehensively. The multi-query retrieval strategy resolves this problem by instructing an LLM to generate multiple queries based on a user’s initial query, executing a parallel retrieval of each query from the data source and then inserting the retrieved results as prompt context to generate a final model output.

This strategy is particularly useful for use cases where a single question may rely on multiple perspectives to provide a comprehensive answer. Here’s a code example of multi-query retrieval in action:

In [4]:
from langchain.prompts import ChatPromptTemplate
from langchain_core.messages import BaseMessage

perspectives_prompt = ChatPromptTemplate.from_template("""You are an AI language model assistant. Your task is to generate five different versions of the given user question, to retrieve relevant documents for a vector database. By generating multiple perspectives on the user question, your goal is to help the user to overcome some of the limitations of the distance-based similarity search. Provide these alternative questions separated by newlines. Original question: {question}""")

def parse_queries_output(message: BaseMessage) -> list[str]:
    return message.content.split("\n")

query_gen = perspectives_prompt | llm | parse_queries_output

Next we take the list of generated queries, retrieve the most relevant docs for each of them in parallel, and then combine to get the unique union of all the retrieved relevant documents:

In [10]:
from langchain_core.documents import Document

def get_unique_union(document_list: list[list[Document]]) -> list[Document]:
    # flatten list of lists, and dedupe them
    deduped_docs = {
        doc.page_content: doc for sublist in document_list for doc in sublist
    }
    # return a flat list of unique docs
    return list(deduped_docs.values())

retrieval_chain = query_gen | retriever.batch | get_unique_union

Because we’re retrieving documents from the same retriever with multiple (related) queries, it’s likely at least some of them are repeated. Before using them as context to answer the question, we need to deduplicate them _(dedupe)_, to end up with a single instance of each.

Notice our use as well of ```.batch```, which runs all generated queries in parallel and returns a list of the results—in this case, a list of lists of ```Documents```, which we then flatten and dedupe as described earlier.

The final step is to construct a prompt, including the user’s question and combined retrieved relevant documents, and a model interface to generate the prediction:

In [11]:
from langchain_core.runnables import chain
from typing import Any

@chain
def multi_query_qa(input: str) -> dict[str, Any]:
    # fetch relevant documents
    docs = retrieval_chain.invoke(input=input)
    formatted_prompt = prompt.invoke(input={"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(input=formatted_prompt)

    return {"answer": answer, "docs": docs, "question": input}

In [15]:
# run the model
response = multi_query_qa.invoke(input="what are the main events described in the story of the ancyent marinere?")

print(f"question: {response['question']}\n\nanswer: {response['answer'].content}\n\ndocs: {response['docs']}")

question: what are the main events described in the story of the ancyent marinere?

answer: Based solely on the provided context, the main events described are:

1.  **The Mariner Detains a Wedding Guest:** An ancient mariner stops a guest on his way to a wedding feast. Despite the guest's protests, the mariner holds him with his "glittering eye" and compels him to listen to a story.

2.  **The Ship's Journey Begins:** The mariner begins his tale by describing the ship's departure from the harbor, sailing cheerfully past the church, hill, and lighthouse. The sun rises from the sea on the left and sets into the sea on the right.

3.  **A Supernatural Event with the Dead Crew:** A strong wind roars and then suddenly drops. Under the lightning and moon, the dead crew members groan, rise, and begin to work the ship's ropes again as if alive, though they are silent and move like "lifeless tools." The mariner is terrified, especially working next to the body of his brother's son.

4.  **The 

Notice how this isn’t that different from our previous QA chains, as all the new logic for multi-query retrieval is contained in ```retrieval_chain```. This is key to making good use of these techniques—implementing each technique as a standalone chain (in this case, ```retrieval_chain```), which makes it easy to adopt them and even to combine them.