# Basic RAG

![basic rag](../images/images-basic%20rag.png)

A basic **Retrieval-Augmented Generation (RAG)** flow can be broken down into three main stages: **Indexing**, **Retrieval**, and **Generation**. Each stage plays a distinct role in ensuring that the user query is processed effectively and results in a relevant, contextually enriched response.

### **1. Indexing**
   - **Purpose**: Prepare the external knowledge base for efficient retrieval.
   - **Process**:
     1. **Document Preprocessing**:
        - Raw data (e.g., PDFs, web pages, structured data) is cleaned, tokenized, and converted into a machine-readable format.
     2. **Representation**:
        - Documents are encoded into a searchable form:
          - **Sparse Indexing**: Traditional techniques like TF-IDF or BM25 create keyword-based indexes.
          - **Dense Indexing**: Neural models (e.g., Sentence Transformers) create vector embeddings for semantic similarity search.
     3. **Storage**:
        - The indexed representations are stored in a retrieval system like Chroma, Elasticsearch, FAISS, Pinecone, or Vespa.
   - **Output**: A structured, searchable repository of documents ready for retrieval.

### **2. Retrieval**
   - **Purpose**: Identify and fetch relevant documents or passages from the knowledge base.
   - **Process**:
     1. **Query Encoding**:
        - The user query is encoded into a vector representation (using the same embedding model as the indexing step for dense retrieval).
     2. **Search**:
        - The encoded query is matched against the indexed documents:
          - **Dense Retrieval**: Measures similarity (e.g., cosine similarity) between the query embedding and document embeddings.
          - **Sparse Retrieval**: Uses keyword-based scoring algorithms like BM25.
     3. **Ranking**:
        - Retrieved documents are ranked based on relevance scores.
     4. **Selection**:
        - A fixed number (e.g., top 5) of the most relevant documents or passages are selected.
   - **Output**: A set of top-ranked documents or snippets that are most relevant to the query.

### **3. Generation**
   - **Purpose**: Use retrieved information to generate a coherent and contextually accurate response.
   - **Process**:
     1. **Context Preparation**:
        - The user query and the retrieved documents are combined into a prompt for the LLM.
     2. **Language Model Processing**:
        - The LLM processes the input, paying attention to the query and retrieved context to craft a grounded response.
     3. **Response Optimization**:
        - The output may be fine-tuned to ensure clarity, coherence, and relevance (e.g., using post-processing techniques).
   - **Output**: A natural language response tailored to the user's query, enriched with the retrieved contextual information.

### **Example Flow**
#### User Query:
*"What are the best practices for securing an API?"*

1. **Indexing**:
   - Security guidelines, blog posts, and API documentation are preprocessed and indexed using dense embeddings and stored in a vector database.

2. **Retrieval**:
   - The query is encoded into a vector and matched against the database.
   - Relevant documents such as "API Security Best Practices (2023)" and "OAuth Implementation Guide" are retrieved.

3. **Generation**:
   - The query and retrieved documents are passed as input to an LLM.
   - The model generates a response like:
     - "To secure an API, implement OAuth 2.0 for authentication, validate all inputs to prevent injection attacks, and ensure HTTPS is enforced for all connections."

### **Key Points**
- **Indexing** ensures efficient retrieval by pre-processing and storing documents in a searchable format.
- **Retrieval** narrows down the knowledge base to the most relevant context.
- **Generation** synthesizes this context with the user query to produce an accurate, grounded response.

This modular flow allows RAG systems to dynamically incorporate external information, making them versatile and scalable for diverse use cases.

## Setup


In [33]:
%run "../Z - Common/setup.ipynb"

## Imports

In [5]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.globals import set_debug
from langchain.prompts import ChatPromptTemplate
from pprint import pprint


## Indexing

[DocumentLoaders](https://python.langchain.com/docs/integrations/document_loaders/) load data into the standard LangChain Document format. Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the `.load()` method. 


In [6]:
%psource load_sample_data

[0;32mdef[0m [0mload_sample_data[0m[0;34m([0m[0;34m)[0m [0;34m->[0m [0mIterator[0m[0;34m[[0m[0mDocument[0m[0;34m][0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Loads data from a blog, intended to be later stored in a vectorstore."""[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0mloader[0m [0;34m=[0m [0mWebBaseLoader[0m[0;34m([0m[0;34m[0m
[0;34m[0m        [0mweb_paths[0m[0;34m=[0m[0;34m([0m[0;34m"https://lilianweng.github.io/posts/2023-06-23-agent/"[0m[0;34m,[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0mbs_kwargs[0m[0;34m=[0m[0mdict[0m[0;34m([0m[0;34m[0m
[0;34m[0m            [0mparse_only[0m[0;34m=[0m[0mbs4[0m[0;34m.[0m[0mSoupStrainer[0m[0;34m([0m[0;34m[0m
[0;34m[0m                [0mclass_[0m[0;34m=[0m[0;34m([0m[0;34m"post-content"[0m[0;34m,[0m [0;34m"post-title"[0m[0;34m,[0m [0;34m"post-header"[0m[0;34m)[0m[0;34m[0m
[0;34m[0m            [0;34m)[0m[0;34m[0m
[0;34m

In [7]:
docs = load_sample_data()

Before loading the documents into vector store they need to be [split](https://python.langchain.com/docs/how_to/recursive_text_splitter/). Here we are starting off with basic splitting by length (`300` characters with `50` overlap), but later will explore other splitting techniques.


In [8]:
%psource split_sample_data

[0;32mdef[0m [0msplit_sample_data[0m[0;34m([0m[0mdocs[0m[0;34m:[0m[0mIterator[0m[0;34m[[0m[0mDocument[0m[0;34m][0m[0;34m,[0m [0mchunk_size[0m[0;34m=[0m[0;36m300[0m[0;34m,[0m [0mchunk_overlap[0m[0;34m=[0m[0;36m50[0m[0;34m)[0m [0;34m->[0m [0mList[0m[0;34m[[0m[0mDocument[0m[0;34m][0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0mtext_splitter[0m [0;34m=[0m [0mRecursiveCharacterTextSplitter[0m[0;34m.[0m[0mfrom_tiktoken_encoder[0m[0;34m([0m[0;34m[0m
[0;34m[0m        [0mchunk_size[0m[0;34m=[0m[0mchunk_size[0m[0;34m,[0m [0;34m[0m
[0;34m[0m        [0mchunk_overlap[0m[0;34m=[0m[0mchunk_overlap[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;31m# Make splits[0m[0;34m[0m
[0;34m[0m    [0msplits[0m [0;34m=[0m [0mtext_splitter[0m[0;34m.[0m[0msplit_documents[0m[0;34m([0m[0mdocs[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0msplits[0m[0;34m[0m[0;34m[0m[0m


In [9]:
splits = split_sample_data(docs)

Then finally we can load the split embeddings into the [vector store](https://python.langchain.com/docs/integrations/vectorstores/). For simplicity we are using [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma/) as the vector store.

In [10]:
%psource seed_sample_data

[0;32mdef[0m [0mseed_sample_data[0m[0;34m([0m[0mdocuments[0m[0;34m:[0m[0mList[0m[0;34m[[0m[0mDocument[0m[0;34m][0m[0;34m,[0m [0mk[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m [0;34m->[0m [0mVectorStoreRetriever[0m[0;34m:[0m [0;34m[0m
[0;34m[0m    [0mvector_store[0m [0;34m=[0m [0mChroma[0m[0;34m([0m[0;34m[0m
[0;34m[0m        [0mcollection_name[0m[0;34m=[0m[0;34m"rag_techniques"[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0membedding_function[0m[0;34m=[0m[0membeddings[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0mpersist_directory[0m[0;34m=[0m[0;34m"./chroma_db"[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0muuids[0m [0;34m=[0m [0;34m[[0m[0mstr[0m[0;34m([0m[0muuid4[0m[0;34m([0m[0;34m)[0m[0;34m)[0m [0;32mfor[0m [0m_[0m [0;32min[0m [0mrange[0m[0;34m([0m[0mlen[0m[0;34m([0m[0mdocuments[0m[0;34m)[0m[0;34m)[0m[0;34m][0m[0;34m[0m
[0;34

In [11]:
retriever = seed_sample_data(splits)

## Retrieval

Using the `retriever` we can query the vector store.

In [19]:
docs = retriever.invoke("What is Task Decomposition?")

print("No. of results: ", len(docs))
print(docs[0].metadata)
print(docs[0].page_content)

No. of results:  1
{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}
Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-firs

## Generation

Define the prompt we will pass to the LLM.

In [20]:
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n'), additional_kwargs={})])

Build  a basic chain. The `|` operator [chains runnable objects](https://python.langchain.com/docs/how_to/sequence/) (objects that have an `invoke()` function) together so as one object is streaming output, the next object in the chain can receive the stream as input.

In [21]:
chain = prompt | llm

Execute the chain by calling its invoke method. The `dict` passed to `invoke()` is used to tokenize varibles declared at any of the steps.

In [22]:
chain.invoke({"context":docs,"question":"What is Task Decomposition?"})

AIMessage(content='Based on the provided context, Task Decomposition is a process where complex tasks are broken down into smaller, more manageable steps. It can be accomplished through several methods:\n\n1. Chain of thought (CoT) - a prompting technique where the model is instructed to "think step by step" to break down complex tasks into simpler steps.\n\n2. Tree of Thoughts - an extension of CoT that explores multiple reasoning possibilities at each step, creating a tree structure that can be searched using BFS or DFS.\n\nTask decomposition can be implemented in three ways:\n1. Using LLM with simple prompts (e.g., "Steps for XYZ.\\n1." or "What are the subgoals for achieving XYZ?")\n2. Using task-specific instructions (e.g., "Write a story outline" for writing a novel)\n3. With human inputs\n\nThe purpose of task decomposition is to make complicated tasks more manageable and provide insight into the model\'s thinking process.', additional_kwargs={'usage': {'prompt_tokens': 371, 'co

Instead of defining our own prompts, we can make use of prompt templates published in the [Langchain Hub](https://smith.langchain.com/hub). Lets replace our previous prompt with one from the hub and rebuild the chain.

In [23]:
prompt = hub.pull("rlm/rag-prompt")
pprint(prompt)

# rebuild the chain
chain = prompt | llm

chain.invoke({"context":docs, "question":"What is Task Decomposition?"})


ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])


AIMessage(content="Task Decomposition is a technique where complex tasks are broken down into smaller, more manageable steps, often implemented through methods like Chain of Thought (CoT) prompting. It can be accomplished through LLM prompting, task-specific instructions, or human inputs, and helps make complicated tasks more approachable while providing insight into the model's thinking process. Advanced versions like Tree of Thoughts extend this concept by exploring multiple reasoning possibilities at each step.", additional_kwargs={'usage': {'prompt_tokens': 418, 'completion_tokens': 98, 'total_tokens': 516}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, response_metadata={'usage': {'prompt_tokens': 418, 'completion_tokens': 98, 'total_tokens': 516}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, id='run-e33b773e-fb2c-4c6b-b0aa-371ef4eccb21-0', usage_metadata={'input_tokens': 418, 'output_tokens': 98, 'tota

We can now build a basic RAG chain, where instead of explicitly passing `docs` as the context we instead provide the `retriever` to query the vector store directly.

In [24]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

result = chain.invoke("What is Task Decomposition?")
result

'Task Decomposition is a technique where complex tasks are broken down into smaller, more manageable steps, often implemented through methods like Chain of Thought (CoT) prompting. It can be accomplished through LLM prompting, task-specific instructions, or human inputs, and helps models tackle complicated problems more effectively. Advanced versions like Tree of Thoughts extend this concept by exploring multiple reasoning possibilities at each step.'

If you are interested in understanding more details about the chain, we can run it in debug mode:

In [34]:
set_debug(True)

result = chain.invoke("What are the main components of an LLM-powered autonomous agent system?")
result

set_debug(False)

[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "What are the main components of an LLM-powered autonomous agent system?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question>] Entering Chain run with input:
[0m{
  "input": "What are the main components of an LLM-powered autonomous agent system?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question> > chain:RunnablePassthrough] Entering Chain run with input:
[0m{
  "input": "What are the main components of an LLM-powered autonomous agent system?"
}
[36;1m[1;3m[chain/end][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question> > chain:RunnablePassthrough] [1ms] Exiting Chain run with output:
[0m{
  "output": "What are the main components of an LLM-powered autonomous agent system?"
}
[36;1m[1;3m[chain/end][0m [1m[chain:RunnableSequence > chain:RunnableP

[Langsmith](https://smith.langchain.com/) also allows you to view the flow. To view:

- Visit [Langsmith](https://smith.langchain.com/)
- Select the name of your project. Will be named `default` if you have not changed it
- All executed LLM chains will be displayed. Click on one to view a breakdown of the steps.

> **WARNING** Do not enable langsmith if the data / chain is confidential!

![example](../images/langsmith.png)




Save the results so we can compare later.

In [35]:
write_results("basic.txt", result)