# **Building Retrieval-Augmented Generation**

## **What's Covered?**
1. Introduction to RAG
    - Why RAG?
    - Benefits of RAG
    - RAG Building Blocks
    - Application Areas
2. How to Build a RAG System?
    - The Process
    - What is an Index?
    - What is a Retriever?
3. Building an End-to-End RAG Chain
    - Step 1: Initialize the Chroma DB Connection
    - Step 2: Create a Retriever Object
    - Step 3: Initialize a Chat Prompt Template
    - Step 4: Initialize a Generator (i.e. Chat Model)
    - Step 5: Initialize a Output Parser
    - Step 6: Define a RAG Chain
    - Step 7: Invoke the Chain




## **Introduction to RAG**

### **Why RAG?**
We know that LLMs have the capability to generate stuff by themselves. But these tools aren't perfect.

Even though they're super smart, they sometimes get things wrong, especially if they need to be really precise or use the latest information. So, to fix this, some of the brightest minds at Meta AI came up with a new trick called retrieval-augmented generation, or RAG for short, in 2020.

Think of it as giving our language models an assistant. This assistant digs through a massive pile of updated information and feeds the most relevant and recent bits to the LLM.

### **Benefits of RAG**  
1. **Enhanced factual accuracy and Domain Specific Expertise:** Imagine a customer service chatbot trained on general conversation data. It might struggle with technical domain specific questions. RAGs allow you to integrate domain-specific knowledge bases, enabling the chatbot to handle these inquires with expertise.
2. **Reduce Hallucination:** LLMs can generate false information, a phenomenon known as hallucination. The knowledge base provided can help support the claims of generative model.

### **RAG Building Blocks**  
1. **Retrieval:** When a user asks a question or provides a prompt, retrievals first help fetch relevant passages from a vast knowledge base. This Knowledge Base could be the company's internal documents, or any other source of text data.
2. **Augmentation:** The retrieved passages are then used to "augment" the LLM's knowledge. This can include various techniques, such as summarization or encoding the key information.
3. **Generation:** Finally LLM leverages its understanding of language along with the augmented information to generate a response. This response can be an answer to a question, a creative text format based on a prompt, etc...

### **Application Areas**  
1. Question Answering: A RAG powered customer care chatbot can answer customer queries by retrieving product information, FAQs and guides to provide a well-rounded response.
2. Document Summarization: A research paper summarization tool can use RAG to retrieve relevant sections and then generate a summary highlighting main points.
3. Creative Text Generation: A story writing assistant can use RAG to retrieve information about historical periods or fictional creation, helping LLM to generate more deeply engaging stories.
4. Code Generation: A code completion tool can use RAG to retrieve relevant code examples and API documentation, helping developers write code more efficiently.

## **How to Build a RAG System?**

### **The Process**
**Step 1: Create an Index on available Knowledge Base i.e. the preprocessing strategy**
- This step is about preprocessing and organizing the raw data so it can be searched efficiently later.
- Data from formats like PDF, HTML, etc is cleaned and converted into plain text. This text is then divided into smaller parts (i.e chunks) and turned into vector representations by passing the chunks into the embedding model to make it easier to find later.
- **Key Outcome:** You now have a vector store that contains all chunks from your knowledge base in vector format.

**Step 2: Create a Retrieval i.e. the search strategy**
- This step defines how you want to search inside the index at query time. It does not store embeddings or build new indexes—it only queries the index.
- When someone asks a question, the Retrieval system turns that question into vector embedding using the same method used in indexing. Then, it compares this vector to the vectors of the indexed text parts to fing the `k` most similar chunks. These `k` most similar chunks are used in the next step as a context.
- You must specify:
    - How many results to return (k)
    - What similarity metric to use (eucledian, cosine, etc...)
    - Whether to use MMR, hybrid search, metadata filters, reranking, etc...
- **Key Outcome:** This stage is real-time and repeated every query. When the user asks a question, the retriever:
    - embeds the question
    - runs similarity search on the existing index
    - returns top relevant chunks

**Step 3: Generation**
- The system combines the retrieved text parts (i.e. context) with the original question to create a prompt. The LLM uses this prompt to answer the question.


### **What are Retrievals?**
Understand that the retrievals are specialized in navigating through vast amounts of data to find information that is relevant to a specific query or context. 

Retrieval models focus on the precision of matching query criteria with the data they have access to. Note that retrieval models rely heavily on the quality and structure of the data they access. Their performance depends on the relevance and accuracy of the information stored in the databases they query. 

In simple terms, retrievals search and identify relevant data from a large corpus for a given query.

## **Building an End-to-End RAG Chain**

**Step 1: Initialize the Chroma DB Connection**  
**Step 2: Create a Retriever Object**   
**Step 3: Initialize a Chat Prompt Template**  
**Step 4: Initialize a Generator (i.e. Chat Model)**  
**Step 5: Initialize a Output Parser**   
**Step 6: Define a RAG Chain**  
**Step 7: Invoke the Chain**

In [9]:
# Step 1: Initialize the Chroma DB Connection

from langchain_chroma import Chroma

# Initialize the database connection
# If database exist, it will connect with the collection_name and persist_directory
# Otherwise a new collection will be created
db = Chroma(collection_name="vector_database", 
            embedding_function=embedding_model, 
            persist_directory="./chroma_db_")

# We can check the already existing values
print(len(db.get()["ids"]))

1004


In [10]:
# Step 2: Create a Retriever Object 

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})

In [11]:
# Step 3: Initialize a Chat Prompt Template

from langchain_core.prompts import ChatPromptTemplate

PROMPT_TEMPLATE = """
Answer the question based only on the following context:
{context}
Answer the question based on the above context: {question}.
Provide a detailed answer.
Don’t justify your answers.
Don’t give information not mentioned in the CONTEXT INFORMATION.
Do not say "according to the context" or "mentioned in the context" or similar.
"""

prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

In [12]:
# Step 4: Initialize a Generator (i.e. Chat Model)

from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(openai_api_key=OPENAI_API_KEY)

In [13]:
# Step 5: Initialize a Output Parser

from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

In [15]:
# Step 6: Define a RAG Chain

from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
    
rag_chain = {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt_template | chat_model | parser

In [17]:
# Invoke the Chain

query = 'Who is Rachem?'

rag_chain.invoke(query)

'Rachem is not a person, it is a misinterpretation of the name Rachel. The speaker is questioning what the term "Rachem" means and if it is a term related to paleontology. They clarify that they wouldn\'t know because they are just a waitress.'

In [18]:
# Invoke the Chain

query = 'What is there on the List comparing Rachel and Julie?'

rag_chain.invoke(query)

'Rachel is described as being a waitress, while Julie is mentioned to have a lot in common with the speaker as they are both paleontologists.'