### Retrieval-Augmented Generation (RAG) Model for Question Answering Bot

#### Objective
- The goal of this project is to create a QA bot that leverages a Retrieval-Augmented Generation (RAG) approach. This means combining the capabilities of both information retrieval (vector database - Pinecone) and generation (OpenAI GPT) to provide coherent and contextually relevant answers.

#### Architecture and Flow
1. **Data Loading and Preprocessing**: Load documents (I have used PDFs and can be modified easily for other data sources) and split them into chunks of text.
2. **Embeddings**: Each chunk of text is converted into embeddings using OpenAI's embedding model (`text-embedding-ada-002`).
3. **Vector Storage**: These embeddings are stored in Pinecone, a vector database optimized for similarity search.
4. **Retrieval**: When a user submits a query, relevant document chunks are retrieved from Pinecone based on similarity.
5. **Generative Model**: A generative model (GPT-3.5-turbo) uses the retrieved document chunks to generate a coherent response to the query.
6. **Question Answering**: The system responds with an answer that is grounded in the retrieved information.

#### Tools Used:
- **LangChain**: For managing prompts, chains, and interaction between components.
- **Pinecone**: As a vector database for efficient storage and retrieval of document embeddings.
- **OpenAI**: For embedding and generating responses.


In [176]:
# Importing Required Libraries

from pinecone import Pinecone as PC
from pinecone import ServerlessSpec
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.schema import SystemMessage, HumanMessage


In [41]:
from dotenv import load_dotenv
import os
load_dotenv()

True

In [159]:
## Lets Read the Documents
""" Preprocessing techniques like removing extra white spaces, special characters, stop words. 
    lemmatisation, tokenisation, special data extraction tools for tables or images may increase the efficiency
    for the specific use case.
    
    - For Text files, TextLoader() from Langchain Document loaders can be used.
    - For CSV files, create_csv_agent from langchain agent tool kits can be used. 
    """


def read_doc(doc):
    loader = PyPDFLoader(doc)
    docs = loader.load()
    return docs

In [30]:
doc = read_doc("budget_speech.pdf")

In [33]:
## Chunking the data read from the pdf

def chunk_data(docs, chunk_size=800, chunk_overlap=50):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap)
    doc=text_splitter.split_documents(docs)
    return doc

In [34]:
chunks = chunk_data(docs=doc)
len(chunks)

141

In [164]:
# configuring the embedding model

# embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")


In [96]:
# Connecting PineCone and Creating an Index

import time

pc = PC(api_key=os.getenv("PINECONE_API_KEY"))
index_name = "docs-rag"

# check if index already exists (it shouldn't if this is first time)
if not pc.has_index(index_name):
    # if does not exist, create index
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(
            cloud='aws', 
            region='us-east-1'
        ) 
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

In [97]:
# Embedding the chunks and Storing in Pinecone
index=Pinecone.from_documents(chunks,embedding_model,index_name=index_name)

In [154]:
# Configuring a Chat model
chat = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.2
)

## Using Groq API
# from langchain_groq import ChatGroq

# chat=ChatGroq(groq_api_key=os.getenv("GROQ_API_KEY"),
#              model_name="llama-3.1-70b-versatile")

In [155]:
# Using PromptTemplate for augmenting the prompt with retrieved knowledge

def augment_prompt_with_template(query, k=5):
    # Retrieve top 3 relevant chunks from the vectorstore
    results = index.similarity_search(query, k=k)
    
    # Extract the text content from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    source = results[0].metadata
    
    # Define the prompt template with placeholders
    template = """
                Your task is to create an answer to the user's query using the information
                from the context provided. Follow these steps to generate the response:

                Step 1: Analyze the user-provided query: {query}
                Step 2: Review the relevant context provided: {contexts}
                Step 3: Generate a concise, clear, and informative response based on the context, 
                        ensuring it addresses the query and maintains accuracy.
    """
    
    # Initialize the LangChain PromptTemplate
    prompt_template = PromptTemplate(
        input_variables=["contexts", "query"],
        template=template
    )
    
    # Create the final augmented prompt by filling the template
    augmented_prompt = prompt_template.format(contexts=source_knowledge, query=query)
    
    return augmented_prompt, source

In [156]:
# Augment prompt with vectorstore results using PromptTemplate
def run_augmented_prompt(query):
    
    # Create initial message history
    messages = [
        SystemMessage(content="You are a helpful assistant.")
    ]
    
    # Augment the user query with context from vectorstore using PromptTemplate
    augmented_prompt_text, source = augment_prompt_with_template(query)
    augmented_prompt = HumanMessage(content=augmented_prompt_text)
    
    # Append augmented prompt to messages
    messages.append(augmented_prompt)
    
    # Interact with the chat model
    res = chat(messages)
    return res.content, source

### Testing RAG with queries

In [157]:
query1 = "What is the total budget for 2023?"
answer, source = run_augmented_prompt(query1)
print(f'Anwer: {answer} \nTop Source: {source}')

Anwer: The total budget for 2023-24 is estimated at ` 45 lakh crore, with total receipts other than borrowings at ` 27.2 lakh crore. The net tax receipts are estimated at ` 23.3 lakh crore. The fiscal deficit is projected to be 5.9 per cent of GDP, with net market borrowings from dated securities estimated at ` 11.8 lakh crore. 
Top Source: {'page': 28.0, 'source': 'budget_speech.pdf'}


In [158]:
query2 = "What is the income tax rate for 9-12 lakh?"
answer, source = run_augmented_prompt(query2)
print(f'Anwer: {answer} \nTop Source: {source}')

Anwer: The income tax rate for the income range of 9-12 lakh is 15%. 
Top Source: {'page': 37.0, 'source': 'budget_speech.pdf'}


In [160]:
query3 = "What is the rate of excise duty on Cigarettes of tobacco substitutes?"
answer, source = run_augmented_prompt(query3)
print(f'Anwer: {answer} \nTop Source: {source}')

Anwer: The rate of excise duty on Cigarettes of tobacco substitutes is `600 per 1000 sticks to `690 per 1000 sticks, effective from 02.02.2023. 
Top Source: {'page': 57.0, 'source': 'budget_speech.pdf'}


- If Advanced RAG is needed for specific use case, we can use Optimazation techniques like Self Query, Query Expansion, Hybrid Search, Re-ranking and filtering.

#### Thank You !