<p style = "font-size : 42px; color : #393e46 ; font-family : 'Comic Sans MS'; text-align : center; background-color : #00adb5; border-radius: 5px 5px;"><strong>MultiDocs Q&A With RAG</strong></p>

<p style = "font-size : 25px; color : #34656d ; font-family : 'Comic Sans MS';"><strong>Objective :</strong></p>
<p style = "font-size : 17px; color : #810000 ; font-family : 'Comic Sans MS'; ">The primary objective of this Kaggle notebook is to design, implement, and demonstrate a sophisticated question-answering system utilizing Retrieval-Augmented Generation (RAG) technology. This system will be capable of ingesting multiple documents as its knowledge base, understanding the context and nuances within these documents, and generating precise, informative answers to a wide range of user queries. By leveraging the RAG framework, the project aims to highlight the system's ability to perform real-time information retrieval from a diverse document set, fuse this information seamlessly, and produce answers that are not only accurate but also contextually enriched.</p>

<p style = "font-size : 25px; color : #34656d ; font-family : 'Comic Sans MS';"><strong>What is RAG ?</strong></p>

<ul>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; ">RAG is a technique for augmenting LLM knowledge with additional data.</li>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; ">LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).</li>
</ul>

<p style = "font-size : 25px; color : #34656d ; font-family : 'Comic Sans MS';"><strong>RAG Architecture</strong></p>
<p style = "font-size : 15px; color : #810000; font-family : 'Comic Sans MS';"><strong>A typical RAG application has two main components: </strong></p>

<ol>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>Indexing:</strong> a pipeline for ingesting data from a source and indexing it. This usually happens offline.</li>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>Retrieval and generation:</strong> The actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.
</li>
</ol>

<p style = "font-size : 25px; color : #34656d ; font-family : 'Comic Sans MS';"><strong>Indexing:</strong></p>
<ol>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>Load:</strong> First we need to load our data. This is done with DocumentLoaders.</li>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>Split:</strong> Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t fit in a model’s finite context window.</li>
<li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>Store:</strong> We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.</li>
</ol>

<p style = "font-size : 25px; color : #34656d ; font-family : 'Comic Sans MS';"><strong>Retrieval and Generation</strong></p>
<ol>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>Retrieve:</strong> Given a user input, relevant splits are retrieved from storage using a Retriever.</li>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>Generate:</strong> A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data.</li>
</ol>

<a id = '0'></a>
<p style = "font-size : 35px; color : #34656d ; font-family : 'Comic Sans MS'; text-align : center; background-color : #f9b208; border-radius: 5px 5px;"><strong>Table of Contents</strong></p>

* [Indexing](#1.0)
    * [Data Loading](#1.1)
    * [Data Extraction](#1.2)
    * [Chunking](#1.3)
    * [Embeddings Creation](#1.4)
    * [Indexing](#1.5)
    
    
* [Retrieval and Generation](#2.0)
    * [Retriever](#2.1)
    * [LLM Model](#2.2)


* [Results](#3.0)
* [Conclusion](#4.0)

In [None]:
# Installing Required Libraries
%pip install python-docx
%pip install python-pptx
%pip install PyPDF2
%pip install langchain
%pip install langchain_community
%pip install langchain_google_genai
%pip install langchain_text_splitters
%pip install sentence-transformers
%pip install faiss-cpu
%pip install cohere

In [None]:
# necessary Imports
from docx import Document
from PyPDF2 import PdfReader
from pptx import Presentation
from langchain_community.llms import Cohere
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAI
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import AIMessage, HumanMessage
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts  import PromptTemplate, ChatPromptTemplate, MessagesPlaceholder

<a id = '1.1'></a>
<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; "><strong>Data Loading</strong></p>
<p style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS' ">For This Notebook, I have taken Three different types of data.</p>
<ul>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>PDF :-</strong> 10th class History BOOK</li>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>PPT :-</strong> Project ppt</li>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>DOCS :- </strong>Project synopsis report.</li>
</ul>

In [None]:
pdf_file = open('/kaggle/input/ncert-class-10-history/NCERT-Class-10-History.pdf','rb')
ppt_file = Presentation("/kaggle/input/mid-report-ppt/Nitesh_PPT.pptx")
doc_file = Document('/kaggle/input/final-report-synopsis/final_project synopsis.docx')

<a id = '1.2'></a>
<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; "><strong>Data Extraction</strong></p>
<ul>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>PDF :-</strong> Pdf data is extracted using PyPDF2 and all text is stored in a string.</li>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>PPT :-</strong> PPT data is extracted using python-pptx module and all text is stored in a string.</li>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; "><strong>DOCS :- </strong>Docs data is extracted using python-docs module and all text is stored in a string.</li>
</ul>
<p style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS' "> After Extracting all data seperately, I have combined all text in a single string for further text processing.</p>
<ul>

In [None]:
# extracting pdf data
pdf_text = ""
pdf_reader = PdfReader(pdf_file)
for page in pdf_reader.pages:
    pdf_text += page.extract_text()

# extracting ppt data
ppt_text = ""
for slide in ppt_file.slides:
    for shape in slide.shapes:
        if hasattr(shape, "text"):
            ppt_text += shape.text + '\n'

# extracting doc data
doc_text = ""
for paragraph in doc_file.paragraphs:
    doc_text += paragraph.text + '\n'

In [None]:
# merging all the text

all_text = pdf_text + '\n' + ppt_text + '\n' + doc_text
len(all_text)

<a id = '1.3'></a>
<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; "><strong>Chunking</strong></p>
<p style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS' "> In this step I am creating the chunks of data, for this step I am using Recursive Character Splitter which break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t fit in a model’s finite context window.</p>

In [None]:
# splitting the text into chunks for embeddings creation

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200, # This is helpul to handle the data loss while chunking.
        length_function = len,
        separators=['\n', '\n\n', ' ', '']
    )

chunks = text_splitter.split_text(text = all_text)

In [None]:
len(chunks)

In [None]:
import os
os.environ['HuggingFaceHub_API_Token']= #HuggingFaceHub_API_Token
os.environ['GOOGLE_API_KEY']= #GOOGLE_API_KEY
os.environ['cohere_api_key'] = #cohere_api_key

<a id = '1.4'></a>
<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; "><strong>Embeddings Creation</strong></p>

<p style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS' ">Embeddings creation is a crucial preprocessing step in the development of document-based Question and Answering (Q&A) systems. This process involves converting textual data from documents and questions into dense, high-dimensional vectors known as embeddings. These embeddings are designed to capture the semantic meaning of words, sentences, or even entire documents, enabling the Q&A system to understand and process natural language more effectively.</p>

In [None]:
# Initializing embeddings model

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

<a id = '1.5'></a>
<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; "><strong>Indexing</strong></p>
<p style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS' ">Indexing data using Facebook AI Similarity Search (FAISS) is a pivotal step in developing efficient and scalable document-based Question and Answering (Q&A) systems. FAISS is a library that facilitates the efficient search for similarities in large datasets, especially useful for tasks involving high-dimensional vectors like text embeddings. When applied to document-based Q&A, FAISS indexes the embeddings of document chunks (e.g., paragraphs, sentences) to optimize the retrieval process.</p>

In [None]:
# Indexing the data using FAISS
vectorstore = FAISS.from_texts(chunks, embedding = embeddings)

<a id = '2.1'></a>
<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; "><strong>Retriever</strong></p>
<p style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS' ">In the development of document-based Question and Answering (Q&A) systems, creating a retriever is a crucial step that directly impacts the system's ability to find relevant information efficiently. The retriever utilizes the pre-indexed embeddings of document chunks, searching through them to find the most relevant pieces of content in response to a user query. This process involves setting up a retrieval mechanism that leverages similarity search to identify the best matches for the query embeddings within the indexed data.</p>

In [None]:
# creating retriever
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [None]:
retrieved_docs = retriever.invoke("How did the Swadeshi Movement influence Indian industries in the early 20th century?")

In [None]:
len(retrieved_docs)

In [None]:
print(retrieved_docs[0].page_content)

<a id = '2.2'></a>
<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; "><strong>LLM Models</strong></p>

<ul>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; ">Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand, generate, and interact with human language in a way that mimics human-like understanding. These models are trained on vast amounts of text data, allowing them to grasp the nuances of language, including grammar, context, and even cultural references. The capabilities of LLMs extend beyond simple text generation; they can perform a variety of tasks such as translation, summarization, question answering, and even code generation.</li>
    <li style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS'; ">One of the key technologies behind LLMs is the Transformer architecture, which enables the model to pay attention to different parts of the input text differently, thereby understanding the context and relationships between words and phrases more effectively. This architecture has led to significant improvements in natural language processing tasks and is the foundation of many state-of-the-art LLMs.</li>
</ul>

<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; ">Cohere LLM</p>

In [None]:
prompt_template = """Answer the question as precise as possible using the provided context. If the answer is
                not contained in the context, say "answer not available in context" \n\n
                Context: \n {context}?\n
                Question: \n {question} \n
                Answer:"""

prompt = PromptTemplate.from_template(template=prompt_template)

In [None]:
# function to create a single string of relevant documents given by Faiss.
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [None]:
# RAG Chain

def generate_answer(question):
    cohere_llm = Cohere(model="command", temperature=0.1, cohere_api_key = os.getenv('cohere_api_key'))

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | cohere_llm
        | StrOutputParser()
    )

    return rag_chain.invoke(question)

<a id = '3.0'></a>
<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; "><strong>Results</strong></p>

In [None]:
ans = generate_answer("How did the Swadeshi Movement influence Indian industries in the early 20th century?")
print(ans)

In [None]:
ans = generate_answer("Who is virat kohli")
print(ans)

In [None]:
ans = generate_answer("How did the East India Company contribute to the opium trade with China in the 19th century?")
print(ans)

In [None]:
ans = generate_answer("What was the impact of British manufactured goods on the Indian market during the 19th century?")
print(ans)

In [None]:
ans = generate_answer("What is the primary goal of the project?")
print(ans)

In [None]:
ans = generate_answer("Which machine learning algorithms are utilized in the project?")
print(ans)

In [None]:
ans = generate_answer("What preprocessing techniques are used in the project?")
print(ans)

In [None]:
ans = generate_answer("How was the project deployed?")
print(ans)

<a id = '4.0'></a>
<p style = "font-size : 20px; color : #34656d ; font-family : 'Comic Sans MS'; "><strong>Conclusion</strong></p>
<p style = "font-size : 15px; color : #810000 ; font-family : 'Comic Sans MS' ">In conclusion, this Kaggle notebook has successfully demonstrated the application of Retrieval-Augmented Generation (RAG) for multi-document Question and Answering. It showcased the power of combining retrieval and generation capabilities to provide accurate, context-aware answers sourced from multiple documents. Through detailed examples, performance evaluations, and interactive demonstrations, the notebook highlights the efficiency and scalability of RAG in handling complex Q&A tasks.</p>


<p style = "font-size : 13px; color : #810000 ; font-family : 'Comic Sans MS' ">
If you found this helpful an upvote would be very much appreciated :-)</p>