<a href="https://colab.research.google.com/github/darthgera123/RAG-Agents/blob/main/RAG_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG PDF
This notebook takes a pdf as input and leverages a RAG agent to answer user queries in natural language.

In [1]:
!pip install langchain pypdf openai chromadb tiktoken

Collecting langchain
  Downloading langchain-0.2.12-py3-none-any.whl.metadata (7.1 kB)
Collecting pypdf
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Collecting openai
  Downloading openai-1.38.0-py3-none-any.whl.metadata (22 kB)
Collecting chromadb
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting langchain-core<0.3.0,>=0.2.27 (from langchain)
  Downloading langchain_core-0.2.28-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.96-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting httpx<1,>=0.23.0 

In [3]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.2.11-py3-none-any.whl.metadata (2.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.21.3-py3-none-any.whl.metadata (7.1 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.2.11-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dataclasses_json-0.6.7-py3-none-any.whl (

In [5]:
!pip install pymupdf

Collecting pymupdf
  Downloading PyMuPDF-1.24.9-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.9 (from pymupdf)
  Downloading PyMuPDFb-1.24.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.9-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyMuPDFb-1.24.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.24.9 pymupdf-1.24.9


## Import Libraries

In [4]:
import numpy as np
import json
import openai
import os
from google.colab import userdata
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pypdf
import langchain
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions
import fitz

openai.api_key = userdata.get('OPEN_AI_API_KEY')
os.environ['OPENAI_API_KEY'] = userdata.get('OPEN_AI_API_KEY')
client = openai.Client()

## Extract Text

## Load files

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Create Database
Define functions that read the input pdfs, extract the contents, split into chunks and then based on defined embedding load it into a vector database

### Extract Contents

In [31]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file using PyMuPDF.

    Args:
        pdf_path (str): The path to the PDF file.

    Returns:
        str: The extracted text from the PDF.
    """
    pdf = fitz.open(pdf_path)
    text = ""
    for page_num in range(pdf.page_count):
        page = pdf[page_num]
        text += page.get_text()
    pdf.close()
    return text

### Tokenize contents
Splits text into chunks

In [26]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)
chunks = text_splitter.create_documents([pdf_text])

  warn_deprecated(


### Embed Chunks
Embed chunks and then load into a Chromadb vector database

In [35]:
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, \
                                    embedding=embeddings,\
                                    persist_directory="chromadb_store")

## Prompt Engineering

In [41]:
instruction_prompt = (
    "**Analyze:** Carefully examine the provided images and text context.\n"
    "**Synthesize:** Integrate information from both the visual and textual elements.\n"
    "**Reason:** Deduce logical connections and inferences to address the question.\n"
    "**Respond:** Provide a concise, accurate answer in the following format:\n\n"
    "* **Question:** {question}\n"
    "* **Answer:** [Direct response to the question]\n"
    "* **Explanation:** [Bullet-point reasoning steps if applicable]\n"
    "* **Source:** [name of the file, page, image from where the information is cited]\n\n"
    "**Ambiguity:** If the context is insufficient to answer, respond 'Not enough context to answer.'"
)
custom_prompt_template = f"{instruction_prompt}\n\nQuestion: {{question}}\n\nAnswer:"
prompt = PromptTemplate(template=custom_prompt_template, input_variables=["question"])

In [45]:
from langchain.prompts import PromptTemplate

custom_prompt_template = PromptTemplate(
    input_variables=["question"],
    template=(
        "**Analyze:** Carefully examine the provided images and text context.\n"
        "**Synthesize:** Integrate information from both the visual and textual elements.\n"
        "**Reason:** Deduce logical connections and inferences to address the question.\n"
        "**Respond:** Provide a concise, accurate answer in the following format:\n\n"
        "* **Question:** {question}\n"
        "* **Answer:** [Direct response to the question]\n"
        "* **Explanation:** [Bullet-point reasoning steps if applicable]\n"
        "* **Source:** [name of the file, page, image from where the information is cited]\n\n"
        "**Ambiguity:** If the context is insufficient to answer, respond 'Not enough context to answer.'"
    )
)

## RAG Agent
Define RAG agent and pass the vector database

In [47]:
llm = ChatOpenAI(model_name="gpt-4", temperature=0.2)

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    prompt=custom_prompt_template
)

ValidationError: 1 validation error for RetrievalQA
prompt
  extra fields not permitted (type=value_error.extra)

## Parse querries

In [None]:


def generate_answer(query, retrieved_docs, openai_llm):
    context = "\n".join(retrieved_docs)
    instruction_prompt = (
        f"**Analyze:** Carefully examine the provided images and text context.\n"
        f"**Synthesize:** Integrate information from both the visual and textual elements.\n"
        f"**Reason:** Deduce logical connections and inferences to address the question.\n"
        f"**Respond:** Provide a concise, accurate answer in the following format:\n\n"
        f"* **Question:** {query}\n"
        f"* **Answer:** [Direct response to the question]\n"
        f"* **Explanation:** [Bullet-point reasoning steps if applicable]\n"
        f"* **Source:** [name of the file, page, image from where the information is cited]\n\n"
        f"**Ambiguity:** If the context is insufficient to answer, respond 'Not enough context to answer.'"
    )
    prompt = f"Context: {context}\n\n{instruction_prompt}\n\nQuestion: {query}\n\nAnswer:"
    response = openai_llm(prompt)
    return response

In [25]:
pdf_dir = '/content/drive/MyDrive/Papers'
pdf_files = [os.path.join(pdf_dir, file) for file in os.listdir(pdf_dir) if file.endswith('.pdf')]
pdf_text = extract_text_from_pdf(pdf_files[0])

In [34]:
def query_pdf(query):
    response = rag_chain.run(query)
    return response

query = "What is the summary of the paper?"
response = query_pdf(query)

The paper discusses the topic of performance capture of humans, which is a highly active area of research. The authors present various methods for performance capture, including image-based methods, model-based approaches, volumetric capture systems, and machine learning-based algorithms. They also introduce a system called "The Relightables", which uses a Light Stage combined with multi-view stereo depth sensors for capture, reconstruction, and rendering. The system is complex and requires high computational power, but the authors argue that it is a significant improvement over the state of the art in terms of speed. The paper also includes an evaluation of the system's main components to justify the design choices.
Meshes per frame are mapped to each other through a process called mesh alignment. For each mesh, a sequential forward alignment and backward alignment through time is performed. This means that the n-th mesh is aligned to all its proceeding meshes when moving forward in t