<a href="https://colab.research.google.com/github/echutch/LongDocumentQA/blob/main/dbqa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document-Based Question Answering

### Installation

In [None]:
!pip install -U langchain-community langchain-chroma pypdf chromadb hf_xet langchain-huggingface huggingface_hub "langchain-google-genai>=0.0.6"

### Hugging Face Login

In [None]:
from huggingface_hub import login
from google.colab import userdata

hf_token = userdata.get('HF_TOKEN')
login(token=hf_token)

## Process Document

### Load in document (pdf)

In [None]:
from langchain.document_loaders import PyPDFLoader
def load(file):
  loader = PyPDFLoader(file)
  pages = loader.load()
  return pages

# TEST
# pages = load('drive/MyDrive/dbqa/PaperQA.pdf')

# print(f"Number of pages: {len(pages)}\n")
# print(f"First 500 characters:\n{pages[0].page_content[0:500]}")
# print(f"\nMeta Data: {pages[0].metadata}")

### Split into chunks
Play around with chunk sizes
- Paper says size of 4000, but might be a little bit too big for prototype?

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split(pages, chunk_size, chunk_overlap):
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = chunk_size,
      chunk_overlap = chunk_overlap
  )

  splits = text_splitter.split_documents(pages)
  return splits

# TEST
# splits = split(pages, 1500, 150)
# print(len(splits))
# print(splits[1])

### Make embedding chunks, create vector database
- Potentially try multiple models to find best result
- Currently all-MiniLM-L6-v2

In [None]:
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

def create_db(splits, persist_directory):
  embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  vectordb = Chroma.from_documents(
      documents=splits,
      embedding=embedding,
      persist_directory=persist_directory
  )
  return vectordb

Embed Document

In [None]:
file_name = 'drive/MyDrive/dbqa/PaperQA.pdf'
persist_directory = 'drive/MyDrive/dbqa/paperqa_db'

In [None]:
pages = load(file_name)
splits = split(pages, 1500, 150)
vectordb = create_db(splits, persist_directory)

## Propose Answer

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def answer(question, vectordb, llm):
  template = """Answer in a direct and concise tone, I am in a hurry. Your audience is an expert, so be
  highly specific. If there are ambiguous terms or acronyms, first define them.
  Write an answer with five sentences maximum for the question below based on the provided context.
  If the context provides insufficient information, reply ''I cannot answer''. Answer in an unbiased, comprehensive,
  and scholarly tone. If the question is subjective, provide an opinionated answer in the concluding 1-2 sentences.

  {context}

  Question: {question}

  Answer:"""

  QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

  retriever = vectordb.as_retriever(
      search_type="mmr",
      search_kwargs={"k": 5, "fetch_k": 10}
  )

  # retriever = vectordb.as_retriever(
  #       search_type="similarity",
  #       search_kwargs={"k": 10} # "fetch_k" is only used for mmr
  #   )



  qa_chain = RetrievalQA.from_chain_type(
      llm=llm,
      retriever=retriever,
      return_source_documents=True,
      chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
  )

  result = qa_chain.invoke({"query": question})

   # Print the retrieved source documents
  print("----- RETRIEVED CHUNKS -----")
  if result.get("source_documents"):
      for i, doc in enumerate(result["source_documents"]):
          # score = doc.metadata.get('distance', 'N/A')
          # print(f"Score: {score}")
          print(f"Chunk {i + 1}:")
          print(doc.page_content)
          print("------------------------------")
  return result

def answer_long_context(question, pages, llm):
  template = """Answer in a direct and concise tone, I am in a hurry. Your audience is an expert, so be
  highly specific. If there are ambiguous terms or acronyms, first define them.
  Write an answer with five sentences maximum for the question below based on the provided context.
  If the context provides insufficient information, reply ''I cannot answer''. Answer in an unbiased, comprehensive,
  and scholarly tone. If the question is subjective, provide an opinionated answer in the concluding 1-2 sentences.

  Context: {context}

  Question: {question}

  Answer:"""

  QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

  chain = QA_CHAIN_PROMPT | llm

  context = "\n\n".join([page.page_content for page in pages])

  response = chain.invoke({
        "question": question,
        "context": context
    })

  return response

## Put it all together

### Setup

In [None]:
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

persist_directory = 'drive/MyDrive/dbqa/paperqa_db'
file_name = 'drive/MyDrive/dbqa/PaperQA.pdf'
vectordb = Chroma(persist_directory=persist_directory, embedding_function=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"))
pages = load(file_name)
# question = "Explain how the search tool works."

###Host on Colab

In [None]:
from transformers import pipeline
from langchain_huggingface import HuggingFacePipeline
from IPython.display import display

llm_pipeline = pipeline(
        "text2text-generation",
        model="google/flan-t5-large",
        max_new_tokens=512,
    )

llm = HuggingFacePipeline(pipeline=llm_pipeline)
result = answer(question, vectordb, llm)

In [None]:
display(result['result'])

###Gemini API (RAG)

In [None]:
from google.colab import userdata
from langchain_google_genai import ChatGoogleGenerativeAI

gemini_api_key = userdata.get('GOOGLE_API_KEY')

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", google_api_key=gemini_api_key)

question = "Highlight each tool of the PaperQA workflow and how they interact together."

result = answer(question, vectordb, llm)
display(result['result'])

### Long Context

In [None]:
file_name = 'drive/MyDrive/dbqa/IPCC_AR6_WGI_Chapter06.pdf'
pages = load(file_name)

In [None]:
from google.colab import userdata
from langchain_google_genai import ChatGoogleGenerativeAI

gemini_api_key = userdata.get('GOOGLE_API_KEY')

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", google_api_key=gemini_api_key)

question = "How have Earth System Models (ESMs) evolved from the CMIP5 to the CMIP6 generation regarding aerosol processes?"

result = answer_long_context(question, pages, llm)
display(result.content)



###Successful Questions
These are questions that the prototype answered correctly about the PaperQA paper.

- "How does the PaperQA model work?"
- "Explain how the search tool works."
- "How did the authors test the performance of the PaperQA model?"
- "How much better did PaperQA perform compared to competing models?" (pulled data from a table)
- "How did the researchers counteract hallucination?"
- "What are some sources cited in this paper that I can learn more about retrieval augmented generation?" (pulled several citations from the paper)
- "What would the cost per hour of PaperQA be?"
- "What experiments were run on PaperQA?
- "What are some examples of questions that were in the LitQA dataset?" (attempted to cite, correctly cited which table it was from but didn't know who the authors were)

###Failed Questions
These are questions that the prototype was unable to answer. Some questions were supposed to fail (denoted) while others had the information contained within the paper.

- "Who are the authors of this paper?"
- "What is the title of this paper?"
- "How do I generate a new Google AI API key?" (failed successfully)
- "If I wanted to build PaperQA myself, how could I do it?" (I didn't expect this to work, but more reasoning based on the findings of the paper could be built in later)
- "What question did I ask you two questions ago?"