<a href="https://colab.research.google.com/github/echutch/LongDocumentQA/blob/main/dbqa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document-Based Question Answering

### Installation

In [None]:
!pip install -U langchain-community langchain-chroma pypdf chromadb hf_xet langchain-huggingface huggingface_hub "langchain-google-genai>=0.0.6"

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-chroma
  Downloading langchain_chroma-0.2.4-py3-none-any.whl.metadata (1.1 kB)
Collecting pypdf
  Downloading pypdf-5.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.0-py3-none-any.whl.metadata (996 bytes)
Collecting langchain-google-genai>=0.0.6
  Downloading langchain_google_genai-2.1.8-py3-none-any.whl.metadata (7.0 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading 

### Hugging Face Login

In [None]:
from huggingface_hub import login
from google.colab import userdata

hf_token = userdata.get('HF_TOKEN')
login(token=hf_token)

## Process Document

### Load in document (pdf)

In [None]:
from langchain.document_loaders import PyPDFLoader
def load(file):
  loader = PyPDFLoader(file)
  pages = loader.load()
  return pages

# TEST
# pages = load('drive/MyDrive/dbqa/PaperQA.pdf')

# print(f"Number of pages: {len(pages)}\n")
# print(f"First 500 characters:\n{pages[0].page_content[0:500]}")
# print(f"\nMeta Data: {pages[0].metadata}")

### Split into chunks
Play around with chunk sizes
- Paper says size of 4000, but might be a little bit too big for prototype?

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split(pages, chunk_size, chunk_overlap):
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = chunk_size,
      chunk_overlap = chunk_overlap
  )

  splits = text_splitter.split_documents(pages)
  return splits

# TEST
# splits = split(pages, 1500, 150)
# print(len(splits))
# print(splits[1])

### Make embedding chunks, create vector database
- Potentially try multiple models to find best result
- Currently all-MiniLM-L6-v2

In [None]:
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

def create_db(splits, persist_directory):
  embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  vectordb = Chroma.from_documents(
      documents=splits,
      embedding=embedding,
      persist_directory=persist_directory
  )
  return vectordb

Embed Document

In [None]:
file_name = 'drive/MyDrive/dbqa/PaperQA.pdf'
persist_directory = 'drive/MyDrive/dbqa/paperqa_db'

In [None]:
pages = load(file_name)
splits = split(pages, 1500, 150)
vectordb = create_db(splits, persist_directory)

NameError: name 'load' is not defined

## Propose Answer

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def answer(question, vectordb, llm):
  template = """Answer in a direct and concise tone, I am in a hurry. Your audience is an expert, so be
  highly specific. If there are ambiguous terms or acronyms, first define them.
  Write an answer with five sentences maximum for the question below based on the provided context.
  If the context provides insufficient information, reply ''I cannot answer''. Answer in an unbiased, comprehensive,
  and scholarly tone. If the question is subjective, provide an opinionated answer in the concluding 1-2 sentences.

  {context}

  Question: {question}

  Answer:"""

  QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

  retriever = vectordb.as_retriever(
      search_type="mmr",
      search_kwargs={"k": 5, "fetch_k": 10}
  )

  # retriever = vectordb.as_retriever(
  #       search_type="similarity",
  #       search_kwargs={"k": 10} # "fetch_k" is only used for mmr
  #   )



  qa_chain = RetrievalQA.from_chain_type(
      llm=llm,
      retriever=retriever,
      return_source_documents=True,
      chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
  )

  result = qa_chain.invoke({"query": question})

   # Print the retrieved source documents
  print("----- RETRIEVED CHUNKS -----")
  if result.get("source_documents"):
      for i, doc in enumerate(result["source_documents"]):
          # score = doc.metadata.get('distance', 'N/A')
          # print(f"Score: {score}")
          print(f"Chunk {i + 1}:")
          print(doc.page_content)
          print("------------------------------")
  return result

def answer_long_context(question, pages, llm):
  template = """Answer in a direct and concise tone, I am in a hurry. Your audience is an expert, so be
  highly specific. If there are ambiguous terms or acronyms, first define them.
  Write an answer with five sentences maximum for the question below based on the provided context.
  If the context provides insufficient information, reply ''I cannot answer''. Answer in an unbiased, comprehensive,
  and scholarly tone. If the question is subjective, provide an opinionated answer in the concluding 1-2 sentences.

  Context: {context}

  Question: {question}

  Answer:"""

  QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

  chain = QA_CHAIN_PROMPT | llm

  context = "\n\n".join([page.page_content for page in pages])

  response = chain.invoke({
        "question": question,
        "context": context
    })

  return response

## Put it all together

### Setup

In [None]:
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

persist_directory = 'drive/MyDrive/dbqa/paperqa_db'
file_name = 'drive/MyDrive/dbqa/PaperQA.pdf'
vectordb = Chroma(persist_directory=persist_directory, embedding_function=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"))
pages = load(file_name)
# question = "Explain how the search tool works."

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

###Host on Colab

In [None]:
from transformers import pipeline
from langchain_huggingface import HuggingFacePipeline
from IPython.display import display

llm_pipeline = pipeline(
        "text2text-generation",
        model="google/flan-t5-large",
        max_new_tokens=512,
    )

llm = HuggingFacePipeline(pipeline=llm_pipeline)
result = answer(question, vectordb, llm)

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu
Token indices sequence length is longer than the specified maximum sequence length for this model (1727 > 512). Running this sequence through the model will result in indexing errors


KeyboardInterrupt: 

In [None]:
display(result['result'])

###Gemini API (RAG)

In [None]:
from google.colab import userdata
from langchain_google_genai import ChatGoogleGenerativeAI

gemini_api_key = userdata.get('GOOGLE_API_KEY')

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", google_api_key=gemini_api_key)

question = "Highlight each tool of the PaperQA workflow and how they interact together."

result = answer(question, vectordb, llm)
display(result['result'])

----- RETRIEVED CHUNKS -----
Chunk 1:
Figure 1: PaperQA Workflow Diagram. PaperQA is an agent that transforms a scientific question
into an answer with cited sources. The agent utilizes three tools – search, gather evidence, and
answer question. The tools enable it to find and parse relevant full-text research papers, identify
specific sections in the paper that help answer the question, summarize those section with the context
of the question (called evidence), and then generate an answer based on the evidence. It is an agent,
so that the LLM orchestrating the tools can adjust the input to paper searches, gather evidence with
different phrases, and assess if an answer is complete.
Evaluating LLM Scientists Assessing the scientific capabilities of LLMs often relies on QA
benchmarks, such as general science benchmarks [50, 51], or those specializing in medicine [21],
biomedical science [52] or chemistry [53, 54]. In contrast, open-ended tasks, such as conducting
chemical synthesis plann

'PaperQA utilizes three core tools: `search`, `gather evidence`, and `answer question`. The `search` tool queries a scientific literature engine with keywords to retrieve relevant papers, which are then chunked, embedded, and added to a vector database. Subsequently, `gather evidence` identifies and summarizes specific sections from these retrieved papers, forming context-specific "evidence" for the question. Finally, the `answer question` tool synthesizes a response based on this accumulated evidence. An orchestrating Large Language Model (LLM) agent manages this workflow iteratively, adjusting tool inputs and re-executing steps if an answer is incomplete or requires more evidence.'

### Long Context

In [None]:
file_name = 'drive/MyDrive/dbqa/IPCC_AR6_WGI_Chapter06.pdf'
pages = load(file_name)

In [None]:
from google.colab import userdata
from langchain_google_genai import ChatGoogleGenerativeAI

gemini_api_key = userdata.get('GOOGLE_API_KEY')

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", google_api_key=gemini_api_key)

question = "How have Earth System Models (ESMs) evolved from the CMIP5 to the CMIP6 generation regarding aerosol processes?"

result = answer_long_context(question, pages, llm)
display(result.content)



"ESMs in CMIP6 generally feature more comprehensive aerosol process representations than their CMIP5 counterparts. Many CMIP6 models now simulate aerosol number size distribution, a key factor for accurately simulating CCN concentrations. Some CMIP6 models also prescribe aerosol optical properties to constrain aerosol forcing. Despite these advancements, the range of complexity in aerosol modeling persists within the CMIP6 ensemble. Limited global coverage of CCN measurements restricts comprehensive model evaluations, impacting confidence in aerosol-cloud interaction simulations. While CMIP6 models represent more aerosol-cloud interaction processes, it's uncertain if this improves radiative forcing simulations due to unresolved small-scale processes."

###Successful Questions
These are questions that the prototype answered correctly about the PaperQA paper.

- "How does the PaperQA model work?"
- "Explain how the search tool works."
- "How did the authors test the performance of the PaperQA model?"
- "How much better did PaperQA perform compared to competing models?" (pulled data from a table)
- "How did the researchers counteract hallucination?"
- "What are some sources cited in this paper that I can learn more about retrieval augmented generation?" (pulled several citations from the paper)
- "What would the cost per hour of PaperQA be?"
- "What experiments were run on PaperQA?
- "What are some examples of questions that were in the LitQA dataset?" (attempted to cite, correctly cited which table it was from but didn't know who the authors were)

###Failed Questions
These are questions that the prototype was unable to answer. Some questions were supposed to fail (denoted) while others had the information contained within the paper.

- "Who are the authors of this paper?"
- "What is the title of this paper?"
- "How do I generate a new Google AI API key?" (failed successfully)
- "If I wanted to build PaperQA myself, how could I do it?" (I didn't expect this to work, but more reasoning based on the findings of the paper could be built in later)
- "What question did I ask you two questions ago?"