
# Building a PDF Question Answering (QA) System using RAG
> In this notebook, you will create a system that answers questions based on the content of a PDF file. 
> This involves building a Retrieval-Augmented Generation (RAG) pipeline, where relevant document sections 
> are retrieved based on the question and used to generate precise answers.

## Key Components:
1. **Document Loading** - Load and preprocess PDF content for analysis.
2. **Indexing** - Split and store text in a vectorized format for fast retrieval.
3. **Retrieval** - Retrieve relevant document sections for a given query.
4. **Generation** - Use a language model to generate an answer based on the retrieved context.

Let's dive in!



# 📥 Library Installation
We install the necessary libraries to process PDFs, handle embeddings, and work with language models.


In [24]:
%pip install -qU pypdf langchain_community
%pip install faiss-cpu
%pip install langchain-openai



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.11 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.11 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.11 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# 📤 Import Libraries

In [25]:
import os
import config  # Importing the config file
import faiss
import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_community.vectorstores import FAISS # A library for efficient similarity search and clustering of dense vectors.
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Data Prepration
## Loading Documents
> First, you'll need to choose a PDF to load. Feel free to use a PDF of your choosing.
> Once you've chosen your PDF, the next step is to load it into a format that an LLM can more easily handle, since LLMs generally require text inputs. 
> LangChain has a few different built-in PDF document loaders for this purpose which you can experiment with. 

> Below, we'll use one powered by the <a href="https://pypi.org/project/pypdf/">pypdf</a> package that reads from a filepath.


In [26]:
# Loading Documents

# Example document
file_path = "./example_data/journal.pone.0264429.pdf"

# The loader reads the PDF at the specified path into memory.
loader = PyPDFLoader(file_path)

# Extract text data using the pypdf package.
docs = loader.load()

### Explore loaded data

In [27]:
# Type and length of docs
print(type(docs))
print(len(docs))

<class 'list'>
15


**Question:** What do each element of the variable docs represent?

In [28]:
# Type of docs elements
print(f"Type of docs elements: \n{type(docs[0])}")
# Display first element of docs
print(f"\nFirst element of loaded data docs: \n{docs[0]}")

Type of docs elements: 
<class 'langchain_core.documents.base.Document'>

First element of loaded data docs: 
page_content='RESEA RCH ARTICL E
Prediction of HIV status based on socio-
behavioural characteristics in East and
Southern Africa
Erol Orel
ID
1
*, Rachel Esra
1
, Janne Estill
1,2
, Amaury Thiabaud
ID
1
, Ste ´ phane Marchand-
Maillet
ID
3
, Aziza Merzouki
1☯
, Olivia Keiser
1☯
1 Institute of Global Health, University of Geneva, Geneva , Switzerlan d, 2 Institute of Mathematic al Statistics
and Actuari al Science, Univers ity of Bern, Bern, Switzerlan d, 3 Department of Computer Science, Viper
Group, University of Geneva, Geneva , Switzerlan d
☯ These authors contribu ted equally to this work.
* Erol.Ore l@unige.ch
Abstract
Introduction
High yield HIV testing strategies are critical to reach epidemic control in high prevalence and
low-resource settings such as East and Southern Africa. In this study, we aimed to predict
the HIV status of individuals living in Angola, Burundi, 

**So what just happened?**

- The loader reads the PDF at the specified path into memory.
- It then extracts text data using the pypdf package.
- Finally, it creates a LangChain Document for each page of the PDF with the page's content and some metadata about where in the document the text came from.

## Text Splitting
We split the document into smaller (more focused) chunks that can more easily fit into an LLM's context window. 

In [29]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(docs)

In [30]:
# Explore results of text splitting into chunks
print(f"Number of chuncks: {len(documents)}\n")
print(f"Type of chuncks: {type(documents[0])}\n")
print(f"Content of first text chunck:\n{documents[0].page_content}")

Number of chuncks: 60

Type of chuncks: <class 'langchain_core.documents.base.Document'>

Content of first text chunck:
RESEA RCH ARTICL E
Prediction of HIV status based on socio-
behavioural characteristics in East and
Southern Africa
Erol Orel
ID
1
*, Rachel Esra
1
, Janne Estill
1,2
, Amaury Thiabaud
ID
1
, Ste ´ phane Marchand-
Maillet
ID
3
, Aziza Merzouki
1☯
, Olivia Keiser
1☯
1 Institute of Global Health, University of Geneva, Geneva , Switzerlan d, 2 Institute of Mathematic al Statistics
and Actuari al Science, Univers ity of Bern, Bern, Switzerlan d, 3 Department of Computer Science, Viper
Group, University of Geneva, Geneva , Switzerlan d
☯ These authors contribu ted equally to this work.
* Erol.Ore l@unige.ch
Abstract
Introduction
High yield HIV testing strategies are critical to reach epidemic control in high prevalence and
low-resource settings such as East and Southern Africa. In this study, we aimed to predict
the HIV status of individuals living in Angola, Burundi, Ethi


**Question**: Try changing the chunk sizes! 
- How does it affect the text splitting result?
- How does it affect the generated answers below?

# Indexing
## Embedding and Vector Storage
Once we have split text chunks, we store them as vectors (embeddings) into a Vector Store. 

In [31]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = config.OPENAI_API_KEY

vectorstore = InMemoryVectorStore.from_documents(
    documents=documents, embedding=OpenAIEmbeddings()
)

# Retrieval
We set up a retriever from the vector store to identify relevant document chunks based on a user's query. 

In [32]:
retriever = vectorstore.as_retriever()
#retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

The retriever allows us to perform similarity search, and retrieve relevant text chunks based on query embeddings.

This involves comparing query embeddings with stored text embeddings to retrieve the most relevant passages.

In [33]:
# Query examples
query = "What is the goal of the study?"
#query = "What is 95-95-95?" #Check retrieved text chunks

In [34]:
retrieved_docs = retriever.invoke(query)

print(f"Number of retrieved text chuncks: {len(retrieved_docs)}")

Number of retrieved text chuncks: 4


**Question:** What is the default number of retrieved text chunks?
- Change the number of retrieved chunks, e.g. 1. See [Documentation](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/vectorstore/#specifying-top-k)
- See effect on LLM answer below.

# Answer Generation with OpenAI
Here, we use the retrieved document chunks to answer the user query through an OpenAI language model. 

In [35]:
from langchain_openai import ChatOpenAI

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = config.OPENAI_API_KEY

llm = ChatOpenAI(model="gpt-4o-mini")

## Augmentation (Context Integration) Process

The retrieved text provides context, enabling the model to generate accurate and contextually relevant responses.

In [36]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Keep the "
    "answer precise and concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [None]:
# Create a chain for passing a list of Documents to a model.
question_answer_chain = create_stuff_documents_chain(llm, prompt)
# Create retrieval chain that retrieves documents and then passes them on.
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": query})

print(f"Query: {query}\n")
print(f"Answer: {results['answer']}\n")
print("Sources:")
for document in results["context"]:
    print(f"\n {document}")
    print()

Query: What is the goal of the study?

Answer: The goal of the study is to use machine learning methods to predict HIV positivity in East and Southern African countries by identifying common risk factors among populations with high HIV prevalence, ultimately improving testing strategies and targeting individuals for prevention services.

Sources:

 page_content='the desired testing coverage. In our second scenario, we identified about 5% of the population
at high risk of being HIV positive using a probability cut-off of 90%. This allowed us to identify
more than 50% of all PLHIV with most of the tested population being HIV-positive; the
remaining HIV-negative tested individuals are choice candidates for preventative services
such as pre-exposure prophylaxis (PrEP). We were consequently able to maximise the efficacy
of the testing. We believe that our method would, therefore, be a valuable addition to current
targeted strategies.
To our knowledge, this study is the first to use machine 

**Question:** Change the number of retrieved text chunks used as context and play with the augmented prompt.

# 📝 Summary
In this notebook, you built a PDF ingestion and question-answering system using Retrieval-Augmented Generation (RAG).

You covered the following aspects:
> - Loading and processing a PDF file into manageable text chunks
> - Vectorizing and storing the document chunks in a vector store for similarity search
> - Setting up a retriever to find relevant sections based on a user’s query
> - Integrating with OpenAI to generate contextually relevant responses based on the retrieved information



# 🔗 References:
- [LangChain - PDF Q/A](https://python.langchain.com/docs/tutorials/pdf_qa/)
- [LangChain Tutorials](https://python.langchain.com/docs/tutorials/)