# Build a PDF ingestion and Question/Answering system
Adapted from a tutorial on [python.langchain.com](https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/)

PDF files often hold crucial unstructured data unavailable from other sources. They can be quite lengthy, and unlike plain text files, cannot generally be fed directly into the prompt of a language model.

In this tutorial, you'll create a system that can answer questions about PDF files. More specifically, you'll use a [Document Loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders) to load text in a format usable by an LLM, then build a retrieval-augmented generation (RAG) pipeline to answer questions, including citations from the source material.

This tutorial will gloss over some concepts more deeply covered in our [RAG tutorial](https://python.langchain.com/v0.2/docs/tutorials/rag/), so you may want to go through those first if you haven't already.

Let's dive in!

## Loading documents
First, you'll need to choose a PDF to load. We'll use a document from [Nike's annual public SEC report](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf). It's over 100 pages long, and contains some crucial data mixed with longer explanatory text. However, you can feel free to use a PDF of your choosing.

Once you've chosen your PDF, the next step is to load it into a format that an LLM can more easily handle, since LLMs generally require text inputs. LangChain has a few different [built-in document loaders](https://python.langchain.com/v0.2/docs/how_to/document_loader_pdf/) for this purpose which you can experiment with. Below, we'll use one powered by the [pypdf](https://pypi.org/project/pypdf/) package that reads from a filepath:

Here is an incomplete list of packages used in this notebook

```bash
langchain==0.2.6
langchain-chroma==0.1.2
langchain-community==0.2.6
langchain-core==0.2.10
langchain-openai==0.1.13
langchain-text-splitters==0.2.0
langchainhub==0.1.20
pypdf==4.2.0
```

In [1]:
%pip install -qU pypdf langchain_community

Note: you may need to restart the kernel to use updated packages.


First download the Nike annual public SEC report using the `urllib` Python package

In [5]:
from urllib.request import urlretrieve

urlretrieve("https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf", "./example-data/nke-10k-2023.pdf")

path, headers = urlretrieve("https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf", "./example-data/nke-10k-2023.pdf")
for name, value in headers.items():
    print(name, value)

Server keycdn
Date Tue, 02 Jul 2024 15:19:39 GMT
Content-Type application/pdf
Content-Length 3463188
Connection close
x-amz-id-2 95AUgM+RlgNMm73LSi/DJXE2dkdEmI8rwPHfvD7EVZhxjciwHkegojmUYlD8jZ/K7N1kSqyHZ/8=
x-amz-request-id BCE7948DRTH03FWM
Last-Modified Sat, 29 Jul 2023 00:53:25 GMT
ETag "897f2d6c858dcc4de8d43910d931b81a"
Expires Tue, 02 Jul 2024 16:19:39 GMT
Cache-Control max-age=3600
X-Cache HIT
X-Edge-Location sgsg
Access-Control-Allow-Origin *
Accept-Ranges bytes


Check the file has been successfully downloaded

In [6]:
import os

os.listdir("./example-data")

['nke-10k-2023.pdf']

In [7]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./example-data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

106


__API Reference__: [PyPDFLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html)

In [8]:
print(docs[0].page_content[0:100])
print(docs[0].metadata)

FORM 10-K FORM 10-K
{'source': './example-data/nke-10k-2023.pdf', 'page': 0}


So what just happened?

* The loader reads the PDF at the specified path into memory.
* It then extracts text data using the pypdf package.
* Finally, it creates a [LangChain Document](https://python.langchain.com/v0.2/docs/concepts/#documents) for each page of the PDF with the page's content and some metadata about where in the document the text came from.

LangChain has [many other document loaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/) for other data sources, or you can create a [custom document loader](https://python.langchain.com/v0.2/docs/how_to/document_loader_custom/).

## Question answering with RAG
Next, you'll prepare the loaded document for later retrieval. Using a [text splitter](https://python.langchain.com/v0.2/docs/concepts/#text-splitters), you'll split your loaded documents into smaller documents which can more easily fit into an LLM's context window, then load them into a [vector store](https://python.langchain.com/v0.2/docs/concepts/#vector-stores). You can then create a [retriever from the vector store](https://python.langchain.com/v0.2/docs/concepts/#retrievers) for use in our RAG chain

### Setup
Load credentials from a `.env` file and the [python-dotenv package](https://pypi.org/project/python-dotenv/)

In this example, we will use the OpenAI `gpt-4o` model

In [11]:
import os
from dotenv import load_dotenv

load_dotenv()
assert os.environ["OPENAI_API_KEY"]

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

Now we use `RecursiveCharacterTextSplitter` to split the text into chunks of 1000 characters each with an overlap of 200

Then, we load the splits into the `Chroma` in-memory vectorstore

Finally, we instantiate vectorstore as a `Retriever`

In [13]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

Finally, you'll use some built-in helpers to construct the final `rag_chain`:

In [14]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "What was Nike's revenue in 2023?"})

results

{'input': "What was Nike's revenue in 2023?",
 'context': [Document(page_content='FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\nThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and \nmajor product line:\nFISCAL 2023 COMPARED TO FISCAL 2022\n•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported \nand currency-neutral basis, respectively. The increase was due to higher revenues in North America, Europe, Middle East & \nAfrica ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. \nRevenues, respectively. \n•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues,  increased  10% and 16% on a reported and \ncurrency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Brand, \nWomen\'s and Kids\' which grew 17%, 35%,11% and 10%, respectiv

In [15]:
results = rag_chain.invoke({"input": "Which geography contributed to most to Nike's revenue in 2023?"})

results["answer"]

"North America contributed the most to Nike's revenue in 2023, with total revenues of $21,608 million."

You can see that you get both a final `answer` in the answer key of the results dict, and the `context` the LLM used to generate an answer.

Examining the values under the `context` further, you can see that they are documents that each contain a chunk of the ingested page content. Usefully, these documents also preserve the original metadata from way back when you first loaded them:

In [16]:
print(results["context"][0].page_content)

NORTH AMERICA
(Dollars in millions) FISCAL 2023 FISCAL 2022 % CHANGE% CHANGE 
EXCLUDING 
CURRENCY 
CHANGES FISCAL 2021 % CHANGE% CHANGE 
EXCLUDING 
CURRENCY 
CHANGES
Revenues by:
Footwear $ 14,897 $ 12,228  22 %  22 % $ 11,644  5 %  5 %
Apparel  5,947  5,492  8 %  9 %  5,028  9 %  9 %
Equipment  764  633  21 %  21 %  507  25 %  25 %
TOTAL REVENUES $ 21,608 $ 18,353  18 %  18 % $ 17,179  7 %  7 %
Revenues by:    
Sales to Wholesale Customers $ 11,273 $ 9,621  17 %  18 % $ 10,186  -6 %  -6 %
Sales through NIKE Direct  10,335  8,732  18 %  18 %  6,993  25 %  25 %
TOTAL REVENUES $ 21,608 $ 18,353  18 %  18 % $ 17,179  7 %  7 %
EARNINGS BEFORE INTEREST 
AND TAXES $ 5,454 $ 5,114  7 % $ 5,089  0 %
FISCAL 2023 COMPARED TO FISCAL 2022
•North America revenues increased 18% on a currency-neutral basis, primarily due to higher revenues in Men's and the 
Jordan Brand. NIKE Direct revenues increased 18%, driven by strong digital sales growth of 23%, comparable store sales


In [17]:
print(results["context"][0].metadata)

{'page': 40, 'source': './example-data/nke-10k-2023.pdf'}
