# 构建一个 PDF 解析和问答系统

PDF 文件通常包含其他来源无法获得的重要非结构化数据。它们可能相当长，而且不像纯文本文件那样，通常无法直接输入到语言模型的提示中。

在本教程中，你将创建一个可以回答关于 PDF 文件问题的系统。更具体地说，你将使用文档加载器将文本加载到一种语言模型可用的格式，然后构建一个检索增强生成（RAG）管道来回答问题，包括引用来源材料。

本教程将略过一些在我们的 RAG 教程中更详细介绍的概念，因此如果你还没有阅读过那些内容，建议你先了解一下。

让我们开始吧！

### 加载文档

首先，你需要选择一个要加载的 PDF。我们将使用 Nike 的年度公开 SEC 报告中的一份文件。它超过 100 页长，包含一些与较长解释性文本混合的重要数据。不过，你也可以选择你自己的 PDF。

一旦你选择了 PDF，下一步是将其加载到一种 LLM 更容易处理的格式中，因为 LLM 通常需要文本输入。LangChain 有几个内置的文档加载器供你实验。下面，我们将使用一个由 pypdf 包支持的从文件路径读取的加载器：

In [4]:
from langchain.document_loaders import PyPDFLoader

# Load the PDF document
loader = PyPDFLoader("../example_data/nke-10k-2023.pdf")
docs = loader.load()


In [5]:
print(len(docs))

106


In [13]:
print(docs[0].page_content[0:100])
print(docs[0].metadata)

FORM 10-K FORM 10-K
{'source': '../example_data/nke-10k-2023.pdf', 'page': 0}


那么刚刚发生了什么？

- 加载程序将指定路径处的 PDF 读取到内存中。
- 然后，它使用包提取文本数据。pypdf
- 最后，它为 PDF 的每一页创建一个 LangChain 文档，其中包含页面内容和一些关于文本来自文档中位置的元数据。

LangChain为其他数据源提供了许多其他文档加载器，或者您可以创建自定义文档加载器。

In [18]:
import os
from langchain_openai import ChatOpenAI
API_SECRET_KEY = ""
BASE_URL = ""  # 代理 base-url 记得加上 /v1

os.environ["OPENAI_API_KEY"] = API_SECRET_KEY
os.environ["OPENAI_API_BASE"] = BASE_URL
llm = ChatOpenAI(temperature=0)

from langchain_cohere import CohereEmbeddings
embeddings_model = CohereEmbeddings(cohere_api_key="YL7DZ8X1MkaB5uJbKsBNSBmVFxAERjGhOs2EA0oh",model='embed-english-v3.0')

In [19]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings_model)

retriever = vectorstore.as_retriever()

最后，您将使用一些内置帮助程序来构建最终的：rag_chain

In [20]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "What was Nike's revenue in 2023?"})

results

{'input': "What was Nike's revenue in 2023?",
 'context': [Document(metadata={'page': 36, 'source': '../example_data/nke-10k-2023.pdf'}, page_content='FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\nThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and \nmajor product line:\nFISCAL 2023 COMPARED TO FISCAL 2022\n•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported \nand currency-neutral basis, respectively. The increase was due to higher revenues in North America, Europe, Middle East & \nAfrica ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. \nRevenues, respectively. \n•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues,  increased  10% and 16% on a reported and \ncurrency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Br

你可以看到，你在结果字典的键中得到了一个最终答案，以及用于生成答案的LLM。`answer` `context`

检查“进一步”下的值，您可以看到它们是每个文档都包含引入的页面内容块。有用的是，这些文档还保留了您首次加载它们时的原始元数据：`context`

In [21]:
print(results["context"][0].page_content)

FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and 
major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported 
and currency-neutral basis, respectively. The increase was due to higher revenues in North America, Europe, Middle East & 
Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. 
Revenues, respectively. 
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues,  increased  10% and 16% on a reported and 
currency-neutral basis, respectively. This increase was primarily due to higher revenues in Men's, the Jordan Brand, 
Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.


In [22]:
print(results["context"][0].metadata)

{'page': 36, 'source': '../example_data/nke-10k-2023.pdf'}


这个特定的块来自原始 PDF 的第 36 页。您可以使用此数据来显示答案来自 PDF 中的哪个页面，从而允许用户快速验证答案是否基于源材料。