PDF问答机器人

加载环境变量

In [2]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_classic.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_classic.chains import RetrievalQA
from dotenv import load_dotenv

# 加载环境变量
load_dotenv()

True

第1步：加载PDF文档

In [3]:
print("正在加载PDF文档...")
loader = PyPDFLoader("files/ddia.pdf")
documents = loader.load()
print(f"✓ 加载完成，共{len(documents)}页")

正在加载PDF文档...
✓ 加载完成，共613页


第2步：分割文档

In [4]:
print("正在分割文档...")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
print(f"✓ 分割完成，共{len(chunks)}个文本块")

正在分割文档...
✓ 分割完成，共1976个文本块


第3步：向量化并存储

In [6]:
print("正在构建向量数据库...")
# embeddings = OpenAIEmbeddings()
# 其他供应商
import os
embeddings = OpenAIEmbeddings(
    base_url=os.getenv("OPENAI_API_BASE"),
    api_key=os.getenv("OPENAI_API_KEY"),
    model="Qwen/Qwen3-Embedding-4B"
)
vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./files/pdf_vectordb"
)
print("✓ 向量数据库构建完成")

正在构建向量数据库...
✓ 向量数据库构建完成


第4步：创建检索器

In [7]:
retriever = vectordb.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # 每次检索返回4个最相关的文档块
)

第5步：构建RetrievalQA链

In [9]:
# llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
# 其他供应商
llm = ChatOpenAI(
    base_url=os.getenv("OPENAI_API_BASE"),
    api_key=os.getenv("OPENAI_API_KEY"),
    model="deepseek-ai/DeepSeek-V3", temperature=0.7
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # 将所有检索到的文档"塞入"一个prompt
    retriever=retriever,
    return_source_documents=True  # 返回源文档，便于验证答案来源
)

第6步：开始问答

In [10]:
print("""
=== PDF问答机器人已就绪 ===""")
print("""输入'退出'结束对话
""")

while True:
    question = input("你的问题: ")
    if question.lower() in ['退出', 'quit', 'exit']:
        print("再见！")
        break
    
    result = qa_chain.invoke({"query": question})
    print(f"""
回答: {result['result']}
""")
    print(f"""参考来源: {len(result['source_documents'])}个文档片段
""")


=== PDF问答机器人已就绪 ===
输入'退出'结束对话


回答: Based on the provided context excerpts from "Designing Data-Intensive Applications" (DDIA), I can see snippets discussing:

1. Change streams (page 456)
2. Distributed transactions (page 361) 
3. Graph processing (page 425)
4. Services architecture (pages 131-136)
5. Asynchronous networks and replication
6. Atomic operations and transactions
7. Log file analysis examples

However, without knowing your specific question, I cannot provide a targeted answer. Could you please clarify what you'd like to know about these topics from DDIA? 

Some areas I could potentially help explain based on the context:
- How change streams work
- Distributed transaction concepts
- Asynchronous vs synchronous replication
- Atomicity guarantees in databases
- Log processing patterns

But I'd need your specific question to provide a proper answer. The book covers many concepts about designing robust, scalable data systems.

参考来源: 4个文档片段

再见！
