- pip install langchain-community pypdf
- **使用文档加载器、嵌入模型和向量存储构建一个基于 PDF 的语义搜索引擎。**
- Build a semantic search engine over a PDF with document loaders, embedding models, and vector stores.
# 加载环境变量

In [1]:
from dotenv import load_dotenv
import os

try:
    print(__file__)  # 检查 __file__ 是否存在
    dotenv_path = os.path.join(os.path.dirname(__file__), '../.env')
except NameError:
    print("Running in an interactive environment, using current directory instead.")
    dotenv_path = os.path.join(os.getcwd(), '../.env')

load_dotenv(dotenv_path, override=True)

Running in an interactive environment, using current directory instead.


True

# Documents and Document Loaders
文档和文档加载器

In [39]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]
documents

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.')]

# Loading documents  加载文档

In [40]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

107


In [41]:
# 页面的字符串内容, 包含文件名和页码的元数据
print(f"{docs[0].page_content[:200]}\n ======")
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F
{'source': './example_data/nke-10k-2023.pdf', 'page': 0, 'page_label': '1'}


# Splitting  分割
我们可以使用文本分割器来实现这一目的。我们将使用一个基于字符的简单文本分割器。我们将文档分割成每段 1000 个字符的片段，并在片段之间保留 200 个字符的重叠。重叠有助于减少将陈述与其相关的重要上下文分离的可能性。我们将使用 RecursiveCharacterTextSplitter，它会递归地使用常见的分隔符（如换行符）来分割文档，直到每个片段都是适当大小。这是推荐的通用文本使用场景的文本分割器。

In [42]:
# 
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

516

# Embeddings  嵌入
- 向量搜索是存储和搜索未结构化数据（如未结构化文本）的一种常见方法。基本思路是存储与文本关联的数值向量。给定一个查询，我们可以将其嵌入为相同维度的向量，并使用向量相似度度量（如余弦相似度）来识别相关的文本。
- LangChain 支持来自数十个LLM的嵌入。这些模型指定了文本应如何转换为数值向量。

In [43]:
from langchain_ollama import OllamaEmbeddings
# nomic-embed-text
# shunyue/llama3-chinese-shunyue:latest
embeddings = OllamaEmbeddings(model="nomic-embed-text")

In [44]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 768

[-0.02281179, 0.071803115, -0.19200394, -0.06733745, 0.027059343, -0.024033826, 0.05242733, 0.010079667, 0.07523421, 0.0086905705]


# Vector stores  向量存储
- LangChain VectorStore 对象包含向存储库添加文本和 Document 对象的方法，并使用各种相似性度量查询它们。它们通常使用嵌入模型初始化，这些模型决定了文本数据如何转换为数值向量。
- 有许多种类型的向量存储数据库,例如:Chroma

In [46]:
# pip install -qU langchain-chroma

In [47]:
from langchain_chroma import Chroma

vector_store = Chroma(embedding_function=embeddings)

In [48]:
# 实例化我们的向量存储后，我们现在可以索引文档
ids = vector_store.add_documents(documents=all_splits)

## VectorStore 包括的查询方法：
- 一旦实例化了一个包含文档的 VectorStore ，我们就可以对其进行查询。

- Synchronously and asynchronously; 同步和异步；
- By string query and by vector; 通过字符串查询和向量
- With and without returning similarity scores; 有和无返回相似度分数；
- By similarity and maximum marginal relevance (to balance similarity with query to diversity in retrieved results). 
通过相似性和最大边际相关性（以平衡相似性与查询的多样性，在检索结果中）。

### 根据字符串查询的相似度返回文档

In [50]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213 
NIKE Brand in-line stores (including employee-only stores) 74 
Converse stores (including factory stores) 82 
TOTAL 369 
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'page': 4, 'page_label': '5', 'source': './example_data/nke-10k-2023.pdf', 'start_index': 3125}


- Return scores:  返回分数

In [53]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.3203344941139221

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTSThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
• NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
• NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metad

### Async query:  异步查询

In [51]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='transition of NIKE Brand businesses in certain countries within APLA to third-party distributors.
The Company's NIKE Direct operations are managed within each NIKE Brand geographic operating segment. Converse is also a reportable segment for the Company
and operates in one industry: the design, marketing, licensing and selling of athletic lifestyle sneakers, apparel and accessories.
Global Brand Divisions is included within the NIKE Brand for presentation purposes to align with the way management views the Company. Global Brand Divisions
revenues include NIKE Brand licensing and other miscellaneous revenues that are not part of a geographic operating segment. Global Brand Divisions costs represent
demand creation and operating overhead expense that include product creation and design expenses centrally managed for the NIKE Brand, as well as costs associated
with NIKE Direct global digital operations and enterprise technology.
(1)
2023 FORM 10-K 84' metadata={'page': 86, '

### 根据嵌入查询的相似度返回文档

In [54]:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we
believe will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving
speed and responsiveness as we serve consumers globally.
FINANCIAL HIGHLIGHTS
• In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively
• NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for
fiscal 2023
• Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign
currency exchange rates, partially offset by strategic pricing action

# Retrievers  检索器
- LangChain VectorStore 对象不继承自 Runnable。LangChain 检索器是 Runnable，因此它们实现了一套标准方法（例如同步和异步 invoke 和 batch 操作）。尽管我们可以从向量存储构建检索器，但检索器也可以与数据源的非向量存储接口，例如外部 API。

- 我们可以自己创建一个简单版本，无需继承 Retriever 。如果我们选择我们希望用来检索文档的方法，我们可以轻松地创建一个可运行的版本。以下我们将围绕 similarity_search 方法构建一个：

In [57]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='1445be55-d806-49d4-884f-88c84ea89a8f', metadata={'page': 4, 'page_label': '5', 'source': './example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(id='9b30c172-50c7-4303-bc46-751fe383874b', metadata={'page': 86, 'page_label': '87', 'source': './example_data/nke-10k-2023.pdf', 'start_index': 3033}, page_content="transition of NIKE Brand businesses in certain countries within APLA to third-party distributors.\nThe Company's NIKE Direct operations are managed within each NIKE Brand geographic operating segment. Conve

- 向量存储实现了一个 as_retriever 方法，该方法将生成一个检索器，具体是 VectorStoreRetriever。这些检索器包括特定的 search_type 和 search_kwargs 属性，用于标识要调用底层向量存储的哪些方法，以及如何参数化它们。例如，实现上述内容同样的效果：

In [58]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='1445be55-d806-49d4-884f-88c84ea89a8f', metadata={'page': 4, 'page_label': '5', 'source': './example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(id='9b30c172-50c7-4303-bc46-751fe383874b', metadata={'page': 86, 'page_label': '87', 'source': './example_data/nke-10k-2023.pdf', 'start_index': 3033}, page_content="transition of NIKE Brand businesses in certain countries within APLA to third-party distributors.\nThe Company's NIKE Direct operations are managed within each NIKE Brand geographic operating segment. Conve

- **VectorStoreRetriever 支持以下搜索类型： "similarity" （默认）、 "mmr" （最大边际相关性，如上所述）和 "similarity_score_threshold" 。我们可以使用后者通过相似度分数来阈值化检索器输出的文档。**