# 2. 基于嵌入的检索技术概述
第一节课中，我们将回顾嵌入式检索系统中的一些元素，以及它们如何在一个检索增强的生成循环中与一个大型语言模型（LLM）一起配合使用。


<div class="toc">
    <ul class="toc-item">
        <li><span><a href="#一课程notebook注意事项" data-toc-modified-id="一、课程notebook注意事项">一、课程notebook注意事项</a></span></li>
        <li>
        <span><a href="#二课程内容" data-toc-modified-id="二、课程内容">二、课程内容</a></span></li><li>
        <ul class="toc-item">
            <li><span><a href="#21-系统运作原理" data-toc-modified-id="2.1 系统运作原理">2.1 系统运作原理</a></span></li>
            <li><span><a href="#22-系统具体实现" data-toc-modified-id="2.2 系统具体实现">2.2 系统具体实现</a></span></li>
        </ul>
        </li>
    </ul>
</div>

## 一、课程notebook注意事项

- 在notebook运行的过程中，可能会弹出大量的warning。这是正常现象且并不影响后续结果，可以忽略。
- 部分操作（如调用LLM或使用生成的数据集）可能产生不可预测的返回结果，因此输出结果可能和视频中不同。

## 二、课程内容
### 2.1 系统运作原理
在Chorma的案例中，检索增强的方式是，当一个用户查询请求进入时，已经有运作嵌入并存储在检索系统中的文档。
当接受到请求时，通过用有相同嵌入的模型运行该请求，来生成嵌入。
当查询请求被嵌入时，检索系统就会根据该查询的嵌入通过最近邻的方法，找到最相关的文档。
最后把查询请求和相关文档一起交给LLM， LLM从检索到的文档中的综合信息来生成答案。



### 2.2 系统具体实现

首先，从工具库中引入一些辅助函数。helper_utils.py文件可在当前目录中找到。
该函数是一个基础的自动换行函数，它能够以一种美观、整洁的方式查看文档。

In [2]:
from helper_utils import word_wrap

In [3]:
# 导入PDF阅读器
from pypdf import PdfReader
# 使用microsoft_annual_report_2022作为示例文件
reader = PdfReader("./data/microsoft_annual_report_2022.pdf")
# 从该文件中提取文本，并跳过空格
pdf_texts=[p.extract_text().strip() for p in reader.pages]

# 过滤空行，因为检索系统不能接受空行
pdf_texts=[text for text in pdf_texts if text]

print(word_wrap(pdf_texts[0]))

1 Dear shareholders, colleagues, customers, and partners:  
We are
living through a period of historic economic, societal, and
geopolitical change. The world in 2022 looks nothing like 
the world in
2019. As I write this, inflation is at a 40 -year high, supply chains
are stretched, and the war in Ukraine is 
ongoing. At the same time, we
are entering a technological era with the potential to power awesome
advancements 
across every sector of our economy and society. As the
world’s largest software company, this places us at a historic

intersection of opportunity and responsibility to the world around us.
 
Our mission to empower every person and every organization on the
planet to achieve more has never been more 
urgent or more necessary.
For all the uncertainty in the world, one thing is clear: People and
organizations in every 
industry are increasingly looking to digital
technology to overcome today’s challenges and emerge stronger. And no

company is better positioned to help th

如果想查看该pdf文件的话，请在data目录里查找。

In [4]:
# 在LangChain工具集中，使用递归字符文本拆分器和句子转换器令牌文本拆分器。
# 字符拆分器可以根据特定的分隔符递归地划分文本，使得它可以在文本中查找指定的字符并在这些字符处将文本呢分割成更小的片段。
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [5]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(word_wrap(character_split_texts[10]))
print(f"\nTotal chunks: {len(character_split_texts)}")

increased, due in large part to significant global datacenter
expansions and the growth in Xbox sales and usage. Despite 
these
increases, we remain dedicated to achieving a net -zero future. We
recognize that progress won’t always be linear, 
and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time.  
On the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate 
over 1.3  million cubic meters of volumetric benefits in nine
water basins around the world. Progress toward our zero waste

commitment included diverting more than 15,200 metric tons of solid
waste otherwise headed to landfills and incinerators, 
as well as
launching new Circular Centers to increase reuse and reduce e -waste at
our datacenters.  
We contracted to protect over 17,000 acres of land
(50% more than the land we use to operate), thus achieving our

Total chunks: 347


In [6]:
# 使用的嵌入模型称为句子转换器，对上下文窗口宽度有限制，最大时256个字符。
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(word_wrap(token_split_texts[10]))
print(f"\nTotal chunks: {len(token_split_texts)}")

  from .autonotebook import tqdm as notebook_tqdm
  return self.fget.__get__(instance, owner)()


increased, due in large part to significant global datacenter
expansions and the growth in xbox sales and usage. despite these
increases, we remain dedicated to achieving a net - zero future. we
recognize that progress won ’ t always be linear, and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time. on the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate over 1. 3 million cubic meters of volumetric benefits in nine
water basins around the world. progress toward our zero waste
commitment included diverting more than 15, 200 metric tons of solid
waste otherwise headed to landfills and incinerators, as well as
launching new circular centers to increase reuse and reduce e - waste
at our datacenters. we contracted to protect over 17, 000 acres of land
( 50 % more than the land we use to operate ), thus achieving our

Total chunks: 349


这里一个小陷阱。如果你不习惯处理嵌入，你可能不会考虑嵌入模型上下文窗口本身，
但这非常重要，因为通常一个嵌入模型有一个固定的上下文窗口大小，这意味着它在任何给定时间只能考虑一定数量的词。
这个上下文窗口限制了模型能够“看到”和因此处理的文本长度。
如果文本超过了模型的上下文窗口大小，模型可能无法捕捉到超出窗口范围的文本信息，这可能会影响嵌入的质量和最终的检索或生成结果的准确性。

In [7]:
# 使用BERT来实现句子转换器
# 句子转换器是出色的嵌入模型，内置于Chorma中，开源且所有权重可在线获取。
# 下面的工作是为了创建一个句子转换器嵌入函数，使其能够和Chorma一起使用。
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

[[0.04256267845630646, 0.0332118459045887, 0.030340120196342468, -0.03486662358045578, 0.0684165358543396, -0.08090914040803909, -0.015474344603717327, -0.001450925599783659, -0.016744449734687805, 0.06770767271518707, -0.05054139718413353, -0.04919539764523506, 0.051399920135736465, 0.09192727506160736, -0.07177843153476715, 0.03951965644955635, -0.012833518907427788, -0.024947475641965866, -0.04622865840792656, -0.024357473477721214, 0.033949702978134155, 0.025502461940050125, 0.027317170053720474, -0.004126226995140314, -0.03633830323815346, 0.0036909349728375673, -0.027430439367890358, 0.004796764813363552, -0.02889624796807766, -0.018870700150728226, 0.03666626289486885, 0.025695810094475746, 0.031312838196754456, -0.06393440067768097, 0.05394405499100685, 0.08225345611572266, -0.04175683856010437, -0.006995807401835918, -0.023486042395234108, -0.03074793517589569, -0.0029792150016874075, -0.07790940999984741, 0.0093531534075737, 0.0031628983560949564, -0.022257069125771523, -0.01

In [8]:
# 接下来是设置Chroma
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

349

In [9]:
# 现在所有内容都加载到了Chorma中，让我们连接一个LLM并构建一个完整的检索增强生成（RAG)系统
# 接下来演示查询、检索和LLM是如何一起工作的
query = "What was the total revenue?"

# 查询Chorma来获取结果，请求5个结果
results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(word_wrap(document))
    print('\n')

revenue, classified by significant product and service offerings, was
as follows : ( in millions ) year ended june 30, 2022 2021 2020 server
products and cloud services $ 67, 321 $ 52, 589 $ 41, 379 office
products and cloud services 44, 862 39, 872 35, 316 windows 24, 761 22,
488 21, 510 gaming 16, 230 15, 370 11, 575 linkedin 13, 816 10, 289 8,
077 search and news advertising 11, 591 9, 267 8, 524 enterprise
services 7, 407 6, 943 6, 409 devices 6, 991 6, 791 6, 457 other 5, 291
4, 479 3, 768 total $ 198, 270 $ 168, 088 $ 143, 015 we have recast
certain previously reported amounts in the table above to conform to
the way we internally manage and monitor our business.


74 note 13 — unearned revenue unearned revenue by segment was as
follows : ( in millions ) june 30, 2022 2021 productivity and business
processes $ 24, 558 $ 22, 120 intelligent cloud 19, 371 17, 710 more
personal computing 4, 479 4, 311 total $ 48, 408 $ 44, 141 changes in
unearned revenue were as follows : ( in milli

In [10]:
# 接下来，将这些结果与LLM一起使用，来回答查询
# 使GPT进行操作，以便拥有一个OpenAI客户端
import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [11]:
# 使用GPT3.5 Turbo完成后续操作
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [12]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))

The total revenue for the year ended June 30, 2022, was $198,270
million.
