最佳实践是为向量数据库提供高质量的问答对。

# 资料准备

在实现基于检索的生成模型（Retrieval-Augmented Generation, RAG）时，确实应该关注高质量问答对的获取和使用。RAG模型结合了检索（retrieval）和生成（generation）两个步骤，以改善生成的答案质量。以下是一些最佳实践：

1. **高质量问答对**: 确保问答对的质量是至关重要的。高质量的问答对可以提供更准确、更相关的信息，有助于生成模型产生更好的答案。这些问答对应该涵盖广泛的主题，并且答案应该是准确和信息丰富的。

2. **问题编码与匹配**: 在RAG模型中，用户的问题通常与问答对中的问题进行比较，而不是直接与答案比较。这是因为用户提出的问题和数据库中存储的问题在语义上更容易匹配。一旦找到最匹配的问题，相应的答案就可以用来辅助生成模型产生答案。

3. **向量数据库**: 使用高效的向量搜索技术来存储和检索问题的向量表示。这通常涉及到使用像FAISS这样的库来加速相似性搜索。问题的向量表示应该能够捕获语义信息，以便在检索阶段能够找到最相关的问答对。

4. **上下文编码**: 在编码问题时，考虑到问题的上下文可以提高检索的准确性。这意味着不仅仅是问题本身，相关的上下文信息（如前后文或附加的背景信息）也应该被编码进向量中。

5. **连续学习**: 随着时间的推移，问答库应该不断更新和扩展，以包括新的信息和数据。此外，可以通过持续学习（continual learning）来微调检索和生成模型，以保持其性能。

6. **多模态数据**: 如果可能，考虑使用多模态数据（如文本、图像、表格等）来丰富问答对，这样可以提供更全面的信息，有助于生成更准确的答案。

7. **用户反馈**: 利用用户反馈来评估和改进模型。用户对生成答案的满意度可以作为一个重要的指标，指导模型的迭代和优化。

8. **评估和测试**: 定期对模型进行全面的评估和测试，以确保其性能满足预期。使用标准化评估指标和测试集可以帮助监控模型进展。

总之，RAG模型的最佳实践应该包括获取和使用高质量问答对、有效的问题编码与匹配、持续的模型优化和更新，以及定期的评估和用户反馈。这样可以确保模型能够提供高质量和相关性强的答案。

# 调优策略

1. 读取文档（html, pdf, docx, excel, csv, ...)
2. 清洗内容块（去除网页非主体内容的框架、示例代码生成结果等干扰文本，仅保留有效的文本内容）
3. 整理内容块
   - 按内容拆解分块（按段落换行切割，按大小循环切割文本，按Token切割，按章节层次切割，混合切割。。。）
   - 翻译内容块（英文转中文，例如通义千问与Tongyi的配对等翻译名补充和替换，避免LLM等术语被翻译为法学硕士）
   - 多模态内容块（图片转文字，表格转文字，代码转文字，附件转文字，链接转文字，...）
   - 构造分层摘要（各个内容块摘要，向上汇总再做摘要，提示语中可携带命中文字和链路上的各层摘要）
4. 构造问答对，针对问题生成文本向量以便检索答案
   - 从内容块提取问答对
   - 人工补充问题，检索后确定关联答案
   - 利用LLM从答案反向生成问题
5. 改进策略
   - 混合检索
   - rerank
   - 多层检索
   - 多主题检索
   - 意图路由（根据全文检索或向量文本查询相似性，根据BERT或LLM的意图分类）
6. 构造测题库（从问答对中留出一定比例或选择固定的问答对作为测试库）
7. 评估命中率（从检索结果评估检索测路的命中率）
8. 回到第**3**步，调整后继续
9. 用户评估
   - 将用户反馈纳入到专门的测试题库
   - 评估命中率
   - 回到第**3**步，调整后继续
10. 反馈增强
    - 对文档内容和检索结果做笔记、标注，作为内容块的补充内容
    - 允许实时增加笔记
    - 允许个别用户在反馈时实时增加（实现立即反馈的效果）

# 有用的工具

## Vector store-backed retriever

```python
retriever = db.as_retriever()
retriever = db.as_retriever(search_type="mmr")
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)
retriever = db.as_retriever(search_kwargs={"k": 1})

retriever.get_relevant_documents("what did he say about ketanji brown jackson")
```

## MultiQueryRetriever：先扩展问题，再做向量查询

- generate_queries: 使用内部的 LLM 生成新的查询列表
- get_relevant_documents: 给定查询,返回相关文档列表
- invoke: 执行查询并合并多个扩展问题查询的结果

## EnsembleRetriever：实现关键字和向量组合检索

```python
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import FAISS  

doc_list = [文档1, 文档2, ...]

# 初始化BM25检索器
bm25_retriever = BM25Retriever.from_texts(doc_list) 

# 初始化向量存储
embedding = 嵌入模型
faiss_vectorstore = FAISS.from_texts(doc_list, embedding)

# 将向量存储转换为检索器
faiss_retriever = faiss_vectorstore.as_retriever()

# 创建EnsembleRetriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], 
    weights=[0.5, 0.5]
)

# 使用EnsembleRetriever检索
docs = ensemble_retriever.get_relevant_documents("查询语句")
```

## LongContextReorder：裁剪超长的结果

参考论文：[https://arxiv.org/abs//2307.03172](https://arxiv.org/abs//2307.03172)

<div class="alert alert-info">
    <b>论文摘要</b><br>
    <p>我们分析了语言模型在需要识别输入上下文中的相关信息的两项任务上的性能：多文档问答和键值检索。 </p>
    <p>我们发现，当改变相关信息的位置时，性能会显着下降，这表明当前的语言模型不能稳健地利用长输入上下文中的信息。 特别是，我们观察到，当相关信息出现在输入上下文的开头或结尾时，性能通常最高，而当模型必须在长上下文中间访问相关信息时，即使对于明确的长上下文模型，性能也会显着下降。 </p>
    <p>我们的分析可以更好地理解语言模型如何使用其输入上下文，并为未来的长上下文语言模型提供新的评估协议。</p>
</div>

```python
# 从向量数据库中查询50条结果 
docs = retriever.query(query, k=50)

# 创建 Long-Context Reorder
reorder = LongContextReorder(model="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# 对50条结果进行重排
reranked_docs = reorder.rerank(docs)  

# 只取前5名作为最终结果
final_docs = reranked_docs[:5]
```

# 几种检索结果进行优化的方法比较

ContextualCompressionRetriever、LLMChainFilter和LongContextReorder都是对检索结果进行优化的方法,主要的异同点如下:

## 相同点

- 都试图提升检索结果的质量
- 通常作用于检索结果获取后、提供给模型前

## 区别点

**工作机制不同**

- ContextualCompressionRetriever:压缩每个文档,提取相关部分
- LLMChainFilter:完全过滤掉不相关文档
- LongContextReorder:调整相关文档的顺序


**主要依赖不同**

- ContextualCompressionRetriever:需要文档压缩器
- LLMChainFilter:需要LLM链判断相关性
- LongContextReorder:需要文档相似度


**注重侧重不同**

- ContextualCompressionRetriever:提升相关信息密度
- LLMChainFilter:减少无关噪声 
- LongContextReorder:优化信息访问

# 文档加载

## 文件目录

In [None]:
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader('./', glob="**/*.md")
docs = loader.load()

## 加载CSV

## 加载HTML

In [5]:
!poetry add parser


[31;1mCould not find a matching version of package parser[39;22m


In [14]:
!poetry add nest_asyncio

Using version [39;1m^1.6.0[39;22m for [36mnest-asyncio[39m

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(10.4s)[39;22m[34mResolving dependencies...[39m [39;2m(7.4s)[39;22m[34mResolving dependencies...[39m [39;2m(7.6s)[39;22m

No dependencies to install or update

[34mWriting lock file[39m


In [16]:
# 仅在jupyter中需要
import nest_asyncio
nest_asyncio.apply()

In [37]:
from bs4 import BeautifulSoup, SoupStrainer
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from langchain_community.document_loaders.sitemap import SitemapLoader
from langchain_core.utils.html import PREFIXES_TO_IGNORE_REGEX, SUFFIXES_TO_IGNORE_REGEX
from parser import langchain_docs_extractor
import re

In [38]:
# 提取langchain的Docs文档
def metadata_extractor(meta: dict, soup: BeautifulSoup) -> dict:
    title = soup.find("title")
    description = soup.find("meta", attrs={"name": "description"})
    html = soup.find("html")
    return {
        "source": meta["loc"],
        "title": title.get_text() if title else "",
        "description": description.get("content", "") if description else "",
        "language": html.get("lang", "") if html else "",
        **meta,
    }

def load_langchain_docs():
    return SitemapLoader(
        "https://python.langchain.com/sitemap.xml",
        filter_urls=["https://python.langchain.com/"],
        parsing_function=langchain_docs_extractor,
        default_parser="lxml",
        bs_kwargs={
            "parse_only": SoupStrainer(
                name=("article", "title", "html", "lang", "content")
            ),
        },
        meta_function=metadata_extractor,
    ).load()

In [39]:
# 提取langchain的API文档
def simple_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()

def load_api_docs():
    return RecursiveUrlLoader(
        url="https://api.python.langchain.com/en/latest/",
        max_depth=8,
        extractor=simple_extractor,
        prevent_outside=True,
        use_async=True,
        timeout=600,
        # Drop trailing / to avoid duplicate pages.
        link_regex=(
            f"href=[\"']{PREFIXES_TO_IGNORE_REGEX}((?:{SUFFIXES_TO_IGNORE_REGEX}.)*?)"
            r"(?:[\#'\"]|\/[\#'\"])"
        ),
        check_response_status=True,
        exclude_dirs=(
            "https://api.python.langchain.com/en/latest/_sources",
            "https://api.python.langchain.com/en/latest/_modules",
        ),
    ).load()

In [40]:
# langsmith的docs文档
def load_langsmith_docs():
    return RecursiveUrlLoader(
        url="https://docs.smith.langchain.com/",
        max_depth=8,
        extractor=simple_extractor,
        prevent_outside=True,
        use_async=True,
        timeout=600,
        # Drop trailing / to avoid duplicate pages.
        link_regex=(
            f"href=[\"']{PREFIXES_TO_IGNORE_REGEX}((?:{SUFFIXES_TO_IGNORE_REGEX}.)*?)"
            r"(?:[\#'\"]|\/[\#'\"])"
        ),
        check_response_status=True,
    ).load()

In [None]:
langchain_docs = await load_langchain_docs()

In [None]:
langchain_api = await load_api_docs()

In [None]:
langchain_api

## 加载JSON

## 加载Markdown

In [152]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./index.md")
loader.load()

[Document(page_content='ok', metadata={'source': './index.md'})]

## 加载PDF

# 文本切分

## HTMLHeaderTextSplitter

In [2]:
from langchain.text_splitter import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(page_content='Foo'),
 Document(page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),
 Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),
 Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),
 Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),
 Document(page_content='Baz', metadata={'Header 1': 'Foo'}),
 Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),
 Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

url = "http://www.hongmeng-info.com/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits[:10]

[Document(page_content='Toggle navigation  \n首页 互联网应用 信息化服务 电子税务 招聘 联系我们  \nPrevious Next  \n稳健、高效 人性化的电子竞价系统, 千亿级股权交易平台实践检验'),
 Document(page_content='鸿蒙在线竞价系统，微信，APP，智能终端，多媒体控制.  \n了解更多>>', metadata={'Header 2': '稳健、高效 人性化的电子竞价系统, 千亿级股权交易平台实践检验'}),
 Document(page_content='会员平台及CRM管理'),
 Document(page_content='社群运营的基础架构系统，支持复杂权益，积分管理，“会员卡”系统，权益/积分商城应用，社群用户关系管理，多种智能行为数据模型，面向客户群/社群运营者提供有效的解决方案.  \n详情 »', metadata={'Header 2': '会员平台及CRM管理'}),
 Document(page_content='互联网运营平台'),
 Document(page_content='核心组件系统支撑O2O类运营体系，订单系统，合作商/供应商/渠道商管理及结算系统，客服系统，营销支撑与分析，活动及传播系统，为运营提供有效灵活的支撑.  \n详情 »', metadata={'Header 2': '互联网运营平台'}),
 Document(page_content='电子商城'),
 Document(page_content='服务/产品的线上交易平台。根据你的需要，实现你想要的电子商城系统。模块化组合实现满足不同运营方多层次的系统需要.  \n详情 »', metadata={'Header 2': '电子商城'}),
 Document(page_content='数据技术&服务'),
 Document(page_content='清洗，分析，挖掘，分析。经验和自有的工具务实有效的解决深层次运营问题。我们擅长解决各种类型的数据接口.  \n详情 »', metadata={'Header 2': '数据技术&服务'})]

# 向量编码

In [5]:
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings()

In [6]:
embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
len(embeddings), len(embeddings[0])

(5, 1536)

In [7]:
embedded_query = embeddings_model.embed_query("对话中提及的名字是什么?")
embedded_query[:5]

[-0.0037282993122298882,
 -0.01327033122820681,
 0.03234768953778434,
 0.0035204264887496776,
 -0.017729538998712827]

**CacheBackedEmbeddings**：支持缓存

# 向量存储

## 资料准备

In [17]:
!poetry add importlib

Using version [39;1m^1.0.4[39;22m for [36mimportlib[39m

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(28.8s)[39;22m://files.pythonhosted.org/packages/31/77/3781f65cafe55480b56914def99022a5d2965a4bb269655c89ef2f1de3cd/importlib-1.0.4.zip[39m [39;2m(0.6s)[39;22m[34mResolving dependencies...[39m [39;2m(6.9s)[39;22m[34mResolving dependencies...[39m [39;2m(10.3s)[39;22m[34mResolving dependencies...[39m [39;2m(20.0s)[39;22m

[39;1mPackage operations[39;22m: [34m1[39m install, [34m0[39m updates, [34m0[39m removals

  [34;1m•[39;22m [39mInstalling [39m[36mimportlib[39m[39m ([39m[39;1m1.0.4[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m•[39;22m [39mInstalling [39m[36mimportlib[39m[39m ([39m[39;1m1.0.4[39;22m[39m)[39m: [34mDownloading...[39m [39;1m0%[39;22m
[1A[0J  [34;1m•[39;22m [39mInstalling [39m[36mimportlib[39m[39m ([39m[39;1m1.0.4[39;22m[39m)[39m: [34mDownloading...[39m [39;1

In [30]:
import os
import importlib.util
spec = importlib.util.find_spec('langchain')
langchain_files_path = os.path.join(os.path.dirname(spec.origin), "docs/docs/modules")
print(langchain_files_path)

/Users/xuehongwei/Library/Caches/pypoetry/virtualenvs/md-8WLN4Vov-py3.10/lib/python3.10/site-packages/langchain/docs/docs/modules


## Chroma

In [8]:
!poetry add chromadb

Using version [39;1m^0.4.22[39;22m for [36mchromadb[39m

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(63.2s)[39;22m://files.pythonhosted.org/packages/79/4d/9cc401e7b07e80532ebc8c8e993f42541534da9e9249c59ee0139dcb0352/websockets-12.0-py3-none-any.whl[39m [39;2m(55.6s)[39;22m39m [39;2m(36.2s)[39;22m

[39;1mPackage operations[39;22m: [34m41[39m installs, [34m1[39m update, [34m0[39m removals

  [34;1m•[39;22m [39mInstalling [39m[36mzipp[39m[39m ([39m[39;1m3.17.0[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m•[39;22m [39mInstalling [39m[36mzipp[39m[39m ([39m[39;1m3.17.0[39;22m[39m)[39m: [34mInstalling...[39m
[1A[0J  [32;1m•[39;22m [39mInstalling [39m[36mzipp[39m[39m ([39m[32m3.17.0[39m[39m)[39m
  [34;1m•[39;22m [39mInstalling [39m[36mimportlib-metadata[39m[39m ([39m[39;1m6.11.0[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m•[39;22m [39mInstalling [39m[36mimportlib-me

In [37]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
file = 'state_of_the_union.txt'
print(file)
raw_documents = TextLoader(file).load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = Chroma.from_documents(documents, OpenAIEmbeddings())

state_of_the_union.txt


用**字符串参数**做相似性查询

In [39]:
query = "总统关于 Ketanji Brown Jackson 的发言"
docs = db.similarity_search(query)
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


用**向量参数**做相似性查询（可以做更深度优化，例如减少向量编码的事件）

In [40]:
embedding_vector = OpenAIEmbeddings().embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


## FAISS

In [72]:
!poetry add faiss-cpu

Using version [39;1m^1.7.4[39;22m for [36mfaiss-cpu[39m

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(12.4s)[39;22m://files.pythonhosted.org/packages/e4/5d/c35f5285b85b54b4b154ce40a8810d57a306f2da4a9a58cb7498f9aefadb/faiss_cpu-1.7.4-cp310-cp310-macosx_10_9_x86_64.whl  99%[39m [39;2m(0.5s)[39;22m[34mResolving dependencies...[39m [39;2m(3.3s)[39;22m[34mResolving dependencies...[39m [39;2m(4.5s)[39;22m

[39;1mPackage operations[39;22m: [34m1[39m install, [34m0[39m updates, [34m0[39m removals

  [34;1m•[39;22m [39mInstalling [39m[36mfaiss-cpu[39m[39m ([39m[39;1m1.7.4[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m•[39;22m [39mInstalling [39m[36mfaiss-cpu[39m[39m ([39m[39;1m1.7.4[39;22m[39m)[39m: [34mDownloading...[39m [39;1m0%[39;22m
[1A[0J  [34;1m•[39;22m [39mInstalling [39m[36mfaiss-cpu[39m[39m ([39m[39;1m1.7.4[39;22m[39m)[39m: [34mDownloading...[39m [39;1m30%[39;22m
[1A[0J  

## lanceDB

In [None]:
!poetry add lancedb

In [44]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import LanceDB

embeddings = OpenAIEmbeddings()

import lancedb
db = lancedb.connect("/tmp/lancedb")

table = db.create_table(
    "my_table",
    data=[
        {
            "vector": embeddings.embed_query("Hello World"),
            "text": "Hello World",
            "id": "1",
        }
    ],
    mode="overwrite",
)

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('state_of_the_union.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = LanceDB.from_documents(documents, OpenAIEmbeddings(), connection=table)

[2024-01-24T08:00:55Z WARN  lance::dataset] No existing dataset at /tmp/lancedb/my_table.lance, it will be created


使用**字符串**做相似性检索

In [48]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


使用**向量**做相似性检索

In [50]:
retriever = db.as_retriever()

# 向量检索：retriever

## RAG

In [52]:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI()


def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

chain.invoke("总统发言提到了什么技术方面的内容?")


'总统发言提到了新兴技术和美国制造业方面的内容。'

## 定制Recevier

In [53]:
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from typing import List

class CustomRetriever(BaseRetriever):    
    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        return [Document(page_content=query)]

retriever = CustomRetriever()

retriever.get_relevant_documents("bar")

[Document(page_content='bar')]