## 矢量存储和检索器

本教程将让您熟悉LangChain的向量存储和检索器抽象。这些抽象旨在支持从（向量）数据库和其他来源检索数据，以便与 LLM 工作流集成。它们对于获取数据作为模型推理的一部分进行推理的应用程序非常重要，例如检索增强生成或 RAG（请参阅此处的 RAG 教程）

In [13]:
import getpass
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

## [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html)



LangChain实现了一个Documents抽象，旨在表示文本单元和关联的元数据。它有两个属性：

page_content：表示内容的字符串;

metadata：包含任意元数据的字典。

该属性可以捕获有关文档源、文档与其他文档的关系以及其他信息的信息。请注意，单个对象通常表示较大文档的块。metadataDocument

让我们生成一些示例文档：

In [14]:
from langchain_core.documents import Document

docs = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"source": "fish-pets-doc"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"source": "bird-pets-doc"},
    ),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

## 矢量存储

矢量搜索是存储和搜索非结构化数据（如非结构化文本）的常用方法。这个想法是存储与文本关联的数字向量。给定一个查询，我们可以将其嵌入为相同维度的向量，并使用向量相似度量来识别存储中的相关数据。

LangChain VectorStore 对象包含用于将文本和对象添加到存储中的方法，以及使用各种相似度指标查询它们的方法。它们通常使用嵌入模型进行初始化，这些模型决定了如何将文本数据转换为数字向量。Document

LangChain包括一套与不同向量存储技术的集成。一些矢量存储由提供商（例如，各种云提供商）托管，需要特定的凭据才能使用;有些（如Postgres）运行在单独的基础设施中，可以在本地或通过第三方运行;其他人可以在内存中运行轻量级工作负载。在这里，我们将演示使用Chroma使用LangChain VectorStores，其中包括内存实现。

为了实例化向量存储，我们通常需要提供一个嵌入模型来指定如何将文本转换为数字向量。在这里，我们将使用 [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html)。

In [15]:
from langchain_huggingface import HuggingFaceEmbeddings

EMBEDDING_DEVICE = "cpu"
embeddings = HuggingFaceEmbeddings(model_name="..\models\m3e-base",
                                   model_kwargs={'device': EMBEDDING_DEVICE})
print(embeddings)
'''
    完成向量数据库的环境准备 : FAISS
    0- 安装所需模块 --CPU
        pip install faiss-cpu
    使用 嵌入模型 将文档生成 词向量，存储到 FAISS 中
    1- 导包
        导入 向量存储的 包
        导入 生成词向量的 包
    2- 生成词向量
        1- 分词|切分
        2- 对一个个的词进行词向量的生成、存入FAISS
'''
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 生成 分词|切分器
text_splitter = RecursiveCharacterTextSplitter()
# 对 load 进来的文档(s) 进行分词&切分
documents = text_splitter.split_documents(documents=docs)
print(documents)

from langchain_community.vectorstores import FAISS

# 建立索引：将词向量存储向量数据库
vector = FAISS.from_documents(documents=documents, embedding=embeddings)
print(vector)

client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
) model_name='..\\models\\m3e-base' cache_folder=None model_kwargs={'device': 'cpu'} encode_kwargs={} multi_process=False show_progress=False
[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'), Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'), Document(metadata={'source': 'fish-pets-doc'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.'), Document(metadata={'source': 'b

在此处调用会将文档添加到矢量存储中。VectorStore 实现用于添加文档的方法，这些文档也可以在实例化对象后调用。大多数实现都允许您连接到现有的向量存储，例如，通过提供客户端、索引名称或其他信息。有关更多详细信息，请参阅特定集成的文档。.from_documents

一旦我们实例化了文档，我们就可以查询它。VectorStore 包括用于查询的方法：VectorStore

- 同步和异步;
- 通过字符串查询和向量;
- 有和没有返回相似性分数;
- 通过相似性和最大边际相关性（以平衡相似性与查询到检索结果的多样性）。
- 这些方法通常会在其输出中包含 Document 对象的列表。

## 例子

根据与字符串查询的相似性返回文档：

In [16]:
vector.similarity_search("cat")

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
 Document(metadata={'source': 'fish-pets-doc'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social animals that need plenty of space to hop around.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.')]

异步查询：

In [17]:
await vector.asimilarity_search("cat")

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
 Document(metadata={'source': 'fish-pets-doc'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social animals that need plenty of space to hop around.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.')]

返回分数：

In [18]:
# Note that providers implement different scores; Chroma here
# returns a distance metric that should vary inversely with
# similarity.

vector.similarity_search_with_score("cat")

[(Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
  153.70816),
 (Document(metadata={'source': 'fish-pets-doc'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.'),
  217.72157),
 (Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social animals that need plenty of space to hop around.'),
  246.1747),
 (Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
  251.75621)]

根据与嵌入查询的相似性返回文档：

In [19]:
embedding = HuggingFaceEmbeddings().embed_query("cat")

vector.similarity_search_by_vector(embedding)

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
 Document(metadata={'source': 'fish-pets-doc'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.'),
 Document(metadata={'source': 'bird-pets-doc'}, page_content='Parrots are intelligent birds capable of mimicking human speech.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.')]

## Retrievers

LangChain 对象不对 Runnable 进行子类化，因此无法立即集成到 LangChain 表达式语言链中。`VectorStore`

LangChain检索器是可运行的，因此它们实现了一组标准方法（例如，同步和异步以及操作），并被设计为合并到LCEL链中。`invoke` `batch`

我们可以自己创建一个简单的版本，而无需子类化。如果我们选择我们希望使用哪种方法来检索文档，我们可以轻松创建一个可运行的。下面我们将围绕该方法构建一个：`Retriever` `similarity_search`

[RunnableLambda](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.RunnableLambda.html)

In [20]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriever = RunnableLambda(vector.similarity_search).bind(k=1)  # select top result

retriever.batch(["cat", "shark"])

[[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.')],
 [Document(metadata={'source': 'fish-pets-doc'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.')]]

VectorStoreRetriever 支持的搜索类型包括 "similarity"（默认）、"mmr"（上面描述的最大边际相关性）和 "similarity_score_threshold"。我们可以使用后者通过相似度分数对检索器输出的文档进行阈值筛选。

检索器可以轻松地集成到更复杂的应用中，例如检索增强生成（RAG）应用，这类应用将给定的问题与检索到的上下文结合起来，生成大语言模型（LLM）的提示。下面我们展示一个最小的示例。

In [24]:

# sparkllm
os.environ["IFLYTEK_SPARK_APP_ID"] = "1a2c0e22"
os.environ["IFLYTEK_SPARK_API_KEY"] = "91ac602cffda5c10bbb78fc314f8525d"
os.environ["IFLYTEK_SPARK_API_SECRET"] = "ODYyMWEzMDViNGVjMWZjYWQyMmE5YWJi"
#　此处参考：https://www.xfyun.cn/doc/spark/Web.html
os.environ["IFLYTEK_SPARK_API_URL"] = "wss://spark-api.xf-yun.com/v3.1/chat"
os.environ["IFLYTEK_SPARK_llm_DOMAIN"] = "generalv3"

from langchain_community.chat_models import ChatSparkLLM

llm = ChatSparkLLM()

In [25]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

In [26]:
response = rag_chain.invoke("tell me about cats")

print(response.content)

Cats are independent pets that often enjoy their own space.
