<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/low_level/vector_store.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>


# 从头开始构建一个（非常简单的）向量存储库

在本教程中，我们将向您展示如何构建一个简单的内存中向量存储库，该存储库可以存储文档以及元数据。它还将公开一个查询接口，可以支持各种查询：
- 语义搜索（使用嵌入相似性）
- 元数据过滤

**注意**：显然，这并不打算取代任何实际的向量存储库（例如Pinecone、Weaviate、Chroma、Qdrant、Milvus或我们广泛的向量存储库集成中的其他存储库）。这更多是为了教授一些关键的检索概念，如前k个嵌入搜索+元数据过滤。

我们不会涉及高级查询/检索概念，比如近似最近邻居、稀疏/混合搜索，或者构建实际数据库所需的任何系统概念。


## 设置

我们加载一些文档，并将它们解析为节点对象 - 这些对象已经准备好被插入到向量存储中。


#### 加载文档


In [None]:
%pip install llama-index-readers-file pymupdf
%pip install llama-index-embeddings-openai

In [None]:
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

In [None]:
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

In [None]:
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

### 解析到节点


In [None]:
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter(chunk_size=256)
nodes = node_parser.get_nodes_from_documents(documents)

#### 为每个节点生成嵌入

在这个部分，我们将使用图神经网络模型来为图中的每个节点生成嵌入。嵌入是节点的低维向量表示，可以捕捉节点在图中的特征和关系。我们将使用图神经网络模型来学习节点的嵌入，以便在后续的任务中使用这些嵌入向量。


In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding()
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

## 构建一个简单的内存向量存储

现在我们将构建我们的内存向量存储。我们将在一个简单的Python字典中存储节点。我们将首先实现嵌入搜索，然后添加元数据过滤器。


### 1. 定义接口

我们首先定义构建向量存储的接口。它包含以下项目：

- `get`
- `add`
- `delete`
- `query`
- `persist`（我们将不实现）


In [None]:
from llama_index.core.vector_stores.types import BasePydanticVectorStore
from llama_index.core.vector_stores中 import VectorStoreQuery、VectorStoreQueryResult
from typing import List、Any、Optional、Dict
from llama_index.core.schema import TextNode、BaseNode
from os

class BaseVectorStore(BasePydanticVectorStore):
    """简单的自定义向量存储。

    将文档存储在简单的内存字典中。

    """

    stores_text: bool = True

    def get(self, text_id: str) -> List[float]:
        """获取嵌入。"""
        pass

    def add(
        self,
        nodes: List[BaseNode],
    ) -> List[str]:
        """将节点添加到索引。"""
        pass

    def delete(self, ref_doc_id: str, **delete_kwargs: Any) -> None:
        """
        使用ref_doc_id删除节点。

        Args:
            ref_doc_id (str): 要删除的文档的doc_id。

        """
        pass

    def query(
        self,
        query: VectorStoreQuery,
        **kwargs: Any,
    ) -> VectorStoreQueryResult:
        """获取响应的节点。"""
        pass

    def persist(self, persist_path, fs=None) -> None:
        """将SimpleVectorStore持久化到一个目录。

        注意：我们暂时不实现这个功能。

        """
        pass

在高层次上，我们对基本的`VectorStore`抽象进行子类化。如果你只是从头开始构建一个向量存储，其实没有固有的理由这样做。我们这样做是因为这样可以很容易地插入到我们后续的抽象中。

让我们来看一下这里定义的一些类。
- `BaseNode` 简单地是我们核心节点模块的父类。每个节点表示一个文本块 + 相关的元数据。
- 我们还使用一些更低级的构造，例如我们的 `VectorStoreQuery` 和 `VectorStoreQueryResult`。这些只是轻量级的数据类容器，用于表示查询和结果。我们将在下面看一下数据类字段。


In [None]:
from dataclasses import fields

{f.name: f.type for f in fields(VectorStoreQuery)}

{'query_embedding': typing.Optional[typing.List[float]],
 'similarity_top_k': int,
 'doc_ids': typing.Optional[typing.List[str]],
 'node_ids': typing.Optional[typing.List[str]],
 'query_str': typing.Optional[str],
 'output_fields': typing.Optional[typing.List[str]],
 'embedding_field': typing.Optional[str],
 'mode': <enum 'VectorStoreQueryMode'>,
 'alpha': typing.Optional[float],
 'filters': typing.Optional[llama_index.vector_stores.types.MetadataFilters],
 'mmr_threshold': typing.Optional[float],
 'sparse_top_k': typing.Optional[int]}

In [None]:
{f.name: f.type for f in fields(VectorStoreQueryResult)}

{'nodes': typing.Optional[typing.Sequence[llama_index.schema.BaseNode]],
 'similarities': typing.Optional[typing.List[float]],
 'ids': typing.Optional[typing.List[str]]}

### 2. 定义`add`、`get`和`delete`

我们添加了一些基本功能，用于向向量存储中添加、获取和删除元素。

实现非常简单（所有内容都只存储在一个Python字典中）。


In [None]:

from llama_index.core.bridge.pydantic import Field


class VectorStore2(BaseVectorStore):
    """VectorStore2（已实现add/get/delete）。"""

    stores_text: bool = True
    node_dict: Dict[str, BaseNode] = Field(default_factory=dict)

    def get(self, text_id: str) -> List[float]:
        """获取嵌入。"""
        return self.node_dict[text_id]

    def add(
        self,
        nodes: List[BaseNode],
    ) -> List[str]:
        """将节点添加到索引。"""
        for node in nodes:
            self.node_dict[node.node_id] = node

    def delete(self, node_id: str, **delete_kwargs: Any) -> None:
        """
        使用node_id删除节点。

        Args:
            node_id: str

        """
        del self.node_dict[node_id]

我们运行一些基本测试，只是为了展示它能够正常工作。


In [None]:
test_node = TextNode(id_="id1", text="hello world")
test_node2 = TextNode(id_="id2", text="foo bar")
test_nodes = [test_node, test_node2]

In [None]:
vector_store = VectorStore2()

In [None]:
vector_store.add(test_nodes)

In [None]:
node = vector_store.get("id1")
print(str(node))

Node ID: id1
Text: hello world


### 3.a 定义`query`（语义搜索）

我们实现了一个基本版本的前k个语义搜索。这简单地遍历所有文档嵌入，并计算与查询嵌入的余弦相似度。返回余弦相似度最高的前k个文档。

余弦相似度：$\dfrac{\vec{d}\vec{q}}{|\vec{d}||\vec{q}|}$，对于每个文档、查询嵌入对$\vec{d}$、$\vec{q}$。

**注意**：前k个值包含在`VectorStoreQuery`容器中。

**注意**：与上面类似，我们定义另一个子类，这样我们就不必重新实现上面的函数（并不是因为这是良好的代码实践）。


In [None]:
from typing import Tuple
import numpy as np


def get_top_k_embeddings(
    query_embedding: List[float],
    doc_embeddings: List[List[float]],
    doc_ids: List[str],
    similarity_top_k: int = 5,
) -> Tuple[List[float], List]:
    """获取与查询相似度最高的顶点。"""
    # 维度：D
    qembed_np = np.array(query_embedding)
    # 维度：N x D
    dembed_np = np.array(doc_embeddings)
    # 维度：N
    dproduct_arr = np.dot(dembed_np, qembed_np)
    # 维度：N
    norm_arr = np.linalg.norm(qembed_np) * np.linalg.norm(
        dembed_np, axis=1, keepdims=False
    )
    # 维度：N
    cos_sim_arr = dproduct_arr / norm_arr

    # 现在我们得到了每个文档的 N 个余弦相似度
    # 按照前 k 个余弦相似度进行排序，并返回对应的id
    tups = [(cos_sim_arr[i], doc_ids[i]) for i in range(len(doc_ids))]
    sorted_tups = sorted(tups, key=lambda t: t[0], reverse=True)

    sorted_tups = sorted_tups[:similarity_top_k]

    result_similarities = [s for s, _ in sorted_tups]
    result_ids = [n for _, n in sorted_tups]
    return result_similarities, result_ids

In [None]:

from typing import cast


class VectorStore3A(VectorStore2):
    """实现语义/密集搜索。"""

    def query(
        self,
        query: VectorStoreQuery,
        **kwargs: Any,
    ) -> VectorStoreQueryResult:
        """获取响应的节点。"""

        query_embedding = cast(List[float], query.query_embedding)
        doc_embeddings = [n.embedding for n in self.node_dict.values()]
        doc_ids = [n.node_id for n in self.node_dict.values()]

        similarities, node_ids = get_top_k_embeddings(
            query_embedding,
            doc_embeddings,
            doc_ids,
            similarity_top_k=query.similarity_top_k,
        )
        result_nodes = [self.node_dict[node_id] for node_id in node_ids]

        return VectorStoreQueryResult(
            nodes=result_nodes, similarities=similarities, ids=node_ids
        )

### 3.b. 支持元数据过滤

接下来的扩展是添加元数据过滤支持。这意味着我们将首先使用元数据过滤器过滤候选集，然后执行语义查询。

为简单起见，我们使用元数据过滤器进行精确匹配，并且使用 AND 条件。


In [None]:
from llama_index.core.vector_stores import MetadataFilters
from llama_index.core.schema import BaseNode
from typing import cast


def filter_nodes(nodes: List[BaseNode], filters: MetadataFilters):
    filtered_nodes = []
    for node in nodes:
        matches = True
        for f in filters.filters:
            if f.key not in node.metadata:
                matches = False
                continue
            if f.value != node.metadata[f.key]:
                matches = False
                continue
        if matches:
            filtered_nodes.append(node)
    return filtered_nodes

我们将`filter_nodes`作为节点在运行语义搜索之前的第一次处理。


In [None]:
def dense_search(query: VectorStoreQuery, nodes: List[BaseNode]):
    """密集搜索。"""
    query_embedding = cast(List[float], query.query_embedding)
    doc_embeddings = [n.embedding for n in nodes]
    doc_ids = [n.node_id for n in nodes]
    return get_top_k_embeddings(
        query_embedding,
        doc_embeddings,
        doc_ids,
        similarity_top_k=query.similarity_top_k,
    )


class VectorStore3B(VectorStore2):
    """实现元数据过滤。"""

    def query(
        self,
        query: VectorStoreQuery,
        **kwargs: Any,
    ) -> VectorStoreQueryResult:
        """获取响应的节点。"""
        # 1. 首先按元数据进行过滤
        nodes = self.node_dict.values()
        if query.filters is not None:
            nodes = filter_nodes(nodes, query.filters)
        if len(nodes) == 0:
            result_nodes = []
            similarities = []
            node_ids = []
        else:
            # 2. 然后执行语义搜索
            similarities, node_ids = dense_search(query, nodes)
            result_nodes = [self.node_dict[node_id] for node_id in node_ids]
        return VectorStoreQueryResult(
            nodes=result_nodes, similarities=similarities, ids=node_ids
        )

### 4. 将数据加载到我们的向量存储中

让我们将文本块加载到向量存储中，并对其运行不同类型的查询：密集搜索，带有元数据过滤器等。


In [None]:
vector_store = VectorStore3B()
# 将数据加载到向量存储中
vector_store.add(nodes)

定义一个示例问题并嵌入其中。


In [None]:
query_str = "Can you tell me about the key concepts for safety finetuning"
query_embedding = embed_model.get_query_embedding(query_str)

#### 使用稠密搜索查询向量存储。


In [None]:
query_obj = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2
)

query_result = vector_store.query(query_obj)
for similarity, node in zip(query_result.similarities, query_result.nodes):
    print(
        "\n----------------\n"
        f"[Node ID {node.node_id}] Similarity: {similarity}\n\n"
        f"{node.get_content(metadata_mode='all')}"
        "\n----------------\n\n"
    )


----------------
[Node ID 3f74fdf4-0e2e-473e-9b07-10c51eb62794] Similarity: 0.835677131511819

total_pages: 77
file_path: ./data/llama2.pdf
source: 23

Specifically, we use the following techniques in safety fine-tuning:
1. Supervised Safety Fine-Tuning: We initialize by gathering adversarial prompts and safe demonstra-
tions that are then included in the general supervised fine-tuning process (Section 3.1). This teaches
the model to align with our safety guidelines even before RLHF, and thus lays the foundation for
high-quality human preference data annotation.
2. Safety RLHF: Subsequently, we integrate safety in the general RLHF pipeline described in Sec-
tion 3.2.2. This includes training a safety-specific reward model and gathering more challenging
adversarial prompts for rejection sampling style fine-tuning and PPO optimization.
3. Safety Context Distillation: Finally, we refine our RLHF pipeline with context distillation (Askell
et al., 2021b).
----------------



--------------

#### 使用密集搜索和元数据过滤器查询向量存储库


In [None]:
# filters = MetadataFilters(
#     filters=[
#         ExactMatchFilter(key="page", value=3)
#     ]
# )
filters = MetadataFilters.from_dict({"source": "24"})

query_obj = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, filters=filters
)

query_result = vector_store.query(query_obj)
for similarity, node in zip(query_result.similarities, query_result.nodes):
    print(
        "\n----------------\n"
        f"[Node ID {node.node_id}] Similarity: {similarity}\n\n"
        f"{node.get_content(metadata_mode='all')}"
        "\n----------------\n\n"
    )



----------------
[Node ID efe54bc0-4f9f-49ad-9dd5-900395a092fa] Similarity: 0.8190195580569283

total_pages: 77
file_path: ./data/llama2.pdf
source: 24

4.2.2
Safety Supervised Fine-Tuning
In accordance with the established guidelines from Section 4.2.1, we gather prompts and demonstrations
of safe model responses from trained annotators, and use the data for supervised fine-tuning in the same
manner as described in Section 3.1. An example can be found in Table 5.
The annotators are instructed to initially come up with prompts that they think could potentially induce
the model to exhibit unsafe behavior, i.e., perform red teaming, as defined by the guidelines. Subsequently,
annotators are tasked with crafting a safe and helpful response that the model should produce.
4.2.3
Safety RLHF
We observe early in the development of Llama 2-Chat that it is able to generalize from the safe demonstrations
in supervised fine-tuning. The model quickly learns to write detailed safe responses, addres

## 使用向量存储构建RAG系统

现在我们已经构建了RAG系统，是时候将其插入到我们的下游系统中了！


In [None]:
from llama_index.core import VectorStoreIndex

In [None]:
index = VectorStoreIndex.from_vector_store(vector_store)

In [None]:
query_engine = index.as_query_engine()

In [None]:
query_str = "Can you tell me about the key concepts for safety finetuning"

In [None]:
response = query_engine.query(query_str)

In [None]:
print(str(response))

The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to align the model with safety guidelines before RLHF. Safety RLHF integrates safety into the RLHF pipeline by training a safety-specific reward model and gathering more challenging adversarial prompts for fine-tuning and optimization. Finally, safety context distillation is used to refine the RLHF pipeline. These techniques aim to mitigate safety risks and ensure that the model aligns with safety guidelines.


## 结论

就是这样！我们已经构建了一个简单的内存向量存储，支持非常简单的插入、获取、删除操作，并支持密集搜索和元数据过滤。然后可以将其插入到LlamaIndex的其他抽象中。

它目前还不支持稀疏搜索，显然也不适合在任何实际应用程序中使用。但这应该暴露了一些底层的情况！
