<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/VertexAIVectorSearchDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>


# Google Vertex AI 矢量搜索

这个笔记本展示了如何使用与`Google Cloud Vertex AI 矢量搜索`向量数据库相关的功能。

> [Google Vertex AI 矢量搜索](https://cloud.google.com/vertex-ai/docs/vector-search/overview)，之前被称为 Vertex AI 匹配引擎，提供了行业领先的高规模低延迟向量数据库。这些向量数据库通常被称为向量相似度匹配或近似最近邻（ANN）服务。

**注意**：LlamaIndex 期望 Vertex AI 矢量搜索端点和部署的索引已经创建。创建空索引可能需要最多一分钟的时间，将索引部署到端点可能需要最多30分钟的时间。

> 要查看如何创建索引，请参考[创建索引并将其部署到端点](#create-index-and-deploy-it-to-an-endpoint)  
如果您已经部署了一个索引，请跳转到[从文本创建 VectorStore](#create-vector-store-from-texts)


## 安装

如果您在colab上打开此笔记本，您可能需要安装LlamaIndex 🦙。


In [None]:
! pip install llama-index llama-index-vector-stores-vertexaivectorsearch llama-index-llms-vertex

## 创建索引并部署到终端

- 本节演示了如何创建一个新的索引并将其部署到一个终端。


In [None]:
# TODO：根据您的需求设置值

# 项目和存储常量
PROJECT_ID = "[your_project_id]"
REGION = "[your_region]"
GCS_BUCKET_NAME = "[your_gcs_bucket]"
GCS_BUCKET_URI = f"gs://{GCS_BUCKET_NAME}"

# textembedding-gecko@003 的维度为 768
# 如果使用其他嵌入器，维度可能需要更改。
VS_DIMENSIONS = 768

# Vertex AI 矢量搜索索引配置
# 参数描述在这里
# https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index
VS_INDEX_NAME = "llamaindex-doc-index"  # @param {type:"string"}
VS_INDEX_ENDPOINT_NAME = "llamaindex-doc-endpoint"  # @param {type:"string"}

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

## 元数据过滤示例


In [None]:
# 创建一个存储桶。
! gsutil mb -l $REGION -p $PROJECT_ID $GCS_BUCKET_URI

### 创建一个空的索引

**注意：** 在创建索引时，您应该指定一个“index_update_method” - `BATCH_UPDATE` 或 `STREAM_UPDATE`

> 批量索引用于当您想要以批量方式更新索引时，使用已存储一定时间的数据，比如每周或每月处理的系统。
>
> 流式索引是指当您希望索引数据在新数据添加到数据存储时进行更新，例如，如果您有一家书店，并希望尽快在网上展示新的库存。
>
> 选择哪种类型很重要，因为设置和要求是不同的。

有关配置索引的更多详细信息，请参阅[官方文档](https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index)和[API参考](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index)。


In [None]:
# 注意：此操作可能需要长达30秒的时间

# 检查索引是否存在
index_names = [
    index.resource_name
    for index in aiplatform.MatchingEngineIndex.list(
        filter=f"display_name={VS_INDEX_NAME}"
    )
]

if len(index_names) == 0:
    print(f"正在创建向量搜索索引 {VS_INDEX_NAME} ...")
    vs_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
        display_name=VS_INDEX_NAME,
        dimensions=VS_DIMENSIONS,
        distance_measure_type="DOT_PRODUCT_DISTANCE",
        shard_size="SHARD_SIZE_SMALL",
        index_update_method="STREAM_UPDATE",  # 允许的值为 BATCH_UPDATE，STREAM_UPDATE
    )
    print(
        f"已创建具有资源名称 {vs_index.resource_name} 的向量搜索索引 {vs_index.display_name}"
    )
else:
    vs_index = aiplatform.MatchingEngineIndex(index_name=index_names[0])
    print(
        f"具有资源名称 {vs_index.resource_name} 的向量搜索索引 {vs_index.display_name} 已存在"
    )

### 创建一个端点

要使用索引，您需要创建一个索引端点。它作为一个服务器实例，接受针对您的索引的查询请求。端点可以是[公共端点](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public)或[私有端点](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-vpc)。

让我们创建一个公共端点。


In [None]:
endpoint_names = [
    endpoint.resource_name
    for endpoint in aiplatform.MatchingEngineIndexEndpoint.list(
        filter=f"display_name={VS_INDEX_ENDPOINT_NAME}"
    )
]

if len(endpoint_names) == 0:
    print(
        f"Creating Vector Search index endpoint {VS_INDEX_ENDPOINT_NAME} ..."
    )
    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
        display_name=VS_INDEX_ENDPOINT_NAME, public_endpoint_enabled=True
    )
    print(
        f"Vector Search index endpoint {vs_endpoint.display_name} created with resource name {vs_endpoint.resource_name}"
    )
else:
    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
        index_endpoint_name=endpoint_names[0]
    )
    print(
        f"Vector Search index endpoint {vs_endpoint.display_name} exists with resource name {vs_endpoint.resource_name}"
    )

### 部署索引到端点

使用索引端点，通过指定唯一的部署索引ID来部署索引。

**注意：此操作可能需要最多30分钟。**


In [None]:
# 检查端点是否存在
index_endpoints = [
    (deployed_index.index_endpoint, deployed_index.deployed_index_id)
    for deployed_index in vs_index.deployed_indexes
]

if len(index_endpoints) == 0:
    print(
        f"正在将向量搜索索引 {vs_index.display_name} 部署到端点 {vs_endpoint.display_name} ..."
    )
    vs_deployed_index = vs_endpoint.deploy_index(
        index=vs_index,
        deployed_index_id=VS_INDEX_NAME,
        display_name=VS_INDEX_NAME,
        machine_type="e2-standard-16",
        min_replica_count=1,
        max_replica_count=1,
    )
    print(
        f"向量搜索索引 {vs_index.display_name} 已部署到端点 {vs_deployed_index.display_name}"
    )
else:
    vs_deployed_index = aiplatform.MatchingEngineIndexEndpoint(
        index_endpoint_name=index_endpoints[0][0]
    )
    print(
        f"向量搜索索引 {vs_index.display_name} 已经部署到端点 {vs_deployed_index.display_name}"
    )

## 从文本创建向量存储

注意：如果您已经有现有的Vertex AI向量搜索索引和端点，可以使用以下代码进行分配：


In [None]:
# 待办事项：用实际的索引ID替换1234567890123456789
vs_index = aiplatform.MatchingEngineIndex(index_name="1234567890123456789")

# 待办事项：用实际的端点ID替换1234567890123456789
vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
    index_endpoint_name="1234567890123456789"
)

In [None]:
# 导入所需的模块
from llama_index.core import (
    StorageContext,
    Settings,
    VectorStoreIndex,
    SimpleDirectoryReader,
)
from llama_index.core.schema import TextNode
from llama_index.core.vector_stores.types import (
    MetadataFilters,
    MetadataFilter,
    FilterOperator,
)
from llama_index.llms.vertex import Vertex
from llama_index.embeddings.vertex import VertexTextEmbedding
from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore

### 从纯文本创建一个简单的向量存储，不包含元数据过滤器


In [None]:
# 设置存储
vector_store = VertexAIVectorStore(
    project_id=PROJECT_ID,
    region=REGION,
    index_id=vs_index.resource_name,
    endpoint_id=vs_endpoint.resource_name,
    gcs_bucket_name=GCS_BUCKET_NAME,
)

# 设置存储上下文
storage_context = StorageContext.from_defaults(vector_store=vector_store)

### 使用[Vertex AI Embeddings](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/embeddings/llama-index-embeddings-vertex)作为嵌入模型


In [None]:
# 配置嵌入模型
embed_model = VertexTextEmbedding(
    model_name="textembedding-gecko@003",
    project=PROJECT_ID,
    location=REGION,
)

# 设置索引/查询过程，即嵌入模型（如果使用则包括完成）
Settings.embed_model = embed_model

### 将向量和映射的文本块添加到您的向量存储库


In [None]:
# 输入文本
texts = [
    "猫坐在",
    "垫子上。",
    "我喜欢",
    "晚餐吃比萨",
    "。",
    "太阳在西边",
    "落下。"
]
nodes = [
    TextNode(text=text, embedding=embed_model.get_text_embedding(text))
    for text in texts
]

vector_store.add(nodes)

### 运行相似度搜索


In [None]:
# 从向量存储中定义索引
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store, embed_model=embed_model
)
retriever = index.as_retriever()

In [None]:
response = retriever.retrieve("pizza")
for row in response:
    print(f"Score: {row.get_score():.3f} Text: {row.get_text()}")

Score: 0.703 Text: eat pizza for
Score: 0.626 Text: dinner.


## 添加带有元数据属性的文档并使用过滤器


In [None]:
# 输入带有元数据的文本
records = [
    {
        "description": "一条多功能的深色牛仔裤。由耐用的棉布制成，经典的直筒裁剪，这条牛仔裤可以轻松从休闲日穿到更正式的场合。",
        "price": 65.00,
        "color": "blue",
        "season": ["秋季", "冬季", "春季"],
    },
    {
        "description": "一件清爽的白色亚麻衬衫。透气的面料和宽松的版型，非常适合保持清凉。",
        "price": 34.99,
        "color": "white",
        "season": ["夏季", "春季"],
    },
    {
        "description": "一件柔软厚实的深绿色粗针织毛衣。宽松的版型和舒适的羊毛混纺材质，非常适合在气温下降时保暖。",
        "price": 89.99,
        "color": "green",
        "season": ["秋季", "冬季"],
    },
    {
        "description": "一件柔软的混纺蓝色圆领T恤。由舒适的棉质杰西布制成，这件T恤是适合每个季节的基本款。",
        "price": 19.99,
        "color": "blue",
        "season": ["秋季", "冬季", "夏季", "春季"],
    },
    {
        "description": "一条轻盈的中长裙，印有精致的花卉图案。轻盈透气，这条裙子为温暖的日子增添了一丝女性风情。",
        "price": 45.00,
        "color": "white",
        "season": ["春季", "夏季"],
    },
]

nodes = []
for record in records:
    text = record.pop("description")
    embedding = embed_model.get_text_embedding(text)
    metadata = {**record}
    nodes.append(TextNode(text=text, embedding=embedding, metadata=metadata))

vector_store.add(nodes)

### 使用过滤器进行相似性搜索


In [None]:
# 从向量存储中定义索引
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store, embed_model=embed_model
)

In [None]:
# 简单的相似度搜索，不带过滤器
retriever = index.as_retriever()
response = retriever.retrieve("pants")

for row in response:
    print(f"文本: {row.get_text()}")
    print(f"   分数: {row.get_score():.3f}")
    print(f"   元数据: {row.metadata}")

Text: A pair of well-tailored dress pants in a neutral grey. Made from a wrinkle-resistant blend, these pants look sharp and professional for workwear or formal occasions.
   Score: 0.669
   Metadata: {'price': 69.99, 'color': 'grey', 'season': ['fall', 'winter', 'summer', 'spring']}
Text: A pair of tailored black trousers in a comfortable stretch fabric. Perfect for work or dressier events, these trousers provide a sleek, polished look.
   Score: 0.642
   Metadata: {'price': 59.99, 'color': 'black', 'season': ['fall', 'winter', 'spring']}


In [None]:
# 使用文本过滤进行相似性搜索
filters = MetadataFilters(filters=[MetadataFilter(key="color", value="blue")])
retriever = index.as_retriever(filters=filters, similarity_top_k=3)
response = retriever.retrieve("牛仔裤")

for row in response:
    print(f"文本: {row.get_text()}")
    print(f"   得分: {row.get_score():.3f}")
    print(f"   元数据: {row.metadata}")

Text: A versatile pair of dark-wash denim jeans.Made from durable cotton with a classic straight-leg cut, these jeans transition easily from casual days to dressier occasions.
   Score: 0.704
   Metadata: {'price': 65.0, 'color': 'blue', 'season': ['fall', 'winter', 'spring']}
Text: A denim jacket with a faded wash and distressed details. This wardrobe staple adds a touch of effortless cool to any outfit.
   Score: 0.667
   Metadata: {'price': 79.99, 'color': 'blue', 'season': ['fall', 'spring', 'summer']}


In [None]:
# 使用文本和数值过滤进行相似度搜索
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="color", value="blue"),
        MetadataFilter(key="price", operator=FilterOperator.GT, value=70.0),
    ]
)
retriever = index.as_retriever(filters=filters, similarity_top_k=3)
response = retriever.retrieve("denims")

for row in response:
    print(f"文本: {row.get_text()}")
    print(f"   得分: {row.get_score():.3f}")
    print(f"   元数据: {row.metadata}")

Text: A denim jacket with a faded wash and distressed details. This wardrobe staple adds a touch of effortless cool to any outfit.
   Score: 0.667
   Metadata: {'price': 79.99, 'color': 'blue', 'season': ['fall', 'spring', 'summer']}


## 使用Vertex AI Vector Search和Gemini Pro解析、索引和查询PDF文件


In [None]:
! mkdir -p ./data/arxiv/
! wget 'https://arxiv.org/pdf/1706.03762.pdf' -O ./data/arxiv/test.pdf

E0501 00:56:50.842446801  266241 backup_poller.cc:127]                 Run client channel backup poller: UNKNOWN:pollset_work {created_time:"2024-05-01T00:56:50.841935606+00:00", children:[UNKNOWN:Bad file descriptor {created_time:"2024-05-01T00:56:50.841810434+00:00", errno:9, os_error:"Bad file descriptor", syscall:"epoll_wait"}]}
--2024-05-01 00:56:52--  https://arxiv.org/pdf/1706.03762.pdf
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.195.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/1706.03762 [following]
--2024-05-01 00:56:52--  http://arxiv.org/pdf/1706.03762
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/pdf]
Saving to: ‘./data/arxiv/test.pdf’


2024-05-01 00:56:52 (31.5 MB/s) - ‘./data/arxiv/test.pdf’ saved [2215244/2215244]


In [None]:
# 加载文档
documents = SimpleDirectoryReader("./data/arxiv/").load_data()
print(f"文档数量 = {len(documents)}")

# of documents = 15


In [None]:
# 设置存储
vector_store = VertexAIVectorStore(
    project_id=PROJECT_ID,
    region=REGION,
    index_id=vs_index.resource_name,
    endpoint_id=vs_endpoint.resource_name,
    gcs_bucket_name=GCS_BUCKET_NAME,
)

# 设置存储上下文
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# 配置嵌入模型
embed_model = VertexTextEmbedding(
    model_name="textembedding-gecko@003",
    project=PROJECT_ID,
    location=REGION,
)

vertex_gemini = Vertex(model="gemini-pro", temperature=0, additional_kwargs={})

# 设置索引/查询过程，即嵌入模型（如果使用）
Settings.llm = vertex_gemini
Settings.embed_model = embed_model

In [None]:
# 从向量存储中定义索引
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

In [None]:
query_engine = index.as_query_engine()

In [None]:
response = query_engine.query(
    "谁是论文《注意力机制就是你所需要的一切》的作者？"
)

print(f"回复:")
print("-" * 80)
print(response.response)
print("-" * 80)
print(f"源文件:")
print("-" * 80)
for source in response.source_nodes:
    print(f"样本文本: {source.text[:50]}")
    print(f"相关性分数: {source.get_score():.3f}")
    print(f"文件名: {source.metadata.get('file_name')}")
    print(f"页码: {source.metadata.get('page_label')}")
    print(f"文件路径: {source.metadata.get('file_path')}")
    print("-" * 80)

Response:
--------------------------------------------------------------------------------
The authors of the paper "Attention Is All You Need" are:

* Ashish Vaswani
* Noam Shazeer
* Niki Parmar
* Jakob Uszkoreit
* Llion Jones
* Aidan N. Gomez
* Łukasz Kaiser
* Illia Polosukhin
--------------------------------------------------------------------------------
Source Documents:
--------------------------------------------------------------------------------
Sample Text: Provided proper attribution is provided, Google he
Relevance score: 0.720
File Name: test.pdf
Page #: 1
File Path: /home/jupyter/llama_index/docs/docs/examples/vector_stores/data/arxiv/test.pdf
--------------------------------------------------------------------------------
Sample Text: length nis smaller than the representation dimensi
Relevance score: 0.678
File Name: test.pdf
Page #: 7
File Path: /home/jupyter/llama_index/docs/docs/examples/vector_stores/data/arxiv/test.pdf
---------------------------------------------

翻译结果已删除。


## 清理

请在运行实验后删除 Vertex AI Vector Search Index 和 Index Endpoint，以避免产生额外的费用。请注意，只要端点正在运行，您就会被收费。

<div class="alert alert-block alert-warning">
    <b>⚠️ 注意：启用 `CLEANUP_RESOURCES` 标志会删除 Vector Search Index、Index Endpoint 和 Cloud Storage 存储桶。请谨慎运行。</b>
</div>


In [None]:
CLEANUP_RESOURCES = False

- 取消部署索引和删除索引端点


In [None]:
if CLEANUP_RESOURCES:
    print(
        f"Undeploying all indexes and deleting the index endpoint {vs_endpoint.display_name}"
    )
    vs_endpoint.undeploy_all()
    vs_endpoint.delete()

- 删除索引


In [None]:
if CLEANUP_RESOURCES:
    print(f"Deleting the index {vs_index.display_name}")
    vs_index.delete()

- 从云存储桶中删除内容


In [None]:
if CLEANUP_RESOURCES and "GCS_BUCKET_NAME" in globals():
    print(f"正在删除云存储桶 {GCS_BUCKET_NAME} 中的内容")

    shell_output = ! gsutil du -ash gs://$GCS_BUCKET_NAME
    print(shell_output)
    print(
        f"删除前存储桶 {GCS_BUCKET_NAME} 的大小 = {' '.join(shell_output[0].split()[:2])}"
    )

    # 取消下面一行的注释以删除存储桶的内容
    # ! gsutil -m rm -r gs://$GCS_BUCKET_NAME