# Google Vertex AI Vector Search

This notebook shows how to use functionality related to the `Google Cloud Vertex AI Vector Search` vector database.

> [Google Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview), formerly known as Vertex AI Matching Engine, provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.

**Note**: LlamaIndex expects Vertex AI Vector Search endpoint and deployed index is already created. An empty index creation time take upto a minute and deploying an index to the endpoint can take upto 30 min.

> To see how to create an index refer to the section [Create Index and deploy it to an Endpoint](#create-index-and-deploy-it-to-an-endpoint)  
If you already have an index deployed , skip to [Create VectorStore from texts](#create-vector-store-from-texts)

## Installation

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

```sh
%pip install llama-index llama-index-vector-stores-vertexaivectorsearch llama-index-llms-vertex 

%pip install google-cloud-aiplatform
```

## Create Index and deploy it to an Endpoint

- This section demonstrates creating a new index and deploying it to an endpoint.

In [4]:
# TODO : Set values as per your requirements

# Project and Storage Constants
PROJECT_ID = "gen-lang-client-0974620078"
REGION = "asia-southeast1"
GCS_BUCKET_NAME = "text-retrieval"
GCS_BUCKET_URI = f"gs://{GCS_BUCKET_NAME}"

# The number of dimensions for the textembedding-gecko@003 is 768
# If other embedder is used, the dimensions would probably need to change.
VS_DIMENSIONS = 1024

# Vertex AI Vector Search Index configuration
# parameter description here
# https://cloud.google.com/pytho
# n/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index
VS_INDEX_NAME = "text-retrieval-index"  # @param {type:"string"}
VS_INDEX_ENDPOINT_NAME = "text-retrieval-endpoint"  # @param {type:"string"}

In [5]:
from google.cloud import aiplatform
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    "C:/Users/mt200/OneDrive/Desktop/AI/AI_challenge/software/back-end/service-account.json"
)

In [6]:
aiplatform.init(project=PROJECT_ID, location=REGION, credentials=credentials)

### Create Cloud Storage bucket
```sh
! gsutil mb -l $REGION -p $PROJECT_ID $GCS_BUCKET_URI
```

In [23]:
! gsutil mb -l $REGION -p $PROJECT_ID $GCS_BUCKET_URI

Creating gs://text-retrieval/...


### Create an empty Index

**Note :** While creating an index you should specify an "index_update_method" - `BATCH_UPDATE` or `STREAM_UPDATE`

> A batch index is for when you want to update your index in a batch, with data which has been stored over a set amount of time, like systems which are processed weekly or monthly.
>
> A streaming index is when you want index data to be updated as new data is added to your datastore, for instance, if you have a bookstore and want to show new inventory online as soon as possible.
>
> Which type you choose is important, since setup and requirements are different.

Refer [Official Documentation](https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index) and [API reference](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index) for more details on configuring indexes

In [7]:
# NOTE : This operation can take upto 30 seconds

# check if index exists
index_names = [
    index.resource_name
    for index in aiplatform.MatchingEngineIndex.list(
        filter=f"display_name={VS_INDEX_NAME}"
    )
]

if len(index_names) == 0:
    print(f"Creating Vector Search index {VS_INDEX_NAME} ...")
    vs_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
        display_name=VS_INDEX_NAME,
        dimensions=VS_DIMENSIONS,
        distance_measure_type="DOT_PRODUCT_DISTANCE",
        shard_size="SHARD_SIZE_SMALL",
        index_update_method="STREAM_UPDATE",  
        approximate_neighbors_count=200,
    )
    print(
        f"Vector Search index {vs_index.display_name} created with resource name {vs_index.resource_name}"
    )
else:
    vs_index = aiplatform.MatchingEngineIndex(index_name=index_names[0])
    print(
        f"Vector Search index {vs_index.display_name} exists with resource name {vs_index.resource_name}"
    )

Vector Search index text-retrieval-index exists with resource name projects/284454080854/locations/asia-southeast1/indexes/39221778786025472


### Create an Endpoint

To use the index, you need to create an index endpoint. It works as a server instance accepting query requests for your index. An endpoint can be a [public endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public) or a [private endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-vpc).

Let's create a public endpoint.

In [8]:
endpoint_names = [
    endpoint.resource_name
    for endpoint in aiplatform.MatchingEngineIndexEndpoint.list(
        filter=f"display_name={VS_INDEX_ENDPOINT_NAME}"
    )
]

if len(endpoint_names) == 0:
    print(
        f"Creating Vector Search index endpoint {VS_INDEX_ENDPOINT_NAME} ..."
    )
    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
        display_name=VS_INDEX_ENDPOINT_NAME, public_endpoint_enabled=True
    )
    print(
        f"Vector Search index endpoint {vs_endpoint.display_name} created with resource name {vs_endpoint.resource_name}"
    )
else:
    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
        index_endpoint_name=endpoint_names[0]
    )
    print(
        f"Vector Search index endpoint {vs_endpoint.display_name} exists with resource name {vs_endpoint.resource_name}"
    )

Vector Search index endpoint text-retrieval-endpoint exists with resource name projects/284454080854/locations/asia-southeast1/indexEndpoints/953056678956236800


### Deploy Index to the Endpoint

With the index endpoint, deploy the index by specifying a unique deployed index ID.

**NOTE : This operation can take upto 30 minutes.**

In [26]:
deployed_index_id="text_retrieval_deploy_index"

In [None]:
# check if endpoint exists
index_endpoints = [
    (deployed_index.index_endpoint, deployed_index.deployed_index_id)
    for deployed_index in vs_index.deployed_indexes
]

if len(index_endpoints) == 0:
    print(
        f"Deploying Vector Search index {vs_index.display_name} at endpoint {vs_endpoint.display_name} ..."
    )
    vs_deployed_index = vs_endpoint.deploy_index(
        index=vs_index,
        deployed_index_id=deployed_index_id,  # ✅ valid ID
        display_name=VS_INDEX_NAME,
        machine_type="e2-standard-16",
        min_replica_count=1,
        max_replica_count=1,
    )
    print(
        f"Vector Search index {vs_index.display_name} is deployed at endpoint {vs_deployed_index.display_name}"
    )
else:
    vs_deployed_index = aiplatform.MatchingEngineIndexEndpoint(
        index_endpoint_name=index_endpoints[0][0]
    )
    print(
        f"Vector Search index {vs_index.display_name} is already deployed at endpoint {vs_deployed_index.display_name}"
    )

Deploying Vector Search index text-retrieval-index at endpoint text-retrieval-endpoint ...
Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/284454080854/locations/asia-southeast1/indexEndpoints/953056678956236800


Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/284454080854/locations/asia-southeast1/indexEndpoints/953056678956236800/operations/6352638985952559104


## Create Vector Store from texts

NOTE : If you have existing Vertex AI Vector Search Index and Endpoints, you can assign them using following code:

In [9]:
aiplatform.init(project=PROJECT_ID, location=REGION, credentials=credentials)

In [12]:
# Get index by display name
indexes = aiplatform.MatchingEngineIndex.list(filter='display_name="text-retrieval-index"')
if not indexes:
    raise ValueError("Index with display_name='text-retrieval-index' not found.")

# Use resource_name, not display_name
vs_index = aiplatform.MatchingEngineIndex(index_name=indexes[0].resource_name)
print(f"✅ Index loaded: {vs_index.resource_name}")

# Same for endpoint
endpoints = aiplatform.MatchingEngineIndexEndpoint.list(filter='display_name="text-retrieval-endpoint"')
if not endpoints:
    raise ValueError("Endpoint with display_name='text-retrieval-endpoint' not found.")

vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(index_endpoint_name=endpoints[0].resource_name)
print(f"✅ Endpoint loaded: {vs_endpoint.resource_name}")

✅ Index loaded: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
✅ Endpoint loaded: projects/284454080854/locations/asia-southeast1/indexEndpoints/953056678956236800


### Create a simple vector store from plain text without metadata filters

In [13]:
# import modules needed
from llama_index.core import (
    StorageContext,
    Settings,
    VectorStoreIndex,
    SimpleDirectoryReader,
)
from llama_index.core.schema import TextNode, ImageNode
from llama_index.core.vector_stores.types import (
    MetadataFilters,
    MetadataFilter,
    FilterOperator,
)
from llama_index.llms.vertex import Vertex
from llama_index.embeddings.vertex import VertexTextEmbedding
from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore

### Create Text Node

#### Take the text from metadata

In [17]:
# Hàm chuyển số thứ tự sang chữ tiếng Việt
def number_to_vietnamese_order(n):
    units = ["", "một", "hai", "ba", "bốn", "năm", "sáu", "bảy", "tám", "chín"]
    tens = ["", "mười", "hai mươi", "ba mươi", "bốn mươi", "năm mươi",
            "sáu mươi", "bảy mươi", "tám mươi", "chín mươi"]

    if n <= 10:
        first10 = ["thứ nhất", "thứ hai", "thứ ba", "thứ tư", "thứ năm",
                   "thứ sáu", "thứ bảy", "thứ tám", "thứ chín", "thứ mười"]
        return first10[n-1]

    ten = n // 10
    unit = n % 10

    if ten == 1:
        text = "mười"
    else:
        text = tens[ten]

    if unit == 0:
        order_text = f"thứ {text}"
    elif unit == 1 and ten > 1:
        order_text = f"thứ {text} mốt"
    elif unit == 5:
        order_text = f"thứ {text} lăm"
    else:
        order_text = f"thứ {text} {units[unit]}"

    return order_text


In [18]:
OBJ_NAME_MAP = {
    # Con người & giới tính
    "Person": "Người",
    "Man": "Đàn ông",
    "Woman": "Phụ nữ",
    "Boy": "Cậu bé",
    "Girl": "Cô bé",
    "Baby": "Em bé",

    # Động vật
    "Bird": "Chim",
    "Cat": "Mèo",
    "Dog": "Chó",
    "Horse": "Ngựa",
    "Sheep": "Cừu",
    "Cow": "Bò",
    "Elephant": "Voi",
    "Bear": "Gấu",
    "Zebra": "Ngựa vằn",
    "Giraffe": "Hươu cao cổ",
    "Rabbit": "Thỏ",
    "Fox": "Cáo",
    "Panda": "Gấu trúc",
    "Tiger": "Hổ",
    "Lion": "Sư tử",
    "Deer": "Hươu",
    "Monkey": "Khỉ",
    "Snake": "Rắn",
    "Frog": "Ếch",
    "Turtle": "Rùa",
    "Whale": "Cá voi",
    "Dolphin": "Cá heo",

    # Phương tiện giao thông
    "Bicycle": "Xe đạp",
    "Bicycle wheel": "Bánh xe đạp",
    "Bicycle helmet": "Mũ bảo hiểm xe đạp",
    "Car": "Ô tô",
    "Motorcycle": "Xe máy",
    "Airplane": "Máy bay",
    "Bus": "Xe buýt",
    "Train": "Tàu hỏa",
    "Truck": "Xe tải",
    "Boat": "Thuyền",
    "Ship": "Tàu lớn",
    "Helicopter": "Trực thăng",
    "Scooter": "Xe tay ga",
    "Subway": "Tàu điện ngầm",

    # Nhà cửa & kiến trúc
    "Skyscraper": "Tòa nhà chọc trời",
    "Building": "Tòa nhà",
    "House": "Ngôi nhà",
    "Apartment": "Căn hộ",
    "Castle": "Lâu đài",
    "Bridge": "Cầu",
    "Tower": "Tháp",
    "Road": "Đường",
    "Street": "Phố",
    "Parking lot": "Bãi đỗ xe",
    "Playground": "Sân chơi",

    # Cảnh vật thiên nhiên
    "Beach": "Bãi biển",
    "Mountain": "Núi",
    "River": "Sông",
    "Lake": "Hồ",
    "Forest": "Rừng",
    "Desert": "Sa mạc",
    "Waterfall": "Thác nước",
    "Volcano": "Núi lửa",
    "Cave": "Hang động",
    "Island": "Đảo",
    "Glacier": "Sông băng",

    # Cây cối, hoa quả, rau củ
    "Tree": "Cây",
    "Palm tree": "Cây cọ",
    "Pine tree": "Cây thông",
    "Flower": "Hoa",
    "Rose": "Hoa hồng",
    "Sunflower": "Hoa hướng dương",
    "Tulip": "Hoa tulip",
    "Cactus": "Xương rồng",
    "Banana": "Chuối",
    "Apple": "Táo",
    "Orange": "Cam",
    "Lemon": "Chanh",
    "Strawberry": "Dâu tây",
    "Watermelon": "Dưa hấu",
    "Tomato": "Cà chua",
    "Carrot": "Cà rốt",
    "Broccoli": "Bông cải xanh",
    "Potato": "Khoai tây",
    "Onion": "Hành tây",

    # Đồ chơi
    "Teddy bear": "Gấu bông",
    "Ball": "Bóng",
    "Puzzle": "Trò chơi ghép hình",
    "Doll": "Búp bê",
    "Toy car": "Xe đồ chơi",
    "Lego": "Lego",
    "Yo-yo": "Con quay",
    "Frisbee": "Đĩa bay",

    # Dụng cụ gia đình, đồ điện tử
    "Chair": "Ghế",
    "Couch": "Sofa",
    "Table": "Bàn",
    "Lamp": "Đèn",
    "Fan": "Quạt",
    "Refrigerator": "Tủ lạnh",
    "Microwave": "Lò vi sóng",
    "Oven": "Lò nướng",
    "Toaster": "Máy nướng bánh",
    "Sink": "Bồn rửa",
    "Washing machine": "Máy giặt",
    "Vacuum cleaner": "Máy hút bụi",
    "Laptop": "Máy tính xách tay",
    "Desktop computer": "Máy tính để bàn",
    "Printer": "Máy in",
    "Camera": "Máy ảnh",
    "Phone": "Điện thoại",
    "TV": "Tivi",
    "Remote": "Điều khiển",

    # Quần áo & phụ kiện
    "Backpack": "Ba lô",
    "Handbag": "Túi xách",
    "Tie": "Cà vạt",
    "Suitcase": "Vali",
    "Hat": "Mũ",
    "Glasses": "Kính",
    "Sunglasses": "Kính mát",
    "Shoes": "Giày",
    "Coat": "Áo khoác",
    "Dress": "Váy",
    "Skirt": "Chân váy",
    "Shorts": "Quần short",
    "Pants": "Quần dài",
    "Shirt": "Áo sơ mi",
    "T-shirt": "Áo phông",
    "Jacket": "Áo khoác",

    # Thể thao & thực phẩm
    "Baseball bat": "Gậy bóng chày",
    "Baseball glove": "Găng tay bóng chày",
    "Skateboard": "Ván trượt",
    "Surfboard": "Ván lướt sóng",
    "Tennis racket": "Vợt tennis",
    "Bowl": "Bát",
    "Cup": "Cốc",
    "Fork": "Nĩa",
    "Knife": "Dao",
    "Spoon": "Thìa",
    "Pizza": "Pizza",
    "Cake": "Bánh ngọt",
    "Donut": "Bánh vòng",
    "Sandwich": "Bánh mì kẹp",
    "Bottle": "Chai",
    "Wine glass": "Ly rượu",

    # Vật dụng & công cụ
    "Skis": "Ván trượt tuyết",
    "Snowboard": "Ván trượt",
    "Umbrella": "Dù",

    # Đèn & ánh sáng
    "Lamp": "Đèn",
    "Ceiling lamp": "Đèn trần",
    "Street lamp": "Đèn đường",
    "Lantern": "Đèn lồng",
    "Flashlight": "Đèn pin",
    "Neon sign": "Bảng đèn neon",
    "Chandelier": "Đèn chùm",
    "Candle": "Nến"

}

COLOR_MAP = {
    "Gray": "Xám",
    "Black": "Đen",
    "White": "Trắng",
    "Red": "Đỏ",
    "Blue": "Xanh dương",
    "Green": "Xanh lá",
    "Yellow": "Vàng",
    "Brown": "Nâu",
    "Orange": "Cam",
    "Pink": "Hồng",
    "Purple": "Tím"
}

In [20]:
from collections import Counter
import json

def build_text_for_embedding(metadata, img_w=1.0, img_h=1.0):
    # Check if 'objects' is a dictionary of counts
    objects_data = metadata.get("objects", {})

    if isinstance(objects_data, dict):
        # If it's a dictionary of counts, format it directly
        if not objects_data:
            objects_text = ""
        else:
            objects_text = ", ".join([f"{OBJ_NAME_MAP.get(name, name)}: {count}" for name, count in objects_data.items()])

    elif isinstance(objects_data, list):
        # If it's a list (for backward compatibility or other data), process as before
        objs = [o for o in objects_data if isinstance(o, dict)]
        if not objs:
            objects_text = ""
            print("Warning: 'objects' list contains no valid dictionaries.")
        else:
            names_mapped = []
            for o in objs:
                if isinstance(o, dict):
                    names_mapped.append(OBJ_NAME_MAP.get(o.get("name", "Unknown"), o.get("name", "Unknown")))
                else:
                    print(f"Warning: Skipping non-dictionary element in 'objects' list: {o}")

            counts = Counter(names_mapped)

            objs_sorted = sorted(objs, key=lambda x: x.get("id", 0))

            type_counter = {}
            desc_list = []

            for o in objs_sorted:
                if isinstance(o, dict):
                    name_orig = o.get("name", "Unknown")
                    name = OBJ_NAME_MAP.get(name_orig, name_orig)
                    color_orig = o.get("color", "")
                    color = COLOR_MAP.get(color_orig, color_orig)
                    bbox = o.get("bbox", None)

                    pos_text = ""
                    size_text = ""

                    if bbox:
                        x_min, y_min, x_max, y_max = bbox
                        w = x_max - x_min
                        h = y_max - y_min
                        area_ratio = (w * h) / (img_w * img_h)

                        cx = (x_min + x_max) / 2
                        cy = (y_min + y_max) / 2
                        horiz = "trái" if cx < img_w/3 else "phải" if cx > 2*img_w/3 else "giữa"
                        vert = "trên" if cy < img_h/3 else "dưới" if cy > 2*img_h/3 else "giữa"
                        pos_text = f"ở {horiz}-{vert}"

                        if area_ratio > 0.2: size_text = "rất to"
                        elif area_ratio > 0.05: size_text = "to"
                        elif area_ratio > 0.01: size_text = "vừa"
                        else: size_text = "nhỏ"

                    idx = type_counter.get(name, 0) + 1
                    type_counter[name] = idx

                    order_text = number_to_vietnamese_order(idx)

                    desc = f"{name} {order_text} ({counts[name]})"
                    if color: desc += f" màu {color}"
                    if size_text: desc += f" ({size_text})"
                    if pos_text: desc += f" {pos_text}"
                    desc_list.append(desc)
                else:
                     print(f"Warning: Skipping non-dictionary element in 'objects_sorted' list: {o}")


            objects_text = ", ".join(desc_list)
    else:
        # Handle cases where 'objects' is neither a dict nor a list
        objects_text = ""
        print(f"Warning: 'objects' is not a dictionary or list: {objects_data}")


    combined = f"""
Tiêu đề: {metadata.get("title", "")}
Tiếng nói: {metadata.get("asr_text", "")}
Phân loại tiếng: {metadata.get("sound", "")}
Các vật thể: {objects_text}
""".strip()

    # print(combined) # Comment out print to avoid verbose output
    return combined

#### Create text node

In [21]:
import os
import json
from typing import List

# Create all note
base_dir = "C:/Users/mt200/OneDrive/Desktop/AI/AI_challenge/feature_extraction/embedding-text"

text_nodes: List[TextNode] = []

# Duyệt tất cả file trong thư mục
for filename in os.listdir(base_dir):
    if filename.endswith(".json"):
        file_path = os.path.join(base_dir, filename)
        with open(file_path, "r", encoding="utf-8") as f:
            print(filename)
            try:
                data = json.load(f)
                # Nếu file chứa một mảng item
                if isinstance(data, list):
                    for item in data:
                        text_metadata=build_text_for_embedding(item.get("metadata", {}), img_w=item.get("width", 1.0), img_h=item.get("height", 1.0))
                        text_node = TextNode(
                            text=text_metadata,  # Bắt buộc phải có trường text
                            id_=item.get("id_") or item.get("id"),
                            embedding=item.get("embedding"),
                            metadata=item.get("metadata", ()),
                        )
                        text_nodes.append(text_node)
                # Nếu file chỉ chứa 1 object
                elif isinstance(data, dict):
                    text_metadata=build_text_for_embedding(item.get("metadata", {}), img_w=item.get("width", 1.0), img_h=item.get("height", 1.0))
                    text_node = TextNode(
                        text=text_metadata,  # Bắt buộc phải có trường text
                        id_=item.get("id_") or item.get("id"),
                        embedding=item.get("embedding"),
                        metadata=item.get("metadata", ()),
                    )
                    text_nodes.append(text_node)
            except json.JSONDecodeError as e:
                print(f"❌ Lỗi đọc file {filename}: {e}")

print(f"✅ Tổng số TextNode: {len(text_nodes)}")


L21_V001.json
L21_V002.json
L21_V003.json
L21_V005.json
L21_V006.json
L21_V007.json
L21_V008.json
L21_V009.json
L21_V010.json
L21_V011.json
L21_V012.json
L21_V013.json
L21_V014.json
L21_V015.json
L21_V016.json
L21_V017.json
L21_V018.json
L21_V019.json
L21_V021.json
L21_V022.json
L21_V023.json
L21_V024.json
L21_V025.json
L21_V026.json
L21_V027.json
L21_V028.json
L21_V029.json
L21_V030.json
L21_V031.json
L22_V001.json
L22_V002.json
L22_V003.json
L22_V004.json
L22_V005.json
L22_V006.json
L22_V007.json
L22_V008.json
L22_V009.json
L22_V010.json
L22_V011.json
L22_V012.json
L22_V013.json
L22_V014.json
L22_V015.json
L22_V016.json
L22_V017.json
L22_V018.json
L22_V019.json
L22_V020.json
L22_V021.json
L22_V022.json
L22_V023.json
L22_V024.json
L22_V025.json
L22_V026.json
L22_V027.json
L22_V028.json
L22_V029.json
L22_V030.json
L22_V031.json
L23_V001.json
L23_V002.json
L23_V003.json
L23_V004.json
L23_V005.json
L23_V006.json
L23_V007.json
L23_V008.json
L23_V009.json
L23_V010.json
L23_V011.json
L23_V0

In [None]:
print(f"✅ Tổng số TextNode: {len(text_nodes)}")

✅ Tổng số TextNode: 85193


In [25]:
print(text_nodes[0])

Node ID: L21_V001_F001
Text: Tiêu đề: 60 Giây Sáng - Ngày 01082024 - HTV Tin Tức Mới Nhất
2024 Tiếng nói:  Phân loại tiếng: Music:0.84;Choir:0.02;Singing:0.01
Các vật thể: Đèn lồng: 2, Tòa nhà chọc trời: 5


### Add TextNodes to Vector Search Google Cloud

In [28]:
# setup storage
vector_store = VertexAIVectorStore(
    project_id=PROJECT_ID,
    region=REGION,
    index_id=vs_index.resource_name,
    endpoint_id=vs_endpoint.resource_name,
    gcs_bucket_name=GCS_BUCKET_NAME,
    credentials_path="C:/Users/mt200/OneDrive/Desktop/AI/AI_challenge/software/back-end/service-account.json"
)

# set storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [31]:
import time

BATCH_SIZE = 500  # giảm batch nếu cần
DELAY_SEC = 60    # delay giữa các batch (tùy quota)

for i in range(18000, len(text_nodes), BATCH_SIZE):
    batch = text_nodes[i:i+BATCH_SIZE]
    vector_store.add(batch)
    print(f"Added batch {i} -> {i+len(batch)}")
    time.sleep(DELAY_SEC)

Upserting datapoints MatchingEngineIndex index: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
MatchingEngineIndex index Upserted datapoints. Resource name: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
Added batch 18000 -> 18500
Upserting datapoints MatchingEngineIndex index: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
MatchingEngineIndex index Upserted datapoints. Resource name: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
Added batch 18500 -> 19000
Upserting datapoints MatchingEngineIndex index: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
MatchingEngineIndex index Upserted datapoints. Resource name: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
Added batch 19000 -> 19500
Upserting datapoints MatchingEngineIndex index: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
MatchingEngineIndex ind

## Indexing text from Vector Search in Google Cloud

In [43]:
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Initialize the embedding model
embed_model = HuggingFaceEmbedding(
    model_name="Qwen/Qwen3-Embedding-0.6B",
    device="cpu",  
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [44]:
# Khi tạo index
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    embed_model=embed_model
)

In [None]:
def get_dense_embedding(text: str):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        last_hidden_state = outputs.last_hidden_state  # (batch, seq_len, hidden)

        # Attention mask để bỏ padding
        mask = inputs["attention_mask"].unsqueeze(-1).expand(last_hidden_state.size())
        masked_embeddings = last_hidden_state * mask

        # Mean pooling
        sum_embeddings = masked_embeddings.sum(dim=1)
        sum_mask = mask.sum(dim=1)
        embeddings = sum_embeddings / sum_mask

        emb = embeddings[0].cpu().numpy().tolist()
    return emb

### Search

In [46]:
# simple similarity search without filter
retriever = index.as_retriever(similarity_top_k=10)
response = retriever.retrieve("Một người đàn ông đang chơi bóng đá trên sân cỏ")

for row in response:
    print(f"Text: {row.get_text()}")
    print(f"   Score: {row.get_score():.3f}")
    print(f"   Metadata: {row.metadata}")

Text: Tiêu đề: Té ngã thì mình đứng lên mình chơi tiếp...#htvsports #htv #htvthethao #lansurong #cupcholonhtv2024
Tiếng nói: 
Phân loại tiếng: Music:0.98;Musical instrument:0.07;Keyboard (musical):0.07
Các vật thể: Clothing thứ nhất (1) màu Xám (vừa) ở giữa-dưới, Người thứ nhất (1) màu Xám (vừa) ở giữa-dưới
   Score: 24.779
   Metadata: {'id': 'L24_V044_F033', 'timestamp': '00:00:24', 'frame_idx': 639, 'image_url': 'https://storage.googleapis.com/test-video-retrieval/dataset/unzips/Keyframes_L24/keyframes/L24_V044/033.jpg', 'asr_text': '', 'sound': 'Music:0.98;Musical instrument:0.07;Keyboard (musical):0.07', 'objects': [{'name': 'Clothing', 'id': 1, 'score': 0.61, 'bbox': [0.27, 0.65, 0.41, 0.85], 'center': [0.34, 0.75], 'color': 'Gray'}, {'name': 'Person', 'id': 1, 'score': 0.55, 'bbox': [0.27, 0.62, 0.41, 0.88], 'center': [0.34, 0.75], 'color': 'Gray'}], 'title': 'Té ngã thì mình đứng lên mình chơi tiếp...#htvsports #htv #htvthethao #lansurong #cupcholonhtv2024', 'author': 'HTV Spor

In [None]:
# similarity search with text filter
filters = MetadataFilters(filters=[MetadataFilter(key="color", value="blue")])
retriever = index.as_retriever(filters=filters, similarity_top_k=100)
response = retriever.retrieve("denims")

for row in response:
    print(f"Text: {row.get_text()}")
    print(f"   Score: {row.get_score():.3f}")
    print(f"   Metadata: {row.metadata}")

In [None]:
# similarity search with text and numeric filter
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="color", value="blue"),
        MetadataFilter(key="price", operator=FilterOperator.GT, value=70.0),
    ]
)
retriever = index.as_retriever(filters=filters, similarity_top_k=3)
response = retriever.retrieve("denims")

for row in response:
    print(f"Text: {row.get_text()}")
    print(f"   Score: {row.get_score():.3f}")
    print(f"   Metadata: {row.metadata}")