# Google Vertex AI Vector Search

This notebook shows how to use functionality related to the `Google Cloud Vertex AI Vector Search` vector database.

> [Google Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview), formerly known as Vertex AI Matching Engine, provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.

**Note**: LlamaIndex expects Vertex AI Vector Search endpoint and deployed index is already created. An empty index creation time take upto a minute and deploying an index to the endpoint can take upto 30 min.

> To see how to create an index refer to the section [Create Index and deploy it to an Endpoint](#create-index-and-deploy-it-to-an-endpoint)  
If you already have an index deployed , skip to [Create VectorStore from texts](#create-vector-store-from-texts)

## Installation

If you're opening this Notebook on colab, you will probably need to install LlamaIndex ü¶ô.

```sh
%pip install llama-index llama-index-vector-stores-vertexaivectorsearch llama-index-llms-vertex 

%pip install google-cloud-aiplatform
```

## Create Index and deploy it to an Endpoint

- This section demonstrates creating a new index and deploying it to an endpoint.

In [4]:
# TODO : Set values as per your requirements

# Project and Storage Constants
PROJECT_ID = "gen-lang-client-0974620078"
REGION = "asia-southeast1"
GCS_BUCKET_NAME = "text-retrieval"
GCS_BUCKET_URI = f"gs://{GCS_BUCKET_NAME}"

# The number of dimensions for the textembedding-gecko@003 is 768
# If other embedder is used, the dimensions would probably need to change.
VS_DIMENSIONS = 1024

# Vertex AI Vector Search Index configuration
# parameter description here
# https://cloud.google.com/pytho
# n/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index
VS_INDEX_NAME = "text-retrieval-index"  # @param {type:"string"}
VS_INDEX_ENDPOINT_NAME = "text-retrieval-endpoint"  # @param {type:"string"}

In [5]:
from google.cloud import aiplatform
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    "C:/Users/mt200/OneDrive/Desktop/AI/AI_challenge/software/back-end/service-account.json"
)

In [6]:
aiplatform.init(project=PROJECT_ID, location=REGION, credentials=credentials)

### Create Cloud Storage bucket
```sh
! gsutil mb -l $REGION -p $PROJECT_ID $GCS_BUCKET_URI
```

In [23]:
! gsutil mb -l $REGION -p $PROJECT_ID $GCS_BUCKET_URI

Creating gs://text-retrieval/...


### Create an empty Index

**Note :** While creating an index you should specify an "index_update_method" - `BATCH_UPDATE` or `STREAM_UPDATE`

> A batch index is for when you want to update your index in a batch, with data which has been stored over a set amount of time, like systems which are processed weekly or monthly.
>
> A streaming index is when you want index data to be updated as new data is added to your datastore, for instance, if you have a bookstore and want to show new inventory online as soon as possible.
>
> Which type you choose is important, since setup and requirements are different.

Refer [Official Documentation](https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index) and [API reference](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index) for more details on configuring indexes

In [7]:
# NOTE : This operation can take upto 30 seconds

# check if index exists
index_names = [
    index.resource_name
    for index in aiplatform.MatchingEngineIndex.list(
        filter=f"display_name={VS_INDEX_NAME}"
    )
]

if len(index_names) == 0:
    print(f"Creating Vector Search index {VS_INDEX_NAME} ...")
    vs_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
        display_name=VS_INDEX_NAME,
        dimensions=VS_DIMENSIONS,
        distance_measure_type="DOT_PRODUCT_DISTANCE",
        shard_size="SHARD_SIZE_SMALL",
        index_update_method="STREAM_UPDATE",  
        approximate_neighbors_count=200,
    )
    print(
        f"Vector Search index {vs_index.display_name} created with resource name {vs_index.resource_name}"
    )
else:
    vs_index = aiplatform.MatchingEngineIndex(index_name=index_names[0])
    print(
        f"Vector Search index {vs_index.display_name} exists with resource name {vs_index.resource_name}"
    )

Vector Search index text-retrieval-index exists with resource name projects/284454080854/locations/asia-southeast1/indexes/39221778786025472


### Create an Endpoint

To use the index, you need to create an index endpoint. It works as a server instance accepting query requests for your index. An endpoint can be a [public endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public) or a [private endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-vpc).

Let's create a public endpoint.

In [8]:
endpoint_names = [
    endpoint.resource_name
    for endpoint in aiplatform.MatchingEngineIndexEndpoint.list(
        filter=f"display_name={VS_INDEX_ENDPOINT_NAME}"
    )
]

if len(endpoint_names) == 0:
    print(
        f"Creating Vector Search index endpoint {VS_INDEX_ENDPOINT_NAME} ..."
    )
    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
        display_name=VS_INDEX_ENDPOINT_NAME, public_endpoint_enabled=True
    )
    print(
        f"Vector Search index endpoint {vs_endpoint.display_name} created with resource name {vs_endpoint.resource_name}"
    )
else:
    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
        index_endpoint_name=endpoint_names[0]
    )
    print(
        f"Vector Search index endpoint {vs_endpoint.display_name} exists with resource name {vs_endpoint.resource_name}"
    )

Vector Search index endpoint text-retrieval-endpoint exists with resource name projects/284454080854/locations/asia-southeast1/indexEndpoints/953056678956236800


### Deploy Index to the Endpoint

With the index endpoint, deploy the index by specifying a unique deployed index ID.

**NOTE : This operation can take upto 30 minutes.**

In [26]:
deployed_index_id="text_retrieval_deploy_index"

In [None]:
# check if endpoint exists
index_endpoints = [
    (deployed_index.index_endpoint, deployed_index.deployed_index_id)
    for deployed_index in vs_index.deployed_indexes
]

if len(index_endpoints) == 0:
    print(
        f"Deploying Vector Search index {vs_index.display_name} at endpoint {vs_endpoint.display_name} ..."
    )
    vs_deployed_index = vs_endpoint.deploy_index(
        index=vs_index,
        deployed_index_id=deployed_index_id,  # ‚úÖ valid ID
        display_name=VS_INDEX_NAME,
        machine_type="e2-standard-16",
        min_replica_count=1,
        max_replica_count=1,
    )
    print(
        f"Vector Search index {vs_index.display_name} is deployed at endpoint {vs_deployed_index.display_name}"
    )
else:
    vs_deployed_index = aiplatform.MatchingEngineIndexEndpoint(
        index_endpoint_name=index_endpoints[0][0]
    )
    print(
        f"Vector Search index {vs_index.display_name} is already deployed at endpoint {vs_deployed_index.display_name}"
    )

Deploying Vector Search index text-retrieval-index at endpoint text-retrieval-endpoint ...
Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/284454080854/locations/asia-southeast1/indexEndpoints/953056678956236800


Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/284454080854/locations/asia-southeast1/indexEndpoints/953056678956236800/operations/6352638985952559104


## Create Vector Store from texts

NOTE : If you have existing Vertex AI Vector Search Index and Endpoints, you can assign them using following code:

In [9]:
aiplatform.init(project=PROJECT_ID, location=REGION, credentials=credentials)

In [12]:
# Get index by display name
indexes = aiplatform.MatchingEngineIndex.list(filter='display_name="text-retrieval-index"')
if not indexes:
    raise ValueError("Index with display_name='text-retrieval-index' not found.")

# Use resource_name, not display_name
vs_index = aiplatform.MatchingEngineIndex(index_name=indexes[0].resource_name)
print(f"‚úÖ Index loaded: {vs_index.resource_name}")

# Same for endpoint
endpoints = aiplatform.MatchingEngineIndexEndpoint.list(filter='display_name="text-retrieval-endpoint"')
if not endpoints:
    raise ValueError("Endpoint with display_name='text-retrieval-endpoint' not found.")

vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(index_endpoint_name=endpoints[0].resource_name)
print(f"‚úÖ Endpoint loaded: {vs_endpoint.resource_name}")

‚úÖ Index loaded: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
‚úÖ Endpoint loaded: projects/284454080854/locations/asia-southeast1/indexEndpoints/953056678956236800


### Create a simple vector store from plain text without metadata filters

In [13]:
# import modules needed
from llama_index.core import (
    StorageContext,
    Settings,
    VectorStoreIndex,
    SimpleDirectoryReader,
)
from llama_index.core.schema import TextNode, ImageNode
from llama_index.core.vector_stores.types import (
    MetadataFilters,
    MetadataFilter,
    FilterOperator,
)
from llama_index.llms.vertex import Vertex
from llama_index.embeddings.vertex import VertexTextEmbedding
from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore

### Create Text Node

#### Take the text from metadata

In [17]:
# H√†m chuy·ªÉn s·ªë th·ª© t·ª± sang ch·ªØ ti·∫øng Vi·ªát
def number_to_vietnamese_order(n):
    units = ["", "m·ªôt", "hai", "ba", "b·ªën", "nƒÉm", "s√°u", "b·∫£y", "t√°m", "ch√≠n"]
    tens = ["", "m∆∞·ªùi", "hai m∆∞∆°i", "ba m∆∞∆°i", "b·ªën m∆∞∆°i", "nƒÉm m∆∞∆°i",
            "s√°u m∆∞∆°i", "b·∫£y m∆∞∆°i", "t√°m m∆∞∆°i", "ch√≠n m∆∞∆°i"]

    if n <= 10:
        first10 = ["th·ª© nh·∫•t", "th·ª© hai", "th·ª© ba", "th·ª© t∆∞", "th·ª© nƒÉm",
                   "th·ª© s√°u", "th·ª© b·∫£y", "th·ª© t√°m", "th·ª© ch√≠n", "th·ª© m∆∞·ªùi"]
        return first10[n-1]

    ten = n // 10
    unit = n % 10

    if ten == 1:
        text = "m∆∞·ªùi"
    else:
        text = tens[ten]

    if unit == 0:
        order_text = f"th·ª© {text}"
    elif unit == 1 and ten > 1:
        order_text = f"th·ª© {text} m·ªët"
    elif unit == 5:
        order_text = f"th·ª© {text} lƒÉm"
    else:
        order_text = f"th·ª© {text} {units[unit]}"

    return order_text


In [18]:
OBJ_NAME_MAP = {
    # Con ng∆∞·ªùi & gi·ªõi t√≠nh
    "Person": "Ng∆∞·ªùi",
    "Man": "ƒê√†n √¥ng",
    "Woman": "Ph·ª• n·ªØ",
    "Boy": "C·∫≠u b√©",
    "Girl": "C√¥ b√©",
    "Baby": "Em b√©",

    # ƒê·ªông v·∫≠t
    "Bird": "Chim",
    "Cat": "M√®o",
    "Dog": "Ch√≥",
    "Horse": "Ng·ª±a",
    "Sheep": "C·ª´u",
    "Cow": "B√≤",
    "Elephant": "Voi",
    "Bear": "G·∫•u",
    "Zebra": "Ng·ª±a v·∫±n",
    "Giraffe": "H∆∞∆°u cao c·ªï",
    "Rabbit": "Th·ªè",
    "Fox": "C√°o",
    "Panda": "G·∫•u tr√∫c",
    "Tiger": "H·ªï",
    "Lion": "S∆∞ t·ª≠",
    "Deer": "H∆∞∆°u",
    "Monkey": "Kh·ªâ",
    "Snake": "R·∫Øn",
    "Frog": "·∫æch",
    "Turtle": "R√πa",
    "Whale": "C√° voi",
    "Dolphin": "C√° heo",

    # Ph∆∞∆°ng ti·ªán giao th√¥ng
    "Bicycle": "Xe ƒë·∫°p",
    "Bicycle wheel": "B√°nh xe ƒë·∫°p",
    "Bicycle helmet": "M≈© b·∫£o hi·ªÉm xe ƒë·∫°p",
    "Car": "√î t√¥",
    "Motorcycle": "Xe m√°y",
    "Airplane": "M√°y bay",
    "Bus": "Xe bu√Ωt",
    "Train": "T√†u h·ªèa",
    "Truck": "Xe t·∫£i",
    "Boat": "Thuy·ªÅn",
    "Ship": "T√†u l·ªõn",
    "Helicopter": "Tr·ª±c thƒÉng",
    "Scooter": "Xe tay ga",
    "Subway": "T√†u ƒëi·ªán ng·∫ßm",

    # Nh√† c·ª≠a & ki·∫øn tr√∫c
    "Skyscraper": "T√≤a nh√† ch·ªçc tr·ªùi",
    "Building": "T√≤a nh√†",
    "House": "Ng√¥i nh√†",
    "Apartment": "CƒÉn h·ªô",
    "Castle": "L√¢u ƒë√†i",
    "Bridge": "C·∫ßu",
    "Tower": "Th√°p",
    "Road": "ƒê∆∞·ªùng",
    "Street": "Ph·ªë",
    "Parking lot": "B√£i ƒë·ªó xe",
    "Playground": "S√¢n ch∆°i",

    # C·∫£nh v·∫≠t thi√™n nhi√™n
    "Beach": "B√£i bi·ªÉn",
    "Mountain": "N√∫i",
    "River": "S√¥ng",
    "Lake": "H·ªì",
    "Forest": "R·ª´ng",
    "Desert": "Sa m·∫°c",
    "Waterfall": "Th√°c n∆∞·ªõc",
    "Volcano": "N√∫i l·ª≠a",
    "Cave": "Hang ƒë·ªông",
    "Island": "ƒê·∫£o",
    "Glacier": "S√¥ng bƒÉng",

    # C√¢y c·ªëi, hoa qu·∫£, rau c·ªß
    "Tree": "C√¢y",
    "Palm tree": "C√¢y c·ªç",
    "Pine tree": "C√¢y th√¥ng",
    "Flower": "Hoa",
    "Rose": "Hoa h·ªìng",
    "Sunflower": "Hoa h∆∞·ªõng d∆∞∆°ng",
    "Tulip": "Hoa tulip",
    "Cactus": "X∆∞∆°ng r·ªìng",
    "Banana": "Chu·ªëi",
    "Apple": "T√°o",
    "Orange": "Cam",
    "Lemon": "Chanh",
    "Strawberry": "D√¢u t√¢y",
    "Watermelon": "D∆∞a h·∫•u",
    "Tomato": "C√† chua",
    "Carrot": "C√† r·ªët",
    "Broccoli": "B√¥ng c·∫£i xanh",
    "Potato": "Khoai t√¢y",
    "Onion": "H√†nh t√¢y",

    # ƒê·ªì ch∆°i
    "Teddy bear": "G·∫•u b√¥ng",
    "Ball": "B√≥ng",
    "Puzzle": "Tr√≤ ch∆°i gh√©p h√¨nh",
    "Doll": "B√∫p b√™",
    "Toy car": "Xe ƒë·ªì ch∆°i",
    "Lego": "Lego",
    "Yo-yo": "Con quay",
    "Frisbee": "ƒêƒ©a bay",

    # D·ª•ng c·ª• gia ƒë√¨nh, ƒë·ªì ƒëi·ªán t·ª≠
    "Chair": "Gh·∫ø",
    "Couch": "Sofa",
    "Table": "B√†n",
    "Lamp": "ƒê√®n",
    "Fan": "Qu·∫°t",
    "Refrigerator": "T·ªß l·∫°nh",
    "Microwave": "L√≤ vi s√≥ng",
    "Oven": "L√≤ n∆∞·ªõng",
    "Toaster": "M√°y n∆∞·ªõng b√°nh",
    "Sink": "B·ªìn r·ª≠a",
    "Washing machine": "M√°y gi·∫∑t",
    "Vacuum cleaner": "M√°y h√∫t b·ª•i",
    "Laptop": "M√°y t√≠nh x√°ch tay",
    "Desktop computer": "M√°y t√≠nh ƒë·ªÉ b√†n",
    "Printer": "M√°y in",
    "Camera": "M√°y ·∫£nh",
    "Phone": "ƒêi·ªán tho·∫°i",
    "TV": "Tivi",
    "Remote": "ƒêi·ªÅu khi·ªÉn",

    # Qu·∫ßn √°o & ph·ª• ki·ªán
    "Backpack": "Ba l√¥",
    "Handbag": "T√∫i x√°ch",
    "Tie": "C√† v·∫°t",
    "Suitcase": "Vali",
    "Hat": "M≈©",
    "Glasses": "K√≠nh",
    "Sunglasses": "K√≠nh m√°t",
    "Shoes": "Gi√†y",
    "Coat": "√Åo kho√°c",
    "Dress": "V√°y",
    "Skirt": "Ch√¢n v√°y",
    "Shorts": "Qu·∫ßn short",
    "Pants": "Qu·∫ßn d√†i",
    "Shirt": "√Åo s∆° mi",
    "T-shirt": "√Åo ph√¥ng",
    "Jacket": "√Åo kho√°c",

    # Th·ªÉ thao & th·ª±c ph·∫©m
    "Baseball bat": "G·∫≠y b√≥ng ch√†y",
    "Baseball glove": "GƒÉng tay b√≥ng ch√†y",
    "Skateboard": "V√°n tr∆∞·ª£t",
    "Surfboard": "V√°n l∆∞·ªõt s√≥ng",
    "Tennis racket": "V·ª£t tennis",
    "Bowl": "B√°t",
    "Cup": "C·ªëc",
    "Fork": "Nƒ©a",
    "Knife": "Dao",
    "Spoon": "Th√¨a",
    "Pizza": "Pizza",
    "Cake": "B√°nh ng·ªçt",
    "Donut": "B√°nh v√≤ng",
    "Sandwich": "B√°nh m√¨ k·∫πp",
    "Bottle": "Chai",
    "Wine glass": "Ly r∆∞·ª£u",

    # V·∫≠t d·ª•ng & c√¥ng c·ª•
    "Skis": "V√°n tr∆∞·ª£t tuy·∫øt",
    "Snowboard": "V√°n tr∆∞·ª£t",
    "Umbrella": "D√π",

    # ƒê√®n & √°nh s√°ng
    "Lamp": "ƒê√®n",
    "Ceiling lamp": "ƒê√®n tr·∫ßn",
    "Street lamp": "ƒê√®n ƒë∆∞·ªùng",
    "Lantern": "ƒê√®n l·ªìng",
    "Flashlight": "ƒê√®n pin",
    "Neon sign": "B·∫£ng ƒë√®n neon",
    "Chandelier": "ƒê√®n ch√πm",
    "Candle": "N·∫øn"

}

COLOR_MAP = {
    "Gray": "X√°m",
    "Black": "ƒêen",
    "White": "Tr·∫Øng",
    "Red": "ƒê·ªè",
    "Blue": "Xanh d∆∞∆°ng",
    "Green": "Xanh l√°",
    "Yellow": "V√†ng",
    "Brown": "N√¢u",
    "Orange": "Cam",
    "Pink": "H·ªìng",
    "Purple": "T√≠m"
}

In [20]:
from collections import Counter
import json

def build_text_for_embedding(metadata, img_w=1.0, img_h=1.0):
    # Check if 'objects' is a dictionary of counts
    objects_data = metadata.get("objects", {})

    if isinstance(objects_data, dict):
        # If it's a dictionary of counts, format it directly
        if not objects_data:
            objects_text = ""
        else:
            objects_text = ", ".join([f"{OBJ_NAME_MAP.get(name, name)}: {count}" for name, count in objects_data.items()])

    elif isinstance(objects_data, list):
        # If it's a list (for backward compatibility or other data), process as before
        objs = [o for o in objects_data if isinstance(o, dict)]
        if not objs:
            objects_text = ""
            print("Warning: 'objects' list contains no valid dictionaries.")
        else:
            names_mapped = []
            for o in objs:
                if isinstance(o, dict):
                    names_mapped.append(OBJ_NAME_MAP.get(o.get("name", "Unknown"), o.get("name", "Unknown")))
                else:
                    print(f"Warning: Skipping non-dictionary element in 'objects' list: {o}")

            counts = Counter(names_mapped)

            objs_sorted = sorted(objs, key=lambda x: x.get("id", 0))

            type_counter = {}
            desc_list = []

            for o in objs_sorted:
                if isinstance(o, dict):
                    name_orig = o.get("name", "Unknown")
                    name = OBJ_NAME_MAP.get(name_orig, name_orig)
                    color_orig = o.get("color", "")
                    color = COLOR_MAP.get(color_orig, color_orig)
                    bbox = o.get("bbox", None)

                    pos_text = ""
                    size_text = ""

                    if bbox:
                        x_min, y_min, x_max, y_max = bbox
                        w = x_max - x_min
                        h = y_max - y_min
                        area_ratio = (w * h) / (img_w * img_h)

                        cx = (x_min + x_max) / 2
                        cy = (y_min + y_max) / 2
                        horiz = "tr√°i" if cx < img_w/3 else "ph·∫£i" if cx > 2*img_w/3 else "gi·ªØa"
                        vert = "tr√™n" if cy < img_h/3 else "d∆∞·ªõi" if cy > 2*img_h/3 else "gi·ªØa"
                        pos_text = f"·ªü {horiz}-{vert}"

                        if area_ratio > 0.2: size_text = "r·∫•t to"
                        elif area_ratio > 0.05: size_text = "to"
                        elif area_ratio > 0.01: size_text = "v·ª´a"
                        else: size_text = "nh·ªè"

                    idx = type_counter.get(name, 0) + 1
                    type_counter[name] = idx

                    order_text = number_to_vietnamese_order(idx)

                    desc = f"{name} {order_text} ({counts[name]})"
                    if color: desc += f" m√†u {color}"
                    if size_text: desc += f" ({size_text})"
                    if pos_text: desc += f" {pos_text}"
                    desc_list.append(desc)
                else:
                     print(f"Warning: Skipping non-dictionary element in 'objects_sorted' list: {o}")


            objects_text = ", ".join(desc_list)
    else:
        # Handle cases where 'objects' is neither a dict nor a list
        objects_text = ""
        print(f"Warning: 'objects' is not a dictionary or list: {objects_data}")


    combined = f"""
Ti√™u ƒë·ªÅ: {metadata.get("title", "")}
Ti·∫øng n√≥i: {metadata.get("asr_text", "")}
Ph√¢n lo·∫°i ti·∫øng: {metadata.get("sound", "")}
C√°c v·∫≠t th·ªÉ: {objects_text}
""".strip()

    # print(combined) # Comment out print to avoid verbose output
    return combined

#### Create text node

In [21]:
import os
import json
from typing import List

# Create all note
base_dir = "C:/Users/mt200/OneDrive/Desktop/AI/AI_challenge/feature_extraction/embedding-text"

text_nodes: List[TextNode] = []

# Duy·ªát t·∫•t c·∫£ file trong th∆∞ m·ª•c
for filename in os.listdir(base_dir):
    if filename.endswith(".json"):
        file_path = os.path.join(base_dir, filename)
        with open(file_path, "r", encoding="utf-8") as f:
            print(filename)
            try:
                data = json.load(f)
                # N·∫øu file ch·ª©a m·ªôt m·∫£ng item
                if isinstance(data, list):
                    for item in data:
                        text_metadata=build_text_for_embedding(item.get("metadata", {}), img_w=item.get("width", 1.0), img_h=item.get("height", 1.0))
                        text_node = TextNode(
                            text=text_metadata,  # B·∫Øt bu·ªôc ph·∫£i c√≥ tr∆∞·ªùng text
                            id_=item.get("id_") or item.get("id"),
                            embedding=item.get("embedding"),
                            metadata=item.get("metadata", ()),
                        )
                        text_nodes.append(text_node)
                # N·∫øu file ch·ªâ ch·ª©a 1 object
                elif isinstance(data, dict):
                    text_metadata=build_text_for_embedding(item.get("metadata", {}), img_w=item.get("width", 1.0), img_h=item.get("height", 1.0))
                    text_node = TextNode(
                        text=text_metadata,  # B·∫Øt bu·ªôc ph·∫£i c√≥ tr∆∞·ªùng text
                        id_=item.get("id_") or item.get("id"),
                        embedding=item.get("embedding"),
                        metadata=item.get("metadata", ()),
                    )
                    text_nodes.append(text_node)
            except json.JSONDecodeError as e:
                print(f"‚ùå L·ªói ƒë·ªçc file {filename}: {e}")

print(f"‚úÖ T·ªïng s·ªë TextNode: {len(text_nodes)}")


L21_V001.json
L21_V002.json
L21_V003.json
L21_V005.json
L21_V006.json
L21_V007.json
L21_V008.json
L21_V009.json
L21_V010.json
L21_V011.json
L21_V012.json
L21_V013.json
L21_V014.json
L21_V015.json
L21_V016.json
L21_V017.json
L21_V018.json
L21_V019.json
L21_V021.json
L21_V022.json
L21_V023.json
L21_V024.json
L21_V025.json
L21_V026.json
L21_V027.json
L21_V028.json
L21_V029.json
L21_V030.json
L21_V031.json
L22_V001.json
L22_V002.json
L22_V003.json
L22_V004.json
L22_V005.json
L22_V006.json
L22_V007.json
L22_V008.json
L22_V009.json
L22_V010.json
L22_V011.json
L22_V012.json
L22_V013.json
L22_V014.json
L22_V015.json
L22_V016.json
L22_V017.json
L22_V018.json
L22_V019.json
L22_V020.json
L22_V021.json
L22_V022.json
L22_V023.json
L22_V024.json
L22_V025.json
L22_V026.json
L22_V027.json
L22_V028.json
L22_V029.json
L22_V030.json
L22_V031.json
L23_V001.json
L23_V002.json
L23_V003.json
L23_V004.json
L23_V005.json
L23_V006.json
L23_V007.json
L23_V008.json
L23_V009.json
L23_V010.json
L23_V011.json
L23_V0

In [None]:
print(f"‚úÖ T·ªïng s·ªë TextNode: {len(text_nodes)}")

‚úÖ T·ªïng s·ªë TextNode: 85193


In [25]:
print(text_nodes[0])

Node ID: L21_V001_F001
Text: Ti√™u ƒë·ªÅ: 60 Gi√¢y S√°ng - Ng√†y 01082024 - HTV Tin T·ª©c M·ªõi Nh·∫•t
2024 Ti·∫øng n√≥i:  Ph√¢n lo·∫°i ti·∫øng: Music:0.84;Choir:0.02;Singing:0.01
C√°c v·∫≠t th·ªÉ: ƒê√®n l·ªìng: 2, T√≤a nh√† ch·ªçc tr·ªùi: 5


### Add TextNodes to Vector Search Google Cloud

In [28]:
# setup storage
vector_store = VertexAIVectorStore(
    project_id=PROJECT_ID,
    region=REGION,
    index_id=vs_index.resource_name,
    endpoint_id=vs_endpoint.resource_name,
    gcs_bucket_name=GCS_BUCKET_NAME,
    credentials_path="C:/Users/mt200/OneDrive/Desktop/AI/AI_challenge/software/back-end/service-account.json"
)

# set storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [31]:
import time

BATCH_SIZE = 500  # gi·∫£m batch n·∫øu c·∫ßn
DELAY_SEC = 60    # delay gi·ªØa c√°c batch (t√πy quota)

for i in range(18000, len(text_nodes), BATCH_SIZE):
    batch = text_nodes[i:i+BATCH_SIZE]
    vector_store.add(batch)
    print(f"Added batch {i} -> {i+len(batch)}")
    time.sleep(DELAY_SEC)

Upserting datapoints MatchingEngineIndex index: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
MatchingEngineIndex index Upserted datapoints. Resource name: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
Added batch 18000 -> 18500
Upserting datapoints MatchingEngineIndex index: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
MatchingEngineIndex index Upserted datapoints. Resource name: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
Added batch 18500 -> 19000
Upserting datapoints MatchingEngineIndex index: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
MatchingEngineIndex index Upserted datapoints. Resource name: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
Added batch 19000 -> 19500
Upserting datapoints MatchingEngineIndex index: projects/284454080854/locations/asia-southeast1/indexes/39221778786025472
MatchingEngineIndex ind

## Indexing text from Vector Search in Google Cloud

In [32]:
import torch

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Qwen/Qwen3-Embedding-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


NameError: name 'AutoModel' is not defined

In [None]:
def get_dense_embedding(text: str):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        last_hidden_state = outputs.last_hidden_state  # (batch, seq_len, hidden)

        # Attention mask ƒë·ªÉ b·ªè padding
        mask = inputs["attention_mask"].unsqueeze(-1).expand(last_hidden_state.size())
        masked_embeddings = last_hidden_state * mask

        # Mean pooling
        sum_embeddings = masked_embeddings.sum(dim=1)
        sum_mask = mask.sum(dim=1)
        embeddings = sum_embeddings / sum_mask

        emb = embeddings[0].cpu().numpy().tolist()
    return emb

In [33]:
# define index from vector store
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store, embed_model=model
)

NameError: name 'model' is not defined

### Search

In [None]:
# simple similarity search without filter
retriever = index.as_retriever(similarity_top_k=10)
response = retriever.retrieve("pants")

for row in response:
    print(f"Text: {row.get_text()}")
    print(f"   Score: {row.get_score():.3f}")
    print(f"   Metadata: {row.metadata}")

In [None]:
# similarity search with text filter
filters = MetadataFilters(filters=[MetadataFilter(key="color", value="blue")])
retriever = index.as_retriever(filters=filters, similarity_top_k=100)
response = retriever.retrieve("denims")

for row in response:
    print(f"Text: {row.get_text()}")
    print(f"   Score: {row.get_score():.3f}")
    print(f"   Metadata: {row.metadata}")

In [None]:
# similarity search with text and numeric filter
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="color", value="blue"),
        MetadataFilter(key="price", operator=FilterOperator.GT, value=70.0),
    ]
)
retriever = index.as_retriever(filters=filters, similarity_top_k=3)
response = retriever.retrieve("denims")

for row in response:
    print(f"Text: {row.get_text()}")
    print(f"   Score: {row.get_score():.3f}")
    print(f"   Metadata: {row.metadata}")

In [None]:
query_text = "four women"
def embedding_text(texts):
    with torch.no_grad():
        inputs = processor(text=texts, return_tensors="pt", padding=True)
        embeds = model.get_text_features(**inputs)
        embeds = embeds / embeds.norm(dim=-1, keepdim=True)
    return embeds.cpu().numpy().tolist()

embed_text = embedding_text(query_text)

In [None]:
# Test query
response = index.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=embed_text,
    num_neighbors=20,
)

In [None]:
import numpy as np

for idx, neighbor in enumerate(response[0]):
    id = np.int64(neighbor.id)
    # Use iloc to access the row by index
    similar = df.iloc[id]
    print(similar)
    print(similar.metadata)
    print(f"{neighbor.distance:.4f}")

## Run a Query

In [None]:

from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import (
    HybridQuery,
)
# create HybridQuery
query_text = "c√≥ 1 ng∆∞·ªùi ph·ª• n·ªØ"
query_dense_emb = get_dense_embedding(query_text)
query_sparse_emb = get_sparse_embedding(query_text)
query = HybridQuery(
    dense_embedding=query_dense_emb,
    sparse_embedding_dimensions=query_sparse_emb["dimensions"],
    sparse_embedding_values=query_sparse_emb["values"],
    rrf_ranking_alpha=0.5,
)

In [None]:
# run a hybrid query
response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=[query],
    num_neighbors=10,
)

# print results
if response and response[0]:
    for idx, neighbor in enumerate(response[0]):
        # Access the title from the metadata dictionary using the neighbor.id as index
        title = df.loc[int(neighbor.id)]['metadata']['title']
        url = df.loc[int(neighbor.id)]['metadata']['image_url']
        dense_dist = neighbor.distance if neighbor.distance else 0.0
        sparse_dist = neighbor.sparse_distance if neighbor.sparse_distance else 0.0
        print(f"{title:<40}: dense_dist: {dense_dist:.3f}, sparse_dist: {sparse_dist:.3f}, image_url: {url}")
else:
    print("No neighbors found for the given query.")