# Product Query & Similarity Search

This notebook demonstrates the end-to-end workflow of:
1. **Extracting Product Data**: Using Gemini 2.5 Flash to interpret a shelf image and extract structured product information (simulated here for reproducibility).
2. **Generating Embeddings**: Converting the extracted text into vector embeddings using the multimodal embedding model.
3. **Vector Search**: Querying a deployed Vertex AI Vector Search index to find similar products in the catalog.

## Setup and Initialization

Import necessary libraries and initialize the Vertex AI and Gemini clients.

In [1]:
import os
import json
import pandas as pd
from google import genai
from google.genai.types import (
  Part,
  EmbedContentConfig
)
from google.cloud import aiplatform

In [2]:
PROJECT_ID = "sandbox-401718"  # @param {type:"string"}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

# Initialize Gemini Client
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

# Initialize Vertex AI SDK
aiplatform.init(project=PROJECT_ID, location=LOCATION)

## Helper Functions

Define functions to parse the JSON output and format it into query strings compatible with our search index.

In [3]:
def generate_embedding_queries(llm_output_str):
    """
    Parses LLM JSON output and formats it into query strings 
    matching the structure of the existing search index.
    Handles nulls, NoneTypes, and missing fields robustly.
    """
    try:
        products = json.loads(llm_output_str)
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return []

    queries = []

    for product in products:
        def clean_field(val):
            """Safely handles None, 'null' strings, and whitespace."""
            if val is None:
                return ""
            s = str(val).strip()
            if s.lower() in ["null", "none", "n/a"]:
                return ""
            return s

        brand = clean_field(product.get("Brand"))
        description = clean_field(product.get("Description"))
        unit = clean_field(product.get("UnitOfMeasure"))
        
        size = product.get("Size")
        if size is None or str(size).lower() == "null":
            size_str = ""
        else:
            size_str = str(size).strip()

        if not brand and not description:
            print("Skipping item: Missing both Brand and Description.")
            continue

        text_parts = []
        if brand:
            text_parts.append(brand)
        if description:
            text_parts.append(description)
        
        main_text = " ".join(text_parts)
        
        combined_size_text = ""
        if size_str:
            if unit:
                combined_size_text = f"{size_str} {unit}"
            else:
                combined_size_text = f"{size_str}"
        query_string = f"Product: {main_text}"
        
        if combined_size_text:
            query_string += f", {combined_size_text}"

        query_object = {
            "query_text": query_string,
            "original_data": product
        }
        
        queries.append(query_object)

    return queries

## Product Data Extraction Prompt

Here we define the system instructions and prompt used to guide Gemini in extracting structured data from shelf images.

In [4]:
system_instructions = "You are a highly reliable JSON extraction engine. Your task is to analyze the provided image of a product shelf display and extract structured product data for all **identifiable products**."
prompt = f"""
### Extraction Rules:

1.  **Product Identification:** Identify all distinct products on the shelf. A product is typically defined by its unique packaging and corresponding electronic shelf label (ESL).
2.  **Extractable Fields:** For each product, focus on the following properties: `UPC`, `Brand`, `Description`, `Price`, `Size`, and `UnitOfMeasure`.
3.  **Field-Specific Instructions:**
    *   **`UPC`**: Locate the UPC on the electronic shelf label (ESL) positioned directly beneath the product. If the UPC is not clearly visible or is in a non-standard format, try to find it using the google_search tool with the product's brand, description and size if appropriate. Do not guess the UPC; if you cannot find it with high confidence, leave the field blank.

        **UPC Format Rules:**

        UPC codes have a standard format, most commonly the 12-digit UPC-A version used in North America. A UPC (Universal Product Code) is the number, while the barcode is the machine-readable visual representation of that number. Both are governed by the global standards organization GS1.

        *   **The standard 12-digit format (UPC-A)**
            The 12 digits of a UPC code are broken into three main components:
            *   **Number system character (1 digit):** The first digit is a number system character that indicates how the code should be classified.
                *   `0`, `6`, `7`, `8`, `9`: Standard UPC for most consumer products.
                *   `2`: For variable-weight items like produce or meat.
                *   `3`: For pharmaceuticals and health products.
                *   `4`: For in-store use or special applications.
                *   `5`: For coupons.
            *   **Company prefix (6–10 digits):** The first six to ten digits are the company prefix, which is assigned by GS1 and uniquely identifies the product's brand owner.
            *   **Item reference number (1–5 digits):** The item number is assigned by the brand owner to identify a specific product. The number of digits varies depending on the length of the company prefix. For example, a 16-gigabyte phone would have a different item number than a 32-gigabyte model.
            *   **Check digit (1 digit):** The final digit is a check digit, calculated using a formula involving all the other numbers in the sequence. This digit helps ensure the integrity of the code during scanning.
        *   **Other UPC formats**
            While UPC-A is the most common format, other variants exist:
            *   **UPC-E:** A shorter, 6-digit version used for smaller retail items, like cosmetics, that don't have enough space for the standard UPC-A barcode.
    *   **`Price`**: Locate the price on the electronic shelf label (ESL) positioned directly beneath the product.
    *   **`Description`**: Capture the main product title and any key variations or features mentioned on the packaging.
    *   **`Size`**: Extract these from the packaging (e.g., '12 OZ' or '1.0 kg'). The `Size` should be the numerical value.
    *   ** `UnitOfMeasure`:** Needs to be the unit string of one ONLY of the following categories ['OZ','LB','CT','EA', 'GAL', 'QT', 'PK' or 'ML']. If multiple units are present, prefer the one most commonly used for that product type.
4.  **Strictness:** Do not infer, guess, or make up any values. If you cannot clearly and confidently determine the value for a field, omit that field from the final JSON object.
5.  **ProcessingStatus Field (MANDATORY):** This field is required in every product object.
    *   **Success:** If you were able to confidently extract the `Brand` and `Price`, set the value to: `"SUCCESS"`.
    *   **Error/Clarification:** If you encounter significant issues—such as an unreadable image, ambiguous text, or the inability to confidently extract the key information—set the value to an informative error message. Use one of the following recommended structures, or provide a custom message specifying the issue:
        *   `"ERROR: IMAGE UNCLEAR - Cannot read price."`
        *   `"ERROR: DATA AMBIGUOUS - Multiple products visible, unclear which is primary."`
        *   `"CLARIFICATION REQUIRED - Please provide a clearer photo or specify the exact product name."`

---
**GOAL:** Produce a single, valid JSON array of objects that adheres to the schema and extraction rules based on the provided image.

You **MUST** adhere strictly to the following JSON Schema for your output. Do not include any text, explanations, or dialogue outside of the JSON object.

**JSON Schema:**
```json

  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "http://example.com/products.schema.json",
  "title": "Products",
  "description": "A collection of products.",
  "type": "array",
  "items": 
    "title": "Product (Tolerant Schema)",
    "description": "A product object where all fields are optional for robustness, and includes a status field for LLM processing feedback.",
    "type": "object",
    "properties": 
      "UPC": 
        "type": "string",
        "description": "The UPC of the product."
      }},
      "Brand": 
        "type": "string",
        "description": "The brand name of the product."
      }},
      "Description": 
        "type": "string",
        "description": "A brief description of the product."
      }},
      "Price": 
        "type": "number",
        "format": "float",
        "description": "The price of the product.",
        "minimum": 0
      }},
      "Size": 
        "type": "number",
        "description": "The size of the product."
      }},
      "UnitOfMeasure": 
        "type": "string",
        "description": "The unit of measure for the quantity (e.g., L, oz, ct)."
      }},
      "ProcessingStatus": 
        "type": "string",
        "description": "Status of the extraction process. Should be 'SUCCESS' if all clear, or an error message/instruction for clarification if a problem (e.g., blurry image, ambiguous text) was encountered."
      }}
    }},
    "required": [
      "ProcessingStatus"
    ],
    "additionalProperties": false
  }}
}}
```

"""

## Execute Extraction (Simulation)

Thw following can call Gemini with an image file. For this demonstration, we will simulate the response to proceed with the query workflow.

In [None]:
# # Uncomment to run an example call on a single instore image stored in GCS:

# IMAGE_FILENAME = "gs://sandbox-401718-compscan-phase1/instore-selected/253097005421_Aldi_1.09.jpg"
#
# response = client.models.generate_content(
#   model="gemini-2.5-flash-001",
#   contents=[
#     prompt,
#     Part.from_uri(
#       file_uri=IMAGE_FILENAME,
#       mime_type="image/jpeg",
#     ),
#   ],
#   config={
#     "response_mime_type": "application/json",
#     "system_instruction": system_instructions,
#   },
# )
# response_text = response.text

In [5]:
# Simulating the output from Gemini data extraction
response_text = """
[
 {
  "Description": "Fresh Family Pack Chicken Drumsticks",
  "Price": 1.09,
  "UnitOfMeasure": "LB",
  "ProcessingStatus": "ERROR: BRAND NOT VISIBLE"
 }
 ]
"""

# Parse the response into query format
generated_queries = generate_embedding_queries(response_text)

print(f"Generated {len(generated_queries)} valid queries:\n")
for q in generated_queries:
    print(f"Query String: '{q['query_text']}'")

Generated 1 valid queries:

Query String: 'Product: Fresh Family Pack Chicken Drumsticks'


## Vector Search Results

Use the generated query string to search the Vector Search index for similar items. This allows us to match the observed product (even with incomplete data) to the ground-truth catalog.

In [6]:
# Get the Project Number programmatically
PROJECT_NUMBER = !gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
PROJECT_NUMBER = PROJECT_NUMBER[0]

index_endpoint = "3966107766178709504" # @param

# Connect to the Index Endpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(
    # index_endpoint_name=f"projects//locations//indexEndpoints/{INDEX_ENDPOINT_ID}"
    index_endpoint_name=f"projects/{PROJECT_NUMBER}/locations/{LOCATION}/indexEndpoints/{index_endpoint}" # @param
)

In [7]:
if generated_queries:
    # Use the first generated query for this test
    QUERY = generated_queries[0]['query_text']
    print(f"Example Query: {QUERY}") 

    # Generate embedding for the query text
    response = client.models.embed_content(
        model="gemini-embedding-001",
        contents=[QUERY],
        config=EmbedContentConfig(
            output_dimensionality=3072,
        ),
    )
    embedding_vector = response.embeddings[0].values

    # Search the index
    response = my_index_endpoint.match(
        deployed_index_id="product_index",
        queries=[embedding_vector],
        num_neighbors=3,
    )

    # Display raw results
    results = []
    for query_result in response:
        for neighbor in query_result:
            # Access attributes directly
            results.append({'id': neighbor.id, 'distance': neighbor.distance})
            print(f"GTIN: {neighbor.id}, Distance: {neighbor.distance}")

Example Query: Product: Fresh Family Pack Chicken Drumsticks
GTIN: 461321268911, Distance: 0.6403461694717407
GTIN: 469106382920, Distance: 0.6367120146751404
GTIN: 1799639891286, Distance: 0.6329072713851929


## Retrieve Product Details

Map the returned Vector Search IDs (GTINs) back to readable product details using the original dataset (CSV).

In [9]:
df = pd.read_csv("product_embeddings_output.csv") # @param from previous notebook

In [10]:
df["id"] = df["id"].astype(str)

df_lookup = df.set_index("id")

# Iterate through the response and map IDs back to content

matched_products = [] 
if response and response[0]:
    for neighbor in response[0]:
        try:
            # Now, the string `neighbor.id` will match the string index in `df_lookup`

            matched_content = df_lookup.loc[neighbor.id]["content"]

            matched_products.append(
                {
                    "ID": neighbor.id,
                    "Content": matched_content,
                    "Similarity": neighbor.distance,
                }
            )
        except KeyError:
            print(
                f"Warning: ID '{neighbor.id}' from search result not found in DataFrame."
            )

if matched_products:
    matched_df = pd.DataFrame(matched_products)
    titles = matched_df["Content"].str.split("Product:", expand=True)[1].str.strip()

    for title in titles:
        print(title)
else:
    print("\nSomething went wrong, no matches were mapped.")

Boston Spice Just Wing It Handmade Gourmet Seasoning Blend Poultry Chicken Legs Wings Thighs Breasts Vegetables for Grilling Smoker Barbecue BBQ Oven Grill Saute 1/2 Cup Spice wt. 2.6oz/74g, Level 0: FOOD, Level 1: GROCERY, Level 2: DRY MIXES AND VINEGAR
Boston Spice Just Wing It Handmade Gourmet Seasoning Blend Poultry Chicken Legs Wings Thighs Breasts Vegetables for Grilling Smoker Barbecue BBQ Oven Grill Saute 1 Cup Spice wt. 5.2oz/149g, Level 0: FOOD, Level 1: GROCERY, Level 2: DRY MIXES AND VINEGAR
Louisiana Fish Fry Products Crispy Seasoned Chicken Fry Coating Powder Mix, 9 oz Bag pack of 3, Level 0: FOOD, Level 1: GROCERY, Level 2: DRY MIXES AND VINEGAR
