In [None]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Vector Embedding Ingestion with Apache Beam and Cloud Spanner

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/spanner_product_catalog_embeddings.ipynb"><img src="https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/spanner_product_catalog_embeddings.ipynb"><img src="https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png" />View source on GitHub</a>
  </td>
</table>


# Introduction

This Colab demonstrates how to generate embeddings from data and ingest them into [Cloud Spanner](https://cloud.google.com/spanner). We'll use Apache Beam and Dataflow for scalable data processing.

## Example: Furniture Product Catalog

We'll work with a sample e-commerce dataset representing a furniture product catalog. Each product has:

*   **Structured fields:** `id`, `name`, `category`, `price`
*   **Detailed text descriptions:** Longer text describing the product's features.
*   **Additional metadata:** `material`, `dimensions`

## Pipeline Overview
We will build a pipeline to:
1. Read product data
2. Convert unstructured product data, to embeddable `Chunk`<sup>[1]</sup> type
2. Generate Embeddings: Use a pre-trained Hugging Face model (via MLTransform) to create vector embeddings
3. Write embeddings and metadata to Spanner table

Here's a visualization of the data flow:

| Stage                     | Data Representation                                      | Notes                                                                                                                   |
| :------------------------ | :------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------- |
| **1. Ingest Data**      | `{`<br> `  "id": "desk-001",`<br> `  "name": "Modern Desk",`<br> `  "description": "Sleek...",`<br> `  "category": "Desks",`<br> `  ...`<br> `}` | Supports:<br>- Reading from batch (e.g., files, databases)<br>- Streaming sources (e.g., Pub/Sub).                 |
| **2. Convert to Chunks** | `Chunk(` <br>  &nbsp;&nbsp;`id="desk-001",` <br>  &nbsp;&nbsp;`content=Content(` <br>   &nbsp;&nbsp;&nbsp;&nbsp;`text="Modern Desk"` <br> &nbsp;&nbsp; `),` <br>  &nbsp;&nbsp;`metadata={...}` <br> `)`       | - `Chunk` is the structured input for generating and ingesting embeddings.<br>- `chunk.content.text` is the field that is embedded.<br> - Converting to `Chunk` does not mean breaking data into smaller pieces,<br>&nbsp;&nbsp; it's simply organizing your data in a standard format for the embedding pipeline.<br> - `Chunk` allows data to flow seamlessly throughout embedding pipelines. |
| **3. Generate Embeddings**| `Chunk(` <br>  &nbsp;&nbsp;`id="desk-001",`<br>  &nbsp;&nbsp;`embedding=[-0.1, 0.6, ...],`<br>  `...)`  | Supports:<br>- Local Hugging Face models<br>- Remote Vertex AI models<br>- Custom embedding implementations.          |
| **4. Write to Spanner** | **Spanner Table (Example Row):**<br>`id: desk-001`<br>`embedding: [-0.1, 0.6, ...]`<br> `name = "Modern Desk"`,<br>`Other fields ...` | Supports:<br>- Custom schemas<br>- Conflict resolution strategies for handling updates                               |


[1]: Chunk represents an embeddable unit of input. It specifies which fields should be embedded and which fields should be treated as metadata. Converting to Chunk does not necessarily mean breaking your text into smaller pieces - it's primarily about structuring your data for the embedding pipeline. For very long texts that exceed the embedding model's maximum input size, you can optionally [use Langchain TextSplitters](https://beam.apache.org/releases/pydoc/2.63.0/apache_beam.ml.rag.chunking.langchain.html) to break the text into smaller `Chunk`'s.


## Execution Environments

This notebook demonstrates two execution environments:

1. **DirectRunner (Local Execution)**: All examples in this notebook run on DirectRunner by default, which executes the pipeline locally. This is ideal for development, testing, and processing small datasets.

2. **DataflowRunner (Distributed Execution)**: The [Run on Dataflow](#scrollTo=Quick_Start_Run_on_Dataflow) section demonstrates how to execute the same pipeline on Google Cloud Dataflow for scalable, distributed processing. This is recommended for production workloads and large datasets.

All examples in this notebook can be adapted to run on Dataflow by following the pattern shown in the "Run on Dataflow" section.

# Setup and Prerequisites

This example requires:
1. A Cloud Spanner instance
2. Apache Beam 2.70.0 or later

## Install Packages and Dependencies

First, let's install the Python packages required for the embedding and ingestion pipeline:


In [None]:
# Apache Beam with GCP support
!pip install apache_beam[interactive,gcp]>=2.70.0 --quiet
# Huggingface sentence-transformers for embedding models
!pip install sentence-transformers --quiet
!pip show apache-beam

## Database Setup

To connect to Cloud Spanner, you'll need:
1. GCP project ID where the Spanner instance is located
2. Spanner instance ID
3. Database ID (Database will be created if it doesn't exist)

Replace these placeholder values with your actual Cloud Spanner details:

In [None]:
PROJECT_ID = "" # @param {type:'string'}
INSTANCE_ID = "" # @param {type:'string'}
DATABASE_ID = "" # @param {type:'string'}

## Authenticate to Google Cloud

To connect to the Cloud Spanner instance, we need to set up authentication. 

**Why multiple authentication steps?**

The Spanner I/O connector uses a cross-language Java transform under the hood. This means:
1. `auth.authenticate_user()` authenticates the Python environment
2. `gcloud auth application-default login` writes credentials to disk where the Java runtime can access them

**Recommended: Use a Service Account**

For production workloads or to avoid interactive login prompts, we recommend using a service account with appropriate Spanner permissions:

1. Create a service account with the `Cloud Spanner Database User` role (or `Cloud Spanner Database Admin` if creating tables)
2. Download the JSON key file
3. Set the environment variable: `export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"`

When using a service account, both Python and Java runtimes will automatically pick up the credentials, and you can skip the interactive authentication below.

In [None]:
import sys
if 'google.colab' in sys.modules:
from google.colab import auth
# Authenticates Python SDK
auth.authenticate_user(project_id=PROJECT_ID)

# Writes application default credentials to disk for Java cross-language transforms
!gcloud auth application-default login

!gcloud config set project {PROJECT_ID}

In [None]:
# @title Spanner Helper Functions for Creating Tables and Verifying Data

from google.cloud import spanner
from google.api_core.exceptions import NotFound, AlreadyExists
import time

def get_spanner_client(project_id: str) -> spanner.Client:
    """Creates a Spanner client."""
    return spanner.Client(project=project_id)


def ensure_instance_exists(
    client: spanner.Client,
    instance_id: str
):
    """Ensure Spanner instance exists, raise an error if it doesn't.

    Args:
        client: Spanner client
        instance_id: Instance ID to check

    Returns:
        The Spanner Instance object.

    Raises:
        NotFound: If the instance does not exist.
    """
    instance = client.instance(instance_id)

    try:
        # Attempt to load the instance metadata
        instance.reload()
        print(f"âœ“ Spanner Instance '{instance_id}' exists")
        return instance
    except NotFound:
        # Instance does not exist
        raise NotFound(
            f"Error: Spanner Instance '{instance_id}' not found. "
            "Please create the instance before running this script."
        )

def ensure_database_exists(
    client: spanner.Client,
    instance_id: str,
    database_id: str,
    ddl_statements: list = None
):
    """Ensure database exists, create if it doesn't.

    Args:
        client: Spanner client
        instance_id: Instance ID to get
        database_id: Database ID to create or get
        ddl_statements: Optional DDL statements for table creation

    Returns:
        Database instance
    """
    instance = ensure_instance_exists(client, instance_id)
    database = instance.database(database_id)

    try:
        # Try to get existing database
        database.reload()
        print(f"âœ“ Database '{database_id}' already exists")
        return database
    except NotFound:
        # Create new database
        print(f"Creating database '{database_id}'...")
        operation = database.create()
        operation.result(timeout=120)
        print(f"âœ“ Database '{database_id}' created successfully")
        return database

def create_or_replace_table(
    client: spanner.Client,
    instance_id: str,
    database_id: str,
    table_name: str,
    table_ddl: str
):
    """Create or replace a table in Spanner.

    Args:
        client: Spanner client
        instance_id: Instance ID to get
        database_id: Database ID
        table_name: Table name to create
        table_ddl: Complete CREATE TABLE DDL statement
    """
    instance = ensure_instance_exists(client, instance_id)
    database = instance.database(database_id)

    # Drop table if exists
    try:
        print(f"Dropping table '{table_name}' if it exists...")
        operation = database.update_ddl([f"DROP TABLE {table_name}"])
        operation.result(timeout=120)
        print(f"âœ“ Dropped existing table '{table_name}'")
        time.sleep(2)  # Wait for drop to complete
    except Exception as e:
        if "NOT_FOUND" not in str(e):
            print(f"Note: Table may not exist (this is normal): {e}")

    # Create table
    print(f"Creating table '{table_name}'...")
    operation = database.update_ddl([table_ddl])
    operation.result(timeout=120)
    print(f"âœ“ Table '{table_name}' created successfully")

def verify_embeddings_spanner(
    client: spanner.Client,
    instance_id: str,
    database_id: str,
    table_name: str
):
    """Query and display all rows from a Spanner table.

    Args:
        client: Spanner client
        instance_id: Instance ID to get
        database_id: Database ID
        table_name: Table name to query
    """
    instance = ensure_instance_exists(client, instance_id)
    database = instance.database(database_id)

    with database.snapshot() as snapshot:
        results = snapshot.execute_sql(f"SELECT * FROM {table_name}")
        rows = list(results)

        print(f"\nFound {len(rows)} products in '{table_name}':")
        print("-" * 80)

        if not rows:
            print("Table is empty.")
            print("-" * 80)
        else:
            # Print each row
            for row in rows:
                for i, value in enumerate(row):
                    # Limit embedding display to first 5 values
                    if isinstance(value, list) and len(value) > 5:
                        print(f"Column {i}: [{value[0]:.4f}, {value[1]:.4f}, ..., {value[-1]:.4f}] (length: {len(value)})")
                    else:
                        print(f"Column {i}: {value}")
                print("-" * 80)

## Create Sample Product Catalog Data

We'll create a typical e-commerce catalog where you want to:
- Generate embeddings for product text
- Store vectors alongside product data
- Enable vector similarity features

Example product:
```python
{
    "id": "desk-001",
    "name": "Modern Minimalist Desk",
    "description": "Sleek minimalist desk with clean lines and a spacious work surface. "
                  "Features cable management system and sturdy steel frame. "
                  "Perfect for contemporary home offices and workspaces.",
    "category": "Desks",
    "price": 399.99,
    "material": "Engineered Wood, Steel",
    "dimensions": "60W x 30D x 29H inches"
}
```

In [None]:
#@title Create sample data
PRODUCTS_DATA = [
    {
        "id": "desk-001",
        "name": "Modern Minimalist Desk",
        "description": "Sleek minimalist desk with clean lines and a spacious work surface. "
                      "Features cable management system and sturdy steel frame. "
                      "Perfect for contemporary home offices and workspaces.",
        "category": "Desks",
        "price": 399.99,
        "material": "Engineered Wood, Steel",
        "dimensions": "60W x 30D x 29H inches"
    },
    {
        "id": "chair-001",
        "name": "Ergonomic Mesh Office Chair",
        "description": "Premium ergonomic office chair with breathable mesh back, "
                      "adjustable lumbar support, and 4D armrests. Features synchronized "
                      "tilt mechanism and memory foam seat cushion. Ideal for long work hours.",
        "category": "Office Chairs",
        "price": 299.99,
        "material": "Mesh, Metal, Premium Foam",
        "dimensions": "26W x 26D x 48H inches"
    },
    {
        "id": "sofa-001",
        "name": "Contemporary Sectional Sofa",
        "description": "Modern L-shaped sectional with chaise lounge. Upholstered in premium "
                      "performance fabric. Features deep seats, plush cushions, and solid "
                      "wood legs. Perfect for modern living rooms.",
        "category": "Sofas",
        "price": 1299.99,
        "material": "Performance Fabric, Solid Wood",
        "dimensions": "112W x 65D x 34H inches"
    },
    {
        "id": "table-001",
        "name": "Rustic Dining Table",
        "description": "Farmhouse-style dining table with solid wood construction. "
                      "Features distressed finish and trestle base. Seats 6-8 people "
                      "comfortably. Perfect for family gatherings.",
        "category": "Dining Tables",
        "price": 899.99,
        "material": "Solid Pine Wood",
        "dimensions": "72W x 42D x 30H inches"
    },
    {
        "id": "bed-001",
        "name": "Platform Storage Bed",
        "description": "Modern queen platform bed with integrated storage drawers. "
                      "Features upholstered headboard and durable wood slat support. "
                      "No box spring needed. Perfect for maximizing bedroom space.",
        "category": "Beds",
        "price": 799.99,
        "material": "Engineered Wood, Linen Fabric",
        "dimensions": "65W x 86D x 48H inches"
    }
]
print(f"""âœ“ Created PRODUCTS_DATA with {len(PRODUCTS_DATA)} records""")

## Importing Pipeline Components

We import the following components to configure our embedding ingestion pipeline:
- `apache_beam.ml.rag.types.Chunk`, the structured input for generating and ingesting embeddings
- `apache_beam.ml.rag.ingestion.spanner.SpannerVectorWriterConfig` for configuring write behavior
- `apache_beam.ml.rag.ingestion.spanner.SpannerColumnSpecsBuilder` for custom schema mapping
- `apache_beam.ml.rag.ingestion.base.VectorDatabaseWriteTransform` to perform the write step

In [None]:
from apache_beam.ml.rag.ingestion.spanner import SpannerVectorWriterConfig
from apache_beam.ml.rag.ingestion.spanner import SpannerColumnSpecsBuilder
from apache_beam.ml.rag.ingestion.base import VectorDatabaseWriteTransform
from apache_beam.ml.rag.types import Chunk, Content
from apache_beam.ml.rag.embeddings.huggingface import HuggingfaceTextEmbeddings

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.ml.transforms.base import MLTransform

# What's next?

This colab covers several use cases that you can explore based on your needs after completing the Setup and Prerequisites:

ðŸ”° **New to vector embeddings?**
- [Start with Quick Start](#scrollTo=Quick_Start_Basic_Vector_Ingestion)
- Uses simple out-of-box schema
- Perfect for initial testing

ðŸš€ **Need to scale to large datasets?**
- [Go to Run on Dataflow](#scrollTo=Quick_Start_Run_on_Dataflow)
- Learn how to execute the same pipeline at scale
- Fully managed
- Process large datasets efficiently

ðŸŽ¯ **Have a specific schema?**
- [Go to Custom Schema](#scrollTo=Custom_Schema_with_Column_Mapping)
- Learn to use different column names
- Map metadata to individual columns

ðŸ”„ **Need to update embeddings?**
- [Check out Updating Embeddings](#scrollTo=Update_Embeddings_and_Metadata_with_Write_Mode)
- Handle conflicts
- Selective field updates

ðŸ”— **Need to generate and Store Embeddings for Existing Spanner Table?**
- [See Database Integration](#scrollTo=Adding_Embeddings_to_Existing_Database_Records)
- Read data from your Spanner table.
- Generate embeddings for the relevant fields.
- Update your table (or a related table) with the generated embeddings.

ðŸ¤– **Want to use Google's AI models?**
- [Try Vertex AI Embeddings](#scrollTo=Generate_Embeddings_with_VertexAI_Text_Embeddings)
- Use Google's powerful embedding models
- Seamlessly integrate with other Google Cloud services

ðŸ”„ Need real-time embedding  updates?

- [Try Streaming Embeddings from PubSub](#scrollTo=Streaming_Embeddings_Updates_from_PubSub)
- Process continuous data streams
- Update embeddings in real-time as information changes

# Quick Start: Basic Vector Ingestion

This section shows the simplest way to generate embeddings and store them in Cloud Spanner.

## Create table with default schema

Before running the pipeline, we need a table to store our embeddings:

In [None]:
table_name = "default_product_embeddings"
table_ddl = f"""
CREATE TABLE {table_name} (
    id STRING(1024) NOT NULL,
    embedding ARRAY<FLOAT32>(vector_length=>384),
    content STRING(MAX),
    metadata JSON
) PRIMARY KEY (id)
"""

In [None]:
client = get_spanner_client(PROJECT_ID)
ensure_database_exists(client, INSTANCE_ID, DATABASE_ID)
create_or_replace_table(client, INSTANCE_ID, DATABASE_ID, table_name, table_ddl)

## Configure Pipeline Components

Now define the components that control the pipeline behavior:

### Convert ingested product data to embeddable Chunks
- Our data is ingested as product dictionaries
- Embedding generation and ingestion processes `Chunks`
- We convert each product dictionary to a `Chunk` to configure what text to embed and what to treat as metadata

In [None]:
from typing import Dict, Any

def create_chunk(product: Dict[str, Any]) -> Chunk:
    """Convert a product dictionary into an embeddable object."""
    return Chunk(
        content=Content(
            text=f"{product['name']}: {product['description']}"
        ),
        id=product['id'],
        metadata=product,
    )


### Generate embeddings with HuggingFace SentenceTransformer

We use a local pre-trained Hugging Face model to create vector embeddings from the product descriptions.

In [None]:
huggingface_embedder = HuggingfaceTextEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

### Write to Cloud Spanner

The default SpannerVectorWriterConfig maps Chunk fields to database columns as:

| Database Column | Chunk Field | Description |
|----------------|-------------|-------------|
| id             | chunk.id    | Unique identifier |
| embedding      | chunk.embedding.dense_embedding | Vector as ARRAY<FLOAT32> |
| content        | chunk.content.text | Text that was embedded |
| metadata       | chunk.metadata | Additional data as JSON |

In [None]:
spanner_writer_config = SpannerVectorWriterConfig(
    project_id=PROJECT_ID,
    instance_id=INSTANCE_ID,
    database_id=DATABASE_ID,
    table_name=table_name
)

## Assemble and Run Pipeline

Now we can create our pipeline that:
1. Takes our product data
2. Converts each product to a Chunk
3. Generates embeddings for each Chunk
4. Stores everything in Cloud Spanner



In [None]:
import tempfile

# Executing on DirectRunner (local execution)
with beam.Pipeline() as p:
    _ = (
        p
        | 'Create Products' >> beam.Create(PRODUCTS_DATA)
        | 'Convert to Chunks' >> beam.Map(create_chunk)
        | 'Generate Embeddings' >> MLTransform(write_artifact_location=tempfile.mkdtemp())
          .with_transform(huggingface_embedder)
        | 'Write to Spanner' >> VectorDatabaseWriteTransform(
            spanner_writer_config
        )
    )

## Verify Embeddings
Let's check what was written to our Cloud Spanner table:

In [None]:
verify_embeddings_spanner(client,INSTANCE_ID, DATABASE_ID, table_name)

## Quick Start Summary

In this section, you learned how to:
- Convert product data to the Chunk format expected by embedding pipelines
- Generate embeddings using a HuggingFace model
- Configure and run a basic embedding ingestion pipeline
- Store embeddings and metadata in Cloud Spanner

This basic pattern forms the foundation for all the advanced use cases covered in the following sections.

# Quick Start: Run on Dataflow

This section demonstrates how to launch the Quick Start embedding pipeline on Google Cloud Dataflow from the colab. While previous examples used DirectRunner for local execution, Dataflow provides a fully managed, distributed execution environment that is:
- Scalable: Automatically scales to handle large datasets
- Fault-tolerant: Handles worker failures and ensures exactly-once processing
- Fully managed: No need to provision or manage infrastructure

For more in-depth documentation to package your pipeline into a python file and launch a DataFlow job from the command line see [Create Dataflow pipeline using Python](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-python).

## Create the Cloud Spanner table with default schema

Before running the pipeline, we need a table to store our embeddings:

In [None]:
table_name = "default_dataflow_product_embeddings"
table_ddl = f"""
CREATE TABLE {table_name} (
    id STRING(1024) NOT NULL,
    embedding ARRAY<FLOAT32>(vector_length=>384),
    content STRING(MAX),
    metadata JSON
) PRIMARY KEY (id)
"""

In [None]:
client = get_spanner_client(PROJECT_ID)
ensure_database_exists(client, INSTANCE_ID, DATABASE_ID)
create_or_replace_table(client, INSTANCE_ID, DATABASE_ID, table_name, table_ddl)

## Save our Pipeline to a python file

To launch our pipeline job on DataFlow, we
1. Add command line arguments for passing pipeline options
2. Save our pipeline code to a local file `basic_ingestion_pipeline.py`

In [None]:
file_content = """
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import argparse
import tempfile

from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.rag.types import Chunk, Content
from apache_beam.ml.rag.ingestion.base import VectorDatabaseWriteTransform
from apache_beam.ml.rag.ingestion.spanner import SpannerVectorWriterConfig
from apache_beam.ml.rag.embeddings.huggingface import HuggingfaceTextEmbeddings
from apache_beam.options.pipeline_options import SetupOptions

PRODUCTS_DATA = [
    {
        "id": "desk-001",
        "name": "Modern Minimalist Desk",
        "description": "Sleek minimalist desk with clean lines and a spacious work surface. "
                      "Features cable management system and sturdy steel frame. "
                      "Perfect for contemporary home offices and workspaces.",
        "category": "Desks",
        "price": 399.99,
        "material": "Engineered Wood, Steel",
        "dimensions": "60W x 30D x 29H inches"
    },
    {
        "id": "chair-001",
        "name": "Ergonomic Mesh Office Chair",
        "description": "Premium ergonomic office chair with breathable mesh back, "
                      "adjustable lumbar support, and 4D armrests. Features synchronized "
                      "tilt mechanism and memory foam seat cushion. Ideal for long work hours.",
        "category": "Office Chairs",
        "price": 299.99,
        "material": "Mesh, Metal, Premium Foam",
        "dimensions": "26W x 26D x 48H inches"
    }
]

def run(argv=None):
    parser = argparse.ArgumentParser()
    parser.add_argument('--instance_id', required=True, help='Spanner instance ID')
    parser.add_argument('--database_id', required=True, help='Spanner database ID')
    parser.add_argument('--table_name', required=True, help='Spanner table name')

    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    project_id = pipeline_options.get_all_options()['project']

    with beam.Pipeline(options=pipeline_options) as p:
        _ = (
            p
            | 'Create Products' >> beam.Create(PRODUCTS_DATA)
            | 'Convert to Chunks' >> beam.Map(lambda product: Chunk(
                content=Content(
                    text=f"{product['name']}: {product['description']}"
                ),
                id=product['id'],
                metadata=product,
            ))
            | 'Generate Embeddings' >> MLTransform(write_artifact_location=tempfile.mkdtemp())
              .with_transform(HuggingfaceTextEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"))
            | 'Write to Spanner' >> VectorDatabaseWriteTransform(
                SpannerVectorWriterConfig(
                    project_id=project_id,
                    instance_id=known_args.instance_id,
                    database_id=known_args.database_id,
                    table_name=known_args.table_name
                )
            )
        )

if __name__ == '__main__':
    run()
"""

with open("basic_ingestion_pipeline.py", "w") as f:
    f.write(file_content)


## Configure the Pipeline options
To run the pipeline on DataFlow we need
- A gcs bucket for staging DataFlow files. Replace `<BUCKET_NAME>`: the name of a valid Google Cloud Storage bucket.
- Optionally set the Google Cloud region that you want to run Dataflow in. Replace `<REGION>` with the desired location.
- Optionally provide `NETWORK` and `SUBNETWORK` for dataflow workers to run on.


In [None]:
BUCKET_NAME = '' # @param {type:'string'}
REGION = '' # @param {type:'string'}
NETWORK = '' # @param {type:'string'}
SUBNETWORK = '' # @param {type:'string'}

## Provide additional Python dependencies to be installed on Worker VM's

We are making use of the HuggingFace `sentence-transformers` package to generate embeddings. Since this package is not installed on Worker VM's by default, we create a requirements.txt file with the additional dependencies to be installed on worker VM's.

See [Managing Python Pipeline Dependencies](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/) for more details.



In [None]:
!echo "sentence-transformers" > ./requirements.txt
!cat ./requirements.txt

## Run Pipeline on Dataflow

We launch the pipeline via the command line, passing
- Cloud Spanner pipeline arguments defined in `basic_ingestion_pipeline.py`
- GCP Project ID
- Job Region
- The runner (DataflowRunner)
- Temp and Staging GCS locations for Pipeline artifacts
- Requirement file location for additional dependencies
- (Optional) The VPC network and Subnetwork that has access to the Cloud Spanner instance

Once the job is launched, you can monitor its progress in the Google Cloud Console:
1. Go to https://console.cloud.google.com/dataflow/jobs
2. Select your project
3. Click on the job named "spanner-dataflow-basic-embedding-ingest"
4. View detailed execution graphs, logs, and metrics

In [None]:
command_parts = [
    "python ./basic_ingestion_pipeline.py",
    f"--project={PROJECT_ID}",
    f"--instance_id={INSTANCE_ID}",
    f"--database_id={DATABASE_ID}",
    f"--table_name={table_name}",
    f"--job_name=spanner-dataflow-basic-embedding-ingest",
    f"--region={REGION}",
    "--runner=DataflowRunner",
    f"--temp_location=gs://{BUCKET_NAME}/temp",
    f"--staging_location=gs://{BUCKET_NAME}/staging",
    "--disk_size_gb=50",
    "--requirements_file=requirements.txt"
]

if NETWORK:
    command_parts.append(f"--network={NETWORK}")

if SUBNETWORK:
    command_parts.append(f"--subnetwork=regions/{REGION}/subnetworks/{SUBNETWORK}")

final_command = " ".join(command_parts)
import logging
logging.getLogger().setLevel(logging.INFO)
print("Generated command:\n", final_command)

In [None]:
# Launch pipeline with generated command
!{final_command}

## Verify the Written Embeddings

Once the dataflow job is complete we check what was written to our Cloud Spanner table:

In [None]:
verify_embeddings_spanner(client,INSTANCE_ID, DATABASE_ID, table_name)

# Advanced Use Cases

This section demonstrates more complex scenarios for using Spanner with Apache Beam for vector embeddings.

ðŸŽ¯ **Have a specific schema?**
- [Go to Custom Schema](#scrollTo=Custom_Schema_with_Column_Mapping)
- Learn to use different column names and transform values
- Map metadata to individual columns

ðŸ”„ **Need to update embeddings?**
- [Check out Updating Embeddings](#scrollTo=Update_Embeddings_and_Metadata_with_Write_Mode)
- Handle conflicts
- Selective field updates

ðŸ”— **Need to generate and Store Embeddings for Existing Cloud Spanner Data??**
- [See Database Integration](#scrollTo=Adding_Embeddings_to_Existing_Database_Records)
- Read data from your Cloud Spanner table.
- Generate embeddings for the relevant fields.
- Update your table (or a related table) with the generated embeddings.

ðŸ¤– **Want to use Google's AI models?**
- [Try Vertex AI Embeddings](#scrollTo=Generate_Embeddings_with_VertexAI_Text_Embeddings)
- Use Google's powerful embedding models
- Seamlessly integrate with other Google Cloud services

ðŸ”„ Need real-time embedding  updates?

- [Try Streaming Embeddings from PubSub](#scrollTo=Streaming_Embeddings_Updates_from_PubSub)
- Process continuous data streams
- Update embeddings in real-time as information changes


## Custom Schema with Column Mapping

In this example, we'll create a custom schema that:
- Uses different column names
- Maps metadata to individual columns
- Uses functions to transform values

### ColumnSpec and SpannerColumnSpecsBuilder


ColumnSpec specifies how to map data to a database column. For example:
```python
from apache_beam.ml.rag.ingestion.spanner import ColumnSpec

ColumnSpec(
    column_name="price",          # Database column
    python_type=float,            # Python Type for the value
    value_fn=lambda c: c.metadata['price'],  # Extract price from Chunk
)
```
In this example `value_fn` extracts price from metadata, `python_type` indicates that the extracted value is of type float, `column_name` inserts it into the Spanner column price.

`SpannerColumnSpecBuilder` offers a fluent api for adding column specs:
```python
specs = (
  SpannerColumnSpecsBuilder()
    .with_id_spec() # Default id spec map Chunk.id to Spanner column "id" as a string
    .with_embedding_spec() # Default embedding spec maps Chunk.embedding.dense_embedding to Spanner column "embedding" of type list<float>
    .with_content_spec() # Default content spec maps Chunk.content.text to Spanner column "content"
    .add_metadata_field(field="source", python_type=str) # Extracts the "source" field from Chunk.metadata and inserts into Spanner column "source" as string type.
    .with_metadata_spec() # Default metadata spec inserts entire Chunk.metadata to spanner as JSON.
    .build()
)

```

### Create Custom Schema Table

In [None]:
table_name = "custom_product_embeddings"
table_ddl = f"""
CREATE TABLE {table_name} (
    product_id STRING(1024) NOT NULL,
    vector_embedding ARRAY<FLOAT32>(vector_length=>384),
    product_name STRING(MAX),
    description STRING(MAX),
    price FLOAT64,
    category STRING(MAX),
    display_text STRING(MAX),
    model_name STRING(MAX),
    created_at TIMESTAMP
) PRIMARY KEY (product_id)
"""
client = get_spanner_client(PROJECT_ID)
ensure_database_exists(client, INSTANCE_ID, DATABASE_ID)
create_or_replace_table(client, INSTANCE_ID, DATABASE_ID, table_name, table_ddl)

### Configure Column Specs

We extract fields from our `Chunk` and map them to our database schema.

In [None]:
from datetime import datetime

column_specs = (
    SpannerColumnSpecsBuilder()
    .with_id_spec(column_name='product_id')
    .with_embedding_spec(column_name='vector_embedding')
    .with_content_spec(column_name='description')
    .add_metadata_field('name', str, column_name='product_name')
    .add_metadata_field('price', float, column_name='price')
    .add_metadata_field('category', str, column_name='category')
    .add_column(
        column_name='display_text',
        python_type=str,
        value_fn=lambda chunk: f"{chunk.metadata['name']} - ${chunk.metadata['price']:.2f}"
    )
    .add_column(
        column_name='model_name',
        python_type=str,
        value_fn=lambda _: "all-MiniLM-L6-v2"
    )
    .add_column(
        column_name='created_at',
        python_type=str,
        value_fn=lambda _: datetime.now().isoformat()+'Z'
    )
    .build()
)


### Run Pipeline

In [None]:
import tempfile

# Executing on DirectRunner (local execution)
with beam.Pipeline() as p:
    _ = (
        p
        | 'Create Products' >> beam.Create(PRODUCTS_DATA)
        | 'Convert to Chunks' >> beam.Map(lambda product_dict: Chunk(Content(text=f"{product_dict['name']}: {product_dict['description']}"), id=product_dict["id"], metadata=product_dict))
        | 'Generate Embeddings' >> MLTransform(write_artifact_location=tempfile.mkdtemp())
          .with_transform(HuggingfaceTextEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"))
        | 'Write to Spanner' >> VectorDatabaseWriteTransform(
            SpannerVectorWriterConfig(
                project_id=PROJECT_ID,
                instance_id=INSTANCE_ID,
                database_id=DATABASE_ID,
                table_name=table_name,
                column_specs=column_specs
            )
        )
    )

## Verify Embeddings
Let's check what was written to our Cloud Spanner table:

In [None]:
verify_embeddings_spanner(client,INSTANCE_ID, DATABASE_ID, table_name)

# Update Embeddings and Metadata with Write Mode <a name="updating"></a>

This section demonstrates how to handle periodic updates to product descriptions and their embeddings using the default schema. We'll show how embeddings and metadata get updated when product descriptions change.

Spanner supports different write modes for handling updates:
- `INSERT`: Fail if row exists
- `UPDATE`: Fail if row doesn't exist  
- `REPLACE`: Delete then insert
- `INSERT_OR_UPDATE`: Insert or update if exists (default)
Any of these can be selected via the `write_mode` `SpannerVectorWriterConfig` argument


### Create table with desired schema

Let's use the same default schema as in Quick Start:

In [None]:
table_name = "mutable_product_embeddings"
table_ddl = f"""
CREATE TABLE {table_name} (
    id STRING(1024) NOT NULL,
    embedding ARRAY<FLOAT32>(vector_length=>384),
    content STRING(MAX),
    metadata JSON,
    created_at TIMESTAMP,
    last_updated TIMESTAMP
) PRIMARY KEY (id)
"""
client = get_spanner_client(PROJECT_ID)
ensure_database_exists(client, INSTANCE_ID, DATABASE_ID)
create_or_replace_table(client, INSTANCE_ID, DATABASE_ID, table_name, table_ddl)

### Sample Data: Day 1 vs Day 2

In [None]:
PRODUCTS_DATA_DAY1 = [
    {
        "id": "desk-001",
        "name": "Modern Minimalist Desk",
        "description": "Sleek minimalist desk with clean lines and a spacious work surface. "
                      "Features cable management system and sturdy steel frame.",
        "category": "Desks",
        "price": 399.99,
        "update_timestamp": "2024-02-18"
    }
]

PRODUCTS_DATA_DAY2 = [
    {
        "id": "desk-001",  # Same ID as Day 1
        "name": "Modern Minimalist Desk",
        "description": "Updated: Sleek minimalist desk with built-in wireless charging. "
                      "Features cable management system, sturdy steel frame, and Qi charging pad. "
                      "Perfect for modern tech-enabled workspaces.",
        "category": "Smart Desks",  # Category changed
        "price": 449.99,  # Price increased
        "update_timestamp": "2024-02-19"
    }
]

### Configure Pipeline Components
#### Writer with `write_mode` specified

In [None]:
# Day 1 data
config_day1 = SpannerVectorWriterConfig(
    project_id=PROJECT_ID,
    instance_id=INSTANCE_ID,
    database_id=DATABASE_ID,
    table_name=table_name,
    write_mode='INSERT',
    column_specs=SpannerColumnSpecsBuilder()
      .with_defaults()
      .add_column(
        column_name='created_at',
        python_type=str,
        value_fn=lambda _: datetime.now().isoformat()+'Z'
      )
      .add_column(
        column_name='last_updated',
        python_type=str,
        value_fn=lambda _: datetime.now().isoformat()+'Z'
    ).build()
)

# Day 2 update
config_day2 = SpannerVectorWriterConfig(
    project_id=PROJECT_ID,
    instance_id=INSTANCE_ID,
    database_id=DATABASE_ID,
    table_name=table_name,
    write_mode='UPDATE',  # 'UPDATE' to fail if doesn't exist
    column_specs=SpannerColumnSpecsBuilder()
      .with_defaults()
      .add_column(
        column_name='last_updated',
        python_type=str,
        value_fn=lambda _: datetime.now().isoformat()+'Z'
    ).build()
)

Run Day 1 Pipeline

First, let's ingest our initial product data:

In [None]:
import tempfile

# Executing on DirectRunner (local execution)
with beam.Pipeline() as p:
    _ = (
        p
        | 'Create Products' >> beam.Create(PRODUCTS_DATA_DAY1)
        | 'Convert to Chunks' >> beam.Map(lambda product_dict: Chunk(Content(text=f"{product_dict['name']}: {product_dict['description']}"), id=product_dict["id"], metadata=product_dict))
        | 'Generate Embeddings' >> MLTransform(write_artifact_location=tempfile.mkdtemp())
          .with_transform(HuggingfaceTextEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"))
        | 'Write to Spanner' >> VectorDatabaseWriteTransform(
            config_day1
        )
    )

In [None]:
print("\nAfter Day 1 ingestion:")
verify_embeddings_spanner(client,INSTANCE_ID, DATABASE_ID, table_name)

### Run Day 2 Pipeline

Now let's process our updated product data:

In [None]:
import tempfile

# Executing on DirectRunner (local execution)
with beam.Pipeline() as p:
    _ = (
        p
        | 'Create Products' >> beam.Create(PRODUCTS_DATA_DAY2)
        | 'Convert to Chunks' >> beam.Map(lambda product_dict: Chunk(Content(text=f"{product_dict['name']}: {product_dict['description']}"), id=product_dict["id"], metadata=product_dict))
        | 'Generate Embeddings' >> MLTransform(write_artifact_location=tempfile.mkdtemp())
          .with_transform(HuggingfaceTextEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"))
        | 'Write to Spanner' >> VectorDatabaseWriteTransform(
            config_day2
        )
    )

In [None]:
print("\nAfter Day 2 ingestion:")
verify_embeddings_spanner(client,INSTANCE_ID, DATABASE_ID, table_name)

### What Changed?

Key points to notice:

1. The embedding vector changed because the product description was updated
2. The metadata JSON field contains the updated category, price, and timestamp
3. The content field reflects the new description
4. The original ID remained the same


## Adding Embeddings to Existing Database Records <a name="integration" target="_blank"></a>

This section demonstrates how to:
1. Read existing product data from a database
2. Generate embeddings for that data
3. Write the embeddings back to the database

In [None]:
table_name = "existing_product_embeddings"
table_ddl = f"""
CREATE TABLE {table_name} (
    id STRING(1024) NOT NULL,
    embedding ARRAY<FLOAT32>(vector_length=>384),
    content STRING(MAX),
    description STRING(MAX),
    created_at TIMESTAMP,
    last_updated TIMESTAMP
) PRIMARY KEY (id)
"""
client = get_spanner_client(PROJECT_ID)
ensure_database_exists(client, INSTANCE_ID, DATABASE_ID)
create_or_replace_table(client, INSTANCE_ID, DATABASE_ID, table_name, table_ddl)

Lets first ingest some unembedded data into our table.

Note this just reuses SpannerVectorWriter to easily ingest unembeded data.

In [None]:
import tempfile

data = PRODUCTS_DATA.copy()

# Executing on DirectRunner (local execution)
with beam.Pipeline() as p:
    _ = (
        p
        | 'Create Products' >> beam.Create(PRODUCTS_DATA)
        | 'Convert to Chunks' >> beam.Map(lambda product_dict: Chunk(Content(text=f"{product_dict['name']}: {product_dict['description']}"), id=product_dict["id"], metadata=product_dict))
        | 'Write to Spanner' >> VectorDatabaseWriteTransform(
            SpannerVectorWriterConfig(
                PROJECT_ID,
                INSTANCE_ID,
                DATABASE_ID,
                table_name,
                column_specs=(
                    SpannerColumnSpecsBuilder()
                      .with_id_spec()
                      .with_content_spec()
                      .add_metadata_field("description", str)
                      .add_column(
                          column_name='created_at',
                          python_type=str,
                          value_fn=lambda _: datetime.now().isoformat()+'Z'
                        )
                        .add_column(
                          column_name='last_updated',
                          python_type=str,
                          value_fn=lambda _: datetime.now().isoformat()+'Z'
                      ).build())
            )
        )
    )

Lets look at the current state of our table. Notice there are no embeddings (Column 1).

In [None]:
verify_embeddings_spanner(client,INSTANCE_ID, DATABASE_ID, table_name)

In [None]:
from apache_beam.io.gcp import spanner

from typing import NamedTuple
from apache_beam import coders

class SpannerRow(NamedTuple):
  id: str
  content: str

def spanner_row_to_chunk(spanner_row):
  return Chunk(
      content= Content(spanner_row.content),
      id=spanner_row.id
  )

coders.registry.register_coder(SpannerRow, coders.RowCoder)

with beam.Pipeline() as p:
    _ = (
        p
        | "Read Unembedded data" >> spanner.ReadFromSpanner(PROJECT_ID, INSTANCE_ID, DATABASE_ID, row_type=SpannerRow, sql=f"select id, content from {table_name}")
        | "Spanner Row to Chunk" >> beam.Map(spanner_row_to_chunk)
        | "Generate Embeddings" >> MLTransform(write_artifact_location=tempfile.mkdtemp())
          .with_transform(HuggingfaceTextEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"))
        | "Update Spanner with embeddings" >> VectorDatabaseWriteTransform(
            SpannerVectorWriterConfig(
            PROJECT_ID,
            INSTANCE_ID,
            DATABASE_ID,
            table_name,
            column_specs=SpannerColumnSpecsBuilder().with_id_spec().with_embedding_spec().build()
          )
        )
    )


Now we confirm that our Spanner table was updated with embeddings

In [None]:
verify_embeddings_spanner(client,INSTANCE_ID, DATABASE_ID, table_name)

What Happened?
1. We started with a table containing product data but no embeddings
2. Read the id and content from existing records using ReadFromSpanner
3. Converted Spanner rows to Chunks, using the spanner id column as our Chunk id, and Spanner content column as our Chunk content to be embedded
4. Generated embeddings using our model
5. Wrote back to the same table, updating only the embedding field,
preserving all other fields (price, etc.)

This pattern is useful when:

- You have an existing product database
- You want to add embeddings without disrupting current data


## Generate Embeddings with VertexAI Text Embeddings

This section demonstrates how to use use the Vertex AI text-embeddings API to generate text embeddings that use Googles large generative artificial intelligence (AI) models.

Vertex AI models are subject to [Rate Limits and Quotas](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#view-the-quotas-by-region-and-by-model) and Dataflow automatically retries throttled requests with exponential backoff.


For more information, see [Get text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) in the Vertex AI documentation.

### Authenticate with Google Cloud
To use the Vertex AI API, we authenticate with Google Cloud.


In [None]:
import sys
if 'google.colab' in sys.modules:
  from google.colab import auth
  auth.authenticate_user(project_id=PROJECT_ID)

In [None]:
table_name = "vertex_product_embeddings"
table_ddl = f"""
CREATE TABLE {table_name} (
    id STRING(1024) NOT NULL,
    embedding ARRAY<FLOAT32>(vector_length=>768),
    content STRING(MAX),
    metadata JSON
) PRIMARY KEY (id)
"""
client = get_spanner_client(PROJECT_ID)
ensure_database_exists(client, INSTANCE_ID, DATABASE_ID)
create_or_replace_table(client, INSTANCE_ID, DATABASE_ID, table_name, table_ddl)

### Configure Embedding Handler

Import the `VertexAITextEmbeddings` handler, and specify the desired `textembedding-gecko` model.

In [None]:
from apache_beam.ml.rag.embeddings.vertex_ai import VertexAITextEmbeddings

vertexai_embedder = VertexAITextEmbeddings(model_name="text-embedding-005")

In [None]:
import tempfile

# Executing on DirectRunner (local execution)
with beam.Pipeline() as p:
    _ = (
        p
        | 'Create Products' >> beam.Create(PRODUCTS_DATA)
        | 'Convert to Chunks' >> beam.Map(create_chunk)
        | 'Generate Embeddings' >> MLTransform(write_artifact_location=tempfile.mkdtemp())
          .with_transform(vertexai_embedder)
        | 'Write to Spanner' >> VectorDatabaseWriteTransform(
            SpannerVectorWriterConfig(
                project_id=PROJECT_ID,
                instance_id=INSTANCE_ID,
                database_id=DATABASE_ID,
                table_name=table_name
            )
        )
    )

In [None]:
verify_embeddings_spanner(client,INSTANCE_ID, DATABASE_ID, table_name)

## Streaming Embeddings Updates from PubSub

This section demonstrates how to build a real-time embedding pipeline that continuously processes product updates and maintains fresh embeddings in Spanner. This approach is ideal data that changes frequently.

### Authenticate with Google Cloud
To use the PubSub, we authenticate with Google Cloud.


In [None]:
import sys
if 'google.colab' in sys.modules:
  from google.colab import auth
  auth.authenticate_user(project_id=PROJECT_ID)

### Setting Up PubSub Resources

First, let's set up the necessary PubSub topics and subscriptions:

In [None]:
from google.cloud import pubsub_v1
from google.api_core.exceptions import AlreadyExists
import json

# Define pubsub topic
TOPIC = "" # @param {type:'string'}

# Create publisher client and topic
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(PROJECT_ID, TOPIC)
try:
    topic = publisher.create_topic(request={"name": topic_path})
    print(f"Created topic: {topic.name}")
except AlreadyExists:
    print(f"Topic {topic_path} already exists.")

### Create Spanner Table for Streaming Updates

Next, create a table to store the embedded data.

In [None]:
table_name = "streaming_product_embeddings"
table_ddl = f"""
CREATE TABLE {table_name} (
    id STRING(1024) NOT NULL,
    embedding ARRAY<FLOAT32>(vector_length=>384),
    content STRING(MAX),
    metadata JSON
) PRIMARY KEY (id)
"""
client = get_spanner_client(PROJECT_ID)
ensure_database_exists(client, INSTANCE_ID, DATABASE_ID)
create_or_replace_table(client, INSTANCE_ID, DATABASE_ID, table_name, table_ddl)

### Configure the Pipeline options
To run the pipeline on DataFlow we need
- A gcs bucket for staging DataFlow files. Replace `<BUCKET_NAME>`: the name of a valid Google Cloud Storage bucket. Don't include a gs:// prefix or trailing slashes
- Optionally set the Google Cloud region that you want to run Dataflow in. Replace `<REGION>` with the desired location


In [None]:
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions, SetupOptions, GoogleCloudOptions, WorkerOptions

options = PipelineOptions()
options.view_as(StandardOptions).streaming = True

# Provide required pipeline options for the Dataflow Runner.
options.view_as(StandardOptions).runner = "DataflowRunner"

# Set the Google Cloud region that you want to run Dataflow in.
REGION = '' # @param {type:'string'}
options.view_as(GoogleCloudOptions).region = REGION

NETWORK = '' # @param {type:'string'}
if NETWORK:
  options.view_as(WorkerOptions).network = NETWORK

SUBNETWORK = '' # @param {type:'string'}
if SUBNETWORK:
  options.view_as(WorkerOptions).subnetwork = f"regions/{REGION}/subnetworks/{SUBNETWORK}"

options.view_as(GoogleCloudOptions).project = PROJECT_ID

BUCKET_NAME = '' # @param {type:'string'}
dataflow_gcs_location = "gs://%s/dataflow" % BUCKET_NAME

# The Dataflow staging location. This location is used to stage the Dataflow pipeline and the SDK binary.
options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location

# The Dataflow temp location. This location is used to store temporary files or intermediate results before outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location

import random
options.view_as(GoogleCloudOptions).job_name = f"spanner-streaming-embedding-ingest{random.randint(0,1000)}"

# options.view_as(SetupOptions).save_main_session = True
options.view_as(SetupOptions).requirements_file = "./requirements.txt"


In [None]:
!echo "sentence-transformers" > ./requirements.txt
!cat ./requirements.txt


### Provide additional Python dependencies to be installed on Worker VM's

We are making use of the HuggingFace `sentence-transformers` package to generate embeddings. Since this package is not installed on Worker VM's by default, we create a requirements.txt file with the additional dependencies to be installed on worker VM's.

See [Managing Python Pipeline Dependencies](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/) for more details.


### Configure and Run Pipeline

Our pipeline contains these key components:

1. **Source**: Continuously reads messages from PubSub
3. **Transformation**: Converts JSON messages to Chunk objects for embedding
4. **ML Processing**: Generates embeddings using HuggingFace models
5. **Sink**: Writes results to Spanner (INSERT_OR_UPDATE)

In [None]:
def parse_message(message):
  #Parse a message containing product data.
  product_json = json.loads(message.decode('utf-8'))
  return Chunk(
      content=Content(
          text=f"{product_json.get('name', '')}: {product_json.get('description', '')}"
      ),
      id=product_json.get('id', ''),
      metadata=product_json
  )

pipeline = beam.Pipeline(options=options)
# Streaming pipeline
_ = (
    pipeline
    | "Read from PubSub" >> beam.io.ReadFromPubSub(
        topic=f"projects/{PROJECT_ID}/topics/{TOPIC}"
    )
    | "Parse Messages" >> beam.Map(parse_message)
    | "Generate Embeddings" >> MLTransform(write_artifact_location=tempfile.mkdtemp())
        .with_transform(HuggingfaceTextEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"))
    | "Write to Spanner" >> VectorDatabaseWriteTransform(
        SpannerVectorWriterConfig(
          PROJECT_ID,
          INSTANCE_ID,
          DATABASE_ID,
          table_name
        )
    )
)

### Create Publisher Subprocess
The publisher simulates real-time product updates by:
- Publishing sample product data to the PubSub topic every 5 seconds
- Modifying prices and descriptions to represent changes
- Adding timestamps to track update times
- Running for 25 minutes in the background while our pipeline processes the data

In [None]:
#@title Define PubSub publisher function
import threading
import time
import json
import logging
from google.cloud import pubsub_v1
import datetime
import os
import sys
log_file = os.path.join(os.getcwd(), "publisher_log.txt")

print(f"Log file will be created at: {log_file}")

def publisher_function(project_id, topic):
    """Function that publishes sample product updates to a PubSub topic.

    This function runs in a separate thread and continuously publishes
    messages to simulate real-time product updates.
    """
    time.sleep(300)
    thread_id = threading.current_thread().ident

    process_log_file = os.path.join(os.getcwd(), f"publisher_{thread_id}.log")

    file_handler = logging.FileHandler(process_log_file)
    file_handler.setFormatter(logging.Formatter('%(asctime)s - ThreadID:%(thread)d - %(levelname)s - %(message)s'))

    logger = logging.getLogger(f"worker.{thread_id}")
    logger.setLevel(logging.INFO)
    logger.addHandler(file_handler)

    logger.info(f"Publisher thread started with ID: {thread_id}")
    file_handler.flush()

    publisher = pubsub_v1.PublisherClient()
    topic_path = publisher.topic_path(project_id, topic)

    logger.info("Starting to publish messages...")
    file_handler.flush()
    for i in range(300):
        message_index = i % len(PRODUCTS_DATA)
        message = PRODUCTS_DATA[message_index].copy()


        dynamic_factor = 1.05 + (0.1 * ((i % 20) / 20))
        message["price"] = round(message["price"] * dynamic_factor, 2)
        message["description"] = f"PRICE UPDATE (factor: {dynamic_factor:.3f}): " + message["description"]

        message["published_at"] = datetime.datetime.now().isoformat()

        data = json.dumps(message).encode('utf-8')
        publish_future = publisher.publish(topic_path, data)

        try:
            logger.info(f"Publishing message {message}")
            file_handler.flush()
            message_id = publish_future.result()
            logger.info(f"Published message {i+1}: {message['id']} (Message ID: {message_id})")
            file_handler.flush()
        except Exception as e:
            logger.error(f"Error publishing message: {e}")
            file_handler.flush()

        time.sleep(5)

    logger.info("Finished publishing all messages.")
    file_handler.flush()

#### Start publishing to PubSub in background

In [None]:
# Launch publisher in a separate thread
print("Starting publisher thread in 5 minutes...")
publisher_thread = threading.Thread(
    target=publisher_function,
    args=(PROJECT_ID, TOPIC),
    daemon=True
)
publisher_thread.start()
print(f"Publisher thread started with ID: {publisher_thread.ident}")
print(f"Publisher thread logging to file: publisher_{publisher_thread.ident}.log")

### Run Pipeline on Dataflow

We launch the pipeline to run remotely on Dataflow. Once the job is launched, you can monitor its progress in the Google Cloud Console:
1. Go to https://console.cloud.google.com/dataflow/jobs
2. Select your project
3. Click on the job named "spanner-streaming-embedding-ingest"
4. View detailed execution graphs, logs, and metrics

**Note**: This streaming pipeline runs indefinitely until manually stopped. Be sure to monitor usage and terminate the job in the [dataflow job console](https://console.cloud.google.com/dataflow/jobs) when finished testing to avoid unnecessary costs.

### What to Expect
After running this pipeline, you should see:
- Continuous updates to product embeddings in the Spanner table
- Price and description changes reflected in the metadata
- New embeddings generated for updated product descriptions
- Timestamps showing when each record was last modified

In [None]:
# Run pipeline
pipeline_result = pipeline.run_async()

In [None]:
pipeline_result

### Verify data
Monitor your job in https://console.cloud.google.com/dataflow/jobs. Once it the workers have started processing requests verify that data has been written

In [None]:
verify_embeddings_spanner(client,INSTANCE_ID, DATABASE_ID, table_name)

Finally, stop your streaming job to tear down the resources.