# Neptune Analytics Instance Management With S3 Table Embedding Projections


This notebook demonstrates how embedding data stored in a data lake can be imported into Amazon Neptune Analytics and used to leverage the TopK algorithm package. 

The goal is to ingest embedding vectors as graph data and enable similarity search, allowing the system to identify products with similar characteristics based on their embedding representations.

The content of this notebook includes:
1. Download and modify the Kaggle fashion dataset, enriching it with an embedding column generated using Amazon Bedrock, and store the result in Amazon S3
2. Create an Athena projection from S3 Tables bucket.
3. Import the projection into Neptune Analytics.
4. Run topK.byNode to search for similar products and return similarity scores


## Setup

Import the necessary libraries and set up logging.

In [None]:
import asyncio
import os
import pandas as pd
import boto3
import dotenv

from nx_neptune import empty_s3_bucket, instance_management, NeptuneGraph, set_config_graph_id
from nx_neptune.instance_management import execute_athena_query, _clean_s3_path
from nx_neptune.utils.utils import get_stdout_logger, check_env_vars, _get_bedrock_embedding, read_csv, write_csv, \
    push_to_s3, to_embedding_entries, push_to_s3_vector, generate_create_table_ddl, generate_projection_stmt
from nx_neptune.utils.task_future import wait_until_all_complete, TaskType

dotenv.load_dotenv()

from nx_neptune.session_manager import SessionManager

# Configure logging to see detailed information about the instance creation process
logger = get_stdout_logger(__name__, [
    'nx_neptune.instance_management',
    'nx_neptune.utils.task_future',
    'nx_neptune.session_manager',
    'nx_neptune.interface',
    __name__
])

## Configuration

Check for environment variables necessary for the notebook.

In [None]:
# Check for optional environment variables
env_vars = check_env_vars([
    'NETWORKX_S3_DATA_LAKE_BUCKET_PATH',
    'NETWORKX_S3_LOG_BUCKET_PATH',
    'NETWORKX_S3_IMPORT_BUCKET_PATH',
    'NETWORKX_S3_TABLES_DATABASE',
    'NETWORKX_S3_TABLES_TABLENAME',
    'S3_VECTOR_CONNECTOR',
    'S3_VECTOR_BUCKET',
    'S3_VECTOR_INDEX',
    'NETWORKX_GRAPH_ID'
])

(s3_location_data_lake, s3_location_log, s3_location_import, 
 s3_tables_database, s3_tables_tablename, s3_vector_connector,
 s3_vector_bucket, s3_vector_index, graph_id) = env_vars.values()


## Data Setup

Fashion product data is sourced from Kaggle, using the dataset available at [kaggle](https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small).

For this demo, only the styles.csv file is required.

In this section, the dataset is modified to append an additional embedding column generated using Amazon Bedrock. The enriched CSV file is then uploaded to Amazon S3 for downstream processing as part of the data lake projection workflow.


In [None]:
# Download the fahsion.csv from Kaggle dataset (Only the style.csv).
# https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small
data_path = "../example/resources/styles.csv"

# Read data from data path
headers, rows = read_csv(data_path)

# Print out the data file content
df = pd.DataFrame(rows)
df

### Data Enrichment – Embeddings

Next, a transformation is being applied to each row calcualte the associated emebdding value from selective set of attributes and convert that into S3 vector compatiable format, and then uploaded to S3 Vector service.


In [None]:
# Add the embedding
columens_to_embed = ["masterCategory", "subCategory", "articleType",
                     "baseColour", "season", "year", "usage", "productDisplayName"]

items = to_embedding_entries(rows[:10], columens_to_embed)

# Writing embedding S3 vector
push_to_s3_vector(items, s3_vector_bucket, s3_vector_index)


### Upload Dataset and Register in Athena

After the embedding column is added, the enriched dataset is uploaded to Amazon S3.

An external table is then created in Amazon Athena over the uploaded CSV, exposing both the original attributes and the embedding array for SQL-based access.

In [None]:
# Push to s3
empty_s3_bucket(s3_location_data_lake)
push_to_s3(data_path, _clean_s3_path(s3_location_data_lake),"styles.csv")

# Create external data
columns = [
    ("id", "string"),
    ("gender", "string"),
    ("masterCategory", "string"),
    ("subCategory", "string"),
    ("articleType", "string"),
    ("baseColour", "string"),
    ("season", "string"),
    ("year", "int"),
    ("usage", "string"),
    ("productDisplayname", "string"),
    ("embedding", "array<float>")
]
stmt_s3_table = generate_create_table_ddl(s3_tables_tablename, s3_location_data_lake, columns)

await execute_athena_query(stmt_s3_table, s3_location_log, database=s3_tables_database, polling_interval=5)


print("DataLake preparation completed.")

## Import Data into Neptune Analytics and Perform Similarity Search

A projection query is executed in Athena to select the required columns, map Neptune-compatible headers, and flatten the embedding array into a vector format.

The resulting CSV is compatible with Amazon Neptune Analytics import requirements and can be ingested directly to enable vector similarity search on the graph.


In [None]:
# Clear import directory
empty_s3_bucket(s3_location_import)

s3_vector_table_ref=f'"{s3_vector_connector}"."{s3_vector_bucket}"."{s3_vector_index}"'

stmt_projection = generate_projection_stmt(
    col_id="t.id",
    col_label="t.masterCategory",
    col_embedding="v.embedding",
    columns=["t.baseColour", "t.season"],
    base_table="test_embedding_table as t",
    joins=[
        (f"{s3_vector_table_ref} v", "t.id = v.vector_id")])

await execute_athena_query(stmt_projection, s3_location_log, database=s3_tables_database, polling_interval=5)


# Remove unnecessary .csv.metadata file generated by Athena. 
empty_s3_bucket(s3_location_import, file_extension=".csv.metadata")

task_id = await instance_management.import_csv_from_s3(
        NeptuneGraph.from_config(set_config_graph_id(graph_id)),
        s3_location_import,
        reset_graph_ahead=True,
        skip_snapshot=True,
    )


### Inspect Embedding

A simple query is used to inspect the imported embeddings by printing the first 5 floating-point values from each node’s embedding vector. 

This provides a quick sanity check to verify that the embedding data has been ingested and stored correctly before running similarity queries.

In [None]:
config = set_config_graph_id(graph_id)
na_graph = NeptuneGraph.from_config(config)

SHOW_EMBEDDING_QUERY = """
    MATCH (n) 
    CALL neptune.algo.vectors.get(n) 
    YIELD embedding RETURN n, embedding[0..5] as embedding_first_five
    limit 3
"""

all_nodes = na_graph.execute_call(SHOW_EMBEDDING_QUERY)
for n in all_nodes:
    print(n["n"]["~id"] + ": " + str(n["embedding_first_five"]))

### Similarity Search

You can now run `neptune.algo.vectors.topK.byNode` to perform similarity search using the imported embedding vectors.

This query returns the top-K most similar nodes along with their similarity scores, confirming that the embeddings are correctly integrated and usable for semantic similarity search in Amazon Neptune Analytics.

In [None]:
TOPK_QUERY = """
    MATCH (n) WHERE id(n) = '30805'
    CALL neptune.algo.vectors.topK.byNode(
      n, {topK: 5})
    YIELD node, score
    RETURN node, score
"""

all_nodes = na_graph.execute_call(TOPK_QUERY)
for n in all_nodes:
    print(n["node"]["~id"] + ", score:" + str(n["score"]))

## Conclusion

This notebook demonstrated the complete lifecycle of embedding vectors—from ingestion and transformation in the data lake to being made available within Amazon Neptune Analytics for similarity search.

By integrating embedding data directly into the graph, this approach enables scalable and explainable similarity queries using native graph algorithms such as TopK. This is particularly important for recommendation, product discovery, and enrichment workflows, where vector similarity needs to be combined with structured graph relationships and properties, rather than treated as an isolated retrieval step.

In practice, this pattern provides a flexible foundation for building hybrid graph-and-vector analytics pipelines that can evolve alongside existing data lake architectures.