# Neptune Analytics Instance Management With S3 Table Embedding Projections


This notebook demonstrates how embedding data stored in a data lake can be imported into Amazon Neptune Analytics and used to leverage the TopK algorithm package. 

The goal is to ingest embedding vectors as graph data and enable similarity search, allowing the system to identify products with similar characteristics based on their embedding representations.

The content of this notebook includes:
1. Download and modify the Kaggle fashion dataset, enriching it with an embedding column generated using Amazon Bedrock, and store the result in Amazon S3
2. Create an Athena projection from S3 Tables bucket.
3. Import the projection into Neptune Analytics.
4. Run topK.byNode to search for similar products and return similarity scores


1. Create a projection from S3 Tables bucket.
2. Import the projection into Neptune Analytics.
3. Run Louvain algorithm on the provisioned instance to create communities.
4. Export the graph back into S3 Tables bucket.

## Setup

Import the necessary libraries and set up logging.

In [None]:
import asyncio
import os

import boto3
import dotenv

from nx_neptune import empty_s3_bucket, instance_management, NeptuneGraph, set_config_graph_id
from nx_neptune.instance_management import _execute_athena_query, _clean_s3_path
from nx_neptune.utils.utils import get_stdout_logger, check_env_vars, _get_bedrock_embedding, read_csv, write_csv, \
    push_to_s3

dotenv.load_dotenv()

from nx_neptune.session_manager import SessionManager

## Configuration

Check for environment variables necessary for the notebook.

In [None]:
# Configure logging to see detailed information about the instance creation process
logger = get_stdout_logger(__name__, [
    'nx_neptune.instance_management',
    'nx_neptune.utils.task_future',
    'nx_neptune.session_manager',
    'nx_neptune.interface',
    __name__
])

# Check for optional environment variables
env_vars = check_env_vars([
    'NETWORKX_S3_IMPORT_BUCKET_PATH',
    'NETWORKX_S3_EXPORT_BUCKET_PATH',
    'NETWORKX_S3_TABLES_CATALOG',
    'NETWORKX_S3_TABLES_DATABASE',
    'NETWORKX_S3_TABLES_TABLENAME',
])

# Get environment variables
s3_location_import = os.getenv('NETWORKX_S3_IMPORT_BUCKET_PATH')
s3_location_export = os.getenv('NETWORKX_S3_EXPORT_BUCKET_PATH')
s3_tables_database = os.getenv('NETWORKX_S3_TABLES_DATABASE')
s3_tables_tablename = os.getenv('NETWORKX_S3_TABLES_TABLENAME')
graph_id = os.getenv('NETWORKX_GRAPH_ID')
session_name = "nx-athena-test-full"

## Data Setup

PaySim data is available from [kaggle](https://www.kaggle.com/code/kartik2112/fraud-detection-on-paysim-dataset/input?select=PS_20174392719_1491204439457_log.csv).

Data should be uploaded to an S3 bucket, and an athena table created for that bucket.

The PaySim dataset includes a simulated mobile money dataset, that involves transactions between client actors and banks. We can use this dataset to detect fraudulent activities in the simulated data.

In [None]:

# Download the fahsion.csv from Kaggle dataset (Only the style.csv).
# https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small

def append_embedding(headers, rows):
    # Inject header
    fieldnames = headers + ["embedding"]
    # Inject embedding
    bedrock = boto3.client("bedrock-runtime")

    # Generate vector embeddings.
    for row in rows:
        # embedding =  [1.1] * 384
        embedding = _get_bedrock_embedding(bedrock,
                                           row["masterCategory"] +
                                           row["subCategory"] +
                                           row["articleType"] +
                                           row["baseColour"])[0]
        row["embedding"] = ";".join(map(str, embedding))

    return fieldnames, rows


data_path = "../example/resources/styles.csv"
data_w_embedding_path = "../example/resources/styles_embedding.csv"

output_bucket = "s3://ak-athena-result/"
data_bucket = "s3://ak-athena-import/"
log_bucket = "s3://ak-athena-log/"


athena_client = boto3.client('athena')

# Read data from data path
headers, rows = read_csv(data_path, 10)
# Add the embedding
headers, rows = append_embedding(headers, rows)
# Write to new csv
write_csv(data_w_embedding_path, headers, rows)

# Push to s3
empty_s3_bucket(data_bucket)
push_to_s3(data_w_embedding_path, _clean_s3_path(data_bucket),"styles_embedding.csv")


print("Completed data preparation.")


### Athena related work

In [None]:

# Create external data
create_csv_table_stmt = f"""
CREATE EXTERNAL TABLE IF NOT EXISTS {s3_tables_tablename} (
    `id` int,
    `gender` string,
    `masterCategory` string,
    `subCategory` string,
    `articleType` string,
    `baseColour` string,
    `season` string,
    `year` int,
    `usage` string,
    `productDisplayname` string,
    `embedding` array<float>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim' = ',', 'collection.delim' = ';')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '{data_bucket}'
TBLPROPERTIES ('classification' = 'csv', 'skip.header.line.count'='1');
"""

_execute_athena_query(athena_client, create_csv_table_stmt, log_bucket, database=s3_tables_database)

empty_s3_bucket(output_bucket)

# Projection
create_csv_table_stmt = f"""
    SELECT
        "id" AS "~id",
        "masterCategory" AS "~label",
        "baseColour" AS "baseColour",
        array_join(
            transform(embedding, x -> cast(x AS varchar)), ';'
        ) AS "embedding:vector"
    FROM {s3_tables_tablename};
"""

_execute_athena_query(athena_client, create_csv_table_stmt, output_bucket, database=s3_tables_database)


empty_s3_bucket(output_bucket, file_extension=".csv.metadata")

## Import Data from S3

Import data from S3 into the Neptune Analytics graph and wait for the operation to complete. <br>
IAM permisisons required for import: <br>
 - s3:GetObject, kms:Decrypt, kms:GenerateDataKey, kms:DescribeKey

In [None]:
task_id = await instance_management.import_csv_from_s3(
        NeptuneGraph.from_config(set_config_graph_id(graph_id)),
        output_bucket,
        reset_graph_ahead=True,
        skip_snapshot=True,
    )


In [None]:
TOPK_QUERY = """
    MATCH (n) WHERE id(n) = '30805'
    CALL neptune.algo.vectors.topK.byNode(
      n, {topK: 3})
    YIELD node, score
    RETURN node, score
"""

config = set_config_graph_id(graph_id)
na_graph = NeptuneGraph.from_config(config)
all_nodes = na_graph.execute_call(TOPK_QUERY)
for n in all_nodes:
    print(n["node"]["~id"] + ", score:" + str(n["score"]))

## Conclusion

This notebook demonstrated the complete lifecycle of running analytics from a datalake projection into Neptune Analytics instance:

1. **Creation**: We created a new Neptune Analytics instance on demand
2. **Import**: We imported a projection of the datalake
3. **Usage**: We ran graph algorithms (Louvain) on the instance and mutated the data
4. **Deletion**: We exported the updated data back into the datalake into an iceberg table

The session manager (`SessionManager`) provides an easy mechanism to execute general datalake functionality.