# How to sample a graph neighborhood from Neptune Database and invoke the GraphStorm SageMaker endpoint

Now that that you have loaded the graph data into Neptune and deployed the SageMaker endpoint into the same VPC,
you can use the SageMaker Notebook instance as the intermediary between the two to gather a graph sample, convert it
to the a JSON payload formatted as GraphStorm expects and receive a prediction for it.

In this notebook we illustrate how to retrieve a k-hop graph neighborhood sample from Neptune Database using the Gremlin
graph query language, convert the JSON response into the format GraphStorm inference expects, and post a prediction request
to the SageMaker endpoint.

## Prerequisites

**This notebook is designed to run from within the SageMaker Notebook instance that you created as part of the CDK deployment.**

This is necessary, as we set up the VPC and security groups to allow the SageMaker Notebook to post queries to the Neptune Database instance the CDK created.

**You also need to ensure you are using the `Python 3` kernel for this notebook**, as
that provides access to the graph notebook magics like `%graph_notebook_config` that we
use to retrieve information about the Neptune Database instance.

## Environment setup

To start set up your environment with the necessary imports and configuration.

We need to know which Neptune DB host we will use during graph sampling, which we get from the graph notebook's configuration, and the name of the SageMaker endpoint we will send inference requests to, which you will need to provide.

In [None]:
import json
import time
import yaml

import boto3
import requests
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
import urllib3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest

# Configure retry strategy
retry_strategy = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503, 504])

# Set up AWS credentials for request signing
session = boto3.Session()
credentials = session.get_credentials()
region = session.region_name

# Set up session with retry
urllib3.disable_warnings()
http_session = requests.Session()
adapter = HTTPAdapter(max_retries=retry_strategy)
http_session.mount("https://", adapter)
http_session.verify = False

# Neptune endpoint configuration
config_obj = %graph_notebook_config
NEPTUNE_HOST = config_obj.host
GREMLIN_PORT = 8182

## ENTER ENDPOINT NAME

In notebook `2-GraphStorm-Endpoint-Deployment.ipynb` you deployed a SageMaker realtime endpoint. During deployment the script printed the endpoint name which should be of the form `ieee-fraud-detection-Endpoint-<timestamp>`. Provide that name here so we can use it to post requests.

The endpoint name can also be found from [Amazon SageMaker AI Web console](https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/endpoints) under the``Inference -> Endpoints`` menu.

In [None]:
ENDPOINT_NAME = "<YOUR-ENDPOINT-NAME-HERE>"

### Executing Gremlin queries

Next, we will set up a Python function that allows us to send authenticated Gremlin queries to the Neptune Database. It is necessary to sign your requests because the  we chose IAM authentication when creating the NeptuneDB instance in the CDK.

In [None]:
def execute_gremlin_query(query):
    """Execute a Gremlin query using HTTPS POST request.

    Args:
        query: Gremlin query string

    Returns:
        dict: Query response
    """
    endpoint_url = f"https://{NEPTUNE_HOST}:{GREMLIN_PORT}/gremlin"

    # Create request for signing
    request = AWSRequest(
        method="POST",
        url=endpoint_url,
        headers={"Host": NEPTUNE_HOST},
        data=json.dumps({"gremlin": query}),
    )
    SigV4Auth(credentials, "neptune-db", region).add_auth(request)

    # Execute request
    headers = dict(request.headers)
    headers["Host"] = NEPTUNE_HOST
    headers["Content-Type"] = "application/json"

    response = http_session.post(endpoint_url, headers=headers, json={"gremlin": query})
    response.raise_for_status()

    return response.json()

We can use this function to post simple queries to the Neptune Database, for example retrieve the ids of 100 `Transaction` nodes. We can use these node ids later to measure inference latency.

In [None]:
# Get test transaction IDs
query = "g.V().hasLabel('Transaction').limit(100).id()"
response = execute_gremlin_query(query)
test_tx_ids = response["result"]["data"]["@value"]

print(f"Found {len(test_tx_ids)} transactions ids for testing: {test_tx_ids[:5]} ...")

## Graph Sampling in Neptune DB

Next we define a function that performs the graph neighborhood sampling using Gremlin.

This function will gradually build a Gremlin query string based on the sampling criteria we provide as arguments.

Remember that GNN sampling for node classification happens from the outside in: After selecting the node you want to predict for, called the "target node", we take a sample of nodes that point to it, its "incoming neighbors". Then for each of these selected neighbors we repeat the process, up to a maximum number of hops `k`, which is determined by the number of layers used when training your GNN.

To make the process scalable we limit the number of maximum neighbors to retrieve for each node by a value called the "fanout".

For example a 2-hop, 10,10 fanout indicates that we will try to sample 10 nodes at the first level (direct neighbors of the target), then for every sampled node we will again try to sample another nodes. 

To retrieve the edges as well as the nodes from Neptune DB we will post two separate queries, one to get the nodes and one to get the edges between them.

The `sample_graph` function will return two Python dicts, the first being the response containing all nodes in the sample, and the second being the edges.

In [None]:
def sample_graph(tx_id, hop_count, fanout, add_reverse=True):
    """Sample local neighborhood of a transaction from Neptune instance

    Parameters
    ----------
        tx_id:
            Target node ID
        hop_count:
            Number of hops
        fanout:
            Max of neighbors to visit per node at each step
        add_reverse:
            Whether to consider reverse edges during traversal.
            When true, edge directionality is ignored during sampling,
            otherwise we use inE() to follow incoming edges, starting
            from the target outwards.


    Returns
    -------
        tuple[dict, dict]: (node_response, edges_response)
            Tuple with Gremlin responses for nodes in the first element
            and edges response in the second element.
    """
    # Query 1: Get nodes
    node_query = f"g.V('{tx_id}').store('subGraph')."
    if add_reverse:
        # Ignore edge directionality
        node_query += f"repeat(bothE().limit({fanout}).otherV().store('subGraph'))."
    else:
        # Only get incoming neighbors
        node_query += f"repeat(inE().limit({fanout}).otherV().store('subGraph'))."
    node_query += f"times({hop_count})."
    node_query += "cap('subGraph').unfold().dedup()."
    node_query += "project('nodeId', 'nodeType', 'properties')."
    node_query += "by(id()).by(label()).by(valueMap())"

    # Query 2: Get edges between sampled nodes
    # TODO: Support edge property extraction
    edge_query = f"g.V('{tx_id}').store('subGraph')."
    if add_reverse:
        edge_query += f"repeat(bothE().limit({fanout}).store('edges').otherV().store('subGraph'))."
    else:
        edge_query += (
            f"repeat(inE().limit({fanout}).store('edges').otherV().store('subGraph'))."
        )
    edge_query += f"times({hop_count})."
    edge_query += "cap('edges').unfold().dedup()."
    edge_query += (
        "project('edgeId', 'edgeType', 'fromId', 'fromType', 'toId', 'toType')."
    )
    edge_query += "by(id()).by(label()).by(outV().id()).by(outV().label()).by(inV().id()).by(inV().label())"

    # Execute queries
    node_response = execute_gremlin_query(node_query)
    edge_response = execute_gremlin_query(edge_query)

    return node_response, edge_response

Now that you have a way to sample the graph, try retrieving a sample for a single Transaction

In [None]:
tx_id = test_tx_ids[0]

node_response, edge_response = sample_graph(
    tx_id, hop_count=2, fanout=10, add_reverse=True
)

The `node_response` and `edge_response` contain responses in the [GraphSON](https://tinkerpop.apache.org/docs/3.7.4/dev/io/#graphson) specification, Gremlin's JSON specification that serializes graph objects.

For example, to see the data retrieved for the first node in the node response you can use

In [None]:
node_response["result"]["data"]["@value"][0]

Similarly, you can view the data for the first edge using

In [None]:
edge_response["result"]["data"]["@value"][0]

## Converting Neptune response to GraphStorm format

[GraphStorm uses its own JSON specification](https://graphstorm.readthedocs.io/en/latest/cli/model-training-inference/real-time-inference-spec.html) for inference requests that's designed with GNN inference in mind. To make it possible to use the data retrieved from Neptune to make a prediction using the GraphStorm endpoint, you will need to convert the retrieved graph from the GraphSON format to the format that GraphStorm expects. 

To ease this process we provide the conversion code for the response of the specific Gremlin query define in `sample_graph` in `convert_neptune_gs_sample.py`.

The function of interest is `prepare_payload` which takes as input the nodes and edges responses from the Gremlin query, uses the GraphStorm graph construction and training configurations to validate and transform data to the appropriate formats, and returns a dictionary that matches the GraphStorm real-time inference specification and can be used directly to make inference requests to the endpoint. 


First, you'll need to retrieve the graph construction and training configuration files, to ensure the requests you make to the endpoint match the expected schema.

In [None]:
import os

with open("task_config.json", "r", encoding="utf-8") as f:
    task_config = json.load(f)

MODEL_PATH = task_config["MODEL_PATH"]

with open(
    os.path.join(MODEL_PATH, "GRAPHSTORM_RUNTIME_UPDATED_TRAINING_CONFIG.yaml"),
    "r",
    encoding="utf-8",
) as f:
    train_config_dict = yaml.safe_load(f)

with open(
    os.path.join(MODEL_PATH, "data_transform_new.json"),
    "r",
    encoding="utf-8",
) as f:
    gconstruct_config_dict = json.load(f)

Now you can try converting the payload of the test transaction from the GraphSON format to the GraphStorm inference format

In [None]:
from convert_neptune_gs_sample import prepare_payload

test_payload = prepare_payload(
    node_response,
    edge_response,
    tx_id,
    gconstruct_config_dict,
    train_config_dict,
    target_node_type="Transaction",
    add_reverse=True,
)

As before you can try inspecting the first node and first edge in the payload

In [None]:
test_payload["graph"]["nodes"][0]

In [None]:
test_payload["graph"]["edges"][0]

## Making an inference request using the GraphStorm payload

Now that you have prepared the data in the format that GraphStorm expects, you are able to make a prediction for the target node using the SageMaker endpoint!

We will write a small function that allows us to invoke the endpoint using `boto3`

In [None]:
def invoke_endpoint(payload):
    """Invoke SageMaker endpoint for inference.

    Args:
        payload: GraphStorm inference request payload

    Returns:
        dict: Inference response
    """
    sagemaker_runtime = boto3.client("sagemaker-runtime")
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=ENDPOINT_NAME,
        ContentType="application/json",
        Accept="application/json",
        Body=json.dumps(payload),
    )

    result = json.loads(response["Body"].read().decode())

    # Check response status code
    if result.get("status_code", 200) != 200:
        raise Exception(f"Inference failed: {result.get('error', 'Unknown error')}")

    return result

Note that in order to be able to make SageMaker runtime invocations through from the VPC, we have created a VPC endpoint for SageMaker runtime during the CDK deployment. This is necessary to be able to send the requests from the Notebook instance that lies within the VPC.

Try making a prediction with the test payload you just created

In [None]:
invoke_endpoint(test_payload)

You should be able to see a response in the format (results will vary depending on which transaction ID was used)

```json
{'status_code': 200,
 'request_uid': '877042dbc361fc33',
 'message': 'Request processed successfully.',
 'error': '',
 'data': {
    'results': [
            {
                'node_type': 'Transaction',
                'node_id': '2991260',
                'prediction': [0.995966911315918, 0.004033133387565613]
            }
        ]
    }
}
```

The data of interest for the single transaction we made a prediction for are in the `prediction` key and corresponding `node_id`. The prediction gives us the raw scores the model produces for class 0 (legitimate) and class 1 (fraudulent), at the corresponding 0 and 1 indexes of the `predictions` list. In this example, the model marks the transaction as most likely legitimate.

You can find the full GraphStorm response specification in the [GraphStorm documentation](https://graphstorm.readthedocs.io/en/latest/cli/model-training-inference/real-time-inference-spec.html#specification-of-response-body-contents).

## Latency measurements


Now that you learned how perform a single inference call you can go ahead and perform multiple end-to-end inference requests to measure response latency.

As demonstrated, an end-to-end inference call for a node that already exists in the graph will have 3 distinct stages:

1. Graph sampling from Neptune DB.
    * For a given target node that already exists in the graph, retrieve its k-hop neighborhood with a fanout limit, i.e. limiting the number of neighbors retrieved at each hop by a threshold.
2. Payload preparation for inference
    * Neptune DB returns graphs using [GraphSON](https://tinkerpop.apache.org/docs/3.7.4/dev/io/#graphson), a specialized JSON-like data format used to describe graph data. At this step we need to convert the returned GraphSON to GraphStorm’s ownJSON spec. This step is performed on the inference client, in this case a SageMaker notebook instance.
3. Model inference via SageMaker endpoint
    * Once the payload is prepared, we send an inference request to a SageMaker endpoint that has loaded a previously trained model snapshot. The endpoint receives the request, performs any feature transformations needed (e.g. converting categorical features to one-hot encoding), creates the binary graph representation in memory and makes a prediction for the target node using the graph neighborhood and trained model weights. The response is encoded to JSON and sent back to the client.


To conclude this investigation you will perform a series of inference calls for a random set of target nodes, measuring the latency of each stage and the total end-to-end latency. This should give you an indicate of what sort of latencies to expect for within-VPC inference.

First, define a function that performs end-to-end inference and measures the latency of each stage for a single inference call:

In [None]:
def measure_e2e_latency(
    tx_id,
    hop_count,
    fanout,
    gconstruct_config_dict,
    train_config_dict,
    target_node_type="Transaction",
    add_reverse=True,
):
    """Measure end-to-end latency for a single transaction.

    Args:
        tx_id: Transaction ID
        hop_count: Number of hops
        fanout: Number of edges per node
        add_reverse: Whether to add reverse edges in the payload

    Returns:
        dict: Latency measurements
    """
    # 1. Graph sampling
    sampling_start = time.time()
    node_response, edge_response = sample_graph(tx_id, hop_count, fanout, add_reverse)
    sampling_end = time.time()

    # 2. Payload preparation
    prep_start = time.time()
    payload = prepare_payload(
        node_response,
        edge_response,
        tx_id,
        gconstruct_config_dict,
        train_config_dict,
        target_node_type=target_node_type,
        add_reverse=add_reverse,
    )
    prep_end = time.time()

    # 3. Inference request
    inference_start = time.time()
    _ = invoke_endpoint(payload)
    inference_end = time.time()

    return {
        "sampling_latency": (sampling_end - sampling_start) * 1000,
        "prep_latency": (prep_end - prep_start) * 1000,
        "inference_latency": (inference_end - inference_start) * 1000,
        "total_latency": (inference_end - sampling_start) * 1000,
        "num_nodes": len(payload["graph"]["nodes"]),
        "num_edges": len(payload["graph"]["edges"]),
        "payload_size": len(json.dumps(payload)),
    }

With this function you can measure the latency of a single end-to-end inference call:

In [None]:
# Test with first transaction
tx_id = test_tx_ids[0]
result = measure_e2e_latency(
    tx_id,
    hop_count=2,
    fanout=10,
    gconstruct_config_dict=gconstruct_config_dict,
    train_config_dict=train_config_dict,
)

print("Latency breakdown:")
print(f"- Graph sampling: {result['sampling_latency']:.2f} ms")
print(f"- Payload preparation: {result['prep_latency']:.2f} ms")
print(f"- Inference: {result['inference_latency']:.2f} ms")
print(f"- Total: {result['total_latency']:.2f} ms")

print("\nGraph size:")
print(f"- Nodes: {result['num_nodes']}")
print(f"- Edges: {result['num_edges']}")
print(f"- Payload size: {result['payload_size']} bytes")

You'll see we are capturing various aspects in the statistics, the latency of each stage, as well as the size of the graph in terms of number of nodes, edges, and payload byte size.

Using the `measure_e2e_latency` function, you can now define a comprehensive experiment that will perform multiple calls and collect statistics as a Pandas dataframe

In [None]:
def run_experiment(
    tx_ids,
    gconstruct_config_dict,
    train_config_dict,
    hop_counts,
    fanouts,
    num_trials,
    add_reverse=True,
):
    """Run latency measurement experiment.

    Args:
        tx_ids: List of transaction IDs
        hop_counts: List of hop counts to test
        fanouts: List of fanout values to test
        num_trials: Number of trials per configuration
        add_reverse: Whether to add reverse edges in the payload

    Returns:
        pd.DataFrame: Experiment results
    """
    results = []

    for hop in hop_counts:
        for fanout in fanouts:
            print(f"Testing hop_count={hop}, fanout={fanout}")
            for tx_id in tx_ids[:num_trials]:
                try:
                    result = measure_e2e_latency(
                        tx_id,
                        hop,
                        fanout,
                        gconstruct_config_dict=gconstruct_config_dict,
                        train_config_dict=train_config_dict,
                        add_reverse=add_reverse,
                    )
                    results.append(
                        {
                            "tx_id": tx_id,
                            "hop_count": hop,
                            "fanout": fanout,
                            "sampling_latency": result["sampling_latency"],
                            "prep_latency": result["prep_latency"],
                            "inference_latency": result["inference_latency"],
                            "total_latency": result["total_latency"],
                            "num_nodes": result["num_nodes"],
                            "num_edges": result["num_edges"],
                            "payload_size": result["payload_size"],
                        }
                    )

                    # Small sleep to avoid overwhelming the service
                    time.sleep(0.1)

                except Exception as e:
                    print(f"Error processing {tx_id}: {e}")
                    continue

    return pd.DataFrame(results)

Since you already sampled 100 transaction IDs you can use those to run the full experiment.

The `run_experiment` function can test various configurations for number of hops and different fanouts to test.

The model we're testing against was trained with a 2-hop, 10,10 fanout so we'll test that configuration, and a couple more
to measure scaling characteristics, namely 1-hop sampling and 100,100 fanout for both k-hop configs. 

With the given configuration, the experiment should take 3-4 minutes to complete.

In [None]:
# Experiment configuration
experiment_config = {"hop_counts": [1, 2], "fanouts": [10, 100], "num_trials": 100}

# Run experiment
print("Running experiment...")
results_df = run_experiment(
    test_tx_ids,
    gconstruct_config_dict,
    train_config_dict,
    experiment_config["hop_counts"],
    experiment_config["fanouts"],
    experiment_config["num_trials"],
)

The `results_df` pandas Dataframe holds the metrics for each individual inference call, so you can aggregate the statistics from it to get an overall idea of latency characteristics.

To begin, take a look at the total latency statistics, which measures the end-to-end latency of each inference call

In [None]:
# Calculate statistics per configuration
import numpy as np


stats = (
    results_df.groupby(["hop_count", "fanout"])
    .agg(
        {
            "total_latency": [
                "mean",
                "median",
                "std",
                "min",
                "max",
                ("p90", lambda x: np.percentile(x, 90)),
                ("p99", lambda x: np.percentile(x, 99)),
            ],
        }
    )
    .round(2)
)

print("Latency Statistics (ms):")
print(stats)

**NOTE**: *The reference numbers we provide were tested on a ml.m5.4xlarge notebook instance, an ml.c6i.xlarge endpoint, and r8g.4xlarge NeptuneDB cluster. Latency measurements will vary depending on factors like network latency.*

With the default configuration, you will see that the median latency for a 2-hop, 10,10 inference call will be around 140ms, while p99 is ~290ms, meaning that 99/100 inference calls took less than 290ms to complete end-to-end.

We can dig more into each aspect of the inference call to gain more insights into the latency of each component

In [None]:
# Plot latency distributions
plt.figure(figsize=(15, 5))

# Sampling latency
plt.subplot(131)
sns.boxplot(data=results_df, x="fanout", y="sampling_latency", hue="hop_count")
plt.title("Graph Sampling Latency")
plt.xlabel("Fanout")
plt.ylabel("Latency (ms)")

# Payload preparation latency
plt.subplot(132)
sns.boxplot(data=results_df, x="fanout", y="prep_latency", hue="hop_count")
plt.title("Payload Preparation Latency")
plt.xlabel("Fanout")
plt.ylabel("Latency (ms)")

# Model inference latency
plt.subplot(133)
sns.boxplot(data=results_df, x="fanout", y="inference_latency", hue="hop_count")
plt.title("Model Inference Latency")
plt.xlabel("Fanout")
plt.ylabel("Latency (ms)")

plt.tight_layout()
plt.show()

In the figure above you can focus on the `hop_count` 2 entries, the right box plot for each `Fanout` configuration on the X-axis (colors can be hard to make out for tight box plots).

You'll see that for hop count == 2, fanout == "10,10", both the graph sampling latency and payload preparation latencies are tight distributions ~50ms, while model inference can be a bit more varied, with most calls around 70-100ms.

One interesting observation is that when scaling up the fanout to "100,100", model inference latency increases sub-linearly. We can look into how graph size affects each aspect of inference with another figure:

In [None]:
# Plot size vs latency relationships
plt.figure(figsize=(15, 5))

# Node count vs sampling latency
plt.subplot(131)
sns.scatterplot(
    data=results_df,
    x="num_nodes",
    y="sampling_latency",
    hue="hop_count",
    style="fanout",
)
plt.title("Node Count vs Sampling Latency")
plt.xlabel("Number of Nodes")
plt.ylabel("Latency (ms)")


# Graph size vs payload preparation
plt.subplot(132)
sns.scatterplot(
    data=results_df, x="num_nodes", y="prep_latency", hue="hop_count", style="fanout"
)
plt.title("Node Count vs Payload Preparation Latency")
plt.xlabel("Number of Nodes")
plt.ylabel("Latency (ms)")

# Graph size vs inference latency
plt.subplot(133)
sns.scatterplot(
    data=results_df,
    x="num_nodes",
    y="inference_latency",
    hue="hop_count",
    style="fanout",
)
plt.title("Node Count vs Model Inference Latency")
plt.xlabel("Number of Nodes")
plt.ylabel("Latency (ms)")


plt.tight_layout()
plt.show()

From the figure above you can see that sampling latency is generally correlated to the number of nodes being retrieved, while the payload preparation is perfectly correlated to node count, which is expected since we sequentially process every node in the GraphSON response. This tells us that there is definite optimization opportunities there, should we wish to reduce end-to-end latency.

Finally we can see that model inference scales sub-linearly with node count: making a prediction for a 1000 node subgraph can take as little as one for 100 nodes, which is the result of the efficient parallel implementation backing GraphStorm.

To conclude let's take a look at the mean latency broken down by component and number of hops/fanout:

In [None]:
# Create heatmaps for each component
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
components = ["sampling_latency", "prep_latency", "inference_latency"]
titles = ["Sampling", "Preparation", "Inference"]

pivot_total = results_df.pivot_table(
    values=["sampling_latency", "prep_latency", "inference_latency"],
    index="hop_count",
    columns="fanout",
    aggfunc="mean",
)

for ax, component, title in zip(axes, components, titles):
    sns.heatmap(pivot_total[component], ax=ax, annot=True, fmt=".1f", cmap="YlOrRd")
    ax.set_title(f"{title} Latency by Hop/Fanout")
    ax.set_xlabel("Fanout")
    ax.set_ylabel("Number of Hops")

plt.tight_layout()
plt.show()

This is another view into the per-component latency. You can see that while for smaller graphs (2-hop, "10,10" fanout) the model inference latency (~90ms) is roughly equal to the combined latency for sampling and preparation, for larger graphs, the sampling+preparation stages dominate the inference time (\~600ms sampling+preparation vs. \~130ms model inference). 

Thankfully sampling and preparation latency can be reduced by more efficient Gremlin queries and payload preparation implementation, which can be a topic for another blog post.

## Conclusion

In this notebook we demonstrated how you can perform end-to-end online inference of graph neural networks, using a GraphStorm-trained model deployed as a SageMaker Inference Endpoint, using Neptune Database as the source of your graph data. In summary you learned the following:

1. How to use the Gremlin query language perform k-hop ego-network sampling that is consistent with how GNN models are trained.
2. How the responses coming from a Gremlin query look like, and how to transform those to the specification that GraphStorm expects for online inference.
3. How to use boto3 to send an inference request to a SageMaker real-time endpoint.
4. How to measure latency for end-to-end GNN inference calls.

GraphStorm and Neptune provide a unique combination that allows you to perform GNN inference online while maintaining full control of your model and graph source in-house.

### Next steps

In this solution we demonstrated how to post inference calls from a client (in this case a SageMaker Notebook instance) that resides inside the same VPC as the Neptune DB cluster. 

If your inference calls will come from external entities, e.g. a customer hitting a REST API on a different VPC, you will need to determine what's the best way for the inference client to access the Neptune cluster, e.g. using a [VPC peering connection](https://docs.aws.amazon.com/neptune/latest/userguide/get-started-connect-ec2-other-vpc.html).

## Cleanup

To clean up all the resources created for this project you will need to:

#### 1. Delete the SageMaker endpoint once you are done testing

First, to delete the endpoint you can run

```python
import boto3

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=config_obj.aws_region)
# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=ENDPOINT_NAME)
```

See the [SageMaker documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-delete-resources.html) on more options to delete an endpoint.

It's important to delete the endpoint **before** running `cdk destroy`, otherwise
`cdk destroy` might fail. The reason is that the endpoint is using a network interface
within the VPC, so we can't destroy the VPC before all network interfaces attached
to it are deleted.

#### 2. Destroy the CloudFormation stack deployed through CDK.


**After deleting all deployed endpoints**, to destroy the CDK project you can run

```bash
# Navigate to the CDK project where you initially deployed it
cd neptune-db-cdk
# Make sure to provide any context variables if used
cdk destroy # [--context prefix=dev-]
```

See the [CDK docs](https://docs.aws.amazon.com/cdk/v2/guide/ref-cli-cmd-destroy.html) on how to use `cdk destroy`, or the CloudFormation docs on how to [delete a stack from the console UI](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-delete-stack.html).

#### 3. [Optional] Empty and destroy the S3 bucket that holds the data model artifacts


Finally, you can [follow the AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) to empty and delete the S3 bucket the CDK has created.

You can find the bucket in name the file `cdk_outputs.json`

In [None]:
!cat cdk_outputs.json