# Use and explore embeddings

In this notebook we will show how you can use the produced embeddings outside of Neptune Analytics, for example creating a t-SNE visualization of embedding quality, or using offline metrics to measure classification quality.

In [None]:
import logging

# Force logging for jupyter
logging.basicConfig(level=logging.INFO, force=True)
logging.getLogger("boto3").setLevel(logging.WARNING)
logging.getLogger("botocore").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
logging.getLogger("s3transfer").setLevel(logging.WARNING)

In [None]:
import json
import os

# Read information about the graph from the JSON file you created in notebook 0
with open("task-info.json", "r") as f:
    task_info = json.load(f)

EXPORTED_GRAPH_S3 = task_info["EXPORTED_GRAPH_S3"]
ENRICHED_S3 = os.path.join(EXPORTED_GRAPH_S3, "enriched")

## Load enriched Transaction node data

You can use NeptuneGS to read the `Transaction` node data directly from S3.

In [None]:
from neptune_gs.fs_handler import FileSystemHandler
import pyarrow as pa


enriched_handler = FileSystemHandler(ENRICHED_S3)
transaction_files = enriched_handler.list_files(pattern="Vertex_Transaction*")
enriched_fields = [
    pa.field("~id", pa.string()),
    pa.field("embedding:Vector", pa.string()),
    pa.field("pred:Float[]", pa.float32()),
]
enriched_schema = pa.schema(enriched_fields)
transaction_emb_pred_df = enriched_handler.create_df_from_files(
    transaction_files, schema=enriched_schema
).drop("~label", axis=1)

In [None]:
# Convert embeddings from string to list of floats
transaction_emb_pred_df["embedding:Vector"] = (
    transaction_emb_pred_df["embedding:Vector"].astype("string").str.split(";")
)

Let's take a look at the dataframe of embeddings and predictions for the Transaction node type, every Transaction node will have a risk score (`pred:Float[]`) and embedding vector attached.

In [None]:
transaction_emb_pred_df

Next you load the original data, and extract the true `isFraud` value from it. You will use that to determine the accuracy of the model

In [None]:
# Gather the actual fraud values from the input data
original_handler = FileSystemHandler(EXPORTED_GRAPH_S3)

orig_fields = [
    pa.field("~id", pa.string()),
    pa.field("isFraud:Int", pa.int8()),
]
orig_schema = pa.schema(orig_fields)
# Skip the enriched files when loading the original graph data
orig_transaction_files = [f
    for f in original_handler.list_files(pattern="Vertex_Transaction*")
    if "enriched" not in f
    ]
orig_transactions_df = original_handler.create_df_from_files(
    orig_transaction_files, schema=orig_schema
)

You can take a quick look at the statistics for the `isFraud` column. The dataset contains about 3.5% fraudulent transactions

In [None]:
orig_transactions_df.describe()

Finally, join the enriched embeddings with the original data to have the true fraud value attached to each prediction/embedding.

In [None]:
transactions_with_fraud = transaction_emb_pred_df.set_index("~id").join(
    orig_transactions_df.set_index("~id"),
)

Let's take a look at the enriched Transactions. We can see the `~id`, `isFraud` column and embedding vectors here, and the risk score in the `pred:Float[]` columns

In [None]:
transactions_with_fraud

## Create a t-SNE visualization of the embeddings

Now that we have embeddings for all of the node we can select a subset and create a t-SNE visualization to inspect embedding quality.

[t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) is a method designed to accurately visualize high-dimensional data.

We will use this method along with the real fraud labels to demonstrate the quality of the learned embeddings, which should be able to separate the two classes even when projected down to two dimensions.

In [None]:
import pandas as pd
import numpy as np

# First, separate fraud and non-fraud cases
fraud_samples = transactions_with_fraud[transactions_with_fraud["isFraud:Int"] == 1]
non_fraud_samples = transactions_with_fraud[transactions_with_fraud["isFraud:Int"] == 0]

# Randomly sample 1000 transactions from each class
n_samples = 1000
seed = 42
fraud_balanced = fraud_samples.sample(n=n_samples, random_state=seed)
non_fraud_balanced = non_fraud_samples.sample(n=n_samples, random_state=seed)

# Combine the balanced datasets
balanced_df = pd.concat([fraud_balanced, non_fraud_balanced])

# Shuffle the combined dataset
balanced_df = balanced_df.sample(frac=1, random_state=seed)

# Extract embeddings and labels from the subset
embeddings_balanced = np.array(balanced_df["embedding:Vector"].values.tolist()).astype(
    "float32"
)
labels_balanced = balanced_df["isFraud:Int"]

Now let's fit the t-SNE estimator and plot the results

In [None]:
import matplotlib.pyplot as plt
from sklearn import manifold

# Apply t-SNE
tsne = manifold.TSNE(
    n_components=2, perplexity=50, init="pca", random_state=seed, n_jobs=-1
)
embeddings_2d = tsne.fit_transform(embeddings_balanced)

Let's set up some visualization function that will allow us to plot the datapoints.

In [None]:
# Visualize the results
plt.figure(figsize=(10, 8))

# Create scatter plot with two distinct colors
scatter = plt.scatter(
    embeddings_2d[labels_balanced == 0, 0],
    embeddings_2d[labels_balanced == 0, 1],
    c="c",
    label="Non-Fraud",
    alpha=0.6,
)
scatter = plt.scatter(
    embeddings_2d[labels_balanced == 1, 0],
    embeddings_2d[labels_balanced == 1, 1],
    c="m",
    label="Fraud",
    alpha=0.6,
)

# Add legend
plt.legend()

plt.title("t-SNE visualization of transaction embeddings")
plt.xlabel("t-SNE dimension 1")
plt.ylabel("t-SNE dimension 2")
plt.show()

We can see that our model has achieved a good separation between the two classes.

To better quantify the model's accuracy, you can use the original train/test splits to measure the model's performance on unseen data.

## Offline evaluation 

With access to the predictions for all the transactions in the graph with can perform additional offline evaluation of the model performance. 

First, you will load the transactions that were selected for your held-out test set to check the model's performance on unseen data

In [None]:
# Gather test ids from input data
test_ids = enriched_handler.create_df_from_files(
    [f"{EXPORTED_GRAPH_S3}/data_splits/test_ids.parquet"]
)
test_ids.set_index("nid", inplace=True)
test_ids.index = test_ids.index.astype(str)
test_preds = test_ids.join(
    transactions_with_fraud,
    how="left",
)
# Convert test predictions from string to float
test_preds["pred:Float[]"] = (
    test_preds["pred:Float[]"]
    .astype("string")
    .str.split(";", expand=True)
    .astype("float")
)

Next you can plot the Receiver Operating Characteristic and Precision-Recall curves to visualize the performance of the GNN model.

In [None]:
import matplotlib.pyplot as plt

from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay

fig, [ax_roc, ax_pr] = plt.subplots(1, 2, figsize=(11, 5))
PrecisionRecallDisplay.from_predictions(
    y_true=test_preds["isFraud:Int"],
    y_pred=test_preds["pred:Float[]"],
    plot_chance_level=True,
    despine=True,
    ax=ax_pr,
)

RocCurveDisplay.from_predictions(
    y_true=test_preds["isFraud:Int"],
    y_pred=test_preds["pred:Float[]"],
    name="GNN",
    plot_chance_level=True,
    despine=True,
    ax=ax_roc,
)

ax_roc.set_title("Receiver Operating Characteristic (ROC) curves")
ax_pr.set_title("Precision-Recall curves")

ax_roc.grid(linestyle="--")
ax_pr.grid(linestyle="--")

plt.legend()
plt.show()

We can see that the model is able to achieve an AUC score of ~0.90 on the unseen data and maintain high precision/recall values.

In the next notebook you will import this graph back into Neptune Analytics to run some example queries and assign a _risk score_ to transactions using information from the graph.