# Prepare graph data

## Required Dependencies
Before running this notebook, ensure you have the following packages installed:
```bash
pip install -r requirements.txt
```

Required Python version: >= 3.9

In this notebook you will start with creating a Neptune Analytics graph and then exporting that as CSV. You will then use the NeptuneGS library to analyze the exported data and create a GraphStorm graph processing configuration file that you will use to kickstart the GraphStorm learning pipeline.

## Import example data into a new Neptune Analytics graph

When starting with an existing Neptune Analytics graph you will be able to directly export it and start from the optional 'Export graph from Neptune Analytics' step.

In this case our data processing script will prepare the data that emulates the Neptune Analytics export schema in the interest of time.

Note: The dataset contains approximately 3.5% fraudulent transactions, making it an imbalanced classification problem.

Start by cloning the GraphStorm repository, which you will use later to launch SageMaker jobs

In [None]:
GS_HOME="~/graphstorm"

In [None]:
!git clone https://github.com/awslabs/graphstorm.git $GS_HOME

In [None]:
import logging

logging.basicConfig(level=logging.INFO, force=True)
logging.getLogger("boto3").setLevel(logging.WARNING)
logging.getLogger("botocore").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
logging.getLogger("s3transfer").setLevel(logging.WARNING)

In [None]:
import boto3

# Configure boto3 client with retries and error handling
from botocore.config import Config
config = Config(
    retries={"total_max_attempts": 1, "mode": "standard"},
    read_timeout=None
)
neptune_graph = boto3.client("neptune-graph", config=config)

In [None]:
# Set environment variables that will be used across all notebooks,
# Replace BUCKET value with your S3 bucket
BUCKET="<YOUR_BUCKET_HERE>"

GRAPH_NAME="ieee-cis-fraud-detection"
AWS_REGION="us-east-1"

#### Create Neptune Graph

Start by creating a Neptune Graph in the background while you work through the rest of the notebooks. You will use this graph to import the graph data after you have enriched them with GNN embeddings and predictions.

### Required IAM Permissions
Your IAM role needs the following permissions:
- AWSNeptuneAnalyticsFullAccess
- AmazonS3FullAccess (or more restricted S3 access)
- AWS KMS permissions if using encrypted S3 buckets

For detailed permissions and trust policy requirements, see:
- [Neptune Analytics IAM Roles](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/security-iam.html)
- [Import/Export Permissions](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/import-export-permissions.html)

In [None]:
# The embeddings size needs to be set at graph creation time
EMBEDDING_SIZE = 128

try:
    # Create a Neptune Analytics graph
    create_response = neptune_graph.create_graph(
        graphName=GRAPH_NAME,
        deletionProtection=False,
        publicConnectivity=True,
        vectorSearchConfiguration={"dimension": EMBEDDING_SIZE},
        replicaCount=0,
        provisionedMemory=16,
    )
    # Make a note of the graph ID for later use
    GRAPH_ID = create_response["id"]
except Exception as e:
    print(f"Error creating graph: {str(e)}")
    raise

#### Convert raw data to Neptune Analytics export format

In this step you will convert the raw graph data into a format that matches the export format of Neptune Analytics. This way you can proceed with 
the rest of the GNN pipeline while the Neptune Analytics graph is being created. The data conversion will take around 2 minutes to run, after which you will analyze the produced files to extract the graph schema.

In [None]:
# Copy the raw data from the SageMaker examples S3 bucket
!aws s3 sync "s3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data" ./input-data

In [None]:
from graph_data_preprocessor_neptune import create_neptune_data

In [None]:
PROCESSED_PREFIX = f"s3://{BUCKET}/neptune-input/{GRAPH_NAME}"

ID_COLS = "card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain"
CAT_COLS = "M1,M2,M3,M4,M5,M6,M7,M8,M9"

create_neptune_data(
    data_prefix="./input-data/",
    output_prefix=PROCESSED_PREFIX,
    id_cols=ID_COLS,
    cat_cols=CAT_COLS,
)

The script you just run will create the graph data under `<PROCESSED_PREFIX>/graph_data` and separate train/validation/test splits of the data under `<PROCESSED_PREFIX>/graph_data/data_splits`.

## [Optional] How to import and export a Neptune Analytics graph

In this section you will learn how you could import and export a Neptune Analytics graph using boto3. You don't need to run the code for this example as we already create the data in a compatible format, but in case you want to test out the import/export functionality yourself, you can follow the steps below.

#### Import graph example

You can create a new graph and import the data you created above using the following boto call. You will need to set up a Neptune import role, with a trust relationship that allows the Neptune Analytics service to assume it, and a the role will need S3 read access for your data source.

For details see https://docs.aws.amazon.com/neptune-analytics/latest/userguide/bulk-import-create-from-s3.html

```python

NEPTUNE_IMPORT_ROLE = "arn:aws:iam::012345678912:role/NeptuneAnalyticsImportRole"

# You need to provide the embedding size in advance if you plan to import GraphStorm embeddings later
EMBEDDING_SIZE = 128
GRAPH_IMPORT_S3 = f"{PROCESSED_PREFIX}/graph_data"
# Create a neptune analytics import task
create_response = neptune_graph.create_graph_using_import_task(
    source=GRAPH_IMPORT_S3,
    graphName=os.environ['GRAPH_NAME'],
    format="CSV",
    roleArn=NEPTUNE_IMPORT_ROLE,
    deletionProtection=False,
    publicConnectivity=True,
    vectorSearchConfiguration={"dimension": EMBEDDING_SIZE},
)
```

This import task would take around 20 minutes to complete

```python

# Wait for import task to finish
GRAPH_ID = create_response["graphId"]
IMPORT_TASK_ID = create_response["taskId"]
import_waiter = neptune_graph.get_waiter("import_task_successful")
import_waiter.wait(taskIdentifier=IMPORT_TASK_ID)
```

#### Export graph example

To use the graph with GraphStorm, you would first need to export it to a tabular representation, using NA's `StartExportTask` API. This would be your normal starting point if you already have a graph on Neptune Analytics.

You will need an export role with a trust relationship for the Neptune Analytics service and the ability to write to your intentended output S3 location. See https://docs.aws.amazon.com/neptune-analytics/latest/userguide/exporting-data.html for details on the export capabilities of Neptune Analytics.

You will also need a KMS key available that Neptune Analytics will use to encrypt the data during export. The export role will need to be added as one of the users of the KMS key to allow the role to encrypt the data using the key. For a walkthrough on how to do this see
https://docs.aws.amazon.com/neptune-analytics/latest/userguide/import-export-permissions.html#create-iam-and-kms

```python

EXPORT_PREFIX = f"s3://{os.environ['BUCKET']}/neptune-export/{os.environ['GRAPH_NAME']}/"
KMS_KEY_ARN = (
    "arn:aws:kms:us-east-1:012345678912:key/xxxxxxx-kms-key-arn"
)
# Export role needs to be able to use the KMS key to encrypt data
NEPTUNE_EXPORT_ROLE = "arn:aws:iam::012345678912:role/NeptuneAnalyticsExportRole"

export_response = neptune_graph.start_export_task(
    destination=EXPORT_PREFIX,
    graphIdentifier=GRAPH_ID,
    roleArn=NEPTUNE_EXPORT_ROLE,
    kmsKeyIdentifier=KMS_KEY_ARN,
    format="CSV",
)
# Assign the export task id to a variable
EXPORT_TASK_ID = export_response["taskId"]
print(EXPORT_TASK_ID)
```

The export process should take 10-20 minutes.

```python
# Wait for export task to complete
export_waiter = neptune_graph.get_waiter("export_task_successful")
export_waiter.wait(taskIdentifier=EXPORT_TASK_ID)
```

## Create GConstruct configuration file from exported data

GraphStorm training requires the original graph data to be converted into a binary, partitioned graph representation to support efficient distributed training. GraphStorm provides the GConstruct module and GSProcessing library that can accomplish this on a single instance or distributed respectively.

To create the train data for GraphStorm we need to create a JSON file that describes the tabular graph data from the NA export task. An example file can be:

```json

{
    "nodes": [
        {
            "node_id_col":  "nid",
            "node_type":    "paper",
            "format":       {"name": "parquet"},
            "files":        ["paper_nodes.parquet"],
            "features":     [
                {
                    "feature_col":  "embedding"
                }
            ],
            "labels": [   
                {
                    "label_col":    "paper_field",
                    "task_type":    "classification"
                }
            ]
        }
    ],
    "edges": [
        {
            "source_id_col":    "src",
            "dest_id_col":      "dst",
            "relation":         ["paper", "cites", "paper"],
            "format":           {"name": "parquet"},
            "files":            ["paper_cites_paper_edges.parquet"]
        }
    ]
}
```


To create such a configuration for the NA-exported data you can use the `neptune_gs` package from the repository. The package includes a script that analyzes the output, gets user input to clarify relations and features when needed, and creates the GConstruct configuration JSON file. 

The package is available at the top level of the repository under the `neptune-gs` directory and you should have installed it during the pre-requisites phase.

### Create GConstruct config

The `create_graphstorm_config` function analyzes the Neptune Export data, optionally asks a series of questions to determine the desired graph schema, and creates a JSON file GConstruct will use as input. The default filename is `gconstruct_config.json`.

The program will iterate through all columns for all vertex and relation files, and provide default transformations for each column, provided on whether it's a feature or label.

In case it's not able to automatically determine some feature type or edge triplets it will as for input by the user. **For existing Neptune Analytics graphs, providing the graph identifier can help with automatically extracting this information.**

 > NOTE: Because for this example we don't have graph data already imported in Neptune Analytics, we have named the edge files in a way that the files correspond to edge triplets, e.g. `Transaction,identified_by,Card1` fully determines an edge triple. The script relies on this setup to automatically detect edge triples without user input. Otherwise, providing the `graph_id` to the `create_graphstorm_config` function ensures you can extract all edge types from the Neptune graph itself, without user input.

In [None]:
import os.path as osp
from neptune_gs.create_gconstruct import create_graphstorm_config


EXPORTED_GRAPH_S3 = osp.join(PROCESSED_PREFIX, "graph_data")

In [None]:
EXPORTED_GRAPH_S3

In [None]:
gs_config = create_graphstorm_config(
    EXPORTED_GRAPH_S3,
    # graph_id=GRAPH_ID, # For this example we don't use the graph ID
    learning_task="classification", # The task is node classification
    target_type="Transaction", # The target node type are Transaction nodes
    label_column="isFraud:Int",  # The property of the Transaction nodes we want to predict. Column type (Int) is appended by Neptune during CSV export
    masks_prefix=osp.join(EXPORTED_GRAPH_S3, "data_splits"), # The location of the train/validation/test masks
    cols_to_keep_dict={ # Select a subset of the Transaction properties to include
        "Transaction":
            # Required columns
            [
                "~id",
                "~label",
                "isFraud:Int",
            ]
            +
            # Numerical features without missing values
            [f"C{idx}:Float" for idx in range(1, 15)] + ["TransactionAmt:Float"]
            +
            # Categorical features
            [f"{CAT_COL}:String" for CAT_COL in CAT_COLS.split(",")]
    },
    verbose=True,
)

Once you provide all values the function returns a GraphStorm graph construction configuration dict which you can save to a JSON file. For more information see the documentation about [how GraphStorm performs graph construction](https://graphstorm.readthedocs.io/en/latest/cli/graph-construction/index.html).

In [None]:
# Print the config dict in a readable format
from pprint import pp

pp(gs_config)

#### Extracting feature list from GConstruct to use during training


To include node/edge features during training and inference, GraphStorm needs a list of all node/edge types with features and the feature names. 
For this example we are providing a YAML file that already contains all the configuration needed to run a node classification task.

If you were writing the YAML file yourself however, you can use 
the following convenience function to extract the lists of features
for every node type to include in your yaml file

In [None]:
from neptune_gs.create_gconstruct import extract_features_from_gconstruct_config

node_feature_lists, edge_feature_lists = extract_features_from_gconstruct_config(
    gs_config
)
node_feature_lists

For this example we aggregated all the individual numerical features in one feature vector per feature type. This helps with processing the graph data faster downstream.

### Write graph construction configuration file to S3

To be able to build a partitioned graph from the graph data on S3 you need to save the configuration dictionary as a JSON file, under the same path as the exported data. 

`neptune_gs` provides the `FileSystemHandler` class for easier reading and writing from and to S3.

In [None]:
import json
import os.path as osp

from neptune_gs.fs_handler import FileSystemHandler

fs_handler = FileSystemHandler(EXPORTED_GRAPH_S3)
CONFIG_FILENAME = "ieee-cis-gconstruct-node-classification.json"
input_no_protocol: str = osp.join(
    EXPORTED_GRAPH_S3.replace("s3://", ""),
)

In [None]:
# Write the config locally and at the input location
with open(CONFIG_FILENAME, "w") as f:
    json.dump(gs_config, f, indent=2)

with fs_handler.pa_fs.open_output_stream(
    f"{osp.join(input_no_protocol, CONFIG_FILENAME)}"
) as f:
    f.write(json.dumps(gs_config, indent=2).encode("utf-8"))

print(
    f"GRAPH_NAME: {GRAPH_NAME}"
    f"\nConfiguration written to ./{CONFIG_FILENAME}"
    f"\nand to s3://{osp.join(input_no_protocol, CONFIG_FILENAME)}"
)

# Let's also write information about the graph and exports locally
with open("task-info.json", "w") as f:
    export_info = {
        "AWS_REGION": AWS_REGION,
        "BUCKET": BUCKET,
        "EXPORTED_GRAPH_S3": EXPORTED_GRAPH_S3,
        "GCONSTRUCT_CONFIG": CONFIG_FILENAME,
        "NODE_FEATURE_LISTS": node_feature_lists,
        "EDGE_FEATURE_LISTS": edge_feature_lists,
        "GRAPH_ID": GRAPH_ID,
        "GRAPH_NAME": GRAPH_NAME,
        "GS_HOME": GS_HOME,
    }
    json.dump(export_info, f, indent=2)
print("Task info written to ./task-info.json")

In the next notebook you will prepare your SageMake environment for training with GraphStorm, building and pushing the necessary Docker images.