# Prepare graph data

### Use 'gsf' kernel

To be able to run through this notebook you will need to use the 'gsf' kernel which comes pre-installed with all the dependencies. We create this kernel as part of the notebook setup during the CDK deployment

In this notebook you will start with creating the graph data you will use throughout this example. First you will convert the raw IEEE CIS data into the format Neptune DB and GraphStorm expect for loading, then prepare a configuration file that describes the graph data so they can be processed and ingested by GraphStorm. 


In the next notebook you will then load this dataset into NeptuneDB.

-----

#### Set log level

In [None]:
import logging

logging.basicConfig(level=logging.INFO, force=True)
logging.getLogger("boto3").setLevel(logging.WARNING)
logging.getLogger("botocore").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
logging.getLogger("s3transfer").setLevel(logging.WARNING)
logging.getLogger("sentence_transformers").setLevel(logging.WARNING)

## Preprocessing Data for GraphStorm Model Training

This script will process the IEEE CIS data set into graph, which will be used for GNN model training for a node classification task. The same processed data will be imported into Neptune Database for online inference. The data conversion will take around 2 minutes to run, after which you will analyze the produced files to extract the graph schema.

Note: The dataset contains approximately 3.5% fraudulent transactions, making it an imbalanced classification problem.

In [None]:
# Copy the raw data from the SageMaker examples S3 bucket
!aws s3 sync "s3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data" ./input-data

In [None]:
# Import the local utility that converts the raw data into a format that's accepted by both NeptuneDB and GraphStorm
from graph_data_preprocessor_neptune_db import create_neptune_db_data

In [None]:
GRAPH_NAME = "ieee-cis-fraud-detection"
AWS_REGION = "us-east-1"

PROCESSED_PREFIX = f"./{GRAPH_NAME}"

ID_COLS = "card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain"
CAT_COLS = "M1,M2,M3,M4,M5,M6,M7,M8,M9"
# Lists of columns to keep from each file
COLS_TO_KEEP = {
    "transaction.csv": (
        ID_COLS.split(",")
        + CAT_COLS.split(",")
        +
        # Numerical features without missing values
        [f"C{idx}" for idx in range(1, 15)]
        + ["TransactionID", "TransactionAmt", "TransactionDT", "isFraud"]
    ),
    "identity.csv": ["TransactionID", "DeviceType"],
}

In [None]:
create_neptune_db_data(
    data_prefix="./input-data/",
    output_prefix=PROCESSED_PREFIX,
    id_cols=ID_COLS,
    cat_cols=CAT_COLS,
    cols_to_keep=COLS_TO_KEEP,
    num_chunks=1,
)

The script you just run will create the graph data under `<PROCESSED_PREFIX>` and separate train/validation/test splits of the data under `<PROCESSED_PREFIX>/data_splits`.

### Create GConstruct configuration file from preprocessed data

GraphStorm training requires the original graph data to be converted into a binary, partitioned graph representation to support efficient distributed training. GraphStorm provides the GConstruct module and GSProcessing library that can accomplish this on a single instance or distributed respectively.

To create the train data for GraphStorm we need to create a JSON file that describes the tabular graph data. An example file can be:

```json

{
    "nodes": [
        {
            "node_id_col":  "nid",
            "node_type":    "paper",
            "format":       {"name": "parquet"},
            "files":        ["paper_nodes.parquet"],
            "features":     [
                {
                    "feature_col":  "embedding"
                }
            ],
            "labels": [   
                {
                    "label_col":    "paper_field",
                    "task_type":    "classification"
                }
            ]
        }
    ],
    "edges": [
        {
            "source_id_col":    "src",
            "dest_id_col":      "dst",
            "relation":         ["paper", "cites", "paper"],
            "format":           {"name": "parquet"},
            "files":            ["paper_cites_paper_edges.parquet"]
        }
    ]
}
```


To create such a configuration for the preprocessed data you can use the `neptune_gs` package from the repository. The package includes a script that analyzes the output, gets user input to clarify relations and features when needed, and creates the GConstruct configuration JSON file. 

The package is available at the top level of the repository under the `neptune-gs` directory and you should have installed it during the pre-requisites phase.

### Create GConstruct config

The `create_graphstorm_config` function analyzes the preprocessed data, optionally asks a series of questions to determine the desired graph schema, and creates a JSON file GConstruct will use as input. The default filename is `gconstruct_config.json`.

The program will iterate through all columns for all vertex and relation files, and provide default transformations for each column, provided on whether it's a feature or label.

In case it's not able to automatically determine some feature type or edge triplets it will ask for input by the user.

 > NOTE: We have named the edge files in a way that the files correspond to edge triplets, e.g. `Transaction,identified_by,Card1` fully determines an edge triple. The script relies on this setup to automatically detect edge triples without user input.

In [None]:
import os.path as osp
from create_gconstruct import create_graphstorm_config

GRAPH_DATA_PATH = osp.abspath(PROCESSED_PREFIX)

In [None]:
GRAPH_DATA_PATH

In [None]:
gs_config = create_graphstorm_config(
    GRAPH_DATA_PATH,
    learning_task="classification",  # The task is node classification
    target_type="Transaction",  # The target node type are Transaction nodes
    label_column="isFraud:Int",  # The property of the Transaction nodes we want to predict. Column type (Int) is appended by Neptune during CSV export
    masks_prefix=osp.join(
        GRAPH_DATA_PATH, "data_splits"
    ),  # The location of the train/validation/test masks
    cols_to_keep_dict={  # Select a subset of the Transaction properties to include
        "Transaction":
        # Required columns
        [
            "~id",
            "~label",
            "isFraud:Int",
        ]
        +
        # Numerical features without missing values
        [f"C{idx}:Float" for idx in range(1, 15)]
        + ["TransactionAmt:Float"]
        +
        # Categorical features
        [f"{CAT_COL}:String" for CAT_COL in CAT_COLS.split(",")]
    },
    add_reversed_edges=True,  # Add a reverse edge for every edge type
    verbose=True,
    aggregate_features=False,
)

Once you provide all values the function returns a GraphStorm graph construction configuration dict which you can save to a JSON file. For more information see the documentation about [how GraphStorm performs graph construction](https://graphstorm.readthedocs.io/en/latest/cli/graph-construction/index.html).

In [None]:
# Print the config dict in a readable format
from pprint import pp

pp(gs_config)

#### Extracting feature list from GConstruct to use during training


To include node/edge features during training and inference, GraphStorm needs a list of all node/edge types with features and the feature names. 
For this example we are providing a YAML file that already contains all the configuration needed to run a node classification task.

If you were writing the YAML file yourself however, you can use 
the following convenience function to extract the lists of features
for every node type to include in your yaml file

In [None]:
from create_gconstruct import extract_features_from_gconstruct_config

node_feature_lists, edge_feature_lists = extract_features_from_gconstruct_config(
    gs_config
)
for feature in sorted(node_feature_lists):
    print(f"- {feature}")

### Write graph construction configuration file

To be able to build a partitioned graph from the graph data you need to save the configuration dictionary as a JSON file, under the same path as the processed data. 

`neptune_gs` provides the `FileSystemHandler` class for easier reading and writing files.

In [None]:
import json
import os.path as osp

from fs_handler import FileSystemHandler

fs_handler = FileSystemHandler(GRAPH_DATA_PATH)
CONFIG_FILENAME = "ieee-cis-gconstruct-node-classification.json"

In [None]:
# Write the config locally and at the input location
with open(CONFIG_FILENAME, "w") as f:
    json.dump(gs_config, f, indent=2)

with fs_handler.pa_fs.open_output_stream(
    f"{osp.join(GRAPH_DATA_PATH, CONFIG_FILENAME)}"
) as f:
    f.write(json.dumps(gs_config, indent=2).encode("utf-8"))

print(
    f"GRAPH_NAME: {GRAPH_NAME}"
    f"\nConfiguration written to ./{CONFIG_FILENAME}"
    f"\nand to {osp.join(GRAPH_DATA_PATH, CONFIG_FILENAME)}"
)

## Run GConstruct to prepare data for training

Before being able to use the graph data to train a GraphStorm model, you need to convert the data into a binary, distributed graph format that's compatible with GraphStorm.

GraphStorm provides the GConstruct module that takes your input data in CSV/Parquet format, applies feature transformations and converting string IDs to numerical node IDs and saves a partitioned binary representation of the graph that's ready to be used for training.

This process also saves metadata that can be used during inference, e.g. information about how to re-apply at inference time the transformations that were applied during training, like one-hot encoding categorical data, or min-max normalization of numerical features.

In [None]:
import sys

# We need to use the python executable of the gsf kernel
PYTHON = sys.executable

In [None]:
%cd {GRAPH_DATA_PATH}
!{PYTHON} -m graphstorm.gconstruct.construct_graph \
          --conf-file  ieee-cis-gconstruct-node-classification.json \
          --output-dir ../ieee_gs \
          --num-parts 1 \
          --graph-name ieee-cis

## Next steps

In the next notebook, `1-Load-Data-Into-Neptune-DB.ipynb` you will load the pre-processed CSV data into NeptuneDB