# Graph Fraud Detection using Graph Neural Networks with DGL (Deep Graph Library) on Amazon SageMaker


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

---

**Note. This notebook should be used with the Python 3 (Data Science) kernel.**

In this notebook, we provide following highlights. 

* An end to end pipeline to train a fraud detection model using graph neural networks and a baseline model using xgboost. 

* Hyper-Parameter Optimization (HPO) for both graph neural networks and xgboost.

* For the training and HPO process, we firstly process the raw dataset to prepare the features and extract the interactions in the dataset that are used to construct the graph. 

* Then we launch a training job using the SageMaker framework estimator to train a graph neural network model with DGL.

In [None]:
!pip install -U boto3 sagemaker

In [None]:
import sys

sys.path.append("./sagemaker_graph_fraud_detection/")

import json
import sagemaker
from sagemaker_graph_fraud_detection import config

role = config.role
sess = sagemaker.Session()

## Dataset and Problem Statement

### Upload raw data to S3
We go over the specific data schema in subsequent cells but now let's move the raw data to a convenient location in the S3 bucket for this proejct, where it is be picked up by the preprocessing job and training job.

If you would like to use your own dataset for this demonstration. Replace the `raw_data_location` with the s3 path or local path of your dataset, and modify the data preprocessing step as needed.

In [None]:
# Replace with an S3 location or local path to point to your own dataset
raw_data_location = "s3://{}/{}/artifacts/data".format(
    config.solution_upstream_bucket, config.solution_name
)

session_prefix = "dgl-fraud-detection"
input_data = "s3://{}/{}/{}".format(config.solution_bucket, session_prefix, config.s3_data_prefix)

!aws s3 cp --recursive $raw_data_location $input_data

!mkdir input_raw_data # for data visualization, we also download the datasets into local directory.
!aws s3 cp --recursive $raw_data_location input_raw_data

# Set S3 locations to store processed data for training and post-training results and artifacts respectively
train_data = "s3://{}/{}/{}".format(
    config.solution_bucket, session_prefix, config.s3_processing_output
)
train_output = "s3://{}/{}/{}".format(
    config.solution_bucket, session_prefix, config.s3_train_output
)

### Data Description
The dataset we use is a synthetic dataset created to mimic typical examples of financial transactions dataset that many companies have. The dataset consists of two tables:

* **Transactions** table: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction, and a column indicating whether the corresponded transcation is fraud or not.
* **Identity** table: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.

The two tables can be joined together using the unique identified-key column **TransactionID**.

### Data Visualization
Read the tables of transaction.csv and identifity.csv and merge them based on the TransactionID column for better visualization.

Besides the unique identifier column (**TransactionID**) to identify each transaction, there are two types of predicting columns and one target column.

* **Identity columns** that contain identity information related to a transaction. The corresponded columns include **card_no**, **card_type**, **email_domain**, **IpAddress**, **PhoneNo**, **DeviceID**.

* **Categorical or numerical columns** that describes the features of each transaction. The corresponded columns include **ProductCD** and **TransactionAmt**.

* Target column **isFraud**.

The **goal** is to fully utilize the information in the predicting columns to classify each transaction (each row in the table) to be either fraud or not fraud.

In [None]:
import os
import pandas as pd

raw_data_dir = "input_raw_data"
transactions_df = pd.read_csv(os.path.join(raw_data_dir, "transaction.csv"))
identity_df = pd.read_csv(os.path.join(raw_data_dir, "identity.csv"))

The first 5 observations in transaction dataset.

In [None]:
transactions_df.head(5)

The first 5 observations in identity dataset.

In [None]:
identity_df.head(5)

Join the two datasets using the **TransactionID** column.

In [None]:
full_identity_df = transactions_df.merge(identity_df, on="TransactionID", how="left")

# drop transcations time column as it is not useful for constructing graph.
full_identity_df.drop(["TransactionDT"], axis=1, inplace=True)

# Re-arange the order of column names for better visualization
full_identity_df = full_identity_df[
    [
        "TransactionID",
        "card_no",
        "card_type",
        "email_domain",
        "IpAddress",
        "PhoneNo",
        "DeviceID",
        "ProductCD",
        "TransactionAmt",
        "isFraud",
    ]
]
full_identity_df.head(5)

## Problem Statement

Since the dataset shown above not only contains the features of each transaction such as the purchased product type and transaction amount but also multiple identity information that could be used to identify the relations between the transactions. 

Those information can be used to construct heterogeneous graphs in graph neural networks. The heterogeneous graphs contain different types of nodes and edges. The different types of nodes and edges tend to have different types of attributes that are designed to capture the characteristics of each node and edge type. 

In our case, different node types correspond to the categorical columns such as **card_type**, **card_no**, **email_domain**, **IpAddress**, **PhoneNo**, and **DeviceID**.

The graph neural networks utilize all the constructed information above to learn a hidden representation (embedding) for each transaction such that the hidden representation is used as input for a linear classification layer to determine whether the transaction is fraud or not fraud.

Below is an example heterogeneous graph based on the datasets mentioned above.

In [None]:
from IPython.display import Image
from IPython.core.display import HTML

Image(filename="illustration-dgl.png", width=500, height=500)

## Data Preprocessing and Feature Engineering

### Run Preprocessing job with Amazon SageMaker Processing

The script we have defined at `data-preprocessing/graph_data_preprocessor.py` performs data preprocessing and feature engineering transformations on the raw data. We provide a general processing framework to convert a relational table to heterogeneous graph edgelists based on the column types of the relational table. Some of the data transformation and feature engineering techniques include:

* Performing numerical encoding for categorical variables and logarithmic transformation for transaction amount
* Constructing graph edgelists between transactions and other entities for the various relation types

The inputs to the data preprocessing script are passed in as python command line arguments. All the columns in the relational table are classifed into one of 3 types for the purposes of data transformation: 

* **Identity columns** `--id-cols`: columns that contain identity information related to a user or transaction for example IP address, Phone Number, device identifiers etc. These column types become node types in the heterogeneous graph, and the entries in these columns become the nodes. The column names for these column types need to passed in to the script.

* **Categorical columns** `--cat-cols`: columns that correspond to categorical features for a user's age group or whether a provided address matches with an address on file. The entries in these columns undergo numerical feature transformation and are used as node attributes in the heterogeneous graph. The columns names for these column types also needs to be passed in to the script

* **Numerical columns**: columns that correspond to numerical features like how many times a user has tried a transaction and so on. The entries here are also used as node attributes in the heterogeneous graph. The script assumes that all columns in the tables that are not identity columns or categorical columns are numerical columns

The datasets are divided into training (70% of the entire data), validation (20%), and test datasets (10%). The validation dataset are used for hyper-parameter optimization to select the optimal set of hyper-parameters. And the test dataset is used for the final evaluation to compare various models.

In order to adapt the preprocessing script to work with data in the same format, you can simply change the python arguments used in the cell below to a comma seperate string for the column names in your dataset. If your dataset is in a different format, then you also have to modify the preprocessing script at `data-preprocessing/graph_data_preprocessor.py`

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from time import strftime, gmtime

processing_job_name = "{}-processing-job-{}".format(
    config.solution_prefix, strftime("%Y-%m-%d-%H-%M-%S", gmtime())
)

sklearn_processor = SKLearnProcessor(
    framework_version="0.20.0",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    base_job_name=processing_job_name,
)

sklearn_processor.run(
    code="data-preprocessing/graph_data_preprocessor.py",
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[ProcessingOutput(destination=train_data, source="/opt/ml/processing/output")],
    arguments=[
        "--id-cols",
        "card_no,card_type,email_domain",
        "--cat-cols",
        "ProductCD",
        "--cat-cols-xgboost",
        "card_type,ProductCD",
    ],
)

### View Results of Data Preprocessing

Once the preprocessing job is complete, we can take a look at the contents of the S3 bucket to see the transformed data. We have a set of bipartite edge lists between transactions and different device id types as well as the features, labels and a set of transactions to validate our graph model performance.

In [None]:
from os import path
from sagemaker.s3 import S3Downloader

processed_files = S3Downloader.list(train_data)
print("===== Processed Files =====")
print("\n".join(processed_files))

# download processed data into local directory preprocessed-data
S3Downloader.download(train_data, train_data.split("/")[-1])

## Train a XGBoost Baseline
Before diving into training a graph neural network with DGL, let us firstly train a XGBoost model with HPO as the baseline on the transaction table data. 

In [None]:
from baselines.utils import get_data
import numpy as np

### Read data and upload to S3
The features used for training XGBoost are from **features_xgboost.csv** that are processed in above processing job.
The features include categorical columns **productCD**, **card_type** and numerical column **TransactionAmt**. The categorical features are onehot encoded. Other features (categorical features) such as **IpAddress**, **PhoneNO** contain too many categories (~40,000) and thus are not suitable to be used as training features for XGBoost.

In [None]:
train_data_df, valid_data_df, test_data_df = get_data()

Let's check the first 5 observations of the train data frame.

In [None]:
train_data_df.head(5)

Save the training and validation data into local directory and then upload them to s3 bucket for training.

In [None]:
!mkdir -p xgboost_input
train_data_df.to_csv("xgboost_input/train_xgb.csv", header=False, index=False)
valid_data_df.to_csv("xgboost_input/validation_xgb.csv", header=False, index=False)

In [None]:
import os
import sagemaker
from sagemaker.s3 import S3Uploader

from sagemaker_graph_fraud_detection import config

role = config.role

session = sagemaker.Session()
bucket = config.solution_bucket
prefix = "xgboost-fraud-detection"

s3_train_data = S3Uploader.upload(
    "xgboost_input/train_xgb.csv", "s3://{}/{}/{}".format(bucket, prefix, "train")
)
print("Uploaded training data location: {}".format(s3_train_data))

s3_validation_data = S3Uploader.upload(
    "xgboost_input/validation_xgb.csv", "s3://{}/{}/{}".format(bucket, prefix, "validation")
)
print("Uploaded training data location: {}".format(s3_validation_data))

output_location = "s3://{}/{}/output".format(bucket, prefix)
print("Training artifacts are uploaded to: {}".format(output_location))

### Train SageMaker XGBoost Estimator with HPO

Moving onto training, first we need to specify the locations of the XGBoost algorithm containers.

In [None]:
import boto3
from sagemaker.inputs import TrainingInput
from sagemaker.amazon.amazon_estimator import get_image_uri

container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "latest")
display(container)

Then, because we're training with the CSV file format, we create TrainingInputs that our training function can use as a pointer to the files in S3.

In [None]:
s3_input_train = TrainingInput(
    s3_data="s3://{}/{}/train".format(bucket, prefix), content_type="csv"
)
s3_input_validation = TrainingInput(
    s3_data="s3://{}/{}/validation/".format(bucket, prefix), content_type="csv"
)

### Train SageMaker XGBoost Estimator with HPO

In [None]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

train_y = train_data_df.values[:, 0]
scale_pos_weight = (len(train_y) - sum(train_y)) / sum(
    train_y
)  # as this is unbalanced dataset, we need give more weight to the minority class example.

xgb = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.m5.xlarge",  #'ml.g4dn.xlarge',
    output_path=output_location,
    sagemaker_session=session,
)

xgb.set_hyperparameters(
    eval_metric="auc",
    objective="binary:logistic",
    num_round=1000,
    early_stopping_rounds=10,
    silent=0,
    scale_pos_weight=scale_pos_weight,
)

In [None]:
# Define the hyper-parameters search ranges.
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "min_child_weight": ContinuousParameter(1, 10),
    "gamma": ContinuousParameter(0, 0.6),
    "alpha": ContinuousParameter(0, 2),
    "max_depth": IntegerParameter(1, 10),
    "subsample": ContinuousParameter(0.2, 1),
}

In [None]:
objective_metric_name = "validation:auc"
objective_type = "Maximize"

In [None]:
import uuid

unique_hash = str(uuid.uuid4())[:6]
tuning_job_name = f"{config.solution_prefix}-{unique_hash}-tuning-job"
print(
    f"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> a job name started with {tuning_job_name} to monitor HPO tuning status and details.\n"
    f"Note. You are unable to successfully run the following cells until the tuning job completes. This step may take around 15 min."
)

tuner = HyperparameterTuner(
    xgb,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=30,
    max_parallel_jobs=3,
    objective_type=objective_type,
    base_tuning_job_name=tuning_job_name,
)

tuner.fit({"train": s3_input_train, "validation": s3_input_validation})

Check the Status of HPO tuning jobs

In [None]:
boto3.client("sagemaker").describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)["HyperParameterTuningJobStatus"]

Retrieve the tuning job name

In [None]:
import boto3

sm_client = boto3.Session().client("sagemaker")

tuning_job_name = tuner.latest_tuning_job.name
tuning_job_name

In [None]:
tuning_job_result = sm_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name
)

status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
    print("Reminder: the tuning job has not been completed.")

job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)

is_maximize = (
    tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]["Type"]
    != "Maximize"
)
objective_name = tuning_job_result["HyperParameterTuningJobConfig"][
    "HyperParameterTuningJobObjective"
]["MetricName"]

In [None]:
import pandas as pd

tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner_analytics.dataframe()

if len(full_df) > 0:
    df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
    if len(df) > 0:
        df = df.sort_values("FinalObjectiveValue", ascending=False)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
        pd.set_option("display.max_colwidth", -1)  # Don't truncate TrainingJobName
    else:
        print("No training jobs have reported valid results yet.")

df

### Deploy endpoint of the best tuning job

In [None]:
from sagemaker.serializers import CSVSerializer

print(
    f"You can go to SageMaker -> Inference -> Endpoints --> an endpoint with name started with {tuning_job_name} to monitor the deployment status."
)

predictor_hpo = tuner.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
    serializer=CSVSerializer(),
)

In [None]:
def predict(current_predictor, data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ""
    for array in split_array:
        predictions = ",".join([predictions, current_predictor.predict(array).decode("utf-8")])
    return np.fromstring(predictions[1:], sep=",")


hpo_raw_preds = predict(
    predictor_hpo, test_data_df.values[:, 1:]
)  # estimated probability for positive class
hpo_preds = np.where(hpo_raw_preds > 0.5, 1, 0)  # generate prediction label

In [None]:
from sklearn.metrics import confusion_matrix, roc_curve, auc
from matplotlib import pyplot as plt


def print_metrics(y_true, y_predicted):
    cm = confusion_matrix(y_true, y_predicted)
    true_neg, false_pos, false_neg, true_pos = cm.ravel()
    cm = pd.DataFrame(
        np.array([[true_pos, false_pos], [false_neg, true_neg]]),
        columns=["labels positive", "labels negative"],
        index=["predicted positive", "predicted negative"],
    )

    acc = (true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg)
    precision = true_pos / (true_pos + false_pos) if (true_pos + false_pos) > 0 else 0
    recall = true_pos / (true_pos + false_neg) if (true_pos + false_neg) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    return [f1, precision, recall, acc]

In [None]:
result_xgboost_with_hpo = print_metrics(test_data_df.values[:, 0], hpo_preds)
fpr, tpr, _ = roc_curve(test_data_df.values[:, 0], hpo_raw_preds)
roc_auc = auc(fpr, tpr)
result_xgboost_with_hpo.append(roc_auc)
result_xgboost_with_hpo = pd.DataFrame(
    result_xgboost_with_hpo,
    index=["F1", "Precision", "Recall", "Accuracy", "ROC_AUC"],
    columns=["XGBoost_With_HPO"],
)

print(result_xgboost_with_hpo)

## Train Graph Neural Network using DGL

Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node's features to a representation space where the node can be easily classified as fraud or not.

Specifically, we use a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types.

### Hyperparameters

To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models, the network architecture and the optimizer and optimization parameters. 

Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `dgl-fraud-detection/estimator_fns.py`. The parameters set below are:

* **`nodes`** is the name of the file that contains the `node_id`s of the target nodes and the node features.
* **`edges`** is a regular expression that when expanded lists all the filenames for the edgelists
* **`labels`** is the name of the file tha contains the target `node_id`s and their labels
* **`model`** specify which graph neural network to use, this should be set to `r-gcn`

The following hyperparameters can be tuned and adjusted to improve model performance
* **batch-size** is the number nodes that are used to compute a single forward pass of the GNN

* **embedding-size** is the size of the embedding dimension for non target nodes
* **n-neighbors** is the number of neighbours to sample for each target node during graph sampling for mini-batch training
* **n-layers** is the number of GNN layers in the model
* **n-epochs** is the number of training epochs for the model training job
* **optimizer** is the optimization algorithm used for gradient based parameter updates
* **lr** is the learning rate for parameter updates


In [None]:
edges = ",".join(
    map(lambda x: x.split("/")[-1], [file for file in processed_files if "relation" in file])
)
params = {
    "nodes": "features.csv",
    "edges": "relation*",
    "labels": "tags.csv",
    "model": "rgcn",
    "num-gpus": 1,
    "batch-size": 1000,
    "embedding-size": 1024,
    "n-neighbors": 100,
    "n-layers": 2,
    "n-epochs": 10,
    "optimizer": "adam",
    "lr": 1e-2,
}

print("Graph is constructed using the following edgelists:\n{}".format("\n".join(edges.split(","))))

### Create and Fit SageMaker Estimator

With the hyperparameters defined, we can kick off the training job. We use the Deep Graph Library (DGL), with MXNet as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker MXNet estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances we want.

We can then `fit` the estimator on the the training data location in S3.

In [None]:
from sagemaker.mxnet import MXNet
from time import strftime, gmtime

estimator = MXNet(
    entry_point="train_dgl_mxnet_entry_point.py",
    source_dir="sagemaker_graph_fraud_detection/dgl_fraud_detection",
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    framework_version="1.6.0",
    py_version="py3",
    hyperparameters=params,
    output_path=train_output,
    code_location=train_output,
    sagemaker_session=sess,
)

training_job_name = "{}-{}".format(config.solution_prefix, strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
print(
    f"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> a job name started with {training_job_name} to monitor training job status and details."
)
estimator.fit({"train": train_data}, job_name=training_job_name)

Once the training is completed, the training instances are automatically stopped and SageMaker stores the trained model and evaluation results (on the test data) to a location in S3.

### Read the prediction output for the test data
Current training process is transductive setting where the predicting columns of test dataset (not including the target column) are used to construct the graph and thus the test data are included in the training process. At the end of training, the predictions on the test dataset are generated and saved in the **train_output** in the s3 bucket.

In [None]:
test_output_path = os.path.join(train_output, estimator.latest_training_job.job_name, "output")
!mkdir -p output_dgl_job
!aws s3 cp --recursive $test_output_path output_dgl_job

In [None]:
import tarfile

# open file
tar = tarfile.open(os.path.join("output_dgl_job", "output.tar.gz"), "r:gz")
tar.extractall("output_dgl_job")
tar.close()

In [None]:
dgl_output = pd.read_csv(os.path.join("output_dgl_job", "preds.csv"))
dgl_raw_preds, dgl_preds = dgl_output["pred_proba"], dgl_output["pred"]

result_dgl_no_hpo = print_metrics(test_data_df.iloc[:, 0], dgl_preds)
fpr, tpr, _ = roc_curve(test_data_df.iloc[:, 0], dgl_raw_preds)
roc_auc = auc(fpr, tpr)
result_dgl_no_hpo.append(roc_auc)
result_dgl_no_hpo = pd.DataFrame(
    result_dgl_no_hpo,
    index=["F1", "Precision", "Recall", "Accuracy", "ROC_AUC"],
    columns=["DGL_No_HPO"],
)

Combine the results with XGBoost

In [None]:
result_xgboost_dgl = result_xgboost_with_hpo.join(result_dgl_no_hpo)
print(result_xgboost_dgl)

### Create and Fit SageMaker Estimator with HPO
In this section we fit the SageMaker Estimator using DGL with HPO.

In [None]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

# Static hyperparameters we do not tune
hyperparameters = {
    "nodes": "features.csv",
    "edges": "relation*",
    "labels": "tags.csv",
    "model": "rgcn",
    "num-gpus": 1,
    "n-layers": 2,
    "optimizer": "adam",
}

# Dynamic hyperparameters we want to tune and their searching ranges. For demonstartion purpose, we skip the architecture search by skipping tunning the hyperparameters such as 'skip_rnn_num_layers', 'rnn_num_layers', and etc.
hyperparameter_ranges = {
    "batch-size": CategoricalParameter([512, 1024, 2048, 10000]),
    "embedding-size": CategoricalParameter([16, 32, 64, 128, 256, 512]),
    "n-neighbors": IntegerParameter(800, 1200),
    "n-epochs": IntegerParameter(10, 17),
    "lr": ContinuousParameter(0.002, 0.1),
}

In [None]:
objective_metric_name = "Validation F1"
metric_definitions = [
    {"Name": "Validation F1", "Regex": "Validation F1 (\\S+)"}
]  # Root Relative Squared Error (RSE):
objective_type = "Maximize"

In [None]:
from sagemaker.mxnet import MXNet

estimator_tuning = MXNet(
    entry_point="train_dgl_mxnet_entry_point.py",
    source_dir="sagemaker_graph_fraud_detection/dgl_fraud_detection",
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    framework_version="1.6.0",
    py_version="py3",
    hyperparameters=params,
    output_path=train_output,
    code_location=train_output,
    sagemaker_session=sess,
)

In [None]:
import time

tuning_job_name = config.solution_prefix + "-tuning-job"
print(
    f"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> a job name started with {tuning_job_name} to monitor HPO tuning status and details.\n"
    f"Note. You are unable to successfully run the following cells until the tuning job completes. This step may take around 2 hour."
)

tuner = HyperparameterTuner(
    estimator_tuning,  # using the estimator defined in previous section
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=30,
    max_parallel_jobs=3,
    objective_type=objective_type,
    base_tuning_job_name=tuning_job_name,
)

start_time = time.time()

tuner.fit({"train": train_data})

hpo_training_job_time_duration = time.time() - start_time

In [None]:
import boto3

sm_client = boto3.Session().client("sagemaker")

tuning_job_name = tuner.latest_tuning_job.name
tuning_job_name

In [None]:
tuning_job_result = sm_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name
)

status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
    print("Reminder: the tuning job has not been completed.")

job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)

is_minimize = (
    tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]["Type"]
    != "Minimize"
)
objective_name = tuning_job_result["HyperParameterTuningJobConfig"][
    "HyperParameterTuningJobObjective"
]["MetricName"]

In [None]:
tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner_analytics.dataframe()

if len(full_df) > 0:
    df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
    if len(df) > 0:
        df = df.sort_values("FinalObjectiveValue", ascending=False)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
        pd.set_option("display.max_colwidth", -1)  # Don't truncate TrainingJobName
    else:
        print("No training jobs have reported valid results yet.")

df

### Read the prediction output for the test dataset from the best tuning job

In [None]:
import os

df = df[df["TrainingJobStatus"] == "Completed"]  # filter out the failed jobs
output_path_best_tuning_job = os.path.join(train_output, df["TrainingJobName"].iloc[0], "output")
print(output_path_best_tuning_job)

In [None]:
!mkdir -p output_dgl_best_tuning_job
!aws s3 cp --recursive $output_path_best_tuning_job output_dgl_best_tuning_job

In [None]:
import tarfile

# open file
tar = tarfile.open(os.path.join("output_dgl_best_tuning_job", "output.tar.gz"), "r:gz")
tar.extractall("output_dgl_best_tuning_job")
tar.close()

In [None]:
dgl_output_hpo = pd.read_csv(os.path.join("output_dgl_best_tuning_job", "preds.csv"))
dgl_hpo_raw_preds, dgl_hpo_preds = dgl_output_hpo["pred_proba"], dgl_output_hpo["pred"]

result_dgl_with_hpo = print_metrics(test_data_df.values[:, 0], dgl_hpo_preds)
fpr, tpr, _ = roc_curve(test_data_df.values[:, 0], dgl_hpo_raw_preds)
roc_auc = auc(fpr, tpr)
result_dgl_with_hpo.append(roc_auc)
result_dgl_with_hpo = pd.DataFrame(
    result_dgl_with_hpo,
    index=["F1", "Precision", "Recall", "Accuracy", "ROC_AUC"],
    columns=["DGL_With_HPO"],
)


result_xgboost_dgl_full = result_xgboost_dgl.join(result_dgl_with_hpo)
print(result_xgboost_dgl_full)

## Clean Up


After you are done using this notebook, delete the model artifacts and other resources to avoid any incurring charges.

**Caution**: You need to manually delete resources that you may have created while running the notebook, such as Amazon S3 buckets for model artifacts, training datasets, processing artifacts, and Amazon CloudWatch log groups.


### Delete the Endpoint Created from XGBoost

In [None]:
predictor_hpo.delete_model()
predictor_hpo.delete_endpoint()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/introduction_to_applying_machine_learning|fraud_detection_using_graph_neural_networks|fraud_detection_using_deep_graph_neural_networks.ipynb)
