# Graph Fraud Detection with DGL on Amazon SageMaker

This notebook shows an end to end pipeline to train a fraud detection model using graph neural networks. 

First, we process the raw dataset to prepare the features and extract the interactions in the dataset that will be used to construct the graph. 

Then, we create a launch a training job using the SageMaker framework estimator to train a graph neural network model with DGL.

In [1]:
!bash setup.sh

import sagemaker
from sagemaker_graph_fraud_detection import config, container_build

role = config.role
sess = sagemaker.Session()

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Obtaining file:///root/fraud-detection-workshop/4.%20Advanced/sagemaker_graph_fraud_detection
Installing collected packages: sagemaker-graph-fraud-detection
  Running setup.py develop for sagemaker-graph-fraud-detection
Successfully installed sagemaker-graph-fraud-detection-1.0


## Data Preprocessing and Feature Engineering

### Upload raw data to S3

The dataset we use is the [IEEE-CIS Fraud Detection dataset](https://www.kaggle.com/c/ieee-fraud-detection/data?select=train_transaction.csv) which is a typical example of financial transactions dataset that many companies have. The dataset consists of two tables:

* **Transactions**: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction. 
* **Identity**: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.

We will go over the specific data schema in subsequent cells but now let's move the raw data to a convenient location in the S3 bucket for this proejct, where it will be picked up by the preprocessing job and training job.

If you would like to use your own dataset for this demonstration. Replace the `raw_data_location` with the s3 path or local path of your dataset, and modify the data preprocessing step as needed.

In [2]:
# Replace with an S3 location or local path to point to your own dataset
raw_data_location = 's3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data'

session_prefix = 'dgl-fraud-detection'
input_data = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_data_prefix)

!aws s3 cp --recursive $raw_data_location $input_data

# Set S3 locations to store processed data for training and post-training results and artifacts respectively
train_data = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_processing_output)
train_output = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_train_output)

copy: s3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data/identity.csv to s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/raw-data/identity.csv
copy: s3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data/transaction.csv to s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/raw-data/transaction.csv


### Build container for Preprocessing and Feature Engineering

Data preprocessing and feature engineering is an important component of the ML lifecycle, and Amazon SageMaker Processing allows you to do these easily on a managed infrastructure. First, we'll create a lightweight container that will serve as the environment for our data preprocessing. 

The Dockerfile that defines the container is shown below and it only contains the pandas package as a dependency but it can also be easily customized to add in more dependencies if your data preprocessing job requires it.

In [3]:
!pygmentize data-preprocessing/container/Dockerfile

[34mFROM[39;49;00m [33mpython:3.7-slim-buster[39;49;00m

[34mRUN[39;49;00m pip3 install [31mpandas[39;49;00m==[34m0[39;49;00m.24.2
[34mENV[39;49;00m [31mPYTHONUNBUFFERED[39;49;00m=TRUE

[34mENTRYPOINT[39;49;00m [[33m"python3"[39;49;00m]


In [5]:
!pip install sagemaker-studio-image-build

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting sagemaker-studio-image-build
  Downloading sagemaker_studio_image_build-0.6.0.tar.gz (13 kB)
Building wheels for collected packages: sagemaker-studio-image-build
  Building wheel for sagemaker-studio-image-build (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker-studio-image-build: filename=sagemaker_studio_image_build-0.6.0-py3-none-any.whl size=13469 sha256=bc9c1dedf926a774e6691c03370ca1a7c24ed08ed2407c7bf8a73a8fd029ddb2
  Stored in directory: /root/.cache/pip/wheels/c1/9c/e8/cbf0266d9d9b1b6161f7ba9ddf572d02aacd411e8a5b4d186b
Successfully built sagemaker-studio-image-build
Installing collected packages: sagemaker-studio-image-build
Successfully installed sagemaker-studio-image-build-0.6.0
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


Now we'll run a simple script to build a container image using the Dockerfile, and push the image to Amazon ECR. The container image will have a unique URI which the SageMaker Processing job executed.

In [10]:
%%sh

cd datapreprocessing/container

sm-docker build .  --repository sagemaker-dgl:latest

Created ECR repository sagemaker-dgl
...[Container] 2021/08/23 19:03:15 Waiting for agent ping

[Container] 2021/08/23 19:03:18 Waiting for DOWNLOAD_SOURCE
[Container] 2021/08/23 19:03:18 Phase is DOWNLOAD_SOURCE
[Container] 2021/08/23 19:03:18 CODEBUILD_SRC_DIR=/codebuild/output/src262387543/src
[Container] 2021/08/23 19:03:18 YAML location is /codebuild/output/src262387543/src/buildspec.yml
[Container] 2021/08/23 19:03:18 Processing environment variables
[Container] 2021/08/23 19:03:18 No runtime version selected in buildspec.
[Container] 2021/08/23 19:03:18 Moving to directory /codebuild/output/src262387543/src
[Container] 2021/08/23 19:03:19 Registering with agent
[Container] 2021/08/23 19:03:19 Phases found in YAML: 3
[Container] 2021/08/23 19:03:19  PRE_BUILD: 9 commands
[Container] 2021/08/23 19:03:19  BUILD: 4 commands
[Container] 2021/08/23 19:03:19  POST_BUILD: 3 commands
[Container] 2021/08/23 19:03:19 Phase complete: DOWNLOAD_SOURCE State: SUCCEEDED
[Container] 2021/08/23 1

In [11]:
# copy the uri from the above command and set it to below variable
ecr_repository_uri = '365792799466.dkr.ecr.us-east-1.amazonaws.com/sagemaker-dgl:latest'

### Run Preprocessing job with Amazon SageMaker Processing

The script we have defined at `data-preprocessing/graph_data_preprocessor.py` performs data preprocessing and feature engineering transformations on the raw data. We provide a general processing framework to convert a relational table to heterogeneous graph edgelists based on the column types of the relational table. Some of the data transformation and feature engineering techniques include:

* Performing numerical encoding for categorical variables and logarithmic transformation for transaction amount
* Constructing graph edgelists between transactions and other entities for the various relation types

The inputs to the data preprocessing script are passed in as python command line arguments. All the columns in the relational table are classifed into one of 3 types for the purposes of data transformation: 

* **Identity columns** `--id-cols`: columns that contain identity information related to a user or transaction for example IP address, Phone Number, device identifiers etc. These column types become node types in the heterogeneous graph, and the entries in these columns become the nodes. The column names for these column types need to passed in to the script.

* **Categorical columns** `--cat-cols`: columns that correspond to categorical features for a user's age group or whether a provided address matches with an address on file. The entries in these columns undergo numerical feature transformation and are used as node attributes in the heterogeneous graph. The columns names for these column types also needs to be passed in to the script

* **Numerical columns**: columns that correspond to numerical features like how many times a user has tried a transaction and so on. The entries here are also used as node attributes in the heterogeneous graph. The script assumes that all columns in the tables that are not identity columns or categorical columns are numerical columns

In order to adapt the preprocessing script to work with data in the same format, you can simply change the python arguments used in the cell below to a comma seperate string for the column names in your dataset. If your dataset is in a different format, then you will also have to modify the preprocessing script at `data-preprocessing/graph_data_preprocessor.py`

In [13]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

script_processor = ScriptProcessor(command=['python3'],
                                   image_uri=ecr_repository_uri,
                                   role=role,
                                   instance_count=1,
                                   instance_type='ml.m4.xlarge')

script_processor.run(code='datapreprocessing/graph_data_preprocessor.py',
                     inputs=[ProcessingInput(source=input_data,
                                             destination='/opt/ml/processing/input')],
                     outputs=[ProcessingOutput(destination=train_data,
                                               source='/opt/ml/processing/output')],
                     arguments=['--id-cols', 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain',
                                '--cat-cols','M1,M2,M3,M4,M5,M6,M7,M8,M9'])

Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  sagemaker-dgl-2021-08-23-19-05-50-832
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/raw-data', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-365792799466/sagemaker-dgl-2021-08-23-19-05-50-832/input/code/graph_data_preprocessor.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
..................................[34m2021-08-23 19:11:19,611 INFO __main__: Shape of transaction data is (590540, 394

### View Results of Data Preprocessing

Once the preprocessing job is complete, we can take a look at the contents of the S3 bucket to see the transformed data. We have a set of bipartite edge lists between transactions and different device id types as well as the features, labels and a set of transactions to validate our graph model performance.

In [14]:
from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(train_data)
print("===== Processed Files =====")
print('\n'.join(processed_files))

# optionally download processed data
# S3Downloader.download(train_data, train_data.split("/")[-1])

===== Processed Files =====
s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data/features.csv
s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data/relation_DeviceInfo_edgelist.csv
s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data/relation_DeviceType_edgelist.csv
s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data/relation_P_emaildomain_edgelist.csv
s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data/relation_ProductCD_edgelist.csv
s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data/relation_R_emaildomain_edgelist.csv
s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data/relation_TransactionID_edgelist.csv
s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data/relation_addr1_edgelist.csv
s3://sagemaker-us-east-1-365792799466/dgl-fraud-detection/processed-data/relation_addr2_edgelist.csv
s3://sagemaker-us-east-1-365792799466/dg

## Train Graph Neural Network with DGL

Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node's features to a representation space where the node can be easily classified as fraud or not.

Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types.

### Hyperparameters

To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters. 

Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `dgl-fraud-detection/estimator_fns.py`. The parameters set below are:

* **`nodes`** is the name of the file that contains the `node_id`s of the target nodes and the node features.
* **`edges`** is a regular expression that when expanded lists all the filenames for the edgelists
* **`labels`** is the name of the file tha contains the target `node_id`s and their labels
* **`model`** specify which graph neural network to use, this should be set to `r-gcn`

The following hyperparameters can be tuned and adjusted to improve model performance
* **batch-size** is the number nodes that are used to compute a single forward pass of the GNN

* **embedding-size** is the size of the embedding dimension for non target nodes
* **n-neighbors** is the number of neighbours to sample for each target node during graph sampling for mini-batch training
* **n-layers** is the number of GNN layers in the model
* **n-epochs** is the number of training epochs for the model training job
* **optimizer** is the optimization algorithm used for gradient based parameter updates
* **lr** is the learning rate for parameter updates


In [15]:
edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in processed_files if "relation" in file]))
params = {'nodes' : 'features.csv',
          'edges': 'relation*',
          'labels': 'tags.csv',
          'model': 'rgcn',
          'num-gpus': 1,
          'batch-size': 10000,
          'embedding-size': 64,
          'n-neighbors': 1000,
          'n-layers': 2,
          'n-epochs': 10,
          'optimizer': 'adam',
          'lr': 1e-2
        }

print("Graph will be constructed using the following edgelists:\n{}" .format('\n'.join(edges.split(","))))

Graph will be constructed using the following edgelists:
relation_DeviceInfo_edgelist.csv
relation_DeviceType_edgelist.csv
relation_P_emaildomain_edgelist.csv
relation_ProductCD_edgelist.csv
relation_R_emaildomain_edgelist.csv
relation_TransactionID_edgelist.csv
relation_addr1_edgelist.csv
relation_addr2_edgelist.csv
relation_card1_edgelist.csv
relation_card2_edgelist.csv
relation_card3_edgelist.csv
relation_card4_edgelist.csv
relation_card5_edgelist.csv
relation_card6_edgelist.csv
relation_id_01_edgelist.csv
relation_id_02_edgelist.csv
relation_id_03_edgelist.csv
relation_id_04_edgelist.csv
relation_id_05_edgelist.csv
relation_id_06_edgelist.csv
relation_id_07_edgelist.csv
relation_id_08_edgelist.csv
relation_id_09_edgelist.csv
relation_id_10_edgelist.csv
relation_id_11_edgelist.csv
relation_id_12_edgelist.csv
relation_id_13_edgelist.csv
relation_id_14_edgelist.csv
relation_id_15_edgelist.csv
relation_id_16_edgelist.csv
relation_id_17_edgelist.csv
relation_id_18_edgelist.csv
relation_

### Create and Fit SageMaker Estimator

With the hyperparameters defined, we can kick off the training job. We will be using the Deep Graph Library (DGL), with MXNet as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker MXNet estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances we want.

We can then `fit` the estimator on the the training data location in S3.

In [16]:
from sagemaker.mxnet import MXNet
from time import strftime, gmtime

estimator = MXNet(entry_point='train_dgl_mxnet_entry_point.py',
                  source_dir='sagemaker_graph_fraud_detection/dgl_fraud_detection',
                  role=role, 
                  train_instance_count=1, 
                  train_instance_type='ml.p3.2xlarge',
                  framework_version="1.6.0",
                  py_version='py3',
                  hyperparameters=params,
                  output_path=train_output,
                  code_location=train_output,
                  sagemaker_session=sess)

training_job_name = "{}-{}".format(config.solution_prefix, strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
estimator.fit({'train': train_data}, job_name=training_job_name)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2021-08-23 19:16:32 Starting - Starting the training job...
2021-08-23 19:16:34 Starting - Launching requested ML instances......
2021-08-23 19:17:49 Starting - Preparing the instances for training.........
2021-08-23 19:19:17 Downloading - Downloading input data...
2021-08-23 19:19:56 Training - Downloading the training image......
2021-08-23 19:20:49 Training - Training image download completed. Training in progress.[34m2021-08-23 19:20:48,849 sagemaker-training-toolkit INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2021-08-23 19:20:48,876 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"batch-size":10000,"edges":"relation*","embedding-size":64,"labels":"tags.csv","lr":0.01,"model":"rgcn","n-epochs":10,"n-layers":2,"n-neighbors":1000,"nodes":"features.csv","num-gpus":1,"optimizer":"adam"}', 'SM_USER_ENTRY_POINT': 'train_dgl_mxnet_entry_point.py', 'SM_FRAMEWORK

Once the training is completed, the training instances are automatically saved and SageMaker stores the trained model and evaluation results to a location in S3.