# Graph Fraud Detection with DGL on Amazon SageMaker

In [None]:
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
sess = sagemaker.Session()

## Data Preprocessing and Feature Engineering

### Upload raw data to S3

The dataset we use is a typical example of financial transactions dataset that many companies have. The dataset consists of three tables:

* **Relations/Transactions**: Records links or actions between two users. 
* **User Features**: demographic features of each user

Now let's upload the raw data to a convenient location in S3 where it will be picked up by the preprocessing job and training job. 

In [None]:
!wget -P data/ 'https://linqs-data.soe.ucsc.edu/public/social_spammer/usersdata.csv.gz'
!wget -P data/ 'https://linqs-data.soe.ucsc.edu/public/social_spammer/relations.csv.gz'

from sagemaker.s3 import S3Uploader
bucket = 'SAGEMAKER_S3_BUCKET'
prefix = 'dgl'

input_data = 's3://{}/{}/raw-data'.format(bucket, prefix)
S3Uploader.upload('data/usersdata.csv.gz', input_data)
S3Uploader.upload('data/relations.csv.gz', input_data)

train_data = 's3://{}/{}/processed-data'.format(bucket, prefix)
train_output = 's3://{}/{}/output'.format(bucket, prefix)

### Build container for Preprocessing and Feature Engineering

Data preprocessing and feature engineering is an important component of the ML lifecycle, and Amazon SageMaker Processing allows you to do these easily on a managed infrastructure. Now, we'll create a lightweight container that will serve as the environment for our data preprocessing. The container can also be easily customized to add in more dependencies if our preprocessing job requires it.

In [None]:
import boto3 

region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-preprocessing-container'
ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, ecr_repository)

!bash data-preprocessing/container/build_and_push.sh $ecr_repository docker

### Run Preprocessing job with Amazon SageMaker Processing

The script we have defined at `data-preprocessing/graph_data_preprocessor.py` performs data preprocessing and feature engineering transformations on the raw data. Some of the data transformation and feature engineering techniques include:

* Aggregating and encoding user activity into a hour-indexed feature vector
* Constructing graph edgelists between users accounts for the various relation types

In [None]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

script_processor = ScriptProcessor(command=['python3'],
                                   image_uri=ecr_repository_uri,
                                   role=role,
                                   instance_count=1,
                                   instance_type='SAGEMAKER_PROCESSING_INSTANCE_TYPE')

script_processor.run(code='data-preprocessing/graph_data_preprocessor.py',
                     inputs=[ProcessingInput(source=input_data,
                                             destination='/opt/ml/processing/input')],
                     outputs=[ProcessingOutput(destination=train_data,
                                               source='/opt/ml/processing/output')],
                     arguments=['--train-days', '5'])

### View Results of Data Preprocessing

Once the preprocessing job is complete, we can take a look at the contents of the S3 bucket to see the transformed data. We have a set of bipartite edge lists between users and different device id types as well as the user features, labels and a set of users to validate our graph model performance.

In [None]:
from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(train_data)
print("===== Processed Files =====")
print('\n'.join(processed_files))

# optionally download processed data
# S3Downloader.download(train_data, train_data.split("/")[-1])

## Train Graph Neural Network with DGL

Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each user node to transform the user node features to a representation space where the node can be easily classified as fraud or not.

Specifically, we will be using a relational graph convolutional neural network model on a heterogeneous graph since we have nodes and edges of different types.

### Hyperparameters

To train the graph neural network, we need to define a few hyperparameters that determine:

* The kind of graph we're constructing
* The class of graph neural network models we will be using 
* The network architecture
* The optimizer and optimization parameters


In [None]:
edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in processed_files if "relation" in file]))
params = {'nodes' : 'user_features.csv',
          'edges': edges,
          'labels': 'tags.csv',
          'model': 'rgcn',
          'num-gpus': 1,
          'embedding-size': 64,
          'n-layers': 2,
          'n-epochs': 100,
          'optimizer': 'adam',
          'lr': 1e-2
        }

print("Graph will be constructed using the following edgelists:\n{}" .format('\n'.join(bipartite_edges.split(","))))

### Create and Fit SageMaker Estimator

With the hyperparameters defined, we can kick off the training job. We will be using the Deep Graph Library (DGL), with MXNet as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker MXNet estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances we want.

We can then `fit` the estimator on the the training data location in S3.

In [None]:
from sagemaker.mxnet import MXNet

estimator = MXNet(entry_point='train_dgl_entry_point.py',
                  source_dir='dgl-fraud-detection',
                  role=role, 
                  train_instance_count=1, 
                  train_instance_type='SAGEMAKER_TRAINING_INSTANCE_TYPE',
                  framework_version="1.4.1",
                  py_version='py3',
                  hyperparameters=params,
                  output_path=train_output,
                  code_location=train_output,
                  sagemaker_session=sess)

estimator.fit({'train': train_data})

Once the training is completed, the training instances are automatically saved and SageMaker stores the trained model and evaluation results to a location in S3.