# Fraud Detection Solution Training Notebook

This notebook shows the 1st time training process of a fraud detection model using graph neural networks in the Solution.

This notebook assumes the transaction data has been dumped out from the Neptune DB, and copied to S3 bukets. So the input data has already in S3.

Then we create a launch of training job using the SageMaker framework estimator to train a graph neural network model with DGL.

In [1]:
!bash setup.sh

import sagemaker
from sagemaker_graph_fraud_detection import config, container_build

role = config.role
sess = sagemaker.Session()

Obtaining file:///home/ec2-user/SageMaker/sagemaker_graph_fraud_detection
Installing collected packages: sagemaker-graph-fraud-detection
  Attempting uninstall: sagemaker-graph-fraud-detection
    Found existing installation: sagemaker-graph-fraud-detection 1.0
    Uninstalling sagemaker-graph-fraud-detection-1.0:
      Successfully uninstalled sagemaker-graph-fraud-detection-1.0
  Running setup.py develop for sagemaker-graph-fraud-detection
Successfully installed sagemaker-graph-fraud-detection


## Define Data Location

### Loading Pre-processed data from S3

The dataset used in this Solution is the [IEEE-CIS Fraud Detection dataset](https://www.kaggle.com/c/ieee-fraud-detection/data?select=train_transaction.csv) which is a typical example of financial transactions dataset that many companies have. The dataset consists of two tables:

* **Transactions**: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction. 
* **Identity**: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.

This notebook assumes that the two data tables had been pre-processed, mimicing the 1st time data preparation. 

**Current version uses the pre-processed data in nearly raw format, include all relation files, a feature file, a tag file, and a test index files.**

In [2]:
# Replace with an S3 location or local path to point to customers' own dataset
data_location = 'fraud-detection-solution'
raw_data_folder = 'raw_data'
processed_data_folder = 'processed_data'
model_output_folder = 'model_output'

processed_data = 's3://{}/{}'.format(data_location, processed_data_folder)
output_path = 's3://{}/{}'.format(data_location, model_output_folder)

print(processed_data)
print(output_path)

s3://fraud-detection-solution/processed_data
s3://fraud-detection-solution/model_output


### View the pre-processed data

Here come the pre-processed file lists

## Need to modified this to Neptune dump files!

In [3]:
from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(processed_data)
print("===== Processed Files =====")
print('\n'.join(processed_files))

===== Processed Files =====
s3://fraud-detection-solution/processed_data/
s3://fraud-detection-solution/processed_data/features.csv
s3://fraud-detection-solution/processed_data/relation_DeviceInfo_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_DeviceType_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_P_emaildomain_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_ProductCD_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_R_emaildomain_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_TransactionID_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_addr1_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_addr2_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_card1_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_card2_edgelist.csv
s3://fraud-detection-solution/processed_data/relation_card3_edgelist.csv
s3://fraud-detection-soluti

## Train Graph Neural Network with DGL

Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node's features to a representation space where the node can be easily classified as fraud or not.

Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types.

### Hyperparameters

To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters. 

Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `dgl-fraud-detection/estimator_fns.py`. The parameters set below are:

* **`nodes`** is the name of the file that contains the `node_id`s of the target nodes and the node features.
* **`edges`** is a regular expression that when expanded lists all the filenames for the edgelists
* **`labels`** is the name of the file tha contains the target `node_id`s and their labels

The following hyperparameters can be tuned and adjusted to improve model performance
* **embedding-size** is the size of the embedding dimension for non target nodes
* **n-layers** is the number of GNN layers in the model
* **n-epochs** is the number of training epochs for the model training job
* **optimizer** is the optimization algorithm used for gradient based parameter updates
* **lr** is the learning rate for parameter updates


In [4]:
edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in processed_files if "relation" in file]))
params = {'nodes' : 'features.csv',
          'edges': 'relation*',
          'labels': 'tags.csv',
          'embedding-size': 64,
          'n-layers': 2,
          'n-epochs': 10,
          'optimizer': 'adam',
          'lr': 1e-2
        }

# print("Graph will be constructed using the following edgelists:\n{}" .format('\n'.join(edges.split(","))))

### Create and Fit SageMaker Pytorch Estimator

With the hyperparameters defined, then kick off the training job. Here use the Deep Graph Library (DGL), with Pytorch as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker Pytorch estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances specified.

Then `fit` the estimator on the the training data location in S3.

In [5]:
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
from time import strftime, gmtime

estimator = PyTorch(entry_point='fd_sl_train_entry_point.py',
                    source_dir='FD_SL_DGL/gnn_fraud_detection_dgl',
                    role=role, 
                    train_instance_count=1, 
                    train_instance_type='ml.c5.4xlarge',
                    framework_version="1.4.0",
                    py_version='py3',
                    hyperparameters=params,
                    output_path=output_path,
                    sagemaker_session=sess)

training_job_name = "{}-{}".format('GNN_FD_SL_DGL_Train', strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
print(training_job_name)

estimator.fit({'train': processed_data}, job_name=training_job_name)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


JamesTrain01-2021-01-21-02-30-04
2021-01-21 02:30:04 Starting - Starting the training job...
2021-01-21 02:30:09 Starting - Launching requested ML instances.........
2021-01-21 02:31:36 Starting - Preparing the instances for training...
2021-01-21 02:32:34 Downloading - Downloading input data......
2021-01-21 02:33:30 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-01-21 02:33:31,309 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-01-21 02:33:31,312 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-01-21 02:33:31,320 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-01-21 02:33:31,323 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-01-21 02:33:

[34mRead edges for relation2 from edgelist: /opt/ml/input/data/train/relation_card3_edgelist.csv[0m
[34mRead edges for relation3 from edgelist: /opt/ml/input/data/train/relation_card1_edgelist.csv[0m
[34mRead edges for relation4 from edgelist: /opt/ml/input/data/train/relation_id_36_edgelist.csv[0m
[34mRead edges for relation5 from edgelist: /opt/ml/input/data/train/relation_card5_edgelist.csv[0m
[34mRead edges for relation6 from edgelist: /opt/ml/input/data/train/relation_id_23_edgelist.csv[0m
[34mRead edges for relation7 from edgelist: /opt/ml/input/data/train/relation_card2_edgelist.csv[0m
[34mRead edges for relation8 from edgelist: /opt/ml/input/data/train/relation_id_08_edgelist.csv[0m
[34mRead edges for relation9 from edgelist: /opt/ml/input/data/train/relation_id_05_edgelist.csv[0m
[34mRead edges for relation10 from edgelist: /opt/ml/input/data/train/relation_card4_edgelist.csv[0m
[34mRead edges for relation11 from edgelist: /opt/ml/input/data/train/relation_i

[34mEpoch 00000 | Time(s) 38.9127 | Loss 0.4637 | f1 0.0000 [0m
[34mEpoch 00001 | Time(s) 38.0535 | Loss 0.9163 | f1 0.0097 [0m
[34mEpoch 00002 | Time(s) 37.5218 | Loss 0.5926 | f1 0.0330 [0m
[34mEpoch 00003 | Time(s) 37.2649 | Loss 0.3099 | f1 0.1682 [0m
[34mEpoch 00004 | Time(s) 37.1678 | Loss 0.8407 | f1 0.1806 [0m
[34mEpoch 00005 | Time(s) 37.0568 | Loss 0.2275 | f1 0.0203 [0m
[34mEpoch 00006 | Time(s) 37.0048 | Loss 0.2538 | f1 0.0023 [0m
[34mEpoch 00007 | Time(s) 36.9389 | Loss 0.3028 | f1 0.0000 [0m
[34mEpoch 00008 | Time(s) 36.9076 | Loss 0.3156 | f1 0.0000 [0m
[34mEpoch 00009 | Time(s) 36.8019 | Loss 0.2967 | f1 0.0011 [0m
[34mMetrics[0m
[34mConfusion Matrix:
                                                    labels positive  labels negative[0m
[34mpredicted positive               11                1[0m
[34mpredicted negative            17826           454594
                                f1: 0.0012, precision: 0.9167, recall: 0.0006, acc: 0.9623,

Once the training is completed, the training instances are automatically saved and SageMaker stores the trained model and evaluation results to a location in S3.