# Fraud Detection Solution Training Notebook with My Own Docker

This notebook shows the 1st time training process of a fraud detection model using graph neural networks in the Solution.

This notebook assumes the transaction data has been dumped out from the Neptune DB, and copied to S3 bukets. So the input data has already in S3.

Then we create a launch of training job using the SageMaker framework estimator to train a graph neural network model with DGL.

### Major difference from the Solution Training Notebook is using my own docker image

In [None]:
! pip uninstall sagemaker -y
! pip install sagemaker

In [None]:
import sagemaker
from sagemaker_graph_fraud_detection import config, container_build

role = config.role
sess = sagemaker.Session()

## Define Data Location

### Loading Pre-processed data from S3

The dataset used in this Solution is the [IEEE-CIS Fraud Detection dataset](https://www.kaggle.com/c/ieee-fraud-detection/data?select=train_transaction.csv) which is a typical example of financial transactions dataset that many companies have. The dataset consists of two tables:

* **Transactions**: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction. 
* **Identity**: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.

This notebook assumes that the two data tables had been pre-processed, mimicing the 1st time data preparation. 

**Current version uses the pre-processed data in nearly raw format, include all relation files, a feature file, a tag file, and a test index files.**

In [None]:
# Replace with an S3 location or local path to point to customers' own dataset
data_location = 'fraud-detection-solution'
raw_data_folder = 'raw_data'
processed_data_folder = 'processed_data'
model_output_folder = 'model_output'

processed_data = 's3://{}/{}'.format(data_location, processed_data_folder)
output_path = 's3://{}/{}'.format(data_location, model_output_folder)

print(processed_data)
print(output_path)

### View the pre-processed data

Here come the pre-processed file lists

## Need to modified this to Neptune dump files!

In [None]:
from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(processed_data)


## Train Graph Neural Network with DGL

Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node's features to a representation space where the node can be easily classified as fraud or not.

Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types.

### Hyperparameters

To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters. 

Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `dgl-fraud-detection/estimator_fns.py`. The parameters set below are:

* **`nodes`** is the name of the file that contains the `node_id`s of the target nodes and the node features.
* **`edges`** is a regular expression that when expanded lists all the filenames for the edgelists
* **`labels`** is the name of the file tha contains the target `node_id`s and their labels

The following hyperparameters can be tuned and adjusted to improve model performance
* **embedding-size** is the size of the embedding dimension for non target nodes
* **n-layers** is the number of GNN layers in the model
* **n-epochs** is the number of training epochs for the model training job
* **optimizer** is the optimization algorithm used for gradient based parameter updates
* **lr** is the learning rate for parameter updates


In [None]:
edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in processed_files if "relation" in file]))
params = {'nodes' : 'features.csv',
          'edges': 'relation*',
          'labels': 'tags.csv',
          'embedding-size': 64,
          'n-layers': 2,
          'n-epochs': 10,
          'optimizer': 'adam',
          'lr': 1e-2
        }

# print("Graph will be constructed using the following edgelists:\n{}" .format('\n'.join(edges.split(","))))

### Create and Fit SageMaker Pytorch Estimator

With the hyperparameters defined, then kick off the training job. Here use the Deep Graph Library (DGL), with Pytorch as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker Pytorch estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances specified.

Then `fit` the estimator on the the training data location in S3.

In [None]:
from time import strftime, gmtime
from sagemaker.estimator import Estimator

estimator = Estimator(image_uri='510768346845.dkr.ecr.cn-north-1.amazonaws.com.cn/pytorch-extending-our-containers-gnn-fraud-detection-solution:latest',
                      role=role,
                      train_instance_count=1,
                      train_instance_type='ml.c5.4xlarge',
                      hyperparameters=params,
                      output_path=output_path,
                      disable_profiler=True,
                      sagemaker_session=sess)

training_job_name = "{}-{}".format('GNN_FD_SL_DGL_Train', strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
print(training_job_name)

estimator.fit({'train': processed_data}, job_name=training_job_name)

Once the training is completed, the training instances are automatically saved and SageMaker stores the trained model and evaluation results to a location in S3.