# Notebook to run docker building and running

This notebook shows the 1st time training process of a fraud detection model using graph neural networks in the Solution.

This notebook assumes the transaction data has been dumped out from the graph database, such as Neptune DB, and copied to S3 bukets. So the input data has already in S3.

Then we create a launch of training job using the SageMaker framework estimator to train a graph neural network model with DGL.


## Step 1: build our own docker image

### Prerequisites

- An AWS account
- Configure credential of aws cli(the credential has sagemaker, ecr permissions)
- Install Docker Engine

In [None]:
! aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

# run below line if you are using AWS China regions
#! aws ecr get-login-password --region cn-north-1 | docker login --username AWS --password-stdin 727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn

In [None]:
image_name = 'fraud-detection-with-gnn-on-dgl/training'
! docker build -t $image_name ./FD_SL_DGL/gnn_fraud_detection_dgl

# run below line if you are using AWS China regions
# ! docker build --build-arg=IMAGE_REPO=727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn -t fraud-detection-with-gnn-on-dgl/training ./FD_SL_DGL/gnn_fraud_detection_dgl

## Step 2: Test this docker image

### Prerequisites

- Complete the steps in notebook [01.FD_SL_Process_IEEE-CIS_Dataset](./01.FD_SL_Process_IEEE-CIS_Dataset.ipynb)
- install **[docker-compose](https://docs.docker.com/compose/install/)** 

**IMPORTANT**: Restore the variables from previous notebook

In [None]:
%store -r

In [None]:
from sagemaker import get_execution_role
import boto3

def resolve_sm_role():
    region = boto3.session.Session().region_name
    client = boto3.client('iam', region_name=region)
    response_roles = client.list_roles(
        PathPrefix='/',
        # Marker='string',
        MaxItems=999
    )
    for role in response_roles['Roles']:
        if role['RoleName'].startswith('AmazonSageMaker-ExecutionRole-'):
            print('Resolved SageMaker IAM Role to: ' + str(role))
            return role['Arn']
    raise Exception('Could not resolve what should be the SageMaker role to be used')
    
try:
    role = get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = resolve_sm_role()
    
print(role)
sagemaker_exec_role = role

**NOTE**: If you meet error when running above step, please refer to [this doc to create SageMaker execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

### Loading Pre-processed data from S3

The dataset used in this Solution is the [IEEE-CIS Fraud Detection dataset](https://www.kaggle.com/c/ieee-fraud-detection/data?select=train_transaction.csv) which is a typical example of financial transactions dataset that many companies have. The dataset consists of two tables:

* **Transactions**: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction. 
* **Identity**: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.

This notebook assumes that the two data tables had been pre-processed, mimicing the 1st time data preparation. 

**Current version uses the pre-processed data in nearly raw format, include all relation files, a feature file, a tag file, and a test index files.**

In [None]:
model_output_folder = 'model_output'

output_path = f's3://{default_bucket}/{model_output_folder}'

print(processed_data)
print(output_path)

from os import path
from sagemaker.s3 import S3Downloader
import sagemaker

sagemakerSession = sagemaker.session.Session(boto3.session.Session(region_name=current_region))
processed_files = S3Downloader.list(processed_data, sagemaker_session=sagemakerSession)
print("===== Processed Files =====")
print('\n'.join(processed_files))

## Train Graph Neural Network with DGL

Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node's features to a representation space where the node can be easily classified as fraud or not.

Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types.

### Hyperparameters

To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters. 

Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `FD_SL_DGL/gnn_fraud_detection_dgl/estimator_fns.py`. The parameters set below are:

* **`nodes`** is the name of the file that contains the `node_id`s of the target nodes and the node features.
* **`edges`** is a regular expression that when expanded lists all the filenames for the edgelists
* **`labels`** is the name of the file tha contains the target `node_id`s and their labels

The following hyperparameters can be tuned and adjusted to improve model performance
* **embedding-size** is the size of the embedding dimension for non target nodes
* **n-layers** is the number of GNN layers in the model
* **n-epochs** is the number of training epochs for the model training job
* **optimizer** is the optimization algorithm used for gradient based parameter updates
* **lr** is the learning rate for parameter updates

In [None]:
edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in processed_files if "relation" in file]))
params = {'nodes' : 'features.csv',
          'edges': 'relation*',
          'labels': 'tags.csv',
          'embedding-size': 64,
          'n-layers': 2,
          'n-epochs': 100,
          'optimizer': 'adam',
          'lr': 4e-3
        }

### Create and Fit SageMaker Pytorch Estimator

With the hyperparameters defined, then kick off the training job. Here use the Deep Graph Library (DGL), with Pytorch as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker Pytorch estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances specified.

Then `fit` the estimator on the the training data location in S3.

In [None]:
from sagemaker.estimator import Estimator
from time import strftime, gmtime
from sagemaker.local import LocalSession

localSageMakerSession = LocalSession(boto_session=boto3.session.Session(region_name=current_region))
estimator = Estimator(image_uri=image_name,
                      role=sagemaker_exec_role,
                      instance_count=1,
                      instance_type='local',
                      hyperparameters=params,
                      output_path=output_path,
                      sagemaker_session=localSageMakerSession)

training_job_name = "{}-{}".format('GNN-FD-SL-DGL-Train', strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
print(training_job_name)

estimator.fit({'train': processed_data}, job_name=training_job_name)

Once the training is completed, the training instances are automatically saved and SageMaker stores the trained model and evaluation results to a location in S3.

### Repackage the model with custom code

We will use custom code as program entry of model, we will download the model then repackge it as the structure with our custom entry program.

In [None]:
model_path = f'{output_path}/{training_job_name}/model.tar.gz'
repackged_model_path = f'{output_path}/{training_job_name}/repackage-model.tar.gz'

import tempfile

temp_dir = tempfile.mkdtemp()

code_path = f'{output_path}/{training_job_name}/code/'

! export AWS_DEFAULT_REGION=$current_region && aws s3 sync ./FD_SL_DGL/code $code_path
! export AWS_DEFAULT_REGION=$current_region && ../lambda.d/repackage-model/repackage.sh $model_path $repackged_model_path $code_path $temp_dir

In [None]:
%store repackged_model_path
%store sagemaker_exec_role