# Install DGL dependency

Please use notebook kernel with pytorch already installed. Using `conda_pytorch_p38` or `conda_pytorch_p36` will work. 
Install DGL dependency after selecting the kernel.

In [26]:
%pip install -q dgl

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score

import pandas as pd
from fgnn.fraud_detector import FraudRGCN

# Load train and test splits

In [2]:
df_train = pd.read_parquet('./data/train.parquet')

In [3]:
df_test = pd.read_parquet('./data/test.parquet')

# Set model parameters overloading defaults

We set parameters to match model parameters used in this [blog post](https://aws.amazon.com/blogs/machine-learning/build-a-gnn-based-real-time-fraud-detection-solution-using-amazon-sagemaker-amazon-neptune-and-the-deep-graph-library/).

In [22]:
params = {
    'embedding_size': 64,
    'n_layers': 2,
    'n_epochs': 150,
    'n_hidden': 16,
    'dropout': 0.2,
    'weight_decay': 5e-05,
    'lr': 0.01,
}

In [11]:
### print default model parameters
FraudRGCN()._default_params

{'num_gpus': 0,
 'embedding_size': 128,
 'n_layers': 2,
 'n_epochs': 50,
 'n_hidden': 16,
 'dropout': 0.2,
 'weight_decay': 5e-06,
 'lr': 0.01,
 'target_col': 'TransactionID',
 'node_cols': 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain',
 'label_col': 'isFraud',
 'cat_cols': 'M1,M2,M3,M4,M5,M6,M7,M8,M9,DeviceType,DeviceInfo,id_12,id_13,id_14,id_15,id_16,id_17,id_18,id_19,id_20,id_21,id_22,id_23,id_24,id_25,id_26,id_27,id_28,id_29,id_30,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38',
 'num_cols': 'TransactionAmt,dist1,dist2,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,id_10,id_11,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36,V37,V38,V39,V40,V41,V42,V43,V44,V45,V46,V47,V48,V49,V50,V51,V52,V53,V54,V55,V56,V57,V58,V59,V60,V61,V62,V63,V64,V65,V66,V67,V68,V

# Train model in transductive mode
We train model five times and save trained models in `model/` directory. Note that to train the model in transductive mode, we combine train and test transactions and pass combined DataFrame to the `train_fg` method. We also pass `test_mask` that identifies test transactions with `True` values. In transductive mode, test labels will be masked out during training, but test samples will be used to construct nodes and features.

Training five models will take over half hour to complete.

In [15]:
import warnings
### disable CUDA-related warnings from torch library 
warnings.filterwarnings("ignore", category=UserWarning)

In [25]:
test_mask = [False]*len(df_train) + [True]*len(df_test)
for ii in range(1,6):
    fd = FraudRGCN()
    fd.train_fg(pd.concat([df_train, df_test], ignore_index=True), params=params, test_mask=test_mask)
    fd.save_fg(f"model/transductive_{ii}")

Constructed heterograph with the following metagraph structure: Node types ['P_emaildomain', 'ProductCD', 'R_emaildomain', 'addr1', 'addr2', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'target'], Edge types[('P_emaildomain', 'P_emaildomain<>target', 'target'), ('ProductCD', 'ProductCD<>target', 'target'), ('R_emaildomain', 'R_emaildomain<>target', 'target'), ('addr1', 'addr1<>target', 'target'), ('addr2', 'addr2<>target', 'target'), ('card1', 'card1<>target', 'target'), ('card2', 'card2<>target', 'target'), ('card3', 'card3<>target', 'target'), ('card4', 'card4<>target', 'target'), ('card5', 'card5<>target', 'target'), ('card6', 'card6<>target', 'target'), ('target', 'self_relation', 'target'), ('target', 'target<>P_emaildomain', 'P_emaildomain'), ('target', 'target<>ProductCD', 'ProductCD'), ('target', 'target<>R_emaildomain', 'R_emaildomain'), ('target', 'target<>addr1', 'addr1'), ('target', 'target<>addr2', 'addr2'), ('target', 'target<>card1', 'card1'), ('target', 'target

# Train model in inductive mode

We train model five times and save them to `model/` directory. Note that only training transactions are passed to the model for training. 

Training five models will take over half hour to complete.

In [24]:
for ii in range(1,6):
    fd = FraudRGCN()
    fd.train_fg(df_train, params=params)
    fd.save_fg(f"model/inductive_{ii}")

Constructed heterograph with the following metagraph structure: Node types ['P_emaildomain', 'ProductCD', 'R_emaildomain', 'addr1', 'addr2', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'target'], Edge types[('P_emaildomain', 'P_emaildomain<>target', 'target'), ('ProductCD', 'ProductCD<>target', 'target'), ('R_emaildomain', 'R_emaildomain<>target', 'target'), ('addr1', 'addr1<>target', 'target'), ('addr2', 'addr2<>target', 'target'), ('card1', 'card1<>target', 'target'), ('card2', 'card2<>target', 'target'), ('card3', 'card3<>target', 'target'), ('card4', 'card4<>target', 'target'), ('card5', 'card5<>target', 'target'), ('card6', 'card6<>target', 'target'), ('target', 'self_relation', 'target'), ('target', 'target<>P_emaildomain', 'P_emaildomain'), ('target', 'target<>ProductCD', 'ProductCD'), ('target', 'target<>R_emaildomain', 'R_emaildomain'), ('target', 'target<>addr1', 'addr1'), ('target', 'target<>addr2', 'addr2'), ('target', 'target<>card1', 'card1'), ('target', 'target