# Notebook
### This notenook includes GNN model training and dataset implementation with PyG library. In this example, we used HI-Small_Trans.csv as our dataset for training and testing.  

For those who are interested in running this notebook – follow this workflow

1. I used Google Colab with T4 GPU support (CUDA version 12.5.82).
2. Put your data (`HI-Small_Trans.csv`) in `./data/raw/` folder, or else it's not going to work.
3. Run all preliminary installation code to ensure code is working properly.

## Preliminary

In [None]:
# !pip uninstall -y torch-geometric torch-sparse torch-scatter torch-cluster pyg-lib

Found existing installation: torch-geometric 2.6.1
Uninstalling torch-geometric-2.6.1:
  Successfully uninstalled torch-geometric-2.6.1
Found existing installation: torch_sparse 0.6.18+pt25cu124
Uninstalling torch_sparse-0.6.18+pt25cu124:
  Successfully uninstalled torch_sparse-0.6.18+pt25cu124
Found existing installation: torch_scatter 2.1.2+pt25cu124
Uninstalling torch_scatter-2.1.2+pt25cu124:
  Successfully uninstalled torch_scatter-2.1.2+pt25cu124
Found existing installation: torch_cluster 1.6.3+pt25cu124
Uninstalling torch_cluster-1.6.3+pt25cu124:
  Successfully uninstalled torch_cluster-1.6.3+pt25cu124
Found existing installation: pyg-lib 0.4.0+pt25cu124
Uninstalling pyg-lib-0.4.0+pt25cu124:
  Successfully uninstalled pyg-lib-0.4.0+pt25cu124


In [1]:
!pip show torch

Name: torch
Version: 2.5.1+cu124
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /usr/local/lib/python3.11/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, fastai, peft, sentence-transformers, timm, torchaudio, torchvision


In [None]:
!cat /etc/os-release
print('-----')
!nvcc --version

PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
-----
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [1]:
!pip install torch-geometric \
  torch-sparse \
  torch-scatter \
  torch-cluster \
  pyg-lib \
  -f https://data.pyg.org/whl/torch-2.5.1+cu124.html

Looking in links: https://data.pyg.org/whl/torch-2.5.1+cu124.html


In [18]:
import datetime
import os
from typing import Callable, Optional
import pandas as pd
from sklearn import preprocessing
import numpy as np
import torch

from torch_geometric.data import (
    Data,
    InMemoryDataset
)

pd.set_option('display.max_columns', None)
path = 'data/raw/HI-Small_Trans.csv'
df = pd.read_csv(path)

## Data Analysis and Visualization
Let's look into the dataset

In [None]:
df.head()

Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
0,2022/09/01 00:20,10,8000EBD30,10,8000EBD30,3697.34,US Dollar,3697.34,US Dollar,Reinvestment,0
1,2022/09/01 00:20,3208,8000F4580,1,8000F5340,0.01,US Dollar,0.01,US Dollar,Cheque,0
2,2022/09/01 00:00,3209,8000F4670,3209,8000F4670,14675.57,US Dollar,14675.57,US Dollar,Reinvestment,0
3,2022/09/01 00:02,12,8000F5030,12,8000F5030,2806.97,US Dollar,2806.97,US Dollar,Reinvestment,0
4,2022/09/01 00:06,10,8000F5200,10,8000F5200,36682.97,US Dollar,36682.97,US Dollar,Reinvestment,0


After the viewing the dataframe, we suggest that we can extract all accounts from receiver and payer among all transcation for sorting the suspicious accounts. We can transform the whole dataset into node classification problem by considering accounts as nodes while transcation as edges.

The object columns should be encoded into classes with sklearn LabelEncoder.

In [None]:
print(df.dtypes)

Timestamp              object
From Bank               int64
Account                object
To Bank                 int64
Account.1              object
Amount Received       float64
Receiving Currency     object
Amount Paid           float64
Payment Currency       object
Payment Format         object
Is Laundering           int64
dtype: object


Check if there are any null values

In [None]:
print(df.isnull().sum())

Timestamp             0
From Bank             0
Account               0
To Bank               0
Account.1             0
Amount Received       0
Receiving Currency    0
Amount Paid           0
Payment Currency      0
Payment Format        0
Is Laundering         0
dtype: int64


There are two columns representing paid and received amount of each transcation, wondering if it is necessary to split the amount into two columns when they shared the same value, unless there are transcation fee/transcation between different currency. Let's find out

In [None]:
print('Amount Received equals to Amount Paid:')
print(df['Amount Received'].equals(df['Amount Paid']))
print('Receiving Currency equals to Payment Currency:')
print(df['Receiving Currency'].equals(df['Payment Currency']))

Amount Received equals to Amount Paid:
False
Receiving Currency equals to Payment Currency:
False


It seens involved the transcations between different currency, let's print it out

In [None]:
not_equal1 = df.loc[~(df['Amount Received'] == df['Amount Paid'])]
not_equal2 = df.loc[~(df['Receiving Currency'] == df['Payment Currency'])]
print(not_equal1.head())
print('---------------------------------------------------------------------------')
print(not_equal2.head())

              Timestamp  From Bank    Account  To Bank  Account.1  \
1173   2022/09/01 00:22       1362  80030A870     1362  80030A870   
7156   2022/09/01 00:28      11318  800C51010    11318  800C51010   
7925   2022/09/01 00:12        795  800D98770      795  800D98770   
8467   2022/09/01 00:01       1047  800E92CF0     1047  800E92CF0   
11529  2022/09/01 00:22      11157  80135FFC0    11157  80135FFC0   

       Amount Received Receiving Currency  Amount Paid Payment Currency  \
1173             52.11               Euro        61.06        US Dollar   
7156             76.06               Euro        89.12        US Dollar   
7925             17.69  Australian Dollar        12.52        US Dollar   
8467             19.43               Euro        22.77        US Dollar   
11529            98.34               Euro       115.24        US Dollar   

      Payment Format  Is Laundering  
1173             ACH              0  
7156             ACH              0  
7925             ACH

The size of two df shows that there are transcation fee and transcation between different currency, we cannot combine/drop the amount columns.

As we are going to encode the columns, we have to make sure that the classes of same attribute are aligned.
Let's check if the list of Receiving Currency and Payment Currency are the same

In [None]:
print(sorted(df['Receiving Currency'].unique()))
print(sorted(df['Payment Currency'].unique()))

['Australian Dollar', 'Bitcoin', 'Brazil Real', 'Canadian Dollar', 'Euro', 'Mexican Peso', 'Ruble', 'Rupee', 'Saudi Riyal', 'Shekel', 'Swiss Franc', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']
['Australian Dollar', 'Bitcoin', 'Brazil Real', 'Canadian Dollar', 'Euro', 'Mexican Peso', 'Ruble', 'Rupee', 'Saudi Riyal', 'Shekel', 'Swiss Franc', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']


## Data Preprocessing
### We will show the functions used in the PyG dataset first, dataset and model training will be provided in bottom section

In the data preprocessing, we perform below transformation:  
1. Transform the Timestamp with min max normalization.  
2. Create unique ID for each account by adding bank code with account number.  
3. Create receiving_df with the information of receiving accounts, received amount and currency
4. Create paying_df with the information of payer accounts, paid amount and currency
5. Create a list of currency used among all transactions
6. Label the 'Payment Format', 'Payment Currency', 'Receiving Currency' by classes with sklearn LabelEncoder


In [19]:
def df_label_encoder(df, columns):
    le = preprocessing.LabelEncoder()
    for col in columns:
        df[col] = le.fit_transform(df[col].astype(str))
    return df

def preprocess(df):
    df = df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
    df['Timestamp'] = pd.to_datetime(df['Timestamp'])
    df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
    df['Timestamp'] = (df['Timestamp'] - df['Timestamp'].min()) / (df['Timestamp'].max() - df['Timestamp'].min())

    df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
    df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']

    df = df.sort_values(by=['Account'])
    receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
    paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
    receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
    currency_ls = sorted(df['Receiving Currency'].unique())

    return df, receiving_df, paying_df, currency_ls

Let's have a look of processed df

In [20]:
df, receiving_df, paying_df, currency_ls = preprocess(df = df)
print(df.head())

        Timestamp  From Bank          Account  To Bank         Account.1  \
213094   0.002274      10057  10057_803A115E0    10057   10057_803A115E0   
673518   0.076352      10057  10057_803AA8E90    10099   10099_804672160   
673517   0.077908      10057  10057_803AA8E90    10099   10099_804672160   
218076   0.000000      10057  10057_803AAB430   210972  210972_8045E8310   
606523   0.062470      10057  10057_803AAB430    10057   10057_803AAB430   

        Amount Received  Receiving Currency  Amount Paid  Payment Currency  \
213094        146954.27                  13    146954.27                13   
673518        667157.92                  13    667157.92                13   
673517         93486.14                  13     93486.14                13   
218076       9194870.98                  13   9194870.98                13   
606523        131419.05                  13    131419.05                13   

        Payment Format  Is Laundering  
213094               5            

paying df and receiving df:

In [21]:
print(receiving_df.head())
print(paying_df.head())

                 Account  Amount Received  Receiving Currency
213094   10057_803A115E0        146954.27                  13
673518   10099_804672160        667157.92                  13
673517   10099_804672160         93486.14                  13
218076  210972_8045E8310       9194870.98                  13
606523   10057_803AAB430        131419.05                  13
                Account  Amount Paid  Payment Currency
213094  10057_803A115E0    146954.27                13
673518  10057_803AA8E90    667157.92                13
673517  10057_803AA8E90     93486.14                13
218076  10057_803AAB430   9194870.98                13
606523  10057_803AAB430    131419.05                13


currency_ls:

In [None]:
print(currency_ls)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


We would like to extract all unique accounts from payer and receiver as node of our graph. It includes the unique account ID, Bank code and the label of 'Is Laundering'.  
In this section, we consider both payer and receiver involved in a illicit transaction as suspicious accounts, we will label both accounts with 'Is Laundering' == 1.

In [22]:
def get_all_account(df):
    ldf = df[['Account', 'From Bank']]
    rdf = df[['Account.1', 'To Bank']]

    # All unique accounts involved in illicit transactions
    suspicious = df[df['Is Laundering']==1]
    s1 = suspicious[['Account', 'Is Laundering']]
    s2 = suspicious[['Account.1', 'Is Laundering']]
    s2 = s2.rename({'Account.1': 'Account'}, axis=1)

    suspicious = pd.concat([s1, s2], join='outer')
    suspicious = suspicious.drop_duplicates()

    ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
    rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
    df = pd.concat([ldf, rdf], join='outer')
    df = df.drop_duplicates()

    df['Is Laundering'] = 0
    df.set_index('Account', inplace=True)
    df.update(suspicious.set_index('Account'))
    df = df.reset_index()
    return df

Take a look of the account list:

In [23]:
accounts = get_all_account(df)
accounts[accounts["Is Laundering"] == 1]

Unnamed: 0,Account,Bank,Is Laundering
82,10057_803DE1580,10057,1
1216,1024_8009FD760,1024,1
2636,1047_8031B5390,1047,1
3928,10656_8044072A0,10656,1
3942,10656_8044B4190,10656,1
...,...,...,...
369418,9571_8056D2690,9571,1
369474,24_803CD14B0,24,1
369501,14381_805954BD0,14381,1
369771,24779_80985A590,24779,1


## Node features
For node features, we would like to aggregate the mean of paid and received amount with different types of currency as the new features of each node.

In [24]:
def paid_currency_aggregate(currency_ls, paying_df, accounts):
    for i in currency_ls:
        temp = paying_df[paying_df['Payment Currency'] == i]
        accounts['avg paid '+ str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
    return accounts

def received_currency_aggregate(currency_ls, receiving_df, accounts):
    for i in currency_ls:
        temp = receiving_df[receiving_df['Receiving Currency'] == i]
        accounts['avg received '+ str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
    accounts = accounts.fillna(0)
    return accounts

Now we can define the node attributes by the bank code and the mean of paid and received amount with different types of currency.

In [102]:
def get_node_attr(currency_ls, paying_df,receiving_df, accounts):
    node_df = paid_currency_aggregate(currency_ls, paying_df, accounts)
    node_df = received_currency_aggregate(currency_ls, receiving_df, node_df)

    node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)

    node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
    node_df = df_label_encoder(node_df, ['Bank'])
    # node_df = torch.from_numpy(node_df.values).to(torch.float)  # comment for visualization

    return node_df, node_label

Take a look of node_df:

In [104]:
node_df, node_label = get_node_attr(currency_ls, paying_df, receiving_df, accounts)
print(node_df.sample(10))

         Bank  avg paid 0  avg paid 1  avg paid 2  avg paid 3    avg paid 4  \
276493   9550         0.0         0.0         0.0         0.0      0.000000   
51118     212         0.0         0.0         0.0         0.0      0.000000   
298946  16313         0.0         0.0         0.0         0.0      0.000000   
167884    852         0.0         0.0         0.0         0.0   6116.930000   
217742   1251         0.0         0.0         0.0         0.0      0.000000   
105072    526         0.0         0.0         0.0         0.0      0.000000   
92979     525         0.0         0.0         0.0         0.0      0.000000   
156910    790         0.0         0.0         0.0         0.0  20355.236667   
36068      19         0.0         0.0         0.0         0.0      0.000000   
43734     179         0.0         0.0         0.0         0.0      0.000000   

        avg paid 5  avg paid 6  avg paid 7  avg paid 8  avg paid 9  \
276493    40627.16         0.0         0.0         0.0      

## Edge features
In terms of edge features, we would like to conside each transcation as edges.  
For edge index, we replace all account with index and stack into a list with size of [2, num of transcation]  
For edge attributes, we used 'Timestamp', 'Amount Received', 'Receiving Currency', 'Amount Paid', 'Payment Currency' and 'Payment Format'


In [105]:
def get_edge_df(accounts, df):
    accounts = accounts.reset_index(drop=True)
    accounts['ID'] = accounts.index

    mapping_dict = dict(zip(accounts['Account'], accounts['ID']))

    df['From'] = df['Account'].map(mapping_dict)
    df['To'] = df['Account.1'].map(mapping_dict)
    df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

    edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

    df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

    edge_attr = torch.from_numpy(df.values).to(torch.float)

    edge_attr = df  # comment for visualization
    return edge_attr, edge_index

edge_attr:

In [106]:
edge_attr, edge_index = get_edge_df(accounts, df)
print(edge_attr.head())

        Timestamp  Amount Received  Receiving Currency  Amount Paid  \
213094   0.002274        146954.27                  13    146954.27   
673518   0.076352        667157.92                  13    667157.92   
673517   0.077908         93486.14                  13     93486.14   
218076   0.000000       9194870.98                  13   9194870.98   
606523   0.062470        131419.05                  13    131419.05   

        Payment Currency  Payment Format  
213094                13               5  
673518                13               4  
673517                13               3  
218076                13               2  
606523                13               5  


edge_index:

In [88]:
print(edge_index)

tensor([[     0,      1,      1,  ..., 341631, 341631, 341632],
        [     0, 341633, 341633,  ..., 341631, 341631, 341632]])


## PyG Dataset Visualization

In [101]:
data = Data(x=node_df,
            edge_index=edge_index,
            y=node_label,
            edge_attr=edge_attr
)
print(data.x)
print(data.edge_index)
print(data.y)
print(data.edge_attr)

tensor([[2.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 3.6973e+03, 0.0000e+00,
         0.0000e+00],
        [2.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 1.2824e+02, 0.0000e+00,
         0.0000e+00],
        [2.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 1.4676e+04, 0.0000e+00,
         0.0000e+00],
        ...,
        [1.2000e+02, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         1.7073e+06],
        [1.0100e+02, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         1.7073e+06],
        [8.2200e+02, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         3.2693e+04]])
tensor([[     0,      1,      1,  ..., 341631, 341631, 341632],
        [     0, 341633, 341633,  ..., 341631, 341631, 341632]])
tensor([0., 0., 0.,  ..., 0., 0., 0.])
tensor([[2.2738e-03, 1.4695e+05, 1.3000e+01, 1.4695e+05, 1.3000e+01, 5.0000e+00],
        [7.6352e-02, 6.6716e+05, 1.3000e+01, 6.6716e+05, 1.3000e+01, 4.0000e+00],
        [7.7908e-02, 9.3486e+04, 1.3000e+01, 9.3486e+04, 1.3000e+01, 3.0

In [91]:
print(data.x.shape)
print(data.edge_index.shape)
print(data.y.shape)
print(data.edge_attr.shape)

torch.Size([370958, 31])
torch.Size([2, 675651])
torch.Size([370958])
torch.Size([675651, 6])


# Combined Pipeline
### Below we will show the final code for model.py, train.py and dataset.py

## Model Architecture
In this section, we used Graph Attention Networks as our backbone model.  
The model built with two GATConv layers followed by a linear layer with sigmoid outout for classification

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_geometric.transforms as T
from torch_geometric.nn import GATConv, Linear, GCNConv

class GCN_GAT(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, heads=4):
        super(GCN_GAT, self).__init__()

        self.gcn1 = GCNConv(in_channels, hidden_channels)

        self.gat1 = GATConv(hidden_channels, hidden_channels, heads=heads, concat=False, dropout=0.6)

        self.fc1 = nn.Linear(hidden_channels, hidden_channels // 2)
        self.fc2 = nn.Linear(hidden_channels // 2, out_channels)

        self.dropout = nn.Dropout(0.6)

        self.sigmoid = nn.Sigmoid()

    def forward(self, x, edge_index, edge_attr):
        x = F.dropout(x, p=0.6, training=self.training)

        x = self.gcn1(x, edge_index)
        x = F.relu(x)

        x = self.gat1(x, edge_index, edge_attr)
        x = F.relu(x)

        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout(x)

        x = self.fc2(x)

        x = self.sigmoid(x)

        return x

class GAT(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, heads):
        super().__init__()
        self.conv1 = GATConv(in_channels, hidden_channels, heads, dropout=0.6)
        self.conv2 = GATConv(hidden_channels * heads, int(hidden_channels/4), heads=1, concat=False, dropout=0.6)
        self.lin = Linear(int(hidden_channels/4), out_channels)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x, edge_index, edge_attr):
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv1(x, edge_index, edge_attr))
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv2(x, edge_index, edge_attr))
        x = self.lin(x)
        x = self.sigmoid(x)

        return x

## PyG In-Memory Dataset
Finally we can build the dataset with above functions

In [4]:
class AMLtoGraph(InMemoryDataset):

    def __init__(self, root: str, edge_window_size: int = 10,
                 transform: Optional[Callable] = None,
                 pre_transform: Optional[Callable] = None,
                 test = False):
        self.edge_window_size = edge_window_size
        super().__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self) -> str:
        if not self.test:
            return 'HI-Small_Trans.csv'
        else:
            return 'LI-Small_Trans.csv'

    @property
    def processed_file_names(self) -> str:
        return 'data.pt'

    @property
    def num_nodes(self) -> int:
        return self._data.edge_index.max().item() + 1

    def df_label_encoder(self, df, columns):
        le = preprocessing.LabelEncoder()
        for i in columns:
            df[i] = le.fit_transform(df[i].astype(str))
        return df

    def preprocess(self, df):
        # Label Encoding
        df = self.df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])

        # Timestamp Normalization
        df['Timestamp'] = pd.to_datetime(df['Timestamp'])
        df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
        df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())

        # Account Modification
        df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
        df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
        df = df.sort_values(by=['Account'])

        # Separate Paying and Receiving Accounts for External Usage
        receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
        paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
        receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)

        # Extract Unique Currencies
        currency_ls = sorted(df['Receiving Currency'].unique())

        return df, receiving_df, paying_df, currency_ls

    def get_all_account(self, df):
        # Extract Unique Accounts Involved with Illicit Transactions
        ldf = df[['Account', 'From Bank']]
        rdf = df[['Account.1', 'To Bank']]
        suspicious = df[df['Is Laundering']==1]
        s1 = suspicious[['Account', 'Is Laundering']]
        s2 = suspicious[['Account.1', 'Is Laundering']]
        s2 = s2.rename({'Account.1': 'Account'}, axis=1)
        suspicious = pd.concat([s1, s2], join='outer')
        suspicious = suspicious.drop_duplicates()

        ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
        rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
        df = pd.concat([ldf, rdf], join='outer')
        df = df.drop_duplicates()

        df['Is Laundering'] = 0
        df.set_index('Account', inplace=True)
        df.update(suspicious.set_index('Account'))
        df = df.reset_index()
        return df

    def paid_currency_aggregate(self, currency_ls, paying_df, accounts):
        for i in currency_ls:
            temp = paying_df[paying_df['Payment Currency'] == i]
            accounts['avg paid '+ str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
        return accounts

    def received_currency_aggregate(self, currency_ls, receiving_df, accounts):
        for i in currency_ls:
            temp = receiving_df[receiving_df['Receiving Currency'] == i]
            accounts['avg received '+ str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
        accounts = accounts.fillna(0)
        return accounts

    def get_node_attr(self, currency_ls, paying_df,receiving_df, accounts):
        # Aggreagte Node Level Features
        node_df = self.paid_currency_aggregate(currency_ls, paying_df, accounts)
        node_df = self.received_currency_aggregate(currency_ls, receiving_df, node_df)

        node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
        node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
        node_df = self.df_label_encoder(node_df,['Bank'])
        node_df = torch.from_numpy(node_df.values).to(torch.float)
        return node_df, node_label

    def get_edge_df(self, accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

        edge_attr = torch.from_numpy(df.values).to(torch.float)
        return edge_attr, edge_index

    def process(self):
        df = pd.read_csv(self.raw_paths[0])

        df, receiving_df, paying_df, currency_ls = self.preprocess(df)
        accounts = self.get_all_account(df)
        node_attr, node_label = self.get_node_attr(currency_ls, paying_df,receiving_df, accounts)
        edge_attr, edge_index = self.get_edge_df(accounts, df)

        data = Data(x=node_attr,
                    edge_index=edge_index,
                    y=node_label,
                    edge_attr=edge_attr
        )

        data_list = [data]
        if self.pre_filter is not None:
            data_list = [d for d in data_list if self.pre_filter(d)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(d) for d in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

## Model Training Workflow
As we cannot create folder in kaggle, please follow the instructions in https://github.com/issacchan26/AntiMoneyLaunderingDetectionWithGNN before you start training

In [5]:
import torch
import torch_geometric.transforms as T
from torch_geometric.loader import NeighborLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dataset = AMLtoGraph('./data', test = False)
data = dataset[0]
epoch = 50

model = GAT(in_channels=data.num_features, hidden_channels=16, out_channels=1, heads=8)
model = model.to(device)
criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)

  self.data, self.slices = torch.load(self.processed_paths[0])


In [6]:
split = T.RandomNodeSplit(split='train_rest', num_val=0.1, num_test=0)
data = split(data)

train_loader = loader = NeighborLoader(
    data,
    num_neighbors=[30] * 2,
    batch_size=256,
    input_nodes=data.train_mask,
)

test_loader = loader = NeighborLoader(
    data,
    num_neighbors=[30] * 2,
    batch_size=256,
    input_nodes=data.val_mask,
)

for i in range(epoch):
    total_loss = 0
    model.train()

    for data in train_loader:
        optimizer.zero_grad()
        data.to(device)
        pred = model(data.x, data.edge_index, data.edge_attr)
        ground_truth = data.y
        loss = criterion(pred, ground_truth.unsqueeze(1))
        loss.backward()
        optimizer.step()
        total_loss += float(loss)

    if epoch % 10 == 0:
        print(f"Epoch: {i:03d}, Loss: {total_loss:.4f}\n")
        model.eval()
        acc = 0
        total = 0

        with torch.no_grad():
            for test_data in test_loader:
                test_data.to(device)
                pred = model(test_data.x, test_data.edge_index, test_data.edge_attr)
                ground_truth = test_data.y
                correct = (pred == ground_truth.unsqueeze(1)).sum().item()
                total += len(ground_truth)
                acc += correct
            acc = acc/total
            print('accuracy:', acc)

torch.save(model.state_dict(), 'model.pth')

Epoch: 000, Loss: 934.9058

accuracy: 0.5978040377370641
Epoch: 001, Loss: 632.4964

accuracy: 0.6029042436731277
Epoch: 002, Loss: 624.8950

accuracy: 0.6068001803078112
Epoch: 003, Loss: 614.3198

accuracy: 0.6436344903084551
Epoch: 004, Loss: 598.2132

accuracy: 0.6542164407379979
Epoch: 005, Loss: 574.0586

accuracy: 0.6652982371407873
Epoch: 006, Loss: 576.7824

accuracy: 0.6630658767467319
Epoch: 007, Loss: 566.0702

accuracy: 0.6550697105322472
Epoch: 008, Loss: 555.5421

accuracy: 0.6742924300479763
Epoch: 009, Loss: 550.1645

accuracy: 0.6852340781763153
Epoch: 010, Loss: 533.0678

accuracy: 0.6904934395878612
Epoch: 011, Loss: 532.0083

accuracy: 0.6965950253561941
Epoch: 012, Loss: 525.3322

accuracy: 0.6975078082235889
Epoch: 013, Loss: 522.5151

accuracy: 0.7012428760021895
Epoch: 014, Loss: 506.9450

accuracy: 0.7038217585884928
Epoch: 015, Loss: 492.7056

accuracy: 0.7211874366115556
Epoch: 016, Loss: 487.5030

accuracy: 0.7086050068421477
Epoch: 017, Loss: 477.9472

acc

# Future Work
In this notebook, we performed the node classification with GAT and the result accuracy looks satisfied.  
However, it may due to highly imbalance data of the dataset. It is suggested that balance the class of 1 and 0 in the data preprocessing. It is expected that the accuracy will dropped a little bit after balancing the data.  We will keep exploring to see if there are any other models give better performance, such as other traditional regression/classifier model.

## Reference
Some of the feature engineering of this repo are referenced to below papers, highly recommend to read:
1. [Weber, M., Domeniconi, G., Chen, J., Weidele, D. K. I., Bellei, C., Robinson, T., & Leiserson, C. E. (2019). Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics. arXiv preprint arXiv:1908.02591.](https://arxiv.org/pdf/1908.02591.pdf)
2. [Johannessen, F., & Jullum, M. (2023). Finding Money Launderers Using Heterogeneous Graph Neural Networks. arXiv preprint arXiv:2307.13499.](https://arxiv.org/pdf/2307.13499.pdf)