<a href="https://colab.research.google.com/github/fathurrahmanyahyasatrio/AntiMoneyLaundering/blob/main/Anti%20Money%20Laundering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Business Understanding**

Money laundering is a significant problem involving billions of dollars. Detecting money laundering is challenging because automated algorithms often produce high false positive rates, flagging legitimate transactions as suspicious. Conversely, false negatives, where laundering transactions go undetected, are also a major issue. Criminals make efforts to conceal their illegal activities.

Access to actual financial transaction data is highly restricted due to proprietary and privacy concerns. Even when access is possible, accurately labeling each transaction as laundering or legitimate is problematic. The synthetic transaction data provided by IBM addresses these challenges.

This data is generated in a virtual world inhabited by individuals, companies, and banks. They engage in various financial interactions, such as purchasing goods and services, placing orders, paying salaries, repaying loans, and more, mostly conducted through banks.

A fraction of individuals and companies in this model engage in criminal activities like smuggling, illegal gambling, and extortion. Criminals obtain funds from these activities and attempt to disguise the source of these illicit funds through a series of financial transactions, constituting money laundering. Therefore, this data is labeled and can be used to train and test Anti Money Laundering (AML) models and other applications.

The data generator not only models illicit activities but also tracks funds derived from illegal activities through multiple transactions, allowing for labeling of laundering transactions even if they are several steps removed from the illicit source.

This IBM generator models the entire money laundering process, including placement (introducing illicit funds), layering (mixing funds in the financial system), and integration (spending illicit funds).

Unlike real financial institutions, which only see their own transactions, these synthetic transactions create an entire financial ecosystem. This allows for the development of laundering detection models that understand transactions across institutions but can apply their findings to transactions at a specific bank.

IBM has improved this data generator since its initial release, making it more realistic and resolving bugs.

Six datasets are provided, divided into two groups based on the level of illicit activity. Each group contains small, medium, and large datasets to accommodate various modeling and computational resources. These datasets are independent, and each can be subdivided chronologically for training, validation, and testing, with a common division being 60% for training, 20% for validation, and 20% for testing.

In [1]:
pip install torch_geometric


Collecting torch_geometric
  Downloading torch_geometric-2.4.0-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch_geometric
Successfully installed torch_geometric-2.4.0


In [2]:
import torch

In [3]:
print(torch.__version__)

2.1.0+cu118


In [4]:
print(torch.version.cuda)

11.8


In [5]:
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

Looking in links: https://data.pyg.org/whl/torch-2.1.0+cu118.html
Collecting pyg_lib
  Downloading https://data.pyg.org/whl/torch-2.1.0%2Bcu118/pyg_lib-0.3.1%2Bpt21cu118-cp310-cp310-linux_x86_64.whl (2.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch_scatter
  Downloading https://data.pyg.org/whl/torch-2.1.0%2Bcu118/torch_scatter-2.1.2%2Bpt21cu118-cp310-cp310-linux_x86_64.whl (10.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m115.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch_sparse
  Downloading https://data.pyg.org/whl/torch-2.1.0%2Bcu118/torch_sparse-0.6.18%2Bpt21cu118-cp310-cp310-linux_x86_64.whl (4.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m112.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch_cluster
  Downloading https://data.pyg.org/whl/torch-2.1.0%2Bcu118/torch_cluster-1.6.3%2

In [6]:
import torch; print(torch.cuda.is_available())

True


In [7]:
import datetime
import os
from typing import Callable, Optional
import pandas as pd
from sklearn import preprocessing
import numpy as np
import torch

from torch_geometric.data import (
    Data,
    InMemoryDataset
)

pd.set_option('display.max_columns', None)
path = '/content/drive/MyDrive/HI-Small_Trans.csv'
df = pd.read_csv(path)

**Data Visualization and Feature Engineering**

In [8]:
print(df.head())

          Timestamp  From Bank    Account  To Bank  Account.1  \
0  2022/09/01 00:20         10  8000EBD30       10  8000EBD30   
1  2022/09/01 00:20       3208  8000F4580        1  8000F5340   
2  2022/09/01 00:00       3209  8000F4670     3209  8000F4670   
3  2022/09/01 00:02         12  8000F5030       12  8000F5030   
4  2022/09/01 00:06         10  8000F5200       10  8000F5200   

   Amount Received Receiving Currency  Amount Paid Payment Currency  \
0          3697.34          US Dollar      3697.34        US Dollar   
1             0.01          US Dollar         0.01        US Dollar   
2         14675.57          US Dollar     14675.57        US Dollar   
3          2806.97          US Dollar      2806.97        US Dollar   
4         36682.97          US Dollar     36682.97        US Dollar   

  Payment Format  Is Laundering  
0   Reinvestment              0  
1         Cheque              0  
2   Reinvestment              0  
3   Reinvestment              0  
4   Reinvest

Upon inspecting the dataframe, we propose the idea of extracting all the accounts involved in transactions, both as receivers and payers. This will enable us to sort and identify potentially suspicious accounts. We can then convert the entire dataset into a node classification problem, where the accounts are treated as nodes and the transactions as edges.

To facilitate this, we recommend encoding the object columns into classes using the sklearn LabelEncoder.

In [9]:
print(df.dtypes)

Timestamp              object
From Bank               int64
Account                object
To Bank                 int64
Account.1              object
Amount Received       float64
Receiving Currency     object
Amount Paid           float64
Payment Currency       object
Payment Format         object
Is Laundering           int64
dtype: object


Check if theres any null values

In [10]:
print(df.isnull().sum())

Timestamp             0
From Bank             0
Account               0
To Bank               0
Account.1             0
Amount Received       0
Receiving Currency    0
Amount Paid           0
Payment Currency      0
Payment Format        0
Is Laundering         0
dtype: int64


There are 2 columns representing paid and received amount of each transaction, wondering if there's any necessary to split the amount into two columns, where they have shared the same value, unless there are transaction fee/ transaction between different currency.

In [11]:
print('Amount Received equals to Amount Paid:')
print(df['Amount Received'].equals(df['Amount Paid']))
print('Receiving Currency equals to Payment Currency:')
print(df['Receiving Currency'].equals(df['Payment Currency']))

Amount Received equals to Amount Paid:
False
Receiving Currency equals to Payment Currency:
False


As a result, there seems to be an involvement between different currency.

In [12]:
not_equal1 = df.loc[~(df['Amount Received'] == df['Amount Paid'])]
not_equal2 = df.loc[~(df['Receiving Currency'] == df['Payment Currency'])]
print(not_equal1)
print('---------------------------------------------------------------------------')
print(not_equal2)

                Timestamp  From Bank    Account  To Bank  Account.1  \
1173     2022/09/01 00:22       1362  80030A870     1362  80030A870   
7156     2022/09/01 00:28      11318  800C51010    11318  800C51010   
7925     2022/09/01 00:12        795  800D98770      795  800D98770   
8467     2022/09/01 00:01       1047  800E92CF0     1047  800E92CF0   
11529    2022/09/01 00:22      11157  80135FFC0    11157  80135FFC0   
...                   ...        ...        ...      ...        ...   
5078167  2022/09/10 23:30      23537  803949A90    23537  803949A90   
5078234  2022/09/10 23:59      16163  803638A90    16163  803638A90   
5078236  2022/09/10 23:55      16163  803638A90    16163  803638A90   
5078316  2022/09/10 23:44     215064  808F06E11   215064  808F06E10   
5078318  2022/09/10 23:45     215064  808F06E11   215064  808F06E10   

         Amount Received Receiving Currency  Amount Paid Payment Currency  \
1173           52.110000               Euro        61.06        US Dol

The dimensions of both dataframes indicate the presence of transaction fees and transactions involving different currencies. Therefore, we cannot merge or discard the amount columns.

Since we intend to encode the columns, it's crucial to ensure that the classes for the same attributes align properly. To do this, let's verify whether the lists of Receiving Currency and Payment Currency are identical.

In [13]:
print(sorted(df['Receiving Currency'].unique()))
print(sorted(df['Payment Currency'].unique()))

['Australian Dollar', 'Bitcoin', 'Brazil Real', 'Canadian Dollar', 'Euro', 'Mexican Peso', 'Ruble', 'Rupee', 'Saudi Riyal', 'Shekel', 'Swiss Franc', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']
['Australian Dollar', 'Bitcoin', 'Brazil Real', 'Canadian Dollar', 'Euro', 'Mexican Peso', 'Ruble', 'Rupee', 'Saudi Riyal', 'Shekel', 'Swiss Franc', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']


**Data Preprocessing**

During the data preprocessing phase, we carry out the following transformations:

1. Normalize the Timestamp using min-max normalization.
2. Generate a unique ID for each account by combining the bank code with the account number.
3. Create a receiving_df containing information about receiving accounts, received amounts, and currencies.
4. Create a paying_df containing information about payer accounts, paid amounts, and currencies.
5. Compile a list of currencies utilized in all transactions.
6. Apply label encoding using sklearn's LabelEncoder to the 'Payment Format,' 'Payment Currency,' and 'Receiving Currency' columns.

In [14]:
def df_label_encoder(df, columns):
        le = preprocessing.LabelEncoder()
        for i in columns:
            df[i] = le.fit_transform(df[i].astype(str))
        return df

def preprocess(df):
        df = df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
        df['Timestamp'] = pd.to_datetime(df['Timestamp'])
        df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
        df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())

        df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
        df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
        df = df.sort_values(by=['Account'])
        receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
        paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
        receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
        currency_ls = sorted(df['Receiving Currency'].unique())

        return df, receiving_df, paying_df, currency_ls

In [15]:
df, receiving_df, paying_df, currency_ls = preprocess(df = df)
print(df.head())

         Timestamp  From Bank          Account  To Bank        Account.1  \
4278714   0.456320      10057  10057_803A115E0    29467  29467_803E020C0   
2798190   0.285018      10057  10057_803A115E0    29467  29467_803E020C0   
2798191   0.284233      10057  10057_803A115E0    29467  29467_803E020C0   
3918769   0.417079      10057  10057_803A115E0    29467  29467_803E020C0   
213094    0.000746      10057  10057_803A115E0    10057  10057_803A115E0   

         Amount Received  Receiving Currency  Amount Paid  Payment Currency  \
4278714        787197.11                  13    787197.11                13   
2798190        787197.11                  13    787197.11                13   
2798191        681262.19                  13    681262.19                13   
3918769        681262.19                  13    681262.19                13   
213094         146954.27                  13    146954.27                13   

         Payment Format  Is Laundering  
4278714               3    

In [16]:
print(receiving_df.head())
print(paying_df.head())

                 Account  Amount Received  Receiving Currency
4278714  29467_803E020C0        787197.11                  13
2798190  29467_803E020C0        787197.11                  13
2798191  29467_803E020C0        681262.19                  13
3918769  29467_803E020C0        681262.19                  13
213094   10057_803A115E0        146954.27                  13
                 Account  Amount Paid  Payment Currency
4278714  10057_803A115E0    787197.11                13
2798190  10057_803A115E0    787197.11                13
2798191  10057_803A115E0    681262.19                13
3918769  10057_803A115E0    681262.19                13
213094   10057_803A115E0    146954.27                13


In [17]:
print(currency_ls)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]



Our goal is to extract all the unique accounts from both payers and receivers to serve as nodes in our graph. This information comprises the unique account ID, bank code, and the 'Is Laundering' label.

In this context, we identify both payers and receivers involved in illicit transactions as suspicious accounts, and we assign a 'Is Laundering' label value of 1 to both of these accounts.

In [18]:
def get_all_account(df):
        ldf = df[['Account', 'From Bank']]
        rdf = df[['Account.1', 'To Bank']]
        suspicious = df[df['Is Laundering']==1]
        s1 = suspicious[['Account', 'Is Laundering']]
        s2 = suspicious[['Account.1', 'Is Laundering']]
        s2 = s2.rename({'Account.1': 'Account'}, axis=1)
        suspicious = pd.concat([s1, s2], join='outer')
        suspicious = suspicious.drop_duplicates()

        ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
        rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
        df = pd.concat([ldf, rdf], join='outer')
        df = df.drop_duplicates()

        df['Is Laundering'] = 0
        df.set_index('Account', inplace=True)
        df.update(suspicious.set_index('Account'))
        df = df.reset_index()
        return df

Let's have a look on the account' list

In [19]:
accounts = get_all_account(df)
print(accounts.head())

           Account   Bank  Is Laundering
0  10057_803A115E0  10057            0.0
1  10057_803AA8E90  10057            0.0
2  10057_803AAB430  10057            0.0
3  10057_803AACE20  10057            0.0
4  10057_803AB4F70  10057            0.0


**Node Features**

For node features, we would like to aggregate the mean of paid and received amount with different types of currency as the new features of each node.

In [20]:
def paid_currency_aggregate(currency_ls, paying_df, accounts):
        for i in currency_ls:
            temp = paying_df[paying_df['Payment Currency'] == i]
            accounts['avg paid '+str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
        return accounts

def received_currency_aggregate(currency_ls, receiving_df, accounts):
    for i in currency_ls:
        temp = receiving_df[receiving_df['Receiving Currency'] == i]
        accounts['avg received '+str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
    accounts = accounts.fillna(0)
    return accounts

Now, we can establish the node attributes based on the bank code and the average of the amounts that have been paid and received in various currency types.

In [21]:
def get_node_attr(currency_ls, paying_df,receiving_df, accounts):
        node_df = paid_currency_aggregate(currency_ls, paying_df, accounts)
        node_df = received_currency_aggregate(currency_ls, receiving_df, node_df)
        node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
        node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
        node_df = df_label_encoder(node_df,['Bank'])
#         node_df = torch.from_numpy(node_df.values).to(torch.float)  # comment for visualization
        return node_df, node_label

In [22]:
node_df, node_label = get_node_attr(currency_ls, paying_df,receiving_df, accounts)
print(node_df.head())

   Bank  avg paid 0  avg paid 1  avg paid 2  avg paid 3  avg paid 4  \
0     2         0.0         0.0         0.0         0.0         0.0   
1     2         0.0         0.0         0.0         0.0         0.0   
2     2         0.0         0.0         0.0         0.0         0.0   
3     2         0.0         0.0         0.0         0.0         0.0   
4     2         0.0         0.0         0.0         0.0         0.0   

   avg paid 5  avg paid 6  avg paid 7  avg paid 8  avg paid 9  avg paid 10  \
0         0.0         0.0         0.0         0.0         0.0          0.0   
1         0.0         0.0         0.0         0.0         0.0          0.0   
2         0.0         0.0         0.0         0.0         0.0          0.0   
3         0.0         0.0         0.0         0.0         0.0          0.0   
4         0.0         0.0         0.0         0.0         0.0          0.0   

   avg paid 11   avg paid 12  avg paid 13  avg paid 14  avg received 0  \
0          0.0   1922.000000  

**Edge Features**

When it cmes to edge characteristics, we want to treat each transaction as an edge.
Regarding the edge index, we'll have to replace all the accounts with indices and compile them into a list with a size of [2, number of transactions]
For edge attributes, we will include 'Timestamp', 'Amount Received', 'Receiving Currency', 'Amount Paid', 'Payment Currency' and 'Payment Format'

In [23]:
def get_edge_df(accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

#         edge_attr = torch.from_numpy(df.values).to(torch.float)  # comment for visualization

        edge_attr = df  # for visualization
        return edge_attr, edge_index

In [24]:
edge_attr, edge_index = get_edge_df(accounts, df)
print(edge_attr.head())

         Timestamp  Amount Received  Receiving Currency  Amount Paid  \
4278714   0.456320        787197.11                  13    787197.11   
2798190   0.285018        787197.11                  13    787197.11   
2798191   0.284233        681262.19                  13    681262.19   
3918769   0.417079        681262.19                  13    681262.19   
213094    0.000746        146954.27                  13    146954.27   

         Payment Currency  Payment Format  
4278714                13               3  
2798190                13               3  
2798191                13               4  
3918769                13               4  
213094                 13               5  


In [25]:
print(edge_index)

tensor([[     0,      0,      0,  ..., 496997, 496997, 496998],
        [299458, 299458, 299458,  ..., 496997, 496997, 496998]])


**Model Architecture**

We will employ Graph Attention Networks as the foundational model for the work. This model consists of two GATConv layers, followed by a linear layer that produces a sigmoid output for the classification

In [26]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_geometric.transforms as T
from torch_geometric.nn import GATConv, Linear

class GAT(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, heads):
        super().__init__()
        self.conv1 = GATConv(in_channels, hidden_channels, heads, dropout=0.6)
        self.conv2 = GATConv(hidden_channels * heads, int(hidden_channels/4), heads=1, concat=False, dropout=0.6)
        self.lin = Linear(int(hidden_channels/4), out_channels)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x, edge_index, edge_attr):
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv1(x, edge_index, edge_attr))
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv2(x, edge_index, edge_attr))
        x = self.lin(x)
        x = self.sigmoid(x)

        return x

Build dataset using the functions above

In [27]:
class AMLtoGraph(InMemoryDataset):

    def __init__(self, root: str, edge_window_size: int = 10,
                 transform: Optional[Callable] = None,
                 pre_transform: Optional[Callable] = None):
        self.edge_window_size = edge_window_size
        super().__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self) -> str:
        return 'HI-Small_Trans.csv'

    @property
    def processed_file_names(self) -> str:
        return 'data.pt'

    @property
    def num_nodes(self) -> int:
        return self._data.edge_index.max().item() + 1

    def df_label_encoder(self, df, columns):
        le = preprocessing.LabelEncoder()
        for i in columns:
            df[i] = le.fit_transform(df[i].astype(str))
        return df


    def preprocess(self, df):
        df = self.df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
        df['Timestamp'] = pd.to_datetime(df['Timestamp'])
        df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
        df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())

        df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
        df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
        df = df.sort_values(by=['Account'])
        receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
        paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
        receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
        currency_ls = sorted(df['Receiving Currency'].unique())

        return df, receiving_df, paying_df, currency_ls

    def get_all_account(self, df):
        ldf = df[['Account', 'From Bank']]
        rdf = df[['Account.1', 'To Bank']]
        suspicious = df[df['Is Laundering']==1]
        s1 = suspicious[['Account', 'Is Laundering']]
        s2 = suspicious[['Account.1', 'Is Laundering']]
        s2 = s2.rename({'Account.1': 'Account'}, axis=1)
        suspicious = pd.concat([s1, s2], join='outer')
        suspicious = suspicious.drop_duplicates()

        ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
        rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
        df = pd.concat([ldf, rdf], join='outer')
        df = df.drop_duplicates()

        df['Is Laundering'] = 0
        df.set_index('Account', inplace=True)
        df.update(suspicious.set_index('Account'))
        df = df.reset_index()
        return df

    def paid_currency_aggregate(self, currency_ls, paying_df, accounts):
        for i in currency_ls:
            temp = paying_df[paying_df['Payment Currency'] == i]
            accounts['avg paid '+str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
        return accounts

    def received_currency_aggregate(self, currency_ls, receiving_df, accounts):
        for i in currency_ls:
            temp = receiving_df[receiving_df['Receiving Currency'] == i]
            accounts['avg received '+str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
        accounts = accounts.fillna(0)
        return accounts

    def get_edge_df(self, accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

        edge_attr = torch.from_numpy(df.values).to(torch.float)
        return edge_attr, edge_index

    def get_node_attr(self, currency_ls, paying_df,receiving_df, accounts):
        node_df = self.paid_currency_aggregate(currency_ls, paying_df, accounts)
        node_df = self.received_currency_aggregate(currency_ls, receiving_df, node_df)
        node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
        node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
        node_df = self.df_label_encoder(node_df,['Bank'])
        node_df = torch.from_numpy(node_df.values).to(torch.float)
        return node_df, node_label

    def process(self):
        df = pd.read_csv(self.raw_paths[0])
        df, receiving_df, paying_df, currency_ls = self.preprocess(df)
        accounts = self.get_all_account(df)
        node_attr, node_label = self.get_node_attr(currency_ls, paying_df,receiving_df, accounts)
        edge_attr, edge_index = self.get_edge_df(accounts, df)

        data = Data(x=node_attr,
                    edge_index=edge_index,
                    y=node_label,
                    edge_attr=edge_attr
                    )

        data_list = [data]
        if self.pre_filter is not None:
            data_list = [d for d in data_list if self.pre_filter(d)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(d) for d in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

**Model Training**

In [28]:
import torch
import torch_geometric.transforms as T
from torch_geometric.loader import NeighborLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dataset = AMLtoGraph('/content/drive/MyDrive/AntiMoneyLaundering')
data = dataset[0]
epoch = 100

model = GAT(in_channels=data.num_features, hidden_channels=16, out_channels=1, heads=8)
model = model.to(device)
criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)

split = T.RandomNodeSplit(split='train_rest', num_val=0.1, num_test=0)
data = split(data)

train_loader = loader = NeighborLoader(
    data,
    num_neighbors=[30] * 2,
    batch_size=256,
    input_nodes=data.train_mask,
)

test_loader = loader = NeighborLoader(
    data,
    num_neighbors=[30] * 2,
    batch_size=256,
    input_nodes=data.val_mask,
)

for i in range(epoch):
    total_loss = 0
    model.train()
    for data in train_loader:
        optimizer.zero_grad()
        data.to(device)
        pred = model(data.x, data.edge_index, data.edge_attr)
        ground_truth = data.y
        loss = criterion(pred, ground_truth.unsqueeze(1))
        loss.backward()
        optimizer.step()
        total_loss += float(loss)
    if epoch%10 == 0:
        print(f"Epoch: {i:03d}, Loss: {total_loss:.4f}")
        model.eval()
        acc = 0
        total = 0
        for test_data in test_loader:
            test_data.to(device)
            pred = model(test_data.x, test_data.edge_index, test_data.edge_attr)
            ground_truth = test_data.y
            correct = (pred == ground_truth.unsqueeze(1)).sum().item()
            total += len(ground_truth)
            acc += correct
        acc = acc/total
        print('accuracy:', acc)

Epoch: 000, Loss: 6008.9402
accuracy: 0.9234248659034505
Epoch: 001, Loss: 2416.2613
accuracy: 0.9298559080694484
Epoch: 002, Loss: 1825.3569
accuracy: 0.9303681534272902
Epoch: 003, Loss: 1826.4374
accuracy: 0.9335804076726599
Epoch: 004, Loss: 1786.2608
accuracy: 0.9401137068844714
Epoch: 005, Loss: 1735.3178
accuracy: 0.9433595900351313
Epoch: 006, Loss: 1722.6374
accuracy: 0.9462822321616186
Epoch: 007, Loss: 1706.3515
accuracy: 0.9479942126790389
Epoch: 008, Loss: 1678.3283
accuracy: 0.9502129299257873
Epoch: 009, Loss: 1653.5942
accuracy: 0.9519212473837588
Epoch: 010, Loss: 1614.1431
accuracy: 0.9590698279441857
Epoch: 011, Loss: 1605.9702
accuracy: 0.9592291327743712
Epoch: 012, Loss: 1577.1260
accuracy: 0.9592315816344407
Epoch: 013, Loss: 1581.4886
accuracy: 0.9640655028827675
Epoch: 014, Loss: 1576.8501
accuracy: 0.964171759504686
Epoch: 015, Loss: 1550.5925
accuracy: 0.9644488977955912
Epoch: 016, Loss: 1534.5545
accuracy: 0.9649216755918122
Epoch: 017, Loss: 1531.9577
accu

In conclusion, we have used Graph Neural Networks (GNN), where it's a class of methods that overcome the drawback by applying the machine learning task directly on the network data through a neural network. GNN were able to solve various tasks as well as unsupervised embedding. This is crucial for Anti Money Laundering (AML) Application where new transactions appear continuously.

Customers labeled as "regular" may involved in a suspicious activity in the form of money laundering, which haven't been controlled in the current AML system. Implementing a predictive model approach for suspicious transactions with real AML system where key decisions need to be made. In order to increase the performance of the money laundering approach along with the dataset is rich in terms of number, that cntains network data from financial transactions.

This is the first attempt to leverage Graph ne within AML which showcased promising results. Ultimately, it leads to the contribution that will aid in te continuous combat of money laundering.