# 1. Introduction:
---

Money laundering is a multi-billion dollar issue. Detection of laundering is very difficult. Most automated algorithms have a high false positive rate: legitimate transactions incorrectly flagged as laundering. The converse is also a major problem -- false negatives, i.e. undetected laundering transactions. Naturally, criminals work hard to cover their tracks.

Access to real financial transaction data is highly restricted, for both proprietary and privacy reasons. Even when access is possible, it is problematic to provide a correct tag (laundering or legitimate) to each transaction, as noted above. 

In this project we are using a synthetic transaction dataset from IBM that avoids these problems (ALTMAN et al. 2023).


**To check the paper that originated this synthetic dataset, [click here!](https://arxiv.org/abs/2306.16424)**

The data provided here is based on a virtual world inhabited by individuals, companies, and banks. Individuals interact with other individuals and companies. Likewise, companies interact with other companies and with individuals. These interactions can take many forms, e.g. purchase of consumer goods and services, purchase orders for industrial supplies, payment of salaries, repayment of loans, and more. These financial transactions are generally conducted via banks, i.e. the payer and receiver both have accounts, with accounts taking multiple forms from checking to credit cards to bitcoin.

Some (small) fraction of the individuals and companies in the generator model engage in criminal behavior -- such as smuggling, illegal gambling, extortion, and more. Criminals obtain funds from these illicit activities, and then try to hide the source of these illicit funds via a series of financial transactions. Such financial transactions to hide illicit funds constitute laundering. Thus, the data available here is labelled and can be used for training and testing AML (Anti Money Laundering) models and for other purposes.

The data generator that created the data here not only models illicit activity, but also tracks funds derived from illicit activity through arbitrarily many transactions -- thus creating the ability to label laundering transactions many steps removed from their illicit source. With this foundation, it is straightforward for the generator to label individual transactions as laundering or legitimate.

Note that this IBM generator models the entire money laundering cycle:

*   **Placement**: Sources like smuggling of illicit funds.
*   **Layering**: Mixing the illicit funds into the financial system.
*   **Integration**: Spending the illicit funds.


As another capability possible only with synthetic data, note that a real bank or other institution typically has access to only a portion of the transactions involved in laundering: the transactions involving that bank. Transactions happening at other banks or between other banks are not seen. Thus, models built on real transactions from one institution can have only a limited view of the world.

By contrast these synthetic transactions contain an entire financial ecosystem. Thus it may be possible to create laundering detection models that undertand the broad sweep of transactions across institutions, but apply those models to make inferences only about transactions at a particular bank.

# Edge features
In terms of edge features, we would like to conside each transcation as edges.  
For edge index, we replace all account with index and stack into a list with size of [2, num of transcation]  
For edge attributes, we used 'Timestamp', 'Amount Received', 'Receiving Currency', 'Amount Paid', 'Payment Currency' and 'Payment Format'

# Edge features
In terms of edge features, we would like to conside each transcation as edges.  
For edge index, we replace all account with index and stack into a list with size of [2, num of transcation]  
For edge attributes, we used 'Timestamp', 'Amount Received', 'Receiving Currency', 'Amount Paid', 'Payment Currency' and 'Payment Format'

In [None]:
def get_edge_df(accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

#         edge_attr = torch.from_numpy(df.values).to(torch.float)  # comment for visualization

        edge_attr = df  # for visualization
        return edge_attr, edge_index

In [None]:
def get_edge_df(accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

#         edge_attr = torch.from_numpy(df.values).to(torch.float)  # comment for visualization

        edge_attr = df  # for visualization
        return edge_attr, edge_index

## 1.1. Importing Libraries
---

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import pathlib
import zipfile


from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, roc_auc_score, roc_curve
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from collections import Counter

import warnings
warnings.filterwarnings("ignore")

## 1.2. Verify if Data is Present
---

In [14]:
pathlib.Path("data").mkdir(parents=True, exist_ok=True)
PATH = str(pathlib.Path.cwd())
file_path = pathlib.Path("data/HI-Large_Trans.csv")

if not file_path.is_file():
    with zipfile.ZipFile("./data.zip", 'r') as zf:
        zf.extractall("./data/")

# 2. Exploratory Data Analisys (EDA)
---

## 2.1. Reading the HI-Small_Trans file

In [15]:
import pandas as pd

full_df = pd.read_csv("./data/HI-Small_Trans.csv")

full_df.shape

(5078345, 11)

### 2.1.1. Sampling a Portion of the Original DataFrame
---

In [16]:
df = full_df.sample(n=500000, random_state=42)

df.shape

(500000, 11)

In [17]:
df.head()

Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
298872,2022/09/01 00:29,117,80E50C3C0,40653,80FA8F490,4981.6,Swiss Franc,4981.6,Swiss Franc,Cheque,0
746726,2022/09/01 13:28,10,8001C6CC0,22828,8010A7DF0,297.72,US Dollar,297.72,US Dollar,Cheque,0
405190,2022/09/01 02:46,29191,80CAF3CE0,29191,80CAF3CE0,32.9,Yuan,32.9,Yuan,Reinvestment,0
1388703,2022/09/02 08:02,10,804DC2C20,14381,80597A020,194634.45,Rupee,194634.45,Rupee,Cheque,0
4713645,2022/09/09 18:01,16136,80A5EC8A0,16031,80C038E30,698940.91,US Dollar,698940.91,US Dollar,ACH,0


### 2.1.2. About the Features
---

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500000 entries, 298872 to 3845689
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Timestamp           500000 non-null  object 
 1   From Bank           500000 non-null  int64  
 2   Account             500000 non-null  object 
 3   To Bank             500000 non-null  int64  
 4   Account.1           500000 non-null  object 
 5   Amount Received     500000 non-null  float64
 6   Receiving Currency  500000 non-null  object 
 7   Amount Paid         500000 non-null  float64
 8   Payment Currency    500000 non-null  object 
 9   Payment Format      500000 non-null  object 
 10  Is Laundering       500000 non-null  int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 45.8+ MB


## 2.2. Basic Statistic in the Numerical Features
---

In [19]:
df.select_dtypes(exclude='object').describe()

Unnamed: 0,From Bank,To Bank,Amount Received,Amount Paid,Is Laundering
count,500000.0,500000.0,500000.0,500000.0,500000.0
mean,45818.82191,65842.322062,8728360.0,4127212.0,0.001038
std,81937.835213,84214.998216,1554638000.0,418905400.0,0.032201
min,1.0,1.0,1e-06,1e-06,0.0
25%,119.0,4403.0,182.37,183.5675,0.0
50%,9679.0,21575.0,1418.135,1422.62,0.0
75%,28663.0,122332.0,12324.78,12286.81,0.0
max,356302.0,356266.0,626035500000.0,140212400000.0,1.0


In [20]:
# def feature_values_changer(col, zero, one):
#     for i in range(col.shape[0]):
#         if col.values[i] == zero:
#             col.values[i] = 0
#         elif col.values[i] == one:
#             col.values[i] = 1
#         else:
#             col.values[i] = 2
    
#     return col

Reading the HI-Large_Trans.csv, 1000000 rows each time, isolating only 'Is Laundering' == 1 

In [21]:
# dfs = []
# count = 1
# for df in pd.read_csv('./data/HI-Large_Trans.csv', chunksize=1000000):
#     df = df[df['Is Laundering'] == 1]
    
#     del df['Timestamp']
#     dfs.append(df)
    
#     if count % 10 == 0:
#         print(f"{(count / 180)*100:.2f}% complete")
#     count += 1

5.56% complete
11.11% complete
16.67% complete
22.22% complete
27.78% complete
33.33% complete
38.89% complete
44.44% complete
50.00% complete
55.56% complete
61.11% complete
66.67% complete
72.22% complete
77.78% complete
83.33% complete
88.89% complete
94.44% complete
100.00% complete


In [None]:
# df_full_1 = pd.concat(dfs)
# del dfs

# ones_count = df_full_1.shape[0]
# print("Number of rows with 'Is Laundering' == 1:", ones_count)

Reading the HI-Large_Trans.csv, 1000000 rows each time, isolating only 'Is Laundering' == 0, until it becames 1:1 ratio with 'Is Laundering' == 1

In [37]:
# dfs = []
# current = 0
# for df in pd.read_csv('./data/HI-Large_Trans.csv', chunksize=15000):
#     df = df[df['Is Laundering'] == 0]
#     current += df.shape[0]
    
#     del df['Timestamp']
#     dfs.append(df)

#     if current >= ones_count:
#         break

In [39]:
# df_full_0 = pd.concat(dfs)
# df_full = pd.concat([df_full_0, df_full_1])
# del dfs

# zeros_count = df_full_0.shape[0]
# print("Number of rows with 'Is Laundering' == 0:", zeros_count)

Number of rows with 'Is Laundering' == 0: 239971


In [42]:
# df_full.head()

Unnamed: 0,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
0,20,800104D70,20,800104D70,6794.63,US Dollar,6794.63,US Dollar,Reinvestment,0
1,3196,800107150,3196,800107150,7739.29,US Dollar,7739.29,US Dollar,Reinvestment,0
2,1208,80010E430,1208,80010E430,1880.23,US Dollar,1880.23,US Dollar,Reinvestment,0
3,1208,80010E650,20,80010E6F0,73966883.0,US Dollar,73966883.0,US Dollar,Cheque,0
4,1208,80010E650,20,80010EA30,45868454.0,US Dollar,45868454.0,US Dollar,Cheque,0


In [43]:
# df_full.info()

<class 'pandas.core.frame.DataFrame'>
Index: 465517 entries, 0 to 179701890
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   From Bank           465517 non-null  int64  
 1   Account             465517 non-null  object 
 2   To Bank             465517 non-null  int64  
 3   Account.1           465517 non-null  object 
 4   Amount Received     465517 non-null  float64
 5   Receiving Currency  465517 non-null  object 
 6   Amount Paid         465517 non-null  float64
 7   Payment Currency    465517 non-null  object 
 8   Payment Format      465517 non-null  object 
 9   Is Laundering       465517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 39.1+ MB


In [53]:
# print("Unique values for feature:")
# {feature:len(df_full[feature].unique()) for feature in df_full.columns}

Unique values for feature:


{'From Bank': 15282,
 'Account': 284350,
 'To Bank': 13678,
 'Account.1': 302143,
 'Amount Received': 352968,
 'Receiving Currency': 15,
 'Amount Paid': 353500,
 'Payment Currency': 15,
 'Payment Format': 7,
 'Is Laundering': 2}

In [55]:
# df_full.describe()

Unnamed: 0,From Bank,To Bank,Amount Received,Amount Paid,Is Laundering
count,465517.0,465517.0,465517.0,465517.0,465517.0
mean,128427.1,156214.2,32947810.0,32946660.0,0.484506
std,289953.1,337152.8,8072156000.0,8072156000.0,0.49976
min,0.0,0.0,1e-06,1e-06,0.0
25%,7770.0,12893.0,563.15,563.63,0.0
50%,28219.0,42935.0,5364.38,5368.86,0.0
75%,169755.0,201998.0,18961.05,18966.46,1.0
max,3206865.0,3104029.0,5257959000000.0,5257959000000.0,1.0


In [1]:
import pandas as pd

full_df = pd.read_csv("./data/HI-Small_Trans.csv")

full_df.shape

(5078345, 11)

In [2]:
full_df.head()

Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
0,2022/09/01 00:20,10,8000EBD30,10,8000EBD30,3697.34,US Dollar,3697.34,US Dollar,Reinvestment,0
1,2022/09/01 00:20,3208,8000F4580,1,8000F5340,0.01,US Dollar,0.01,US Dollar,Cheque,0
2,2022/09/01 00:00,3209,8000F4670,3209,8000F4670,14675.57,US Dollar,14675.57,US Dollar,Reinvestment,0
3,2022/09/01 00:02,12,8000F5030,12,8000F5030,2806.97,US Dollar,2806.97,US Dollar,Reinvestment,0
4,2022/09/01 00:06,10,8000F5200,10,8000F5200,36682.97,US Dollar,36682.97,US Dollar,Reinvestment,0


There are two columns representing paid and received amount of each transcation, wondering if it is necessary to split the amount into two columns when they shared the same value, unless there are transcation fee/transcation between different currency. Let's find out 

In [3]:
print('Amount Received equals to Amount Paid:')
print(full_df['Amount Received'].equals(full_df['Amount Paid']))
print('Receiving Currency equals to Payment Currency:')
print(full_df['Receiving Currency'].equals(full_df['Payment Currency']))

Amount Received equals to Amount Paid:
False
Receiving Currency equals to Payment Currency:
False


In [4]:
not_equal1 = full_df.loc[~(full_df['Amount Received'] == full_df['Amount Paid'])]
not_equal2 = full_df.loc[~(full_df['Receiving Currency'] == full_df['Payment Currency'])]
print("Transactions with different amount received and paid")
display(not_equal1.head())
print('---------------------------------------------------------------------------')
print("Transactions with differente currency received and paid")
display(not_equal2.head())

Transactions with different amount received and paid


Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
1173,2022/09/01 00:22,1362,80030A870,1362,80030A870,52.11,Euro,61.06,US Dollar,ACH,0
7156,2022/09/01 00:28,11318,800C51010,11318,800C51010,76.06,Euro,89.12,US Dollar,ACH,0
7925,2022/09/01 00:12,795,800D98770,795,800D98770,17.69,Australian Dollar,12.52,US Dollar,ACH,0
8467,2022/09/01 00:01,1047,800E92CF0,1047,800E92CF0,19.43,Euro,22.77,US Dollar,ACH,0
11529,2022/09/01 00:22,11157,80135FFC0,11157,80135FFC0,98.34,Euro,115.24,US Dollar,ACH,0


---------------------------------------------------------------------------
Transactions with differente currency received and paid


Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
1173,2022/09/01 00:22,1362,80030A870,1362,80030A870,52.11,Euro,61.06,US Dollar,ACH,0
7156,2022/09/01 00:28,11318,800C51010,11318,800C51010,76.06,Euro,89.12,US Dollar,ACH,0
7925,2022/09/01 00:12,795,800D98770,795,800D98770,17.69,Australian Dollar,12.52,US Dollar,ACH,0
8467,2022/09/01 00:01,1047,800E92CF0,1047,800E92CF0,19.43,Euro,22.77,US Dollar,ACH,0
11529,2022/09/01 00:22,11157,80135FFC0,11157,80135FFC0,98.34,Euro,115.24,US Dollar,ACH,0


Checking if the values of `Receiving Currency` and `Payment Currency` match

In [5]:
print(sorted(full_df['Receiving Currency'].unique()))
print(sorted(full_df['Payment Currency'].unique()))

['Australian Dollar', 'Bitcoin', 'Brazil Real', 'Canadian Dollar', 'Euro', 'Mexican Peso', 'Ruble', 'Rupee', 'Saudi Riyal', 'Shekel', 'Swiss Franc', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']
['Australian Dollar', 'Bitcoin', 'Brazil Real', 'Canadian Dollar', 'Euro', 'Mexican Peso', 'Ruble', 'Rupee', 'Saudi Riyal', 'Shekel', 'Swiss Franc', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']


In the data preprocessing, we perform below transformation:  
1. Transform the Timestamp with min max normalization.  
2. Create unique ID for each account by adding bank code with account number.  
3. Create receiving_df with the information of receiving accounts, received amount and currency
4. Create paying_df with the information of payer accounts, paid amount and currency
5. Create a list of currency used among all transactions
6. Label the 'Payment Format', 'Payment Currency', 'Receiving Currency' by classes with sklearn OrdinalEncoder

In [6]:
from sklearn.preprocessing import OrdinalEncoder

def df_ord_encoder(df, cat_columns):
        encoders = []
        ord_enc = OrdinalEncoder()
        for i in cat_columns:
            df[i] = ord_enc.fit_transform(np.reshape(df[i], (-1,1)))
            encoders.append(ord_enc)
        return df, encoders

def preprocess(df):
        df, _ = df_ord_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
        df['Timestamp'] = pd.to_datetime(df['Timestamp'])
        df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
        df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())

        df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
        df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
        df = df.sort_values(by=['Account'])
        receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
        paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
        receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
        currency_ls = sorted(df['Receiving Currency'].unique())

        return df, receiving_df, paying_df, currency_ls

In [9]:
df, receiving_df, paying_df, currency_ls = preprocess(df = full_df)
display(df.head())

Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
4278714,0.45632,10057,10057_803A115E0,29467,29467_803E020C0,787197.11,13.0,787197.11,13.0,3.0,0
2798190,0.285018,10057,10057_803A115E0,29467,29467_803E020C0,787197.11,13.0,787197.11,13.0,3.0,0
2798191,0.284233,10057,10057_803A115E0,29467,29467_803E020C0,681262.19,13.0,681262.19,13.0,4.0,0
3918769,0.417079,10057,10057_803A115E0,29467,29467_803E020C0,681262.19,13.0,681262.19,13.0,4.0,0
213094,0.000746,10057,10057_803A115E0,10057,10057_803A115E0,146954.27,13.0,146954.27,13.0,5.0,0


We would like to extract all unique accounts from payer and receiver as node of our graph. It includes the unique account ID, Bank code and the label of 'Is Laundering'.  
In this section, we consider both payer and receiver involved in a illicit transaction as suspicious accounts, we will label both accounts with 'Is Laundering' == 1.

In [10]:
def get_all_account(df):
        ldf = df[['Account', 'From Bank']]
        rdf = df[['Account.1', 'To Bank']]
        suspicious = df[df['Is Laundering']==1]
        s1 = suspicious[['Account', 'Is Laundering']]
        s2 = suspicious[['Account.1', 'Is Laundering']]
        s2 = s2.rename({'Account.1': 'Account'}, axis=1)
        suspicious = pd.concat([s1, s2], join='outer')
        suspicious = suspicious.drop_duplicates()

        ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
        rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
        df = pd.concat([ldf, rdf], join='outer')
        df = df.drop_duplicates()

        df['Is Laundering'] = 0
        df.set_index('Account', inplace=True)
        df.update(suspicious.set_index('Account'))
        df = df.reset_index()
        return df

In [11]:
accounts = get_all_account(df)
display(accounts.head())

Unnamed: 0,Account,Bank,Is Laundering
0,10057_803A115E0,10057,0
1,10057_803AA8E90,10057,0
2,10057_803AAB430,10057,0
3,10057_803AACE20,10057,0
4,10057_803AB4F70,10057,0


# Node features
For node features, we would like to aggregate the mean of paid and received amount with different types of currency as the new features of each node. 

In [13]:
def paid_currency_aggregate(currency_ls, paying_df, accounts):
        for i in currency_ls:
            temp = paying_df[paying_df['Payment Currency'] == i]
            accounts['avg paid '+str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
        return accounts

def received_currency_aggregate(currency_ls, receiving_df, accounts):
    for i in currency_ls:
        temp = receiving_df[receiving_df['Receiving Currency'] == i]
        accounts['avg received '+str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
    accounts = accounts.fillna(0)
    return accounts

Now we can define the node attributes by the bank code and the mean of paid and received amount with different types of currency.

In [15]:
import torch

from torch_geometric.data import Data, InMemoryDataset

def get_node_attr(currency_ls, paying_df,receiving_df, accounts):
        node_df = paid_currency_aggregate(currency_ls, paying_df, accounts)
        node_df = received_currency_aggregate(currency_ls, receiving_df, node_df)
        node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
        node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
        node_df = df_label_encoder(node_df,['Bank'])
        return node_df, node_label

In [18]:
node_df, node_label = get_node_attr(currency_ls, paying_df,receiving_df, accounts)
node_df[0]

Unnamed: 0,Bank,avg paid 0.0,avg paid 1.0,avg paid 2.0,avg paid 3.0,avg paid 4.0,avg paid 5.0,avg paid 6.0,avg paid 7.0,avg paid 8.0,...,avg received 5.0,avg received 6.0,avg received 7.0,avg received 8.0,avg received 9.0,avg received 10.0,avg received 11.0,avg received 12.0,avg received 13.0,avg received 14.0
0,598.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,330.166429,0.0,0.0
1,598.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,119.992000,0.0,0.0
2,598.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14675.570000,0.0,0.0
3,598.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,756.486190,0.0,0.0
4,598.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3120.573333,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
515083,746.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6960.194583,0.0,0.0
515084,762.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3657.100000,0.0,0.0
515085,6502.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,111.120000,0.0,0.0
515086,6506.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,334.675000,0.0,0.0


# Edge features
In terms of edge features, we would like to conside each transcation as edges.  
For edge index, we replace all account with index and stack into a list with size of [2, num of transcation]  
For edge attributes, we used 'Timestamp', 'Amount Received', 'Receiving Currency', 'Amount Paid', 'Payment Currency' and 'Payment Format'


In [None]:
def get_edge_df(accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

#         edge_attr = torch.from_numpy(df.values).to(torch.float)  # comment for visualization

        edge_attr = df  # for visualization
        return edge_attr, edge_index

In [21]:
edge_attr, edge_index = get_edge_df(accounts, df)
display(edge_attr.head())

Unnamed: 0,Timestamp,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format
4278714,0.45632,787197.11,13.0,787197.11,13.0,3.0
2798190,0.285018,787197.11,13.0,787197.11,13.0,3.0
2798191,0.284233,681262.19,13.0,681262.19,13.0,4.0
3918769,0.417079,681262.19,13.0,681262.19,13.0,4.0
213094,0.000746,146954.27,13.0,146954.27,13.0,5.0
