Transform the complete graph into the bipartite one in this notebook.

Key features to transform the public graph anomaly detection datasets into a more money laundry cases:

1. There are two types of accounts existing. One is internal account and the other one is external account. Internal accounts will have rich node features as well as the node labels (fraud or not) recorded in the bank system. External accounts, on the other hand, we do not know any node features, node labels, as well as the connections (i.e. transactions) between external accounts. The only information we can have from the external information is the unique id defined by the external accounts' routing number and account ID.
2. Apart from the two types of accounts, the edges, which are the transactions happening among accounts, can be further divided into three types: internal transactions (internal <->internal, where we know the features for both the origins and destinations), internal depsoit from external, and internal withdraw to external accounts.

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from utils import *
# Setup the random number of external data
# The portion needs to be between 0 and 1 and requires sensitivity analysis
seed = 1
np.random.seed(seed=seed)
portion_external = 0.1 # np.round(np.random.rand(),1)
print(f"portion_external: {portion_external}")

portion_external: 0.1


# Elliptic Data Processing

In [39]:
# Load and process the Elliptic data into heterogeneous data 
elliptic_accounts = pd.read_csv('./data/elliptic/elliptic_bitcoin_dataset/elliptic_txs_classes.csv')
elliptic_edges    = pd.read_csv('./data/elliptic/elliptic_bitcoin_dataset/elliptic_txs_edgelist.csv')
elliptic_features = pd.read_csv('./data/elliptic/elliptic_bitcoin_dataset/elliptic_txs_features.csv',header=None)
elliptic_features = elliptic_features.drop(columns=[0,1])
# Check the data shape
elliptic_accounts.shape, elliptic_edges.shape, elliptic_features.shape

((203769, 2), (234355, 2), (203769, 165))

In [40]:
# Randomly split the data into internal and external data
np.random.seed(seed=seed)
elliptic_accounts['internal'] = np.random.rand(len(elliptic_accounts)) > portion_external

In [49]:
# # Check the illicit and licit accounts in internal data
# elliptic_accounts_internal = elliptic_accounts[elliptic_accounts['internal']]
# elliptic_accounts_external = elliptic_accounts[~elliptic_accounts['internal']]
# # Check the class distribution percentage
# elliptic_accounts_internal['class'].value_counts(normalize=True), elliptic_accounts_external['class'].value_counts(normalize=True)

(class
 unknown    0.771088
 2          0.206389
 1          0.022523
 Name: proportion, dtype: float64,
 class
 unknown    0.772084
 2          0.205939
 1          0.021977
 Name: proportion, dtype: float64)

The percentages of labels in the external and internal accounts remain same.

In [41]:
# Create hetero data and features
elliptic_accounts_hetero = pd.DataFrame(columns=['account_id','internal','label'])
elliptic_features_hetero = elliptic_features.copy()

# Hetero accounts
elliptic_accounts_hetero['account_id'] = elliptic_accounts['txId']
elliptic_accounts_hetero['internal'] = elliptic_accounts['internal']
# Only record the illicit accounts
elliptic_accounts_hetero['label'] = elliptic_accounts['class'].apply(lambda x: 1 if x == '1' else 0)
# Set the external accounts label to be NaN
elliptic_accounts_hetero['label'].loc[elliptic_accounts_hetero['internal']==0] = np.nan

# Only keeps the features of internal accounts
# elliptic_features_hetero.loc[elliptic_accounts_hetero['internal']==0] = 0
# The external accounts features are the weighted average of the internal accounts
elliptic_features_hetero = elliptic_features_hetero.fillna(0)
elliptic_features_hetero.loc[elliptic_accounts_hetero['internal']==0] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  elliptic_accounts_hetero['label'].loc[elliptic_accounts_hetero['internal']==0] = np.nan


In [42]:
# recording the transaction from account 1 (`sender`) to account 2 (`receiver`), with transaction type `txn_type`: 0 (internal transactions), 1 (internal->external), and 2(external->internal)
elliptic_transactions_hetero = pd.DataFrame(columns=['sender','receiver','txn_type'])

elliptic_transactions_hetero['sender'] = elliptic_edges['txId1']
elliptic_transactions_hetero['receiver'] = elliptic_edges['txId2']

elliptic_accounts_internal = elliptic_accounts_hetero['account_id'][elliptic_accounts_hetero['internal']]

# Check the transaction type
elliptic_transactions_senderflag = elliptic_transactions_hetero['sender'].isin(elliptic_accounts_internal)
elliptic_transactions_receiverflag = elliptic_transactions_hetero['receiver'].isin(elliptic_accounts_internal)
elliptic_transactions_hetero['txn_type'] = elliptic_transactions_senderflag.combine(elliptic_transactions_receiverflag,check_edge_type)

# Drop the external -> external transactions
elliptic_transactions_hetero = elliptic_transactions_hetero[elliptic_transactions_hetero['txn_type'] != -1]

In [43]:
# Save the data
elliptic_data_path = f'./hetero_data/elliptic/ext_{portion_external}/'
import os
if not os.path.exists(elliptic_data_path):
    os.makedirs(elliptic_data_path)
# Save to csv
elliptic_accounts_hetero.to_csv(elliptic_data_path + 'accounts.csv',index=False)
elliptic_transactions_hetero.to_csv(elliptic_data_path + 'transactions.csv',index=False)
elliptic_features_hetero.to_csv(elliptic_data_path + 'features.csv',index=False)

In [37]:
elliptic_accounts_hetero

Unnamed: 0,account_id,internal,label
0,230425980,True,0.0
1,5530458,True,0.0
2,232022460,False,
3,232438397,False,
4,230460314,False,
...,...,...,...
203764,173077460,True,0.0
203765,158577750,True,0.0
203766,158375402,True,1.0
203767,158654197,False,
