Transform the complete graph into the bipartite one in this notebook.

Key features to transform the public graph anomaly detection datasets into a more money laundry cases:

1. There are two types of accounts existing. One is internal account and the other one is external account. Internal accounts will have rich node features as well as the node labels (fraud or not) recorded in the bank system. External accounts, on the other hand, we do not know any node features, node labels, as well as the connections (i.e. transactions) between external accounts. The only information we can have from the external information is the unique id defined by the external accounts' routing number and account ID.
2. Apart from the two types of accounts, the edges, which are the transactions happening among accounts, can be further divided into three types: internal transactions (internal <->internal, where we know the features for both the origins and destinations), internal depsoit from external, and internal withdraw to external accounts.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from utils import *
# Setup the random number of external data
# The portion needs to be between 0 and 1 and requires sensitivity analysis
seed = 1
np.random.seed(seed=seed)
portion_external = 0.1 #np.round(np.random.rand(),1)
print(f"portion_external: {portion_external}")

portion_external: 0.1


# Elliptic Data Processing

In [10]:
# Load and process the Elliptic data into heterogeneous data 
elliptic_accounts = pd.read_csv('./data/elliptic/elliptic_bitcoin_dataset/elliptic_txs_classes.csv')
elliptic_edges    = pd.read_csv('./data/elliptic/elliptic_bitcoin_dataset/elliptic_txs_edgelist.csv')
elliptic_features = pd.read_csv('./data/elliptic/elliptic_bitcoin_dataset/elliptic_txs_features.csv',header=None)

# Check the data shape
elliptic_accounts.shape, elliptic_edges.shape, elliptic_features.shape

((203769, 2), (234355, 2), (203769, 167))

In [None]:
# # Create homogeneous data
# # Create hetero data and features
# elliptic_accounts_homo = pd.DataFrame(columns=['account_id','internal','label'])
# elliptic_features_homo = elliptic_features.copy()

# # Hetero accounts
# elliptic_accounts_homo['account_id'] = elliptic_accounts['txId']
# elliptic_accounts_homo['internal'] = elliptic_accounts['internal']
# # Drop unknown label accounts
# # elliptic_accounts_homo = elliptic_accounts_homo.loc[elliptic_accounts_homo['class']!='unknown']

# # Only record the illicit accounts
# elliptic_accounts_homo['label'] = elliptic_accounts['class'].apply(lambda x: 1 if x == '1' else 
#                                                                    (2 if x == 'unknown' else 0))

# # Only keeps the features of internal accounts
# # elliptic_features_hetero.loc[elliptic_accounts_hetero['internal']==0] = 0
# elliptic_features_homo = elliptic_features_homo.drop(columns=[0,1])
# # elliptic_features_homo = elliptic_features_homo[list(range(2,94))]
# # recording the transaction from account 1 (`sender`) to account 2 (`receiver`), with transaction type `txn_type`: 0 (internal transactions), 1 (internal->external), and 2(external->internal)
# elliptic_transactions_homo= pd.DataFrame(columns=['sender','receiver','txn_type'])

# elliptic_transactions_homo['sender'] = elliptic_edges['txId1']
# elliptic_transactions_homo['receiver'] = elliptic_edges['txId2']

# # Save the homogeneous data
# elliptic_data_path = f'./homo_data/elliptic/'
# import os
# if not os.path.exists(elliptic_data_path):
#     os.makedirs(elliptic_data_path)
# # Save to csv
# elliptic_accounts_homo.to_csv(elliptic_data_path + 'accounts.csv',index=False)
# elliptic_transactions_homo.to_csv(elliptic_data_path + 'transactions.csv',index=False)
# elliptic_features_homo.to_csv(elliptic_data_path + 'features.csv',index=False)

In [5]:
# Randomly split the data into internal and external data
np.random.seed(seed=seed)
elliptic_accounts['internal'] = np.random.rand(len(elliptic_accounts)) > portion_external

In [49]:
# # Check the illicit and licit accounts in internal data
# elliptic_accounts_internal = elliptic_accounts[elliptic_accounts['internal']]
# elliptic_accounts_external = elliptic_accounts[~elliptic_accounts['internal']]
# # Check the class distribution percentage
# elliptic_accounts_internal['class'].value_counts(normalize=True), elliptic_accounts_external['class'].value_counts(normalize=True)

(class
 unknown    0.771088
 2          0.206389
 1          0.022523
 Name: proportion, dtype: float64,
 class
 unknown    0.772084
 2          0.205939
 1          0.021977
 Name: proportion, dtype: float64)

The percentages of labels in the external and internal accounts remain same.

In [6]:
# Create hetero data and features
elliptic_accounts_hetero = pd.DataFrame(columns=['account_id','internal','label'])
elliptic_features_hetero = elliptic_features.copy()

# Hetero accounts
elliptic_accounts_hetero['account_id'] = elliptic_accounts['txId']
elliptic_accounts_hetero['internal'] = elliptic_accounts['internal']
# Only record the illicit accounts
elliptic_accounts_hetero['label'] = elliptic_accounts['class'].apply(lambda x: 1 if x == '1' else (2 if x == 'unknown' else 0))
# Set the external accounts label to be NaN
elliptic_accounts_hetero['label'].loc[elliptic_accounts_hetero['internal']==0] = np.nan

# Only keeps the features of internal accounts
# elliptic_features_hetero.loc[elliptic_accounts_hetero['internal']==0] = 0
elliptic_features_hetero = elliptic_features_hetero.drop(columns=[0,1])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  elliptic_accounts_hetero['label'].loc[elliptic_accounts_hetero['internal']==0] = np.nan


In [25]:
elliptic_accounts_hetero['label'].value_counts()

2.0    94284
0.0    25236
1.0     2754
Name: label, dtype: int64

In [16]:
# a = (elliptic_features_hetero - elliptic_features_hetero.mean(axis=0))/elliptic_features_hetero.std(axis=0)
elliptic_features_hetero.describe()

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,157,158,159,160,161,162,163,164,165,166
count,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0,...,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0,203769.0
mean,7.513050000000001e-17,2.179304e-14,1.609496e-13,-1.96617e-15,-4.103764e-15,6.953823e-15,4.018657e-14,-9.926576e-16,-4.875034e-16,3.45271e-15,...,-8.626965e-15,1.966023e-15,5.926123e-15,2.0526e-14,6.97548e-15,2.942974e-14,-1.748088e-14,8.62616e-14,-6.53372e-14,-1.622924e-14
std,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,...,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002
min,-0.1729826,-0.2105526,-1.756361,-0.1219696,-0.06372457,-0.113002,-0.06158379,-0.1636459,-0.1694603,-0.04970696,...,-0.5770994,-0.6262286,-0.9790738,-0.978556,-0.2160569,-0.1259391,-0.1311553,-0.2698175,-1.760926,-1.760984
25%,-0.1725317,-0.1803266,-1.201369,-0.1219696,-0.04387455,-0.113002,-0.06158379,-0.1635168,-0.1690701,-0.04970696,...,-0.5696264,-0.5946915,-0.9790738,-0.978556,-0.09888874,-0.08749016,-0.1311553,-0.1405971,-0.1206134,-0.1197925
50%,-0.1692045,-0.1328975,0.4636092,-0.1219696,-0.04387455,-0.113002,-0.06158379,-0.162044,-0.1662255,-0.04970696,...,-0.4799511,-0.4559278,0.2411283,0.2414064,0.0182794,-0.08749016,-0.1311553,-0.09752359,-0.1206134,-0.1197925
75%,-0.1318553,-0.05524241,1.018602,-0.1219696,-0.04387455,-0.113002,-0.06158379,-0.1355932,-0.1323665,-0.04970696,...,0.1552495,0.1212026,1.305594,1.398764,0.0182794,-0.08749016,-0.08467423,-0.09752359,0.1520067,0.119971
max,71.68197,73.59505,2.68358,49.0276,260.0907,54.56518,113.4409,73.35457,72.3184,189.1869,...,7.862953,7.914041,1.46133,1.461369,117.0692,251.849,238.7835,105.734,1.5197,1.521399


In [7]:
# recording the transaction from account 1 (`sender`) to account 2 (`receiver`), with transaction type `txn_type`: 0 (internal transactions), 1 (internal->external), and 2(external->internal)
elliptic_transactions_hetero = pd.DataFrame(columns=['sender','receiver','txn_type'])

elliptic_transactions_hetero['sender'] = elliptic_edges['txId1']
elliptic_transactions_hetero['receiver'] = elliptic_edges['txId2']

elliptic_accounts_internal = elliptic_accounts_hetero['account_id'][elliptic_accounts_hetero['internal']]

# Check the transaction type
elliptic_transactions_senderflag = elliptic_transactions_hetero['sender'].isin(elliptic_accounts_internal)
elliptic_transactions_receiverflag = elliptic_transactions_hetero['receiver'].isin(elliptic_accounts_internal)
elliptic_transactions_hetero['txn_type'] = elliptic_transactions_senderflag.combine(elliptic_transactions_receiverflag,check_edge_type)

# Drop the external -> external transactions
elliptic_transactions_hetero = elliptic_transactions_hetero[elliptic_transactions_hetero['txn_type'] != -1]

In [8]:
elliptic_features_hetero.head()

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,157,158,159,160,161,162,163,164,165,166
0,-0.171469,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.162097,-0.167933,-0.049707,...,-0.562153,-0.600999,1.46133,1.461369,0.018279,-0.08749,-0.131155,-0.097524,-0.120613,-0.119792
1,-0.171484,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.162112,-0.167948,-0.049707,...,0.947382,0.673103,-0.979074,-0.978556,0.018279,-0.08749,-0.131155,-0.097524,-0.120613,-0.119792
2,-0.172107,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.162749,-0.168576,-0.049707,...,0.670883,0.439728,-0.979074,-0.978556,-0.098889,-0.106715,-0.131155,-0.183671,-0.120613,-0.119792
3,0.163054,1.96379,-0.646376,12.409294,-0.063725,9.782742,12.414558,-0.163645,-0.115831,0.043598,...,-0.577099,-0.613614,0.241128,0.241406,1.072793,0.08553,-0.131155,0.677799,-0.120613,-0.119792
4,1.011523,-0.081127,-1.201369,1.153668,0.333276,1.312656,-0.061584,-0.163523,0.041399,0.935886,...,-0.511871,-0.400422,0.517257,0.579382,0.018279,0.277775,0.326394,1.29375,0.178136,0.179117


In [9]:
elliptic_accounts_hetero.head()

Unnamed: 0,account_id,internal,label
0,230425980,True,2.0
1,5530458,True,2.0
2,232022460,False,
3,232438397,True,0.0
4,230460314,True,2.0


In [10]:
# Save the data
elliptic_data_path = f'./hetero_data/elliptic/ext_{portion_external}/'
import os
if not os.path.exists(elliptic_data_path):
    os.makedirs(elliptic_data_path)
# Save to csv
elliptic_accounts_hetero.to_csv(elliptic_data_path + 'accounts.csv',index=False)
elliptic_transactions_hetero.to_csv(elliptic_data_path + 'transactions.csv',index=False)
elliptic_features_hetero.to_csv(elliptic_data_path + 'features.csv',index=False)

In [24]:
elliptic_accounts_hetero

Unnamed: 0,account_id,internal,label
0,230425980,True,2.0
1,5530458,True,2.0
2,232022460,False,
3,232438397,False,
4,230460314,False,
...,...,...,...
203764,173077460,True,2.0
203765,158577750,True,2.0
203766,158375402,True,1.0
203767,158654197,False,


In [60]:
elliptic_accounts_hetero

Unnamed: 0,account_id,internal,label
0,230425980,True,0.0
1,5530458,True,0.0
2,232022460,False,
3,232438397,False,
4,230460314,False,
...,...,...,...
203764,173077460,True,0.0
203765,158577750,True,0.0
203766,158375402,True,1.0
203767,158654197,False,


# DGraph_Fin data processing

File **dgraphfin.npz** including below keys:  

- **x**: 17-dimensional node features.
- **y**: node label.  
    There four classes. Below are the nodes counts of each class.     
    0: 1210092    
    1: 15509    
    2: 1620851    
    3: 854098    
    Nodes of Class 1 are fraud users and nodes of 0 are normal users, and they the two classes to be predicted.    
    Nodes of Class 2 and Class 3 are background users.    
    
- **edge_index**: shape (4300999, 2).   
    Each edge is in the form (id_a, id_b), where ids are the indices in x.        

- **edge_type**: 11 types of edges. 
    
- **edge_timestamp**: the desensitized timestamp of each edge.
    
- **train_mask, valid_mask, test_mask**:  
    Nodes of Class 0 and Class 1 are randomly splitted by 70/15/15.  

Naturally, the class 1 and 0 are the internal accounts as they contain labels. The bacground labels are the external accounts that serve as the background information. However, we do not have the features regarding the background nodes.

Note that we do not follow their train/val/test split, but use the one in model/utils.py


In [3]:
dgraphfin = np.load('data/dgraph_fin/dgraphfin.npz')
dgraphfin.files

['x',
 'y',
 'edge_index',
 'edge_type',
 'edge_timestamp',
 'train_mask',
 'valid_mask',
 'test_mask']

In [7]:
# pd.value_counts(dgraphfin['y'],normalize=True), sum(dgraphfin['y']==1)/(sum(dgraphfin['y']==1)+sum(dgraphfin['y']==0))

(2    0.438003
 0    0.327003
 3    0.230803
 1    0.004191
 dtype: float64,
 0.012654199857865651)

In [4]:
dgraphfin['x'].shape,dgraphfin['y'].shape,

((3700550, 17), (3700550,))

In [5]:
dgraphfin['edge_index'].min(axis=0),dgraphfin['edge_index'].max(axis=0)

(array([3, 0]), array([3699087, 3700549]))

In [6]:
# Create hetero data and features
dgraphfin_accounts_hetero = pd.DataFrame(columns=['account_id','internal','label'])
dgraphfin_features_hetero = dgraphfin['x'].copy()
# Normalize by the largest value
dgraphfin_features_hetero = dgraphfin_features_hetero/dgraphfin_features_hetero.max(axis=0)

# Hetero accounts
dgraphfin_accounts_hetero['account_id'] = np.arange(dgraphfin['x'].shape[0])
dgraphfin_accounts_hetero['internal']   = np.zeros(dgraphfin['x'].shape[0],dtype=bool)
dgraphfin_accounts_hetero['internal'][(dgraphfin['y']==0)|(dgraphfin['y']==1)] = True
dgraphfin_accounts_hetero['label'][(dgraphfin['y']==0)|(dgraphfin['y']==1)] = dgraphfin['y'][(dgraphfin['y']==0)|(dgraphfin['y']==1)]

# Random choose half of the label 0 to be 2
np.random.seed(seed=seed)
label_0_index = dgraphfin_accounts_hetero['label']==0
label_0_index = np.where(label_0_index)[0]
np.random.shuffle(label_0_index)
label_0_index = label_0_index[:int(sum(dgraphfin_accounts_hetero['label']==0)/12*10)]
dgraphfin_accounts_hetero['label'].iloc[label_0_index] = 2

# Only keeps the features of internal accounts
# dgraphfin_features_hetero[dgraphfin_accounts_hetero['internal']==0] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dgraphfin_accounts_hetero['internal'][(dgraphfin['y']==0)|(dgraphfin['y']==1)] = True
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dgraphfin_accounts_hetero['label'][(dgraphfin['y']==0)|(dgraphfin['y']==1)] = dgraphfin['y'][(dgraphfin['y']==0)|(dgraphfin['y']==1)]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dgraphfin_accounts_hetero['label'].iloc[label_0_index] = 2


In [8]:
dgraphfin_accounts_hetero['label'].value_counts(normalize=True)

2    0.822788
0    0.164558
1    0.012654
Name: label, dtype: float64

In [23]:
# dgraphfin_accounts_hetero.internal.value_counts(), 2474949/(2474949+1225601)

(False    2474949
 True     1225601
 Name: internal, dtype: int64,
 0.6688057180689357)

In [7]:
# Record the transaction from account 1 (`sender`) to account 2 (`receiver`), with transaction type `txn_type`: 0 (internal transactions), 1 (internal->external), and 2(external->internal)
dgraphfin_transactions_hetero = pd.DataFrame(columns=['sender','receiver','txn_type'])
dgraphfin_transactions_hetero['sender'] = dgraphfin['edge_index'][:,0]
dgraphfin_transactions_hetero['receiver'] = dgraphfin['edge_index'][:,1]

dgraphfin_accounts_internal = dgraphfin_accounts_hetero['account_id'][dgraphfin_accounts_hetero['internal']]

# Check the transaction type
dgraphfin_transactions_senderflag = dgraphfin_transactions_hetero['sender'].isin(dgraphfin_accounts_internal)
dgraphfin_transactions_receiverflag = dgraphfin_transactions_hetero['receiver'].isin(dgraphfin_accounts_internal)
dgraphfin_transactions_hetero['txn_type'] = dgraphfin_transactions_senderflag.combine(dgraphfin_transactions_receiverflag,check_edge_type)

# Drop the external -> external transactions
dgraphfin_transactions_hetero = dgraphfin_transactions_hetero[dgraphfin_transactions_hetero['txn_type'] != -1]

In [8]:
# Save the data
dgraphfin_data_path = f'./hetero_data/dgraph_fin/ext_0.6/'
import os
if not os.path.exists(dgraphfin_data_path):
    os.makedirs(dgraphfin_data_path)
# Save to csv
dgraphfin_accounts_hetero.to_csv(dgraphfin_data_path + 'accounts.csv',index=False)
dgraphfin_transactions_hetero.to_csv(dgraphfin_data_path + 'transactions.csv',index=False)
# Save the features to csv
pd.DataFrame(dgraphfin_features_hetero).to_csv(dgraphfin_data_path + 'features.csv',index=False)