<a href="https://colab.research.google.com/github/boersmamarcel/notebooks/blob/main/ellipticpp_dataset_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unveiling Bitcoin Network Secrets with the Elliptic++ dataset: From Transactions to Graph Neural Networks
Due to the network's inherent pseudonymity and decentralized nature, analyzing Bitcoin transactions can be complex. Understanding the relationships between wallets and transactions is crucial for uncovering potential illicit activities.

This tutorial will leverage the Elliptic++ dataset, a rich labeled Bitcoin transaction data source. We will focus on preparing this data for subsequent exploration using a graph neural network model. We'll transform the data into a HeteroData structure, allowing us to explore and analyze the intricate connections within the Bitcoin network with Graph Neural Network in PyTorch Geometric.

This tutorial will guide you through the entire data preparation process. By the end, you'll be equipped with the skills to prepare your datasets for evaluation using graph neural network models, enabling you to gain valuable insights into complex networks like the Bitcoin ecosystem. Let's embark on this journey and shed light on the hidden patterns within the Bitcoin network!

# Preparation: download the dataset
Please download the files from Google Drive and put them in the folder called *elliptic_bitcoin_dataset*.
[Google Drive link](https://drive.google.com/drive/folders/1MRPXz79Lu_JGLlJ21MDfML44dKN9R08l).

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
!cp -r '/content/gdrive/My Drive/elliptic_bitcoin_dataset/' './'

# Wallet data
At the heart of the Bitcoin network lie wallets. These keys are essential for authorizing transactions on the blockchain, ensuring only the rightful owner can spend their holdings. We store feature information collected in the dataset for each wallet. The Elliptic Data Set we're exploring dives deep into these wallets, analyzing their connections within the Bitcoin network. By understanding how wallets interact with each other and with transactions, we can potentially identify illicit activities lurking on the blockchain.

We first read the file, which contains all the wallet's features and classes.

In [None]:
import polars as pol
import torch

features_wallets = pol.read_csv("./elliptic_bitcoin_dataset/wallets_features_classes_combined.csv")
features_wallets

address,Time step,class,num_txs_as_sender,num_txs_as receiver,first_block_appeared_in,last_block_appeared_in,lifetime_in_blocks,total_txs,first_sent_block,first_received_block,num_timesteps_appeared_in,btc_transacted_total,btc_transacted_min,btc_transacted_max,btc_transacted_mean,btc_transacted_median,btc_sent_total,btc_sent_min,btc_sent_max,btc_sent_mean,btc_sent_median,btc_received_total,btc_received_min,btc_received_max,btc_received_mean,btc_received_median,fees_total,fees_min,fees_max,fees_mean,fees_median,fees_as_share_total,fees_as_share_min,fees_as_share_max,fees_as_share_mean,fees_as_share_median,blocks_btwn_txs_total,blocks_btwn_txs_min,blocks_btwn_txs_max,blocks_btwn_txs_mean,blocks_btwn_txs_median,blocks_btwn_input_txs_total,blocks_btwn_input_txs_min,blocks_btwn_input_txs_max,blocks_btwn_input_txs_mean,blocks_btwn_input_txs_median,blocks_btwn_output_txs_total,blocks_btwn_output_txs_min,blocks_btwn_output_txs_max,blocks_btwn_output_txs_mean,blocks_btwn_output_txs_median,num_addr_transacted_multiple,transacted_w_address_total,transacted_w_address_min,transacted_w_address_max,transacted_w_address_mean,transacted_w_address_median
str,i64,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""111112TykSw72z…",25,2,0.0,1.0,439586.0,439586.0,0.0,1.0,0.0,439586.0,1.0,0.0106281,0.0106281,0.0106281,0.0106281,0.0106281,0.0,0.0,0.0,0.0,0.0,0.0106281,0.0106281,0.0106281,0.0106281,0.0106281,0.007042,0.007042,0.007042,0.007042,0.007042,0.000012,0.000012,0.000012,0.000012,0.000012,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,1.0,1.0,1.0,1.0
"""1111DAYXhoxZx2…",25,3,0.0,8.0,439589.0,485959.0,46370.0,8.0,0.0,439589.0,6.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.0,0.0,0.0,0.0,0.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.002371,0.000122,0.00058,0.000296,0.000242,0.002217,0.000121,0.000523,0.000277,0.000237,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,0.0,0.0,0.0,0.0,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,8.0,1.0,1.0,1.0,1.0
"""1111DAYXhoxZx2…",29,3,0.0,8.0,439589.0,485959.0,46370.0,8.0,0.0,439589.0,6.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.0,0.0,0.0,0.0,0.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.002371,0.000122,0.00058,0.000296,0.000242,0.002217,0.000121,0.000523,0.000277,0.000237,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,0.0,0.0,0.0,0.0,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,8.0,1.0,1.0,1.0,1.0
"""1111DAYXhoxZx2…",39,3,0.0,8.0,439589.0,485959.0,46370.0,8.0,0.0,439589.0,6.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.0,0.0,0.0,0.0,0.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.002371,0.000122,0.00058,0.000296,0.000242,0.002217,0.000121,0.000523,0.000277,0.000237,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,0.0,0.0,0.0,0.0,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,8.0,1.0,1.0,1.0,1.0
"""1111DAYXhoxZx2…",39,3,0.0,8.0,439589.0,485959.0,46370.0,8.0,0.0,439589.0,6.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.0,0.0,0.0,0.0,0.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.002371,0.000122,0.00058,0.000296,0.000242,0.002217,0.000121,0.000523,0.000277,0.000237,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,0.0,0.0,0.0,0.0,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,8.0,1.0,1.0,1.0,1.0
"""1111DAYXhoxZx2…",43,3,0.0,8.0,439589.0,485959.0,46370.0,8.0,0.0,439589.0,6.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.0,0.0,0.0,0.0,0.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.002371,0.000122,0.00058,0.000296,0.000242,0.002217,0.000121,0.000523,0.000277,0.000237,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,0.0,0.0,0.0,0.0,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,8.0,1.0,1.0,1.0,1.0
"""1111DAYXhoxZx2…",43,3,0.0,8.0,439589.0,485959.0,46370.0,8.0,0.0,439589.0,6.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.0,0.0,0.0,0.0,0.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.002371,0.000122,0.00058,0.000296,0.000242,0.002217,0.000121,0.000523,0.000277,0.000237,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,0.0,0.0,0.0,0.0,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,8.0,1.0,1.0,1.0,1.0
"""1111DAYXhoxZx2…",47,3,0.0,8.0,439589.0,485959.0,46370.0,8.0,0.0,439589.0,6.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.0,0.0,0.0,0.0,0.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.002371,0.000122,0.00058,0.000296,0.000242,0.002217,0.000121,0.000523,0.000277,0.000237,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,0.0,0.0,0.0,0.0,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,8.0,1.0,1.0,1.0,1.0
"""1111DAYXhoxZx2…",48,3,0.0,8.0,439589.0,485959.0,46370.0,8.0,0.0,439589.0,6.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.0,0.0,0.0,0.0,0.0,0.273046,0.0039,0.133777,0.034131,0.014352,0.002371,0.000122,0.00058,0.000296,0.000242,0.002217,0.000121,0.000523,0.000277,0.000237,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,0.0,0.0,0.0,0.0,46370.0,0.0,20164.0,6624.285714,8060.0,0.0,8.0,1.0,1.0,1.0,1.0
"""1111VHuXEzHaRC…",21,2,0.0,1.0,431522.0,431522.0,0.0,1.0,0.0,431522.0,1.0,0.000104,0.000104,0.000104,0.000104,0.000104,0.0,0.0,0.0,0.0,0.0,0.000104,0.000104,0.000104,0.000104,0.000104,0.0276,0.0276,0.0276,0.0276,0.0276,0.000001,0.000001,0.000001,0.000001,0.000001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0


The code reads a CSV file named "elliptic_bitcoin_dataset/wallets_features_classes_combined.csv" into a Polars DataFrame named features_wallets. Printing the shape of this DataFrame reveals that it contains 1,268,260 rows and 58 columns. The columns in the DataFrame represent the following features:

| Feature | Description |
|---|---|
| **Transaction related** | |
| BTCtransacted | Total BTC transacted (sent+received) |
| BTCsent | Total BTC sent |
| BTCreceived | Total BTC received |
| Fees | Total fees in BTC |
| Feesshare | Total fees as share of BTC transacted |
| **Time related** | |
| Blockstxs | Number of blocks between transactions |
| Blocksinput | Number of blocks between being an input address |
| Blocksoutput | Number of blocks between being an output address |
| Addr interactions | Number of interactions among addresses<br>5 values: total, min, max, mean, median |
| Class | Class label: {illicit, licit, unknown} |
| **Transaction related** | |
| Txstotal | Total number of blockchain transactions |
| TxSinput | Total number of dataset transactions as input address |
| TxSoutput | Total number of dataset transactions as output address |
| **Time related** | |
| Timesteps | Number of time steps transacting in |
| Lifetime | Lifetime in blocks |
| Block first | Block height first transacted in<br>Block height last transacted in |
| Blocklast |  |
| Block first sent | Block height first sent in |
| Block first receive | Block height first received in |
| Repeat interactions | Number of addresses transacted with multiple times<br>single value |

Next, we prepare the labels as separate tensors, which are used for supervised training tasks.

In [None]:
wallet_labels = torch.tensor(
    features_wallets["class"].to_numpy()
).long()
wallet_labels

tensor([2, 3, 3,  ..., 3, 3, 3])

We convert a column of wallet class labels into a PyTorch tensor. We extract the class column as a NumPy array using features_wallets["class"].to_numpy(). Then, we create a PyTorch tensor from this array using torch.tensor(…). Importantly, we cast the labels to long integers (dtype=torch.long()) since class labels are often treated as categorical data. Next, we clean up the feature vector and convert the Polars DataFrame to a PyTorch Tensor:

Clean up the feature vector and convert the Polars DataFrame to a PyTorch Tensor

In [None]:
tensor_features_wallets = torch.tensor(
    features_wallets.drop(["class", "Time step", "address"]).to_numpy(),
    dtype=torch.float32,
)
tensor_features_wallets

tensor([[0.0000e+00, 1.0000e+00, 4.3959e+05,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [0.0000e+00, 8.0000e+00, 4.3959e+05,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [0.0000e+00, 8.0000e+00, 4.3959e+05,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        ...,
        [0.0000e+00, 1.0000e+00, 4.0734e+05,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [0.0000e+00, 1.0000e+00, 3.9524e+05,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [1.0000e+00, 0.0000e+00, 4.0733e+05,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00]])

# Transaction data

The Bitcoin network thrives on a constant flow of transactions, represented as nodes in our dataset. Each transaction acts as a digital record on the blockchain, documenting the transfer of Bitcoins between different wallets. In our dataset, the edges represent these pathways. They show the flow of Bitcoins, connecting transactions to the wallets (wallet transactions edges) and even linking transactions to each other (transaction transactions). This intricate network of transactions allows us to analyze patterns and identify potential illicit activities that might try to remain hidden within the flow.

In [None]:
transaction_classes = pol.read_csv("elliptic_bitcoin_dataset/txs_classes.csv")
features_transaction = pol.read_csv("elliptic_bitcoin_dataset/txs_features.csv")

features_transaction = features_transaction.join(
    transaction_classes, how="left", left_on="txId", right_on="txId"
)

features_transaction

txId,Time step,Local_feature_1,Local_feature_2,Local_feature_3,Local_feature_4,Local_feature_5,Local_feature_6,Local_feature_7,Local_feature_8,Local_feature_9,Local_feature_10,Local_feature_11,Local_feature_12,Local_feature_13,Local_feature_14,Local_feature_15,Local_feature_16,Local_feature_17,Local_feature_18,Local_feature_19,Local_feature_20,Local_feature_21,Local_feature_22,Local_feature_23,Local_feature_24,Local_feature_25,Local_feature_26,Local_feature_27,Local_feature_28,Local_feature_29,Local_feature_30,Local_feature_31,Local_feature_32,Local_feature_33,Local_feature_34,Local_feature_35,…,Aggregate_feature_54,Aggregate_feature_55,Aggregate_feature_56,Aggregate_feature_57,Aggregate_feature_58,Aggregate_feature_59,Aggregate_feature_60,Aggregate_feature_61,Aggregate_feature_62,Aggregate_feature_63,Aggregate_feature_64,Aggregate_feature_65,Aggregate_feature_66,Aggregate_feature_67,Aggregate_feature_68,Aggregate_feature_69,Aggregate_feature_70,Aggregate_feature_71,Aggregate_feature_72,in_txs_degree,out_txs_degree,total_BTC,fees,size,num_input_addresses,num_output_addresses,in_BTC_min,in_BTC_max,in_BTC_mean,in_BTC_median,in_BTC_total,out_BTC_min,out_BTC_max,out_BTC_mean,out_BTC_median,out_BTC_total,class
i64,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64
3321,1,-0.169615,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.160199,-0.166062,-0.049707,-0.162507,-0.028741,-0.035391,-0.042955,-0.013282,-0.042183,-0.16877,-0.171416,-0.172277,-1.373657,-1.37146,-0.139663,-0.148869,-0.080147,-0.155604,-0.010763,-0.012107,-0.139665,-0.148864,-0.080147,-0.155604,-0.010669,-0.012005,-0.024668,-0.031272,…,-1.01623,-0.968903,-1.94385,-1.059868,-1.678997,0.185597,0.185492,-0.003773,-0.562664,-0.577099,-0.50008,0.241128,0.241406,-0.098889,-0.08749,-0.084674,-0.140597,1.5197,1.521399,1.0,0.0,0.533972,0.0001,225.0,1.0,2.0,0.534072,0.534072,0.534072,0.534072,0.534072,0.166899,0.367074,0.266986,0.266986,0.533972,3
11108,1,-0.137586,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.127429,-0.133751,-0.049707,-0.129773,-0.028741,-0.035391,-0.042955,-0.013282,-0.003952,-0.13856,-0.139821,-0.134358,0.887058,0.884557,-0.139564,-0.148805,-0.080147,-0.15552,-0.010763,-0.012107,-0.139566,-0.1488,-0.080147,-0.15552,-0.010669,-0.012005,-0.02453,-0.031142,…,-1.01623,-0.968903,-1.94385,-1.059868,-1.678997,0.185597,0.185492,-0.216814,-0.605631,-0.562153,-0.600999,-0.979074,-0.978556,0.018279,-0.08749,-0.131155,-0.097524,-0.120613,-0.119792,1.0,1.0,5.611778,0.0001,225.0,1.0,2.0,5.611878,5.611878,5.611878,5.611878,5.611878,0.586194,5.0255839,2.805889,2.805889,5.611778,3
51816,1,-0.170103,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.160699,-0.166555,-0.049707,-0.163006,-0.028741,-0.035391,-0.042955,-0.013282,-0.036613,-0.169668,-0.172904,-0.172855,-1.373657,-1.37146,-0.139731,-0.148912,-0.080147,-0.155661,-0.010763,-0.012107,-0.139732,-0.148907,-0.080147,-0.155661,-0.010669,-0.012005,-0.024669,-0.031272,…,0.142525,-0.968903,-1.94385,-1.059868,-1.678997,0.185597,0.185492,-0.216814,-0.617907,-0.577099,-0.613614,0.241128,0.241406,0.018279,-0.08749,-0.131155,-0.097524,-0.120613,-0.119792,1.0,1.0,0.456508,0.0001,226.0,1.0,2.0,0.456608,0.456608,0.456608,0.456608,0.456608,0.22799,0.2285182,0.228254,0.228254,0.456508,3
68869,1,-0.114267,-0.184668,-1.201369,0.028105,-0.043875,-0.113002,0.547008,-0.161652,-0.118555,0.300047,-0.145947,2.017758,1.189967,-0.042955,-0.013282,0.054659,-0.118754,-0.121849,-0.106751,-1.373657,-1.37146,-0.139302,-0.148638,-0.080147,-0.155297,-0.010763,-0.012107,-0.139303,-0.148633,-0.080147,-0.155297,-0.010669,-0.012005,-0.024667,-0.03127,…,0.142525,-0.968903,-1.94385,-1.059868,-1.678997,0.185597,0.185492,-0.216814,-0.611769,-0.569626,-0.607306,-0.979074,-0.978556,0.018279,-0.08749,-0.131155,-0.097524,-0.120613,-0.119792,0.0,1.0,9.3088,0.0001,853.0,3.0,2.0,0.3089,8.0,3.102967,1.0,9.3089,1.229,8.0798,4.6544,4.6544,9.3088,2
89273,1,5.202107,-0.210553,-1.756361,-0.12197,260.090707,-0.113002,-0.061584,5.335864,5.252974,-0.049707,5.327423,-0.028741,-0.035391,265.263236,-0.013282,-0.057401,0.096439,-0.167593,-0.175293,-0.474922,-1.37146,1.828567,1.107041,-0.080147,1.512162,-0.010763,-0.012107,1.828864,1.10713,-0.080147,1.51234,-0.010669,-0.012005,-0.024669,-0.031272,…,0.030362,-1.51191,1.192421,1.085161,-0.222904,0.094796,0.084615,-0.216814,4.010246,1.25863,0.982479,0.118347,0.091066,-0.098889,0.854508,-0.066727,-0.150067,-0.08076,-0.070977,1.0,288.0,852.16468,0.0,445268.0,1.0,13107.0,852.16468,852.16468,852.16468,852.16468,852.16468,1.3000e-7,41.264036,0.065016,0.000441,852.16468,2
195142,1,-0.170284,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.160884,-0.166737,-0.049707,-0.163191,-0.028741,-0.035391,-0.042955,-0.013282,-0.056954,-0.168408,-0.169792,-0.173069,0.887058,0.884557,-0.13973,-0.148911,-0.080147,-0.15566,-0.010763,-0.012107,-0.139732,-0.148906,-0.080147,-0.15566,-0.010669,-0.012005,-0.024669,-0.031272,…,0.142525,-0.968903,-1.94385,-1.059868,-1.678997,0.185597,0.185492,-0.216814,-0.482868,-0.412695,-0.47485,-0.979074,-0.978556,-0.098889,-0.08749,-0.084674,-0.140597,1.5197,1.521399,0.0,1.0,0.427855,0.0001,225.0,1.0,2.0,0.427955,0.427955,0.427955,0.427955,0.427955,0.0049,0.422955,0.213928,0.213928,0.427855,3
223263,1,-0.138618,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.128486,-0.134793,-0.049707,-0.130828,-0.028741,-0.035391,-0.042955,-0.013282,-0.048949,-0.136422,-0.133685,-0.135581,0.887058,0.884557,-0.139583,-0.148817,-0.080147,-0.155536,-0.010763,-0.012107,-0.139585,-0.148813,-0.080147,-0.155536,-0.010669,-0.012005,-0.024669,-0.031272,…,-1.01623,-0.968903,-0.898426,0.153209,-1.071885,-1.116918,-1.116948,-0.216814,-0.587217,-0.539735,-0.582077,-0.979074,-0.978556,0.018279,-0.08749,-0.131155,-0.097524,-0.120613,-0.119792,1.0,1.0,5.448093,0.0001,225.0,1.0,2.0,5.448193,5.448193,5.448193,5.448193,5.448193,0.092693,5.355399,2.724046,2.724046,5.448093,3
223265,1,0.229845,-0.171725,-0.646376,-0.12197,0.234026,-0.113002,-0.061584,0.248507,0.236916,-0.049707,0.245743,-0.028741,-0.035391,0.24047,-0.013282,-0.05529,0.202928,-0.021889,-0.116651,-0.680107,-0.906986,-0.139557,-0.148801,-0.080147,-0.155514,-0.010763,-0.012107,-0.139559,-0.148796,-0.080147,-0.155514,-0.010669,-0.012005,-0.024669,-0.031272,…,0.583786,-0.968903,0.669709,0.845346,-0.464773,0.727563,0.691176,-0.216814,3.028144,2.055449,1.803711,0.862509,0.907171,-0.098889,0.066306,0.011857,-0.075987,-0.891306,-1.010039,0.0,3.0,63.864107,0.00015,702.0,1.0,16.0,63.864257,63.864257,63.864257,63.864257,63.864257,0.023154,57.685279,3.991507,0.231161,63.864107,3
245992,1,-0.123707,-0.09203,1.018602,-0.12197,0.035526,-0.113002,-0.061584,-0.113227,-0.119748,-0.049707,-0.115587,-0.028741,-0.035391,0.038023,-0.013282,-0.05719,-0.130887,-0.147136,-0.156819,0.551244,0.562268,-0.13971,-0.148899,-0.080147,-0.155644,-0.010763,-0.012107,-0.139712,-0.148894,-0.080147,-0.155644,-0.010669,-0.012005,-0.024669,-0.031272,…,0.109418,-0.968903,0.146997,0.691058,-0.565959,0.492568,0.426657,-0.216814,1.82507,1.012619,0.734075,-0.526769,-0.359798,-0.098889,0.008632,0.010847,-0.040092,1.097003,1.267268,1.0,1.0,7.81214,0.000458,361.0,1.0,6.0,7.812598,7.812598,7.812598,7.812598,7.812598,0.002312,6.208918,1.302023,0.168599,7.81214,3
267260,1,-0.171722,-0.184668,-1.201369,-0.046932,-0.043875,-0.02914,-0.061584,-0.163485,-0.168347,-0.040994,-0.165305,2.459222,2.415325,-0.042955,-0.013282,-0.055122,-0.170017,-0.171792,-0.174772,0.887058,0.884557,-0.139732,-0.148912,-0.080147,-0.155662,2.679386,2.665377,-0.139734,-0.148907,-0.080146,-0.155662,2.679448,2.665308,-0.024669,-0.031272,…,-1.01623,-0.968903,-1.94385,-1.059868,-1.678997,0.185597,0.185492,-0.193143,-0.57494,-0.532262,-0.563154,-0.979074,-0.978556,-0.098889,-0.08749,-0.084674,-0.140597,1.5197,1.521399,2.0,1.0,0.1998,0.0001,374.0,2.0,2.0,0.025,0.1749,0.09995,0.09995,0.1999,0.025,0.1748,0.0999,0.0999,0.1998,3



Each node in the dataset has 166 features (2 are the transactionID and class, so we don't count those); however, we can't explain them all due to sheer volume. One key feature is a time step (1–49 with approximately two-week intervals) (column_2), representing when a transaction was broadcast on the Bitcoin network. Each time step captures transactions occurring within a three-hour window. The first 94 features provide localized transaction data (inputs/outputs, fees, etc.). The remaining 72 features offer aggregated statistics about neighboring transactions.

In [None]:
tensor_features_transaction = torch.tensor(
    features_transaction.drop(["txId", "Time Step", "class"]).to_numpy(),
    dtype=torch.float32,
)

tensor_features_transaction

tensor([[ 1.0000, -0.1696, -0.1847,  ...,  0.2670,  0.2670,  0.5340],
        [ 1.0000, -0.1376, -0.1847,  ...,  2.8059,  2.8059,  5.6118],
        [ 1.0000, -0.1701, -0.1847,  ...,  0.2283,  0.2283,  0.4565],
        ...,
        [49.0000, -0.1670, -0.1396,  ...,     nan,     nan,     nan],
        [49.0000, -0.1722, -0.1396,  ...,     nan,     nan,     nan],
        [49.0000, -0.1722, -0.1396,  ...,     nan,     nan,     nan]])

In [None]:
transaction_labels = torch.tensor(
    features_transaction["class"].to_numpy(), dtype=torch.float32
).long()
transaction_labels

tensor([3, 3, 3,  ..., 3, 3, 3])

# Edges


Having defined the nodes in our graph structure, we now focus on processing the edges. We first load the edge lists.

In [None]:
edgelist_addr_addr = pol.read_csv("elliptic_bitcoin_dataset/AddrAddr_edgelist.csv")
edgelist_addr_tx = pol.read_csv("elliptic_bitcoin_dataset/AddrTx_edgelist.csv")
edgelist_tx_addr = pol.read_csv("elliptic_bitcoin_dataset/TxAddr_edgelist.csv")
edgelist_tx_tx = pol.read_csv("elliptic_bitcoin_dataset/txs_edgelist.csv")

print("Shape of edgelist_addr_addr:", edgelist_addr_addr.shape)
print("Shape of edgelist_addr_tx:", edgelist_addr_tx.shape)
print("Shape of edgelist_tx_addr:", edgelist_tx_addr.shape)
print("Shape of edgelist_tx_tx:", edgelist_tx_tx.shape)


Shape of edgelist_addr_addr: (2868964, 2)
Shape of edgelist_addr_tx: (477117, 2)
Shape of edgelist_tx_addr: (837124, 2)
Shape of edgelist_tx_tx: (234355, 2)


We're analyzing a Bitcoin transaction dataset to understand the relationships between wallets (addresses) and transactions. We define edge lists (edgelist_addr_addr, edgelist_addr_tx, etc.) loaded from CSV files to represent these connections. These edge lists capture various types of relationships, such as wallet-to-wallet connections, wallet-to-transaction connections, and transaction-to-transaction links. We'll prepare this data as tensors to analyze these relationships using graph neural network techniques.

First, we build several mapping dictionaries. These dictionaries define the relationships between entities in our graph (such as transactions and wallet addresses) and specify how we want to represent them numerically. Each dictionary links original categorical labels to their unique numerical replacements.

In [None]:
# Map nodes to indices
features_transaction = features_transaction.with_columns(
    mapped_id=pol.arange(0, features_transaction.shape[0])
)

features_wallets = features_wallets.with_columns(
    mapped_id=pol.arange(0, features_wallets.shape[0])
)

wallets_mapping = dict(
    zip(features_wallets["address"], features_wallets["mapped_id"])
)
transaction_mapping = dict(
    zip(features_transaction["txId"], features_transaction["mapped_id"])
)


Next, we use the prepare_edge_index function repeatedly on different edge lists. For each edge list, the function iterates through it and substitutes the existing categorical labels with their corresponding numbers from the relevant mapping dictionary. Finally, it transforms the processed data into a PyTorch tensor, ensuring it's in the correct format for our graph neural network model.

In [None]:
import numpy as np
def prepare_edge_index(edgelist, mapping_dict):
    for k, v in mapping_dict.items():
        edgelist = edgelist.replace(
            k,
            edgelist[k].apply(lambda x: v.get(x, x)),
        )

    # Preparing edge_index for PyTorch
    edgelist = np.array(edgelist.to_numpy()).T
    return torch.tensor(edgelist, dtype=torch.long).contiguous()

tx_tx_dict = {"txId1": transaction_mapping, "txId2": transaction_mapping}
addr_addr_dict = {
    "input_address": wallets_mapping,
    "output_address": wallets_mapping,
}
addr_tx_dict = {"input_address": wallets_mapping, "txId": transaction_mapping}
tx_addr_dict = {"txId": transaction_mapping, "output_address": wallets_mapping}

addr_tx_edge_index = prepare_edge_index(edgelist_addr_tx, addr_tx_dict)
tx_addr_edge_index = prepare_edge_index(edgelist_tx_addr, tx_addr_dict)
addr_addr_edge_index = prepare_edge_index(
    edgelist_addr_addr, addr_addr_dict
)
tx_tx_edge_index = prepare_edge_index(edgelist_tx_tx, tx_tx_dict)

print("Shape of addr_tx_edge_index:", addr_tx_edge_index.shape)
print("Shape of tx_addr_edge_index:", tx_addr_edge_index.shape)
print("Shape of addr_addr_edge_index:", addr_addr_edge_index.shape)
print("Shape of tx_tx_edge_index:", tx_tx_edge_index.shape)


  edgelist[k].apply(lambda x: v.get(x, x)),
    df = df.with_columns(new_column.alias(column_name))
instead.
  edgelist = edgelist.replace(


Shape of addr_tx_edge_index: torch.Size([2, 477117])
Shape of tx_addr_edge_index: torch.Size([2, 837124])
Shape of addr_addr_edge_index: torch.Size([2, 2868964])
Shape of tx_tx_edge_index: torch.Size([2, 234355])


# Masks


Creating masks for datasets is crucial because it facilitates the division of our Bitcoin transaction data into training, validation, and testing sets. This division is essential for building graph neural network models that generalize well to unseen data. The training mask allows the model to learn from a representative subset of transactions. In contrast, the validation mask helps us prevent overfitting by monitoring model performance on data it hasn't trained on. Finally, the testing mask provides an independent set of transactions to evaluate our model's final performance. It gives us a realistic idea of its ability to make accurate predictions in real-world scenarios. In the following sections, we create a train, validation, and test split for our dataset.

In [None]:
def split_data(num_data, splits=[0.8, 0.1]):
    assert len(splits) == 2, "The length of splits should be 2"
    assert sum(splits) < 1, "The sum of splits should be less than 1"

    # Generate numbers
    num_train = int(splits[0] * num_data)
    num_val = int(splits[1] * num_data)
    num_test = num_data - num_train - num_val

    # Generate ranges
    train_index = torch.arange(num_train, dtype=torch.long)
    val_index = torch.arange(num_train, num_train + num_val, dtype=torch.long)
    test_index = torch.arange(
        num_train + num_val, num_train + num_val + num_test, dtype=torch.long
    )

    # Create masks
    train_mask = torch.zeros(num_data, dtype=torch.bool)
    val_mask = torch.zeros(num_data, dtype=torch.bool)
    test_mask = torch.zeros(num_data, dtype=torch.bool)
    train_mask[train_index] = True
    val_mask[val_index] = True
    test_mask[test_index] = True

    return train_mask, val_mask, test_mask

wallet_train_mask, wallet_val_mask, wallet_test_mask = split_data(features_wallets.shape[0])
print("Shape of wallet_train_mask:", wallet_train_mask.shape)
print("Shape of wallet_val_mask:", wallet_val_mask.shape)
print("Shape of wallet_test_mask:", wallet_test_mask.shape)


total_records = features_wallets.shape[0]
print("Sum of wallet_train_mask as a fraction of total records:", wallet_train_mask.sum().item() / total_records)
print("Sum of wallet_val_mask as a fraction of total records:", wallet_val_mask.sum().item() / total_records)
print("Sum of wallet_test_mask as a fraction of total records:", wallet_test_mask.sum().item() / total_records)


Shape of wallet_train_mask: torch.Size([1268260])
Shape of wallet_val_mask: torch.Size([1268260])
Shape of wallet_test_mask: torch.Size([1268260])
Sum of wallet_train_mask as a fraction of total records: 0.8
Sum of wallet_val_mask as a fraction of total records: 0.1
Sum of wallet_test_mask as a fraction of total records: 0.1


In [None]:
# Mask data
transaction_train_mask, transaction_val_mask, transaction_test_mask = split_data(features_transaction.shape[0])

print("Shape of transaction_train_mask:", transaction_train_mask.shape)
print("Shape of transaction_val_mask:", transaction_val_mask.shape)
print("Shape of transaction_test_mask:", transaction_test_mask.shape)

total_records = features_transaction.shape[0]
print("Sum of transaction_train_mask as a fraction of total records:", transaction_train_mask.sum().item() / total_records)
print("Sum of transaction_val_mask as a fraction of total records:", transaction_val_mask.sum().item() / total_records)
print("Sum of transaction_test_mask as a fraction of total records:", transaction_test_mask.sum().item() / total_records)



Shape of transaction_train_mask: torch.Size([203769])
Shape of transaction_val_mask: torch.Size([203769])
Shape of transaction_test_mask: torch.Size([203769])
Sum of transaction_train_mask as a fraction of total records: 0.7999990184964347
Sum of transaction_val_mask as a fraction of total records: 0.0999955832339561
Sum of transaction_test_mask as a fraction of total records: 0.10000539826960922


The split_data function splits a dataset into training, validation, and testing subsets. It takes two parameters: num_data, which represents the total number of data points in the dataset and splits, an optional parameter specifying the proportions in which the dataset should be split. The function calculates the number of data points for each subset based on the proportions specified in splits. It uses these numbers to generate ranges for the training, validation, and testing subsets indices. Then, the function creates boolean masks for each subset. These masks are used to identify which data points belong to each subset. The masks are initialized as arrays of zeros with a length equal to num_data, and then specific indices are set to True to indicate membership in the corresponding subset. Finally, the function returns the three masks: train_mask, val_mask, and test_mask, representing the training, validation, and testing subsets. This function is useful when you want to split a dataset into different subsets for tasks such as training a machine learning model, evaluating its performance on a validation set, and testing its generalization on a separate testing set.

# HeteroData
HeteroData in PyTorch Geometric is a data structure that represents graphs with multiple node and edge types. This is important for our Bitcoin graph network analysis because it allows us to model the complex relationships inherent in Bitcoin transactions: we can have different node types (e.g., wallets and transactions), each with their own distinct sets of features, and we can represent different kinds of relationships between them (e.g., a wallet sending Bitcoin to another wallet, a wallet being involved in a transaction). HeteroData enables us to capture this richness, making our graph neural network models more expressive and potentially leading to better insights into Bitcoin transaction patterns.

We construct a heterogeneous graph representation of our Bitcoin dataset using PyTorch Geometric's HeteroData structure. We define two distinct node types: "wallets" and "transactions," and associate features and labels (potentially related to illicit activities) with each type. We establish different edge types to represent connections between transactions, between transactions and addresses, and between addresses themselves. Finally, we include training, validation, and testing masks on our nodes to prepare the graph for training and evaluating a graph neural network model to uncover insights within the Bitcoin network.

In [None]:
!pip install torch_geometric



In [None]:
from torch_geometric.data import HeteroData

df = HeteroData(
    wallets={
        "x": tensor_features_wallets,
        "y": wallet_labels,
        "train_mask": wallet_train_mask,
        "val_mask": wallet_val_mask,
        "test_mask": wallet_test_mask,
    },
    transactions={
        "x": tensor_features_transaction,
        "y": transaction_labels,
        "train_mask": transaction_train_mask,
        "val_mask": transaction_val_mask,
        "test_mask": transaction_test_mask,
    },
)

df["transaction", "transaction"].edge_index = tx_tx_edge_index
df["transaction", "address"].edge_index = tx_addr_edge_index
df["address", "transaction"].edge_index = addr_tx_edge_index
df["address", "address"].edge_index = addr_addr_edge_index

print(df)

HeteroData(
  wallets={
    x=[1268260, 55],
    y=[1268260],
    train_mask=[1268260],
    val_mask=[1268260],
    test_mask=[1268260],
  },
  transactions={
    x=[203769, 183],
    y=[203769],
    train_mask=[203769],
    val_mask=[203769],
    test_mask=[203769],
  },
  (transaction, to, transaction)={ edge_index=[2, 234355] },
  (transaction, to, address)={ edge_index=[2, 837124] },
  (address, to, transaction)={ edge_index=[2, 477117] },
  (address, to, address)={ edge_index=[2, 2868964] }
)


The print output confirms that we successfully created our dataset, achieving the intended goal. It shows two node types, "wallets" (1,268,260 entries with 55 features each) and "transactions" (203,769 entries with 165 features each). Notably, these nodes include labels (potentially indicating illicit activities) and equally sized training, validation, and testing masks - essential for controlled model training. Finally, there are four different edge types representing diverse relationships: transaction-to-transaction, transaction-to-address, address-to-transaction, and address-to-address, encompassing the complexity of the Bitcoin network. This output indicates a well-constructed heterogeneous graph dataset prepped for training and evaluating our graph neural network model.