# **Elliptic++ Transactions Dataset**


---
---


Released by: Youssef Elmougy, Ling Liu



School of Computer Science, Georgia Institute of Technology

Contact: yelmougy3@gatech.edu


---

Github Repository: [https://www.github.com/git-disl/EllipticPlusPlus](https://www.github.com/git-disl/EllipticPlusPlus)


If you use our dataset in your work, please cite our paper:





>> Youssef Elmougy and Ling Liu. 2023. Demystifying Fraudulent Transactions and Illicit Nodes in the Bitcoin Network for Financial Forensics.

---



## [SETUP] Import libraries and csv files 

Download dataset from: [https://www.github.com/git-disl/EllipticPlusPlus](https://www.github.com/git-disl/EllipticPlusPlus)

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!cp drive/My\ Drive/Elliptic++\ Dataset/txs_features.csv ./
!cp drive/My\ Drive/Elliptic++\ Dataset/txs_classes.csv ./
!cp drive/My\ Drive/Elliptic++\ Dataset/txs_edgelist.csv ./

Mounted at /content/drive


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import plotly.graph_objs as go 
import plotly.offline as py 
import math

!pip install -U ipython 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import VotingClassifier
from sklearn.base import clone 

import xgboost as xgb

In [None]:
!pip install eli5
import eli5
from eli5.sklearn import PermutationImportance

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.2/216.2 KB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.1/133.1 KB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... [?25l[?25hdone
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=138c7b7afc731dc3a39e6bd4e82b7a9fa2f965be699cbe7c19820bc09cf8bfa4
  Stored in directory: /root/.cache/pip/wheels/85/ac/25/ffcd87ef8f9b1eec324fdf339359be71f22612459d8c75d89c
Successfully built eli5
Installing collected packages: jinja2, eli5
  Attempting uninstall: jinja2
   

## Transactions Dataset Overview


---

This section loads the 3 csv files (txs_features, txs_classes, txs_edgelist) and provides a quick overview of the dataset structure and features.

Load saved transactions dataset csv files:

In [None]:
print("\nTransaction features: \n")
df_txs_features = pd.read_csv("txs_features.csv")
df_txs_features

print("\nTransaction classes: \n")
df_txs_classes = pd.read_csv("txs_classes.csv")
df_txs_classes

print("\nTransaction-Transaction edgelist: \n")
df_txs_edgelist = pd.read_csv("txs_edgelist.csv")
df_txs_edgelist


Transaction features: 



Unnamed: 0,txId,Time step,Local_feature_1,Local_feature_2,Local_feature_3,Local_feature_4,Local_feature_5,Local_feature_6,Local_feature_7,Local_feature_8,...,in_BTC_min,in_BTC_max,in_BTC_mean,in_BTC_median,in_BTC_total,out_BTC_min,out_BTC_max,out_BTC_mean,out_BTC_median,out_BTC_total
0,3321,1,-0.169615,-0.184668,-1.201369,-0.121970,-0.043875,-0.113002,-0.061584,-0.160199,...,0.534072,0.534072,0.534072,0.534072,0.534072,1.668990e-01,0.367074,0.266986,0.266986,0.533972
1,11108,1,-0.137586,-0.184668,-1.201369,-0.121970,-0.043875,-0.113002,-0.061584,-0.127429,...,5.611878,5.611878,5.611878,5.611878,5.611878,5.861940e-01,5.025584,2.805889,2.805889,5.611778
2,51816,1,-0.170103,-0.184668,-1.201369,-0.121970,-0.043875,-0.113002,-0.061584,-0.160699,...,0.456608,0.456608,0.456608,0.456608,0.456608,2.279902e-01,0.228518,0.228254,0.228254,0.456508
3,68869,1,-0.114267,-0.184668,-1.201369,0.028105,-0.043875,-0.113002,0.547008,-0.161652,...,0.308900,8.000000,3.102967,1.000000,9.308900,1.229000e+00,8.079800,4.654400,4.654400,9.308800
4,89273,1,5.202107,-0.210553,-1.756361,-0.121970,260.090707,-0.113002,-0.061584,5.335864,...,852.164680,852.164680,852.164680,852.164680,852.164680,1.300000e-07,41.264036,0.065016,0.000441,852.164680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203764,158304003,49,-0.165622,-0.139563,1.018602,-0.121970,-0.043875,-0.113002,-0.061584,-0.156113,...,,,,,,,,,,
203765,158303998,49,-0.167040,-0.139563,1.018602,-0.121970,-0.043875,-0.113002,-0.061584,-0.157564,...,,,,,,,,,,
203766,158303966,49,-0.167040,-0.139563,1.018602,-0.121970,-0.043875,-0.113002,-0.061584,-0.157564,...,,,,,,,,,,
203767,161526077,49,-0.172212,-0.139573,1.018602,-0.121970,-0.043875,-0.113002,-0.061584,-0.162856,...,,,,,,,,,,



Transaction classes: 



Unnamed: 0,txId,class
0,3321,3
1,11108,3
2,51816,3
3,68869,2
4,89273,2
...,...,...
203764,158304003,3
203765,158303998,3
203766,158303966,3
203767,161526077,3



Transaction-Transaction edgelist: 



Unnamed: 0,txId1,txId2
0,230425980,5530458
1,232022460,232438397
2,230460314,230459870
3,230333930,230595899
4,232013274,232029206
...,...,...
234350,158365409,157930723
234351,188708874,188708879
234352,157659064,157659046
234353,87414554,106877725


Data structure for an example transaction (txId = 272145560):

In [None]:
print("\ntxs_features.csv for txId = 272145560\n")
df_txs_features[df_txs_features['txId']==272145560]

print("\ntxs_classes.csv for txId = 272145560\n")
df_txs_classes[df_txs_classes['txId']==272145560]

print("\ntxs_edgelist.csv for txId = 272145560\n")
df_txs_edgelist[(df_txs_edgelist['txId1']==272145560) | (df_txs_edgelist['txId2']==272145560)]


txs_features.csv for txId=272145560



Unnamed: 0,txId,Time step,Local_feature_1,Local_feature_2,Local_feature_3,Local_feature_4,Local_feature_5,Local_feature_6,Local_feature_7,Local_feature_8,...,in_BTC_min,in_BTC_max,in_BTC_mean,in_BTC_median,in_BTC_total,out_BTC_min,out_BTC_max,out_BTC_mean,out_BTC_median,out_BTC_total
105573,272145560,24,-0.155493,-0.107012,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.145749,...,2.7732,2.7732,2.7732,2.7732,2.7732,0.001917,2.770883,1.3864,1.3864,2.7728



txs_classes.csv for txId=272145560



Unnamed: 0,txId,class
105573,272145560,1



txs_edgelist.csv for txId=272145560



Unnamed: 0,txId1,txId2
123072,272145560,296926618
123272,272145560,272145556
125873,299475624,272145560



Transaction features --- 94 local features, 72 aggregate features, 17 augmented features:


In [None]:
list(df_txs_features.columns)

['txId',
 'Time step',
 'class',
 'Local_feature_1',
 'Local_feature_2',
 'Local_feature_3',
 'Local_feature_4',
 'Local_feature_5',
 'Local_feature_6',
 'Local_feature_7',
 'Local_feature_8',
 'Local_feature_9',
 'Local_feature_10',
 'Local_feature_11',
 'Local_feature_12',
 'Local_feature_13',
 'Local_feature_14',
 'Local_feature_15',
 'Local_feature_16',
 'Local_feature_17',
 'Local_feature_18',
 'Local_feature_19',
 'Local_feature_20',
 'Local_feature_21',
 'Local_feature_22',
 'Local_feature_23',
 'Local_feature_24',
 'Local_feature_25',
 'Local_feature_26',
 'Local_feature_27',
 'Local_feature_28',
 'Local_feature_29',
 'Local_feature_30',
 'Local_feature_31',
 'Local_feature_32',
 'Local_feature_33',
 'Local_feature_34',
 'Local_feature_35',
 'Local_feature_36',
 'Local_feature_37',
 'Local_feature_38',
 'Local_feature_39',
 'Local_feature_40',
 'Local_feature_41',
 'Local_feature_42',
 'Local_feature_43',
 'Local_feature_44',
 'Local_feature_45',
 'Local_feature_46',
 'Local_fe

## EASY, HARD, and AVERAGE cases Analysis


---

This section analyzes EASY, HARD, and AVERAGE cases (only refers to illicit transactions):


1.   **EASY** cases: all models classify an illicit transaction correctly
2.   **HARD** cases: all models classify an illicit transaction incorrectly
3.   **AVERAGE** cases: some models failed to classify an illicit transaction but  ≥1 models classified correctly


Correct base values in the test set ('0': licit, '1': illicit):

In [None]:
y_test.value_counts()

0    15263
1     1083
Name: class, dtype: int64

In [None]:
# EASY CASES: fraud transactions that all 5 models identified correctly
indices_easy_fraud = [i for i in range(len(y_test.values)) if ((y_test.values[i] == 1) and (y_test.values[i] == y_preds_LR[i]) and (y_test.values[i] == y_preds_RF[i]) and (y_test.values[i] == y_preds_MLP[i]) and (y_test.values[i] == y_preds_LSTM[i]) and (y_test.values[i] == y_preds_XGB[i]))]
print("Number of EASY CASES: ")
print(len(indices_easy_fraud))
wrong_predictions_easy = X_testing_timesteps.iloc[indices_easy_fraud,:]
wrong_predictions_easy2 = wrong_predictions_easy.drop(columns=wrong_predictions_easy.columns.difference(['txId', 'class', 'Time step']))

Number of EASY CASES: 
49


In [None]:
# EASY CASES: number in each time step
print("Number of EASY CASES in each time step:")
wrong_predictions_easy2['Time step'].value_counts().sort_index()

Number of EASY CASES in each time step:


35    32
37     2
38     5
39     5
40     1
41     1
42     3
Name: Time step, dtype: int64

In [None]:
# HARD CASES: fraud transactions that all 5 models failed to identify
indices_hard_fraud = [i for i in range(len(y_test.values)) if ((y_test.values[i] == 1) and (y_test.values[i] != y_preds_LR[i]) and (y_test.values[i] != y_preds_RF[i]) and (y_test.values[i] != y_preds_MLP[i]) and (y_test.values[i] != y_preds_LSTM[i]) and (y_test.values[i] != y_preds_XGB[i]))]
print("Number of HARD CASES: ")
print(len(indices_hard_fraud))
wrong_predictions_hard = X_testing_timesteps.iloc[indices_hard_fraud,:]
wrong_predictions_hard2 = wrong_predictions_hard.drop(columns=wrong_predictions_hard.columns.difference(['txId', 'class', 'Time step']))

Number of HARD CASES: 
243


In [None]:
# HARD CASES: number in each time step
print("Number of HARD CASES in each time step:")
wrong_predictions_hard2['Time step'].value_counts().sort_index()

Number of HARD CASES in each time step:


35     4
37    10
38     7
39     4
40    28
41     6
42    36
43    22
44    20
45     4
46     1
47    21
48    27
49    53
Name: Time step, dtype: int64

In [None]:
# AVERAGE CASES: fraud transactions that some models failed but at least 1 of the 5 models succeed
indices_avg_fraud = [i for i in range(len(y_test.values)) if ( 
    #((y_test.values[i] == 1) and (y_test.values[i] == y_preds_LR[i]) and (y_test.values[i] != y_preds_RF[i]) and (y_test.values[i] != y_preds_MLP[i]) and (y_test.values[i] != y_preds_LSTM[i]) and (y_test.values[i] != y_preds_XGB[i]))# or
    ((y_test.values[i] == 1) and (y_test.values[i] == y_preds_LR[i]) and (y_test.values[i] == y_preds_RF[i]) and (y_test.values[i] != y_preds_MLP[i]) and (y_test.values[i] != y_preds_LSTM[i]) and (y_test.values[i] == y_preds_XGB[i]))# or
    #((y_test.values[i] == 1) and (y_test.values[i] == y_preds_LR[i]) and (y_test.values[i] == y_preds_RF[i]) and (y_test.values[i] != y_preds_MLP[i])) or
    #((y_test.values[i] == 1) and (y_test.values[i] == y_preds_LR[i]) and (y_test.values[i] != y_preds_RF[i]) and (y_test.values[i] == y_preds_MLP[i])) or
    #((y_test.values[i] == 1) and (y_test.values[i] != y_preds_LR[i]) and (y_test.values[i] == y_preds_RF[i]) and (y_test.values[i] != y_preds_MLP[i])) or
    #((y_test.values[i] == 1) and (y_test.values[i] != y_preds_LR[i]) and (y_test.values[i] == y_preds_RF[i]) and (y_test.values[i] == y_preds_MLP[i])) or
    #((y_test.values[i] == 1) and (y_test.values[i] != y_preds_LR[i]) and (y_test.values[i] != y_preds_RF[i]) and (y_test.values[i] == y_preds_MLP[i]))
    )]
print("Number of AVERAGE CASES: ")
print(len(indices_avg_fraud))
wrong_predictions_avg = X_testing_timesteps.iloc[indices_avg_fraud,:]
wrong_predictions_avg2 = wrong_predictions_avg.drop(columns=wrong_predictions_avg.columns.difference(['txId', 'class', 'Time step']))

Number of AVERAGE CASES: 
98


In [None]:
# AVERAGE CASES: number in each time step
print("Number of AVERAGE CASES in each time step:")
wrong_predictions_avg2['Time step'].value_counts().sort_index()

Number of AVERAGE CASES in each time step:


35     6
36     1
37    10
38    27
39    18
40    10
41     5
42    21
Name: Time step, dtype: int64

# **Acknowledgements**


---
---


Released by: Youssef Elmougy, Ling Liu



School of Computer Science, Georgia Institute of Technology

Contact: yelmougy3@gatech.edu


---

Github Repository: [https://www.github.com/git-disl/EllipticPlusPlus](https://www.github.com/git-disl/EllipticPlusPlus)


If you use our dataset in your work, please cite our paper:





>> Youssef Elmougy and Ling Liu. 2023. Demystifying Fraudulent Transactions and Illicit Nodes in the Bitcoin Network for Financial Forensics.

---

