# TDAMapper Visualization of the BitcoinHeist Dataset
This notebook demonstrates how to use KeplerMapper to analyze the BitcoinHeist dataset for ransomware detection in the Bitcoin blockchain. We'll preprocess the data, apply a dimensionality reduction technique, and visualize it using TDAMapper.


In [19]:
# Ensure that the necessary packages are installed
#!pip install kmapper scikit-learn pandas matplotlib

In [20]:
# Import necessary libraries
import pandas as pd
import numpy as np
import kmapper as km
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA




## Overview of the BitcoinHeist Dataset

The BitcoinHeist dataset is a rich collection of data extracted from the Bitcoin blockchain, designed specifically for the analysis of ransomware transactions. It includes detailed information on Bitcoin transactions that have been flagged as potential ransom payments.

URL: https://www.ijcai.org/proceedings/2020/0612.pdf

### Dataset Description

- **Data Points:** The dataset comprises transactions sampled over several years from the Bitcoin blockchain.
- **Features:** Each transaction in the dataset is characterized by several features such as:
  - `address`: The Bitcoin address involved in the transaction.
  - `year`: The year the transaction was made.
  - `day`: The day of the transaction within the year.
  - `length`: A metric indicative of the length of the transaction chain.
  - `weight`: A measure of the transaction's cumulative security based on the participating addresses.
  - `count`: The number of transactions associated with the specific address.
  - `looped`: The number of looped transactions to self.
  - `neighbors`: The number of neighboring nodes in the transaction graph.
  - `income`: The amount of Bitcoin transacted.
  - `label`: Classification of the transaction (e.g., 'white' or, 'ransomwarex', specific ransomware family names).

### Use Cases

This dataset is widely used in cybersecurity research to develop models that can identify and classify ransomware transactions based on blockchain analysis. By applying machine learning and data mining techniques, researchers can:
- Detect unusual patterns indicating fraudulent activities.
- Develop systems that automatically flag transactions related to ransomware.
- Analyze trends in ransomware evolution over time based on transaction data.

### Dataset Accessibility

The BitcoinHeist dataset is typically available through academic data repositories and can be used under specific terms and conditions for educational and research purposes.

https://www.kaggle.com/datasets/sapere0/bitcoinheist-ransomware-dataset
https://archive.ics.uci.edu/dataset/526/bitcoinheistransomwareaddressdataset

### Visualization and Analysis

Using data visualization and topological data analysis tools like KeplerMapper, we can explore complex patterns and relationships within the data, providing insights that are critical for enhancing blockchain security measures.

---

**Note:** The article originally uses the R:TDAMapper code (on temporal snapshots) but the library code is outdated now. Below we only give a Python overview of the data without using a rigorous backtesting approach.


In [28]:
# Load the dataset
# Make sure to upload the BitcoinHeistData.csv file to your Jupyter environment
full_df = pd.read_csv('https://github.com/jihwankimqd/Bitcoin_Heist_Classification/raw/master/BitcoinHeistData.csv', delimiter=',')

# Display unique values in the 'label' column to understand its composition
print(full_df['label'].unique())

# Assuming 'white' is used to label white transactions, and all others are non-white
white_transactions = full_df[full_df['label'] == 'white']
non_white_transactions = full_df[full_df['label'] != 'white']

# Sample 500 white and 500 non-white transactions
white_sample = white_transactions.sample(n=500, random_state=42)  # random_state for reproducibility
non_white_sample = non_white_transactions.sample(n=500, random_state=42)

# Combine the two samples into a single DataFrame
df = pd.concat([white_sample, non_white_sample], ignore_index=True)


['princetonCerber' 'princetonLocky' 'montrealCryptoLocker'
 'montrealCryptXXX' 'paduaCryptoWall' 'montrealWannaCry'
 'montrealDMALockerv3' 'montrealCryptoTorLocker2015' 'montrealSamSam'
 'montrealFlyper' 'montrealNoobCrypt' 'montrealDMALocker' 'montrealGlobe'
 'montrealEDA2' 'paduaKeRanger' 'montrealVenusLocker' 'montrealXTPLocker'
 'paduaJigsaw' 'montrealGlobev3' 'montrealJigSaw' 'montrealXLockerv5.0'
 'montrealXLocker' 'montrealRazy' 'montrealCryptConsole'
 'montrealGlobeImposter' 'montrealSam' 'montrealComradeCircle'
 'montrealAPT' 'white']


In [29]:
# Display the first few rows of the dataset to understand its structure
df.head()


Unnamed: 0,address,year,day,length,weight,count,looped,neighbors,income,label
0,1Keuc65zA62DjCh4aRXAnVfNCdtzr9htpW,2015,365,4,1.0,5,5,2,125920600.0,white
1,1CkaHrXSqAJdmXo6zaeJ8UbhQaDSYJF2nE,2013,96,0,0.5,1,0,2,2134359000.0,white
2,16RDLhXDT4L4kfmWHd68TntiyP3zZtX3Xv,2012,338,16,0.03125,1,0,2,1056004000.0,white
3,1CAmcT4S2XnMwjViEbgB9r5qhXZCZHGFUd,2013,249,6,0.25,1,0,2,39905920.0,white
4,1514euPsGdYsZPFfJJE4ruFb7P5Ey2cvRr,2013,114,0,0.5,1,0,1,253750000.0,white


In [30]:
# Preprocess the data
# Select numerical features for simplicity and scale them
features = df.select_dtypes(include=[np.number])
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)


In [31]:
# Initialize KeplerMapper object
mapper = km.KeplerMapper(verbose=1)

# Define the lens with PCA, reducing to 2 components for visualization
lens = mapper.fit_transform(features_scaled, projection=PCA(n_components=2))# PCA is faster, but we suggest tSNE for production code


KeplerMapper(verbose=1)
..Composing projection pipeline of length 1:
	Projections: PCA(n_components=2)
	Distance matrices: False
	Scalers: MinMaxScaler()
..Projecting on data shaped (1000, 8)

..Projecting data using: 
	PCA(n_components=2)


..Scaling with: MinMaxScaler()


In [32]:
import sklearn

#we will consider the binary case where all non-white (i.e., ransomware) addresses are colored the same.
df['is_white'] = (df['label'] == 'white').astype(int)
# Create the TDAMapper graph
# Prepare the color function which will be used in the visualization
color_function = df['is_white'].values

# Create the graph using the color function for node colors
graph = mapper.map(lens,
                   features_scaled,
                   clusterer=sklearn.cluster.KMeans(n_clusters=5),
    cover=km.Cover(n_cubes=10, perc_overlap=0.2))



Mapping on data shaped (1000, 8) using lens shaped (1000, 2)

Creating 100 hypercubes.


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)



Created 96 edges and 60 nodes in 0:00:00.232390.


  super()._check_params_vs_input(X, default_n_init=10)


In [34]:
import os
 
# Define the output directory
output_dir = 'output'

# Check if the directory exists, and create it if it doesn't
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
# Visualize the graph with color coding
html = mapper.visualize(graph,
                        path_html=os.path.join(output_dir, "bitcoin_heist_mapper_output.html"),
                        title="Bitcoin Heist TDAMapper Visualization",
                        color_function=df['is_white'].values,
                        color_function_name="Transaction Type",  # Indicates what the color represents
                        custom_tooltips=np.array(df['label']))  # Optional: add tooltips for more interaction

# Save the visualization to the disk
with open(os.path.join(output_dir, "bitcoin_heist_mapper_output.html"), "w") as f:
    f.write(html)

from IPython.display import HTML

# Display the HTML in the Jupyter Notebook
HTML(html)


Wrote visualization to: output/bitcoin_heist_mapper_output.html


Lens,Mean,Max,Min

Feature,Mean,STD

Feature,Mean,STD

Key,Action
s,Nodes glow :D
c,remove glow
p,Print mode - white backgrounds
d,Display mode - black backgrounds
z,Turn off gravity
m,Spacious layout
e,Tight layout
f,Freeze layout
x,Unfreeze all nodes


## Interacting with the KeplerMapper Visualization

Once you have generated and opened the output HTML file in your browser, you can interact with the visualization to explore data clusters more deeply:

1. **Expand Cluster Details**
   - Click on individual nodes within the Mapper graph to view details about the data points contained within each cluster. This action reveals specific characteristics and metrics related to each cluster.

2. **View Mapper Summary**
   - Navigate to the summary section of the visualization to gain insights into the overall distribution and linkage of clusters. This part provides a high-level overview of the topological structure created by the Mapper algorithm.

**Note:** Each node in the Mapper graph represents a cluster of data points. By analyzing the HTML object, you can identify which nodes are linked and determine the composition of data points within each node. This analysis is crucial for understanding how different addresses or entities are grouped together based on the selected lens and filtration parameters.
