Using Spark Scala Graphx to analyze the Bitcoin transaction graph

Jörn Franke edited this page Feb 26, 2017 · 13 revisions

This is a Spark/Graphx application written in Scala demonstrating some of the capabilities of the hadoopcryptoledger library. It takes as input a set of files on HDFS containing Bitcoin Blockchain data. It generates out of the data a Bitcoin transaction graph, i.e. a graph consisting of the vertices describing Bitcoin addresses and the edges represent transactions between these addresses.

It returns as a result from the graph the top 5 Bitcoin addresses in terms of number of transfers to them. Most of the times these addresses belong to so called Mixing services that try to obfuscate transactions in the Bitcoin blockchain.

It has successfully been tested with the Cloudera Quickstart VM 5.5, but other Hadoop distributions should work equally well. Spark 1.5 was used for testing.

Getting blockchain data

You need to download and verify blockchain data using a tool. Bitcoin Core (use the most recent version!) has been used for this project. Once you have installed it, simply start it and let it download the whole blockchain (can take several hours). You need at least 70-100 GB free space (depending on the size of the blockchain more). Once you have downloaded the blockchain, you will find in your user directory in the subdirectory .bitcoin/blocks/*.dat all the Bitcoin data.

You can put it on your HDFS cluster by executing the following commands:

hadoop fs -mkdir -p /user/cloudera/bitcoin/input

hadoop fs -put ~./.bitcoin/blocks/blk*.dat /user/cloudera/bitcoin/input

After it has been copied you are ready to use the example.

Building the example

Execute

git clone https://github.com/ZuInnoTe/hadoopcryptoledger.git hadoopcryptoledger

You can build the application by changing to the directory hadoopcryptoledger/examples/scala-spark-graphx-bitcointransaction and using the following command:

sbt clean assembly test it:test

This will also execute the integration tests

You will find the jar "example-hcl-spark-scala-graphx-bitcointransaction.jar" in ./target/scala-2.10

Running the example

Make sure that the output directory is clean:

hadoop fs -rm -R /user/cloudera/bitcoin/output

Execute the following command (to execute it using a local master)

spark-submit --class org.zuinnote.spark.bitcoin.example.SparkScalaBitcoinTransactionGraph --master local[8] ./target/scala-2.10/example-hcl-spark-scala-graphx-bitcointransaction.jar /user/cloudera/bitcoin/input /user/cloudera/bitcoin/output

After the Spark job has completed, you find the result in /user/cloudera/bitcoin/output. You can display it using the following command:

hadoop fs -cat /user/cloudera/bitcoin/output/part-00000

An example for such an output can be found here (vertexId,(receiving Bitcoin address, number of inputs)):

(26038,(bitcoinaddress_F4B004C3CA2E7F96F9FC5BCA767708967AF67A44,119539))
(138462,(bitcoinaddress_8ED0DCFF2D3D8F6F6EDC244885EEAE895469D7BB,7938))
(29231,(bitcoinaddress_2D6FE761506A0FBDCED503FDA7FB85C26ED1E76E,7784))
(260013,(bitcoinaddress_77E84EAC6A7132B1E9267BA926EEA90F7FED9C74,5565))
(385502,(bitcoinaddress_A92BA6F9C91233B6A4EC84DC04CDB7AF08246184,5136))

Hint: You can put the addresses (without the prefix bitcoinaddress_) in popular Bitcoin explorers (e.g. https://blockchain.info) to get more details about the address.

Of course, you can integrate other data together with the blockchain data or add to the edges in the Graph the amount for each transaction.

More Information

Blog: https://snippetessay.wordpress.com/2016/09/20/sparkscalagraphx-analyzing-the-bitcoin-transaction-graph/

Understanding the structure of Bitcoin data:

Understanding the destination of Bitcoin addresses and why they cannot be compared with transactions occurring in the current financial system: https://en.bitcoin.it/wiki/From_address

Blocks: https://en.bitcoin.it/wiki/Block

Transactions: https://en.bitcoin.it/wiki/Transactions