Hive Serde

Jörn Franke edited this page Jan 21, 2017 · 20 revisions

This Hive Serde will be used to represent blocks and transactions from crypto ledgers, such as the Bitcoin Blockchain, as tables in Hive. These tables can then be joined with other tables containing other relevant information, such as stock market movements or weather patterns. The following Hive Serde is available

  • Bitcoin
    • BitcoinBlockSerde: Represents information about BitcoinBlock(s) containing transactions transparently as a table in Hive

Build

Note that the Hadoop File Format and Hive Serde are available on Maven Central and you do not need to build and publish it to a local Maven anymore to use it. Furthermore, they are available from the releases page.

Execute

git clone https://github.com/ZuInnoTe/hadoopcryptoledger.git hadoopcryptoledger

You can build the application by changing to the directory hadoopcryptoledger/hiveserde and using the following command:

../gradlew clean build

Deploy

After the build you will find the Serde in ./build/libs/hadoopcryptoledger-hiveserde-1.0.3.jar It is assumed that you have build hadoopcryptoledger-fileformat-1.0.3.jar Alternatively you can download both from the release page.

You need to put it into a local directory (e.g. /tmp/hadoopcryptoledger-hiveserde-1.0.3.jar) and execute hive. Enter the following line every time you start hive:

add jar /tmp/hadoopcryptoledger-fileformat-1.0.3.jar 
add jar /tmp/hadoopcryptoledger-hiveserde-1.0.3.jar 

You can create any database in hive:

create database blockchains;

You can then create a table containing Bitcoin Blockchain data as follows:

use blockchains;
Create table BitcoinBlockchain ROW FORMAT SERDE 'org.zuinnote.hadoop.bitcoin.hive.serde.BitcoinBlockSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.bitcoin.format.mapred.BitcoinBlockFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.lib.NullOutputFormat' LOCATION '/path/to/bitcoin/blockchain/files';

The table has the following structure:

describe BitcoinBlockchain;

Output:

magicno                 binary                  from deserializer   
blocksize               int                     from deserializer   
version                 int                     from deserializer   
time                    int                     from deserializer   
bits                    binary                  from deserializer   
nonce                   int                     from deserializer   
transactioncounter      bigint                  from deserializer   
hashprevblock           binary                  from deserializer   
hashmerkleroot          binary                  from deserializer   
transactions            array<struct<version:int,incounter:binary,outcounter:binary,listofinputs:array<struct<prevtransactionhash:binary,previoustxoutindex:bigint,txinscriptlength:binary,txinscript:binary,seqno:bigint>>,listofoutputs:array<struct<value:bigint,txoutscriptlength:binary,txoutscript:binary>>,locktime:int>>    from deserializer

Find more examples on how to configure the InputFileFormat via TBLPROPERTIES here: https://github.com/ZuInnoTe/hadoopcryptoledger/tree/master/examples/hive-bitcoin

Use

Hint: It is recommended to select the relevant transactions and blocks from the provided tables and do CTAS to store them in another table using a format suitable for analytics, such as ORC or Parquet.

More Information

Understanding the structure of Bitcoin data:

Blocks: https://en.bitcoin.it/wiki/Block

Transactions: https://en.bitcoin.it/wiki/Transactions