# Intro to Bitcoin Transaction Parsing

This notebook is designed to viewed through NBviewer, please navigate your browser to https://nbviewer.jupyter.org/github/destrys/bitcoin_transaction_parsing/blob/master/notebooks/1_Intro_to_parsing.ipynb

And we start with a **warning**: Vitalek's pybitcointools and Peter Todd's python-bitcoinlib both install as 'bitcoin'. Beware. We're using Peter Todd's python-bitcoinlib for these notebooks.


### Contents
1. Import a Transaction
2. Deserialize the Tx into metadata, inputs, and outputs
3. Investigate the metadata
4. Inputs and Output parsing will be in the following notebooks.

### 1. Import a Transaction

We'll cover how to grab raw transactions in another notebook. For now, the easier way to get a raw transaction is
to use blockchain.info and  
append ```?format=hex``` to the url.

Here is the output of https://blockchain.info/tx/9021b49d445c719106c95d561b9c3fac7bcb3650db67684a9226cd7fa1e1c1a0?format=hex:

In [1]:
rawtx = "0100000002d8c8df6a6fdd2addaf589a83d860f18b44872d13ee6ec3526b2b470d42a96d4d000000008b483045022100b31557e47191936cb14e013fb421b1860b5e4fd5d2bc5ec1938f4ffb1651dc8902202661c2920771fd29dd91cd4100cefb971269836da4914d970d333861819265ba014104c54f8ea9507f31a05ae325616e3024bd9878cb0a5dff780444002d731577be4e2e69c663ff2da922902a4454841aa1754c1b6292ad7d317150308d8cce0ad7abffffffff2ab3fa4f68a512266134085d3260b94d3b6cfd351450cff021c045a69ba120b2000000008b4830450220230110bc99ef311f1f8bda9d0d968bfe5dfa4af171adbef9ef71678d658823bf022100f956d4fcfa0995a578d84e7e913f9bb1cf5b5be1440bcede07bce9cd5b38115d014104c6ec27cffce0823c3fecb162dbd576c88dd7cda0b7b32b0961188a392b488c94ca174d833ee6a9b71c0996620ae71e799fc7c77901db147fa7d97732e49c8226ffffffff02c0175302000000001976a914a3d89c53bb956f08917b44d113c6b2bcbe0c29b788acc01c3d09000000001976a91408338e1d5e26db3fce21b011795b1c3c8a5a5d0788ac00000000"
print("Raw Transaction: ", rawtx)

Raw Transaction:  0100000002d8c8df6a6fdd2addaf589a83d860f18b44872d13ee6ec3526b2b470d42a96d4d000000008b483045022100b31557e47191936cb14e013fb421b1860b5e4fd5d2bc5ec1938f4ffb1651dc8902202661c2920771fd29dd91cd4100cefb971269836da4914d970d333861819265ba014104c54f8ea9507f31a05ae325616e3024bd9878cb0a5dff780444002d731577be4e2e69c663ff2da922902a4454841aa1754c1b6292ad7d317150308d8cce0ad7abffffffff2ab3fa4f68a512266134085d3260b94d3b6cfd351450cff021c045a69ba120b2000000008b4830450220230110bc99ef311f1f8bda9d0d968bfe5dfa4af171adbef9ef71678d658823bf022100f956d4fcfa0995a578d84e7e913f9bb1cf5b5be1440bcede07bce9cd5b38115d014104c6ec27cffce0823c3fecb162dbd576c88dd7cda0b7b32b0961188a392b488c94ca174d833ee6a9b71c0996620ae71e799fc7c77901db147fa7d97732e49c8226ffffffff02c0175302000000001976a914a3d89c53bb956f08917b44d113c6b2bcbe0c29b788acc01c3d09000000001976a91408338e1d5e26db3fce21b011795b1c3c8a5a5d0788ac00000000


### 2. Deserialize the Tx into Metadata, Inputs, and Outputs

The bitcoin blockchain has a rigid transaction serialization structure. As follows:

https://bitcoin.org/en/developer-reference#raw-transaction-format


* Version: 4 Bytes
* Number of Inputs: CompactSize Bytes
* Serialized Inputs
* Number of Outputs: CompactSize Bytes
* Serialized Outputs
* Timestamp : 4 Bytes

We're going to slowly walk though deserializing a raw transaction.

In [2]:
# Extract Version Bytes and remove from leading edge of the raw transaction.
version = rawtx[0:8]
rawtx = rawtx[8:]

#### Version
The first 4 bytes are the version bytes. 

When representing bytes in hex, each byte is two characters.  
From wikipedia: "As each hexadecimal digit represents four binary digits (bits), it allows a more human-friendly representation of binary-coded values. One hexadecimal digit represents a nibble (4 bits), which is half of an octet or byte (8 bits). For example, a single byte can have values ranging from 00000000 to 11111111 in binary form, but this may be more conveniently represented as 00 to FF in hexadecimal."

So to extract the version bytes we grab the first 8 characters.

In [3]:
print('Version Bytes: ',version)

Version Bytes:  01000000


And remove those bytes from the front of the raw transaction.

In [4]:
print("Remaining Raw Transaction: ", rawtx)

Remaining Raw Transaction:  02d8c8df6a6fdd2addaf589a83d860f18b44872d13ee6ec3526b2b470d42a96d4d000000008b483045022100b31557e47191936cb14e013fb421b1860b5e4fd5d2bc5ec1938f4ffb1651dc8902202661c2920771fd29dd91cd4100cefb971269836da4914d970d333861819265ba014104c54f8ea9507f31a05ae325616e3024bd9878cb0a5dff780444002d731577be4e2e69c663ff2da922902a4454841aa1754c1b6292ad7d317150308d8cce0ad7abffffffff2ab3fa4f68a512266134085d3260b94d3b6cfd351450cff021c045a69ba120b2000000008b4830450220230110bc99ef311f1f8bda9d0d968bfe5dfa4af171adbef9ef71678d658823bf022100f956d4fcfa0995a578d84e7e913f9bb1cf5b5be1440bcede07bce9cd5b38115d014104c6ec27cffce0823c3fecb162dbd576c88dd7cda0b7b32b0961188a392b488c94ca174d833ee6a9b71c0996620ae71e799fc7c77901db147fa7d97732e49c8226ffffffff02c0175302000000001976a914a3d89c53bb956f08917b44d113c6b2bcbe0c29b788acc01c3d09000000001976a91408338e1d5e26db3fce21b011795b1c3c8a5a5d0788ac00000000


#### Number of Inputs

The next field is the number of inputs, this is the first instance of a CompactSize integer: https://bitcoin.org/en/developer-reference#compactsize-unsigned-integers . In practice, the number of inputs will always be less than 253 (0xfd), but it's good practice to treat this field as variable-sized.

In [5]:
def extract_compact_sized(raw_hex):
    if raw_hex[0:2] == "ff":
        return raw_hex[0:18]
    if raw_hex[0:2] == "fe":
        return raw_hex[0:10]
    if raw_hex[0:2] == "fd":
        return raw_hex[0:6]
    else:
        return raw_hex[0:2]

In [6]:
number_of_inputs_hex = extract_compact_sized(rawtx)
print("Number of Inputs in Hex: ",number_of_inputs_hex)
number_of_inputs = int(number_of_inputs_hex, 16)
print("Number of Inputs: ", number_of_inputs)
rawtx = rawtx[len(number_of_inputs_hex):]

Number of Inputs in Hex:  02
Number of Inputs:  2


In [7]:
print("Remaing Raw Transaction: ", rawtx)

Remaing Raw Transaction:  d8c8df6a6fdd2addaf589a83d860f18b44872d13ee6ec3526b2b470d42a96d4d000000008b483045022100b31557e47191936cb14e013fb421b1860b5e4fd5d2bc5ec1938f4ffb1651dc8902202661c2920771fd29dd91cd4100cefb971269836da4914d970d333861819265ba014104c54f8ea9507f31a05ae325616e3024bd9878cb0a5dff780444002d731577be4e2e69c663ff2da922902a4454841aa1754c1b6292ad7d317150308d8cce0ad7abffffffff2ab3fa4f68a512266134085d3260b94d3b6cfd351450cff021c045a69ba120b2000000008b4830450220230110bc99ef311f1f8bda9d0d968bfe5dfa4af171adbef9ef71678d658823bf022100f956d4fcfa0995a578d84e7e913f9bb1cf5b5be1440bcede07bce9cd5b38115d014104c6ec27cffce0823c3fecb162dbd576c88dd7cda0b7b32b0961188a392b488c94ca174d833ee6a9b71c0996620ae71e799fc7c77901db147fa7d97732e49c8226ffffffff02c0175302000000001976a914a3d89c53bb956f08917b44d113c6b2bcbe0c29b788acc01c3d09000000001976a91408338e1d5e26db3fce21b011795b1c3c8a5a5d0788ac00000000


### Inputs

The size of an input in the transaction isn't fixed because the signature to authorize a spend can vary between different types of transaction scripts (P2PKH, P2SH, etc.) So to extract each input, we have to extract each field of input.

The transaction input serialization is as follows:
https://bitcoin.org/en/developer-reference#txin

* Outpoint : 36 bytes (32 Bytes for the reference tx hash, 4 bytes for the reference tx index)
* Script Size: CompactSize Bytes
* ScriptSig: ScriptSize Bytes
* Sequence: 4 Bytes (usually 0xffffffff which is easy to spot by eye)


In [8]:
input_reftx_1_raw = rawtx[0:32*2]
print("Input 1 Reference TxHash: ",input_reftx_1_raw)
rawtx = rawtx[32*2:]

Input 1 Reference TxHash:  d8c8df6a6fdd2addaf589a83d860f18b44872d13ee6ec3526b2b470d42a96d4d


In [9]:
input_output_index_1_hex = rawtx[0:8]
print("Input 1 Output Index (hex): ", input_output_index_1_hex)
input_output_index_1 = int(input_output_index_1_hex,16)
print("Input 1 Output Index: ", input_output_index_1)
rawtx = rawtx[8:]

Input 1 Output Index (hex):  00000000
Input 1 Output Index:  0


In [10]:
input_scriptsig_length_1_hex = extract_compact_sized(rawtx)
print("Input 1 ScriptSig Length (hex): ", input_scriptsig_length_1_hex)
input_scriptsig_length_1 = int(input_scriptsig_length_1_hex,16)
print("Input 1 ScriptSig Length", input_scriptsig_length_1)
rawtx = rawtx[len(input_scriptsig_length_1_hex):]

Input 1 ScriptSig Length (hex):  8b
Input 1 ScriptSig Length 139


In [11]:
input_scriptsig_1 = rawtx[0:input_scriptsig_length_1*2]
print("Input 1 ScriptSig: ",input_scriptsig_1)
rawtx = rawtx[input_scriptsig_length_1*2:]

Input 1 ScriptSig:  483045022100b31557e47191936cb14e013fb421b1860b5e4fd5d2bc5ec1938f4ffb1651dc8902202661c2920771fd29dd91cd4100cefb971269836da4914d970d333861819265ba014104c54f8ea9507f31a05ae325616e3024bd9878cb0a5dff780444002d731577be4e2e69c663ff2da922902a4454841aa1754c1b6292ad7d317150308d8cce0ad7ab


In [12]:
input_sequence_1 = rawtx[0:8]
print("Input 1 Sequence: ", input_sequence_1)
rawtx = rawtx[8:]

Input 1 Sequence:  ffffffff


In [13]:
print("RawTx after extracting first input: ", rawtx)

RawTx after extracting first input:  2ab3fa4f68a512266134085d3260b94d3b6cfd351450cff021c045a69ba120b2000000008b4830450220230110bc99ef311f1f8bda9d0d968bfe5dfa4af171adbef9ef71678d658823bf022100f956d4fcfa0995a578d84e7e913f9bb1cf5b5be1440bcede07bce9cd5b38115d014104c6ec27cffce0823c3fecb162dbd576c88dd7cda0b7b32b0961188a392b488c94ca174d833ee6a9b71c0996620ae71e799fc7c77901db147fa7d97732e49c8226ffffffff02c0175302000000001976a914a3d89c53bb956f08917b44d113c6b2bcbe0c29b788acc01c3d09000000001976a91408338e1d5e26db3fce21b011795b1c3c8a5a5d0788ac00000000


In [14]:
print("Input 1 : ", "".join([input_reftx_1_raw, input_output_index_1_hex, input_scriptsig_length_1_hex, input_scriptsig_1, input_sequence_1]))

Input 1 :  d8c8df6a6fdd2addaf589a83d860f18b44872d13ee6ec3526b2b470d42a96d4d000000008b483045022100b31557e47191936cb14e013fb421b1860b5e4fd5d2bc5ec1938f4ffb1651dc8902202661c2920771fd29dd91cd4100cefb971269836da4914d970d333861819265ba014104c54f8ea9507f31a05ae325616e3024bd9878cb0a5dff780444002d731577be4e2e69c663ff2da922902a4454841aa1754c1b6292ad7d317150308d8cce0ad7abffffffff


And now for the second input (remember above that there are 2 inputs specified)

In [15]:
input_reftx_2_raw = rawtx[0:32*2]
print("Input 2 Reference TxHash: ",input_reftx_2_raw)
rawtx = rawtx[32*2:]

Input 2 Reference TxHash:  2ab3fa4f68a512266134085d3260b94d3b6cfd351450cff021c045a69ba120b2


In [16]:
input_output_index_2_hex = rawtx[0:8]
print("Input 2 Output Index (hex): ", input_output_index_2_hex)
input_output_index_2 = int(input_output_index_2_hex,16)
print("Input 2 Output Index: ", input_output_index_2)
rawtx = rawtx[8:]

Input 2 Output Index (hex):  00000000
Input 2 Output Index:  0


In [17]:
input_scriptsig_length_2_hex = extract_compact_sized(rawtx)
print("Input 2 ScriptSig Length (hex): ", input_scriptsig_length_2_hex)
input_scriptsig_length_2 = int(input_scriptsig_length_2_hex,16)
print("Input 2 ScriptSig Length", input_scriptsig_length_2)
rawtx = rawtx[len(input_scriptsig_length_2_hex):]

Input 2 ScriptSig Length (hex):  8b
Input 2 ScriptSig Length 139


In [18]:
input_scriptsig_2 = rawtx[0:input_scriptsig_length_2*2]
print("Input 2 ScriptSig: ",input_scriptsig_2)
rawtx = rawtx[input_scriptsig_length_2*2:]

Input 2 ScriptSig:  4830450220230110bc99ef311f1f8bda9d0d968bfe5dfa4af171adbef9ef71678d658823bf022100f956d4fcfa0995a578d84e7e913f9bb1cf5b5be1440bcede07bce9cd5b38115d014104c6ec27cffce0823c3fecb162dbd576c88dd7cda0b7b32b0961188a392b488c94ca174d833ee6a9b71c0996620ae71e799fc7c77901db147fa7d97732e49c8226


In [19]:
input_sequence_2 = rawtx[0:8]
print("Input 2 Sequence: ", input_sequence_2)
rawtx = rawtx[8:]

Input 2 Sequence:  ffffffff


In [20]:
print("RawTx after extracting second input: ", rawtx)

RawTx after extracting second input:  02c0175302000000001976a914a3d89c53bb956f08917b44d113c6b2bcbe0c29b788acc01c3d09000000001976a91408338e1d5e26db3fce21b011795b1c3c8a5a5d0788ac00000000


In [21]:
print("Input 2 : ", "".join([input_reftx_2_raw, input_output_index_2_hex ,input_scriptsig_length_2_hex, input_scriptsig_2, input_sequence_2]))

Input 2 :  2ab3fa4f68a512266134085d3260b94d3b6cfd351450cff021c045a69ba120b2000000008b4830450220230110bc99ef311f1f8bda9d0d968bfe5dfa4af171adbef9ef71678d658823bf022100f956d4fcfa0995a578d84e7e913f9bb1cf5b5be1440bcede07bce9cd5b38115d014104c6ec27cffce0823c3fecb162dbd576c88dd7cda0b7b32b0961188a392b488c94ca174d833ee6a9b71c0996620ae71e799fc7c77901db147fa7d97732e49c8226ffffffff


A better way to extract these imputs would be to have a little function that pops them off and stores them in a list of dictionaries, but we're doing this the slow verbose way so you can see each step clearly.

#### Number of Outputs

This is the same as the number of inputs, but specifies how many outputs follow. Again, realistically it won't be over 253, but we still treat it as CompactSize bytes.

In [22]:
number_of_outputs_hex = extract_compact_sized(rawtx)
print("Number of Outputs in Hex: ",number_of_outputs_hex)
number_of_outputs = int(number_of_outputs_hex, 16)
print("Number of Outputs: ", number_of_outputs)
rawtx = rawtx[len(number_of_outputs_hex):]

Number of Outputs in Hex:  02
Number of Outputs:  2


#### Outputs

Output serialization is much easier than input serialization because you don't need to specify any previous transactions and you don't need a signature. 

TODO: Link to output serialization

* Value: 8 Bytes - the amount to send to this output in satoshis
* Length of Script - CompactSize Bytes
* Script - The script that 'encumbers' the output. See the notebook on outputs and script.

#### Value

The value sent to this output in satoshis.

One annoyance here is that the values are formatted in 'big-endian' format, which flips the bytes from most important first, to most important last. In decimal this is the equivalent of representing 'one thousand twenty' as 0201. To convert from big-endian hex  we have to swap the bytes around first (unlike when we extract the number of inputs or outputs).

In [23]:
output_value_1_hex= rawtx[0:16]
print("Value of Output 1 (big-endian hex): ", output_value_1_hex)

from binascii import unhexlify
# convert to bytes 
value_bytes = unhexlify(output_value_1_hex)
# reverse byte oder
swapped_bytes = value_bytes[::-1]
# convert back to hex
swapped_hex = swapped_bytes.hex()
print("Value of Ouput 1 (little-endian hex): ", swapped_hex)
# convert to Int
output_value_1 = int(swapped_hex, 16)
print("Value of Output 1 (sat): ", output_value_1)
print("Value of Output 1 (btc): ", output_value_1 / 1e8)
rawtx = rawtx[16:]

Value of Output 1 (big-endian hex):  c017530200000000
Value of Ouput 1 (little-endian hex):  00000000025317c0
Value of Output 1 (sat):  39000000
Value of Output 1 (btc):  0.39


#### Length of Script

This is the same format as the length of the ScriptSig in an input.

In [24]:
output_script_length_1_hex = extract_compact_sized(rawtx)
print("Output 1 Script Length (hex): ", output_script_length_1_hex)
output_script_length_1 = int(output_script_length_1_hex,16)
print("Output 1 Script Length", output_script_length_1)
rawtx = rawtx[len(output_script_length_1_hex):]

Output 1 Script Length (hex):  19
Output 1 Script Length 25


#### Output Script

We now have the length of the output script, so we can extract it.

In [25]:
output_script_1 = rawtx[0:output_script_length_1*2]
print("Output 1 Script: ",output_script_1)
rawtx = rawtx[output_script_length_1*2:]

Output 1 Script:  76a914a3d89c53bb956f08917b44d113c6b2bcbe0c29b788ac


Note that this is **not** a bitcoin address. TODO: convert to base58Check to confirm this.

In [26]:
print("Remaining RawTx: ", rawtx)

Remaining RawTx:  c01c3d09000000001976a91408338e1d5e26db3fce21b011795b1c3c8a5a5d0788ac00000000


#### Output 2

Same process, second output.

In [27]:
output_value_2_hex= rawtx[0:16]
print("Value of Output 2 (big-endian hex): ", output_value_2_hex)

from binascii import unhexlify
# convert to bytes 
value_bytes = unhexlify(output_value_2_hex)
# reverse byte oder
swapped_bytes = value_bytes[::-1]
# convert back to hex
swapped_hex = swapped_bytes.hex()
print("Value of Ouput 2 (little-endian hex): ", swapped_hex)
# convert to Int
output_value_2 = int(swapped_hex, 16)
print("Value of Output 2 (sat): ", output_value_2)
print("Value of Output 2 (btc): ", output_value_2 / 1e8)
rawtx = rawtx[16:]

Value of Output 2 (big-endian hex):  c01c3d0900000000
Value of Ouput 2 (little-endian hex):  00000000093d1cc0
Value of Output 2 (sat):  155000000
Value of Output 2 (btc):  1.55


In [28]:
output_script_length_2_hex = extract_compact_sized(rawtx)
print("Output 2 Script Length (hex): ", output_script_length_2_hex)
output_script_length_2 = int(output_script_length_2_hex,16)
print("Output 2 Script Length", output_script_length_2)
rawtx = rawtx[len(output_script_length_2_hex):]

Output 2 Script Length (hex):  19
Output 2 Script Length 25


In [29]:
output_script_2= rawtx[0:output_script_length_2*2]
print("Output 2 Script: ",output_script_2)
rawtx = rawtx[output_script_length_2*2:]

Output 2 Script:  76a91408338e1d5e26db3fce21b011795b1c3c8a5a5d0788ac


In [30]:
print("Remaining RawTx: ", rawtx)

Remaining RawTx:  00000000


### Lock_time

The last 4 bytes of the serialized transaction is the lock_time that can be parsed in two ways:

https://bitcoin.org/en/developer-guide#locktime-and-sequence-number

* If less than 500 million, locktime is parsed as a block height. The transaction can be added to any block which has this height or higher.

* If greater than or equal to 500 million, locktime is parsed using the Unix epoch time format (the number of seconds elapsed since 1970-01-01T00:00 UTC—currently over 1.395 billion). The transaction can be added to any block whose block time is greater than the locktime.

In [31]:
print("There are ",int(len(rawtx)/2), " bytes left in the raw transaction.")
lock_time = rawtx
print("Lock Time: ", lock_time)

There are  4  bytes left in the raw transaction.
Lock Time:  00000000
