# Anaylze Live Bitcoin Blockchain Data
### Using BigQuery

[DATA SOURCE]
* https://www.kaggle.com/bigquery/bitcoin-blockchain

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [1]:
# import our bq_helper package
import bq_helper
# create a helper object for our bigquery dataset
bitcoin_blockchain = bq_helper.BigQueryHelper(active_project= "bigquery-public-data", 
                                       dataset_name = "bitcoin_blockchain")

In [2]:
# print a list of all the tables in the hacker_news dataset
bitcoin_blockchain.list_tables()

['transactions']

In [3]:
# print information on all the columns in the "transactions" table
# in the bitcoin_blockchain dataset
bitcoin_blockchain.table_schema("transactions")

[SchemaField('timestamp', 'integer', 'NULLABLE', None, ()),
 SchemaField('transaction_id', 'string', 'NULLABLE', None, ()),
 SchemaField('inputs', 'record', 'REPEATED', None, (SchemaField('input_script_bytes', 'bytes', 'NULLABLE', None, ()), SchemaField('input_script_string', 'string', 'NULLABLE', None, ()), SchemaField('input_script_string_error', 'string', 'NULLABLE', None, ()), SchemaField('input_sequence_number', 'integer', 'NULLABLE', None, ()), SchemaField('input_pubkey_base58', 'string', 'NULLABLE', None, ()), SchemaField('input_pubkey_base58_error', 'string', 'NULLABLE', None, ()))),
 SchemaField('outputs', 'record', 'REPEATED', None, (SchemaField('output_satoshis', 'integer', 'NULLABLE', None, ()), SchemaField('output_script_bytes', 'bytes', 'NULLABLE', None, ()), SchemaField('output_script_string', 'string', 'NULLABLE', None, ()), SchemaField('output_script_string_error', 'string', 'NULLABLE', None, ()), SchemaField('output_pubkey_base58', 'string', 'NULLABLE', None, ()), Sch

Each SchemaField tells us about a specific column. In order, the information is:

* The name of the column
* The datatype in the column
* [The mode of the column](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#schema.fields.mode) (NULLABLE means that a column allows NULL values, and is the default)
* A description of the data in that column

So, for the first column, we have the following schema field:

`SchemaField('timestamp', 'integer', 'NULLABLE', None, ())`

This tells us that the column is called "timestamp", that is has integers in it but that NULL values are allowed.

We can use the `BigQueryHelper.head()` method to check just the first couple of lines of of the "transactions" table to make sure this is right. (Sometimes you'll run into databases out there where the schema isn't an accurate description of the data anymore, so it's good to check. This shouldn't be a problem with any of the BigQuery databases on Kaggle, though!)

In [4]:
# preview the first couple lines of the "full" table
bitcoin_blockchain.head("transactions")

Unnamed: 0,timestamp,transaction_id,inputs,outputs,block_id,previous_block,merkle_root,nonce,version,work_terahash,work_error
0,1459899321000,1b477f4cb18ff8bf8ec9db41db055d4964a7bb5a717233...,[{'input_script_bytes': b'H0E\x02!\x00\xb8\x1e...,"[{'output_satoshis': 92391000, 'output_script_...",000000000000000000d6e1daae1afafb15fafc6a10e362...,000000000000000003a394df13826b7338234398c7e95d...,1b71a1985b1c2b1411978ecede0b879820eeb850b01683...,1060179002,4,7166327277,
1,1505105531000,2b1ebc7dd319a055cb641a8be486060a35967c0f759162...,[{'input_script_bytes': b'H0E\x02!\x00\xe7\xde...,"[{'output_satoshis': 118073619, 'output_script...",000000000000000000468aaa06a632b001642dc0f1d5ce...,00000000000000000118f7e8698b2d12256477ff4d18aa...,41e98d8bdb4c4c694b189f3738f175a7e44fdd1401b466...,410256991,536870912,39631328811,
2,1397223235000,23af21c63a3a4fd210c13f180be2a11a41ef3921e64b17...,[{'input_script_bytes': b'I0F\x02!\x00\xd6J\xd...,"[{'output_satoshis': 124296849, 'output_script...",000000000000000052129807a19524bc641c94fcaf5576...,00000000000000000f1f9fc26857c7ada11eb7aa71e7ab...,a21854e45d97f4263c06bf5757610bc5a46ee9932e1d45...,2267577094,2,262844244,
3,1450714161000,613ca2e33f091d6cc65e895f9df9fdf920f51768c31581...,[{'input_script_bytes': b'G0D\x02 o_\xaa\xf1BU...,"[{'output_satoshis': 8802607568, 'output_scrip...",0000000000000000031ca198948af2739f44886e08fb0b...,0000000000000000040ebffc0bb174145b3c70b2b8534e...,a482702f28a391d9aa3997a898bea6b676c40c9e09a8e1...,1105219,4,4013651092,
4,1507245877000,1fc8bcb803b7de957362396473b739bb4d554de39df7d0...,"[{'input_script_bytes': b""H0E\x02!\x00\xady\xf...","[{'output_satoshis': 644000, 'output_script_by...",000000000000000000211de5eb2802860cfe49f1d074e1...,0000000000000000004add76174511d81312c7697b0546...,15c893389c12936297ff9d2aa8a2bdb10e5b80f1ad673b...,1101865512,536870912,48270297094,


In [6]:
# preview the first ten entries in the by column of the full table
bitcoin_blockchain.head("transactions", selected_columns="timestamp", num_rows=10)

Unnamed: 0,timestamp
0,1459899321000
1,1505105531000
2,1397223235000
3,1450714161000
4,1507245877000
5,1508357096000
6,1481216035000
7,1514845427000
8,1492157752000
9,1473243858000


## Check the size of your query before you run it
____

BigQuery datasets are, true to their name, BIG. The [biggest dataset we've got on Kaggle so far](https://www.kaggle.com/github/github-repos) is 3 terabytes. Since the monthly quota for BigQuery queries is 5 terabytes, you can easily go past your 30-day quota by running just a couple of queries!

> **What's a query?** A query is small piece of SQL code that specifies what data would you like to scan from a databases, and how much of that data you would like returned. (Note that your quota is on data *scanned*, not the amount of data returned.)

One way to help avoid this is to estimate how big your query will be before you actually execute it. You can do this with the `BigQueryHelper.estimate_query_size()` method. For the rest of this notebook, I'll be using an example query that finding the scores for every Hacker News post of the type "job". Let's see how much data it will scan if we actually ran it.

In [9]:
# this query looks in the transactions table in the bitcoin_blockchain
# dataset, then gets the timestamp column from every row where 
# the merkle_root column has "1b71a1985b1c2b1411978ecede0b879820eeb850b01683" in it.
query = """SELECT timestamp
            FROM `bigquery-public-data.bitcoin_blockchain.transactions`
            WHERE merkle_root = "1b71a1985b1c2b1411978ecede0b879820eeb850b01683" 
        """

# check how big this query will be
bitcoin_blockchain.estimate_query_size(query)

20.584799559786916

Running this query will take around 20 GB. (The query size is returned in gigabytes.)

In [11]:
query = """SELECT timestamp
            FROM `bigquery-public-data.bitcoin_blockchain.transactions`
            WHERE merkle_root = "1b71a1985b1c2b1411978ecede0b879820eeb850b01683"
            ORDER BY 1 DESC
            LIMIT 100
        """

# check how big this query will be
bitcoin_blockchain.estimate_query_size(query)

20.584828574210405

## Actually run a query
___

Now that we know how to check the size of the query (and make sure we're not scanning several terabytes of data!) we're ready to actually run our first query. You have two methods available to help you do this:

* *`BigQueryHelper.query_to_pandas(query)`*: This method takes a query and returns a Pandas dataframe.
* *`BigQueryHelper.query_to_pandas_safe(query, max_gb_scanned=1)`*: This method takes a query and returns a Pandas dataframe only if the size of the query is less than the upperSizeLimit (1 gigabyte by default). 

Here's an example of a query that is larger than the specified upper limit.

In [12]:
# only run this query if it's less than 1 GB
bitcoin_blockchain.query_to_pandas_safe(query, max_gb_scanned=1)

Query cancelled; estimated size of 20.584828574210405 exceeds limit of 1 GB


And here's an example where the same query returns a dataframe. 

In [None]:
# check out the scores of job postings (if the 
# query is smaller than 1 gig)
job_post_scores = hacker_news.query_to_pandas_safe(query)