<a href="https://colab.research.google.com/github/antonum/Timescale-Workshops/blob/main/Tutorials/query-bitcoin-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TimescaleDB Blockchain Query Tutorial

Source: Synthesized from Timescale Documentation snippets found via Google Search
Based on: https://docs.timescale.com/tutorials/latest/blockchain-query/

This notebook guides you through setting up a dataset, querying it,
and optionally compressing it using TimescaleDB for Bitcoin blockchain data.

# Setup Timescale Connection

By default, this notebook installs Timescale right within the colab runtime with endpoint `"postgres://postgres:password@localhost/postgres"`. You can optionally use your own Timescale cloud instance endpoint.

Try Timescale Cloud for free at: https://console.cloud.timescale.com/signup

In [1]:
import os
### Default connection for in-notebook Timescale ###
TS_CONNECTION="postgres://postgres:password@localhost/postgres"

### Use environment variable ###
#TS_CONNECTION = os.getenv("TS_CONNECTION", "postgres://postgres:password@localhost/postgres")

### Use your own Timescale Cloud instance ###
#TS_CONNECTION="postgres://tsdbadmin:xxxxxxx.yyyyy.tsdb.cloud.timescale.com:39966/tsdb?sslmode=require"

### Use colab secret ###
#from google.colab import userdata
#TS_CONNECTION=userdata.get('TS_CONNECTION')

### Set environment variable to be used in psql CLI ###
os.environ["TS_CONNECTION"]=TS_CONNECTION

In [2]:
#@title Install Timescale
%%bash
set -e # Exit immediately if a command exits with a non-zero status.

# --- Configuration ---
PG_VERSION="17"
PGVECTORSCALE_VERSION="0.7.0"
PG_PASSWORD="password" # Consider using a more secure password

echo "--- 1. Installing Prerequisites & Adding Repositories ---"
# Install essential packages quietly
apt-get -qq -y install gnupg postgresql-common apt-transport-https lsb-release wget > /dev/null 2>&1

# Add the official PostgreSQL repository
# The 'yes |' answers confirmation prompts automatically. Output redirected.
yes | /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh > /dev/null 2>&1

# Add the TimescaleDB repository
echo "deb https://packagecloud.io/timescale/timescaledb/ubuntu/ $(lsb_release -c -s) main" | sudo tee /etc/apt/sources.list.d/timescaledb.list > /dev/null
# Add the TimescaleDB GPG key using the recommended method (avoids apt-key add)
wget --quiet -O - https://packagecloud.io/timescale/timescaledb/gpgkey | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/timescaledb.gpg

echo "--- 2. Updating Package List & Installing PostgreSQL + Extensions ---"
# Update package list quietly (should suppress apt-key warnings too)
apt-get -qq update > /dev/null 2>&1

# Install PostgreSQL, TimescaleDB, pgvector, toolkit, and client
apt-get -qq -y install \
  "timescaledb-2-postgresql-${PG_VERSION}" \
  "postgresql-client-${PG_VERSION}" \
  "postgresql-${PG_VERSION}-pgvector" \
  "timescaledb-toolkit-postgresql-${PG_VERSION}" > /dev/null 2>&1

echo "--- 3. Installing pgvectorscale ---"
# Download and install pgvectorscale
wget --quiet "https://github.com/timescale/pgvectorscale/releases/download/${PGVECTORSCALE_VERSION}/pgvectorscale-${PGVECTORSCALE_VERSION}-pg${PG_VERSION}-amd64.zip" -O pgvectorscale.zip
unzip -q pgvectorscale.zip # Use -q for quiet unzip
# Install the .deb package quietly
apt-get -qq -y install "./pgvectorscale-postgresql-${PG_VERSION}_${PGVECTORSCALE_VERSION}-Linux_amd64.deb" > /dev/null 2>&1

# Clean up downloaded files
rm pgvectorscale.zip "./pgvectorscale-postgresql-${PG_VERSION}_${PGVECTORSCALE_VERSION}-Linux_amd64.deb"

echo "--- 4. Configuring PostgreSQL & TimescaleDB ---"
# Tune PostgreSQL for TimescaleDB
timescaledb-tune --quiet --yes  > /dev/null 2>&1

# Restart PostgreSQL service to apply changes
service postgresql restart
sleep 2 # Give the service a moment to restart fully

echo "--- 5. Setting Up Database User and Extensions ---"
# Set the password for the default postgres user
sudo -u postgres psql -c "ALTER USER postgres PASSWORD '${PG_PASSWORD}'" > /dev/null

# Connect as the postgres user and create extensions quietly
psql -d "postgres://postgres:${PG_PASSWORD}@localhost/postgres" > /dev/null <<EOF
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
CREATE EXTENSION IF NOT EXISTS timescaledb_toolkit CASCADE;
CREATE EXTENSION IF NOT EXISTS vector CASCADE;
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
EOF

echo "--- Installation and Setup Complete ---"



--- 1. Installing Prerequisites & Adding Repositories ---
--- 2. Updating Package List & Installing PostgreSQL + Extensions ---
--- 3. Installing pgvectorscale ---
--- 4. Configuring PostgreSQL & TimescaleDB ---
 * Restarting PostgreSQL 17 database server
   ...done.
--- 5. Setting Up Database User and Extensions ---
--- Installation and Setup Complete ---


In [3]:
# Optional: Verify extensions are installed
#!psql -d $TS_CONNECTION -c '\dx'

In [4]:
#@title Init psycopg2 connection to Timescale
import pandas as pd
import psycopg2

# establish connection to Timescale
conn = psycopg2.connect(TS_CONNECTION)
cursor = conn.cursor()

# helper function to convert SQL Results to the dataframe
def execute_sql(query, cursor=cursor):
    try:
        cursor.execute(query)
        conn.commit()
        # Check if query returns data (SELECT)
        if cursor.description:  # If description is not None, query returned data
            columns = [desc[0] for desc in cursor.description]
            data = cursor.fetchall()
            df = pd.DataFrame(data, columns=columns)
            return df
        else:
            # Query was likely INSERT, CREATE TABLE, UPDATE, DELETE, etc.
            return f"Rows affected: {cursor.rowcount}"  # Return the number of rows affected

    except psycopg2.Error as e:
        print(f"Error executing SQL query: {e}")
        conn.rollback()  # Rollback changes in case of error
        return None  # Or raise the exception if you prefer

## Setting up your dataset

Prerequisites:
* A running TimescaleDB instance (Timescale Cloud or self-hosted).
* psql command-line utility installed and connected to your database.

Create a standard PostgreSQL table to store the Bitcoin blockchain data.
The dataset contains around 1.5 million Bitcoin transactions for five days.
It includes information about each transaction, value in satoshi,
if it's a coinbase transaction, and miner rewards.

In [5]:
query = """
CREATE TABLE transactions (
    time TIMESTAMPTZ,
    block_id INT,
    hash TEXT,
    size INT,
    weight INT,
    is_coinbase BOOLEAN,
    output_total BIGINT,
    output_total_usd DOUBLE PRECISION,
    fee BIGINT,
    fee_usd DOUBLE PRECISION,
    details JSONB
);
"""
execute_sql(query)

'Rows affected: -1'

Convert the standard table into a hypertable partitioned on the time column.
Hypertables are the core of TimescaleDB, enabling efficient handling of time-series data
by automatically partitioning data by time.

In [6]:
query = """
SELECT create_hypertable('transactions', by_range('time'));
"""
execute_sql(query)

Unnamed: 0,create_hypertable
0,"(1,t)"


Note: The by_range dimension builder is available in TimescaleDB 2.13+.
For older versions, you might use: SELECT create_hypertable('transactions', 'time');

When you create a hypertable, an index on the time column is automatically created.
However, filtering on other columns is common. Create additional indexes for performance.

Create an index on the hash column for faster individual transaction lookups.

In [7]:
query = """
CREATE INDEX hash_idx ON public.transactions USING HASH (hash);
"""
execute_sql(query)

'Rows affected: -1'

Create an index on the block_id column for faster block-level queries.

In [8]:
query = """
CREATE INDEX block_idx ON public.transactions (block_id);
"""
execute_sql(query)

'Rows affected: -1'

Create a unique index on time and hash to prevent duplicate records.

In [9]:
query = """
CREATE UNIQUE INDEX time_hash_idx ON public.transactions (time, hash);
"""
execute_sql(query)

'Rows affected: -1'

Download the sample dataset.
This file contains a .csv file with Bitcoin transactions for five days.
(Link mentioned in docs: bitcoin_sample.zip - Download manually)
--> Download bitcoin_sample.zip

In [10]:
!wget https://assets.timescale.com/docs/downloads/bitcoin-blockchain/bitcoin_sample.zip

--2025-04-17 14:42:16--  https://assets.timescale.com/docs/downloads/bitcoin-blockchain/bitcoin_sample.zip
Resolving assets.timescale.com (assets.timescale.com)... 3.165.160.30, 3.165.160.123, 3.165.160.101, ...
Connecting to assets.timescale.com (assets.timescale.com)|3.165.160.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 175175921 (167M) [binary/octet-stream]
Saving to: ‘bitcoin_sample.zip’


2025-04-17 14:42:21 (39.3 MB/s) - ‘bitcoin_sample.zip’ saved [175175921/175175921]



Unzip the downloaded file in your terminal.
In psql, you can execute shell commands using \!
Make sure the zip file is in the current directory accessible by psql.

In [11]:
!unzip bitcoin_sample.zip

Archive:  bitcoin_sample.zip
  inflating: tutorial_bitcoin_sample.csv  


Load the data from the CSV file into the transactions table using the COPY command.
Ensure 'tutorial_bitcoin_sample.csv' is in the correct path accessible by the psql client.
This might take a few minutes depending on your connection and client resources.

In [12]:
!psql -d $TS_CONNECTION -c "\COPY transactions FROM 'tutorial_bitcoin_sample.csv' CSV HEADER;"

COPY 2719085


Cleanup

In [13]:
!rm tutorial_bitcoin_sample.csv bitcoin_sample.zip

## Querying your dataset

Now that the data is loaded, you can run queries.

Query for the five most recent non-coinbase transactions.

In [14]:
query = """
SELECT time, hash, block_id, weight
FROM transactions
WHERE is_coinbase IS NOT TRUE
ORDER BY time DESC
LIMIT 5;
"""
execute_sql(query)

Unnamed: 0,time,hash,block_id,weight
0,2023-11-21 23:57:55+00:00,9cf2419d8b5edabf65ed960da638003b6f4b845014af6d...,817877,437
1,2023-11-21 23:57:55+00:00,465ffa7ca6eb759f8ac564422c4021ddb0e235836ac463...,817877,437
2,2023-11-21 23:57:55+00:00,0b2d2ffacfa084cae5ffcefe5ae65a13f05151f8a09845...,817877,438
3,2023-11-21 23:57:55+00:00,d1a3d858517c5947d634be097701d32bc4b95d243cdca0...,817877,721
4,2023-11-21 23:57:55+00:00,fac6b32ca7aa4503084ed55ff792c5701977e128f341a0...,817877,442


Query to get transaction count, total weight, and total USD value for the 5 most recent blocks.
This uses a Common Table Expression (CTE) to first find the latest blocks.

In [15]:
query = """
WITH recent_blocks AS (
    SELECT block_id
    FROM transactions
    WHERE is_coinbase IS TRUE
    ORDER BY time DESC
    LIMIT 5
)
SELECT
    t.block_id,
    count(*) AS transaction_count,
    SUM(weight) AS block_weight,
    SUM(output_total_usd) AS block_value_usd
FROM transactions t
INNER JOIN recent_blocks b ON b.block_id = t.block_id
GROUP BY t.block_id;
"""
execute_sql(query)

Unnamed: 0,block_id,transaction_count,block_weight,block_value_usd
0,817875,3599,3992696,155546300.0
1,817874,2076,3992685,12807270.0
2,817876,2032,3992983,22790700.0
3,817873,3842,3997010,145384000.0
4,817877,2268,3993274,68447840.0


(Add more example queries from the "Querying your dataset" section of the tutorial if available/needed)
Example: Analyze data using time_bucket() or other hyperfunctions (Referenced in related Timescale docs)

In [16]:
query = """
SELECT time_bucket('1 day', time) AS bucket, avg(output_total_usd)
FROM transactions
GROUP BY bucket
ORDER BY bucket DESC;
"""
execute_sql(query)

Unnamed: 0,bucket,avg
0,2023-11-21 00:00:00+00:00,69031.205565
1,2023-11-20 00:00:00+00:00,54529.545866
2,2023-11-19 00:00:00+00:00,28969.39308
3,2023-11-18 00:00:00+00:00,28452.000654
4,2023-11-17 00:00:00+00:00,49540.638896


## Bonus: Store data efficiently

TimescaleDB allows compressing data to save storage space and potentially improve query performance
on large datasets. Compression converts data into a columnar format.

Configure compression on the hypertable.
'segment by' columns define how data is grouped within compressed chunks.
'order by' columns define the sort order within compressed chunks, improving compression ratios.

In [17]:
query = """
ALTER TABLE transactions SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'block_id', -- Example: segment by block_id
    timescaledb.compress_orderby = 'time DESC'     -- Example: order by time descending
);
"""
execute_sql(query)

'Rows affected: -1'

Create a policy to automatically compress chunks older than a certain age (e.g., 7 days).
The database will periodically run this policy in the background.

In [18]:
query = """
SELECT add_compression_policy('transactions', INTERVAL '7 days');
"""
execute_sql(query)

Unnamed: 0,add_compression_policy
0,1000


Note: After enabling compression, older data chunks will be compressed according to the policy.
Queries generally work transparently on compressed and uncompressed data.



## Measure the effect of compression

Note: it takes some time to compress the data. Results might not be available for couple of minutes

In [23]:
query = """
SELECT
    'transactions' AS hypertable,
    pg_size_pretty(before_compression_total_bytes) AS before_compression,
    pg_size_pretty(after_compression_total_bytes) AS after_compression
FROM hypertable_compression_stats('transactions');
"""
execute_sql(query)

Unnamed: 0,hypertable,before_compression,after_compression
0,transactions,1679 MB,265 MB


End of Tutorial Script