# Exploring Apache Iceberg on Cloudflare R2

This notebook demonstrates querying federal regulations data through the **R2 Data Catalog** (Apache Iceberg REST catalog).

**What's different from raw Parquet?**
- Tables are managed — schema evolution, ACID transactions, time travel
- Multi-engine access — DuckDB, PyIceberg, Spark, Snowflake all see the same catalog
- Incremental updates — append/update rows without rewriting entire files

Data source: [regulations.gov](https://www.regulations.gov/) via [Mirrulations](https://github.com/MoravianUniversity/mirrulations)

In [1]:
# Install dependencies (run once)
# !pip install duckdb pandas python-dotenv pyiceberg[pyarrow]

## 1. Connect via DuckDB

DuckDB 1.4+ can attach directly to an Iceberg REST catalog. Once attached, tables are queryable with standard SQL.

In [2]:
import os
import duckdb
from dotenv import load_dotenv

load_dotenv()

# Iceberg catalog config
CATALOG_URI = "https://catalog.cloudflarestorage.com/a18589c7a7a0fc4febecadfc9c71b105/spicy-regs"
WAREHOUSE = "a18589c7a7a0fc4febecadfc9c71b105_spicy-regs"
TOKEN = os.getenv("R2_API_TOKEN")

# Initialize DuckDB with Iceberg support
conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg;")
conn.execute("INSTALL httpfs; LOAD httpfs;")

# Authenticate
conn.execute(f"""
    CREATE SECRET r2_secret (
        TYPE ICEBERG,
        TOKEN '{TOKEN}'
    );
""")

# Attach the catalog
conn.execute(f"""
    ATTACH '{WAREHOUSE}' AS spicy_regs (
        TYPE ICEBERG,
        ENDPOINT '{CATALOG_URI}'
    );
""")

print("✓ Connected to Iceberg catalog")
print(f"DuckDB {duckdb.__version__}")

✓ Connected to Iceberg catalog
DuckDB 1.4.3


## 2. Discover the Catalog

Browse schemas (namespaces) and tables — just like a traditional database.

In [3]:
# List all tables in the catalog
conn.execute("SHOW ALL TABLES").fetchdf()

Unnamed: 0,database,schema,name,column_names,column_types,temporary
0,spicy_regs,regulations,comments,[__],[UNKNOWN],False
1,spicy_regs,regulations,dockets,[__],[UNKNOWN],False
2,spicy_regs,regulations,documents,[__],[UNKNOWN],False


In [4]:
# Set the active schema for shorter queries
conn.execute("USE spicy_regs.regulations")
print("✓ Active schema: spicy_regs.regulations")

✓ Active schema: spicy_regs.regulations


In [5]:
# Inspect the schema of each table
for table in ["dockets", "documents", "comments"]:
    print(f"\n── {table} ──")
    display(conn.execute(f"DESCRIBE {table}").fetchdf())


── dockets ──


Unnamed: 0,column_name,column_type,null,key,default,extra
0,docket_id,VARCHAR,YES,,,
1,agency_code,VARCHAR,YES,,,
2,title,VARCHAR,YES,,,
3,docket_type,VARCHAR,YES,,,
4,modify_date,VARCHAR,YES,,,
5,abstract,VARCHAR,YES,,,



── documents ──


Unnamed: 0,column_name,column_type,null,key,default,extra
0,document_id,VARCHAR,YES,,,
1,docket_id,VARCHAR,YES,,,
2,agency_code,VARCHAR,YES,,,
3,title,VARCHAR,YES,,,
4,document_type,VARCHAR,YES,,,
5,posted_date,VARCHAR,YES,,,
6,modify_date,VARCHAR,YES,,,
7,comment_start_date,VARCHAR,YES,,,
8,comment_end_date,VARCHAR,YES,,,
9,file_url,VARCHAR,YES,,,



── comments ──


Unnamed: 0,column_name,column_type,null,key,default,extra
0,comment_id,VARCHAR,YES,,,
1,docket_id,VARCHAR,YES,,,
2,agency_code,VARCHAR,YES,,,
3,title,VARCHAR,YES,,,
4,comment,VARCHAR,YES,,,
5,document_type,VARCHAR,YES,,,
6,posted_date,VARCHAR,YES,,,
7,modify_date,VARCHAR,YES,,,
8,receive_date,VARCHAR,YES,,,


## 3. Dataset Overview

Get row counts and basic stats — same queries as the Parquet notebook, but now reading from managed Iceberg tables.

In [6]:
# Row counts for all tables
for table in ["dockets", "documents", "comments"]:
    count = conn.execute(f"SELECT COUNT(*) FROM {table}").fetchone()[0]
    agencies = conn.execute(f"SELECT COUNT(DISTINCT agency_code) FROM {table}").fetchone()[0]
    print(f"{table}: {count:,} rows, {agencies} agencies")

dockets: 393,580 rows, 194 agencies
documents: 2,122,180 rows, 315 agencies
comments: 31,615,846 rows, 178 agencies


## 4. Query Examples

### Top Agencies by Docket Count

In [7]:
conn.execute("""
    SELECT agency_code, COUNT(*) as docket_count
    FROM dockets
    GROUP BY agency_code
    ORDER BY docket_count DESC
    LIMIT 15
""").fetchdf()

Unnamed: 0,agency_code,docket_count
0,FDA,74470
1,FAA,58658
2,EPA,34100
3,DOT,14526
4,USCG,14064
5,FMCSA,10098
6,NOAA,9571
7,PHMSA,8527
8,HUD,6919
9,NRC,6635


### Recent EPA Dockets

In [8]:
conn.execute("""
    SELECT docket_id, title, docket_type, modify_date
    FROM dockets
    WHERE agency_code = 'EPA'
    ORDER BY modify_date DESC
    LIMIT 10
""").fetchdf()

Unnamed: 0,docket_id,title,docket_type,modify_date
0,EPA-R06-OW-2025-0307,Proposed Modification of The NPDES General Per...,Nonrulemaking,2026-01-27T17:23:28Z
1,EPA-R08-OAR-2025-2375,"Marathon Oil Company - Goodbird USA CTB, Synth...",Nonrulemaking,2026-01-27T17:15:25Z
2,EPA-R08-OAR-2025-0717,"Marathon Oil Company - Bulls Eye, Keith, Glisa...",Nonrulemaking,2026-01-27T14:50:18Z
3,EPA-R08-OAR-2025-2070,Montana - Revisions to Western Sugar SO2 Stipu...,Rulemaking,2026-01-23T14:38:50Z
4,EPA-R03-OAR-2025-2532,Air Plan Approval; Maryland; Clean Data Determ...,Rulemaking,2026-01-23T14:07:02Z
5,EPA-R05-OW-2025-2832,"Republic Services of Michigan I, LLC",Nonrulemaking,2026-01-21T18:30:31Z
6,EPA-R08-OAR-2025-0667,"Marathon Oil Company - Collins USA CTB, Synthe...",Nonrulemaking,2026-01-20T23:56:27Z
7,EPA-HQ-OW-2025-0322,Updated Definition of Waters of the United States,Rulemaking,2026-01-14T21:11:07Z
8,EPA-HQ-OPP-2024-0239,Pyriofenone on Apple and Cherry subgroup 12-12...,Rulemaking,2026-01-14T19:51:45Z
9,EPA-R08-OAR-2021-0418,"CR Group, LLC - Tekoi Landfill, Part 71 Renewa...",Nonrulemaking,2026-01-14T18:33:24Z


### Documents with Open Comment Periods

In [9]:
conn.execute("""
    SELECT document_id, agency_code, title, 
           comment_start_date, comment_end_date
    FROM documents
    WHERE comment_end_date IS NOT NULL
      AND TRY_CAST(comment_end_date AS DATE) > CURRENT_DATE
    ORDER BY comment_end_date ASC
    LIMIT 10
""").fetchdf()

Unnamed: 0,document_id,agency_code,title,comment_start_date,comment_end_date
0,DOT-OST-2025-2085-0001,DOT,Request for Information: Transportation Resear...,2026-01-13T05:00:00Z,2026-02-13T04:59:59Z
1,MCC-2026-0001-0001,MCC,Agency Information Collection Activities; Prop...,2026-01-13T05:00:00Z,2026-02-13T04:59:59Z
2,ITC-2025-2237-0001,ITC,"Investigations; Determinations, Modifications,...",2025-12-01T05:00:00Z,2026-02-13T04:59:59Z
3,NASA_FRDOC_0001-1064,NASA,Agency Information Collection Activities; Prop...,2026-01-13T05:00:00Z,2026-02-13T04:59:59Z
4,DOL_FRDOC_0001-2609,DOL,Agency Information Collection Activities; Prop...,2026-01-13T05:00:00Z,2026-02-13T04:59:59Z
5,ITC-2025-2234-0001,ITC,"Investigations; Determinations, Modifications,...",2025-12-01T05:00:00Z,2026-02-13T04:59:59Z
6,FAA-2025-0562-0003,FAA,Agency Information Collection Activities; Prop...,2026-01-13T05:00:00Z,2026-02-13T04:59:59Z
7,FAA-2013-0259-4320,FAA,Waiver with Respect to Land: Indianapolis Down...,2026-01-13T05:00:00Z,2026-02-13T04:59:59Z
8,MCC-2026-0001-0001,MCC,Agency Information Collection Activities; Prop...,2026-01-13T05:00:00Z,2026-02-13T04:59:59Z
9,FHWA-2025-0499-0001,FHWA,Agency Information Collection Activities; Prop...,2025-12-15T05:00:00Z,2026-02-14T04:59:59Z


### Most Commented Dockets (Cross-Table Join)

This is where Iceberg shines — joins across managed tables work like a traditional database.

In [10]:
conn.execute("""
    SELECT 
        d.docket_id,
        d.agency_code,
        d.title,
        COUNT(c.comment_id) as comment_count
    FROM dockets d
    LEFT JOIN comments c ON d.docket_id = c.docket_id
    WHERE d.agency_code = 'EPA'
    GROUP BY d.docket_id, d.agency_code, d.title
    ORDER BY comment_count DESC
    LIMIT 10
""").fetchdf()

Unnamed: 0,docket_id,agency_code,title,comment_count
0,EPA-HQ-OAR-2023-0072,EPA,New Source Performance Standards for GHG Emiss...,690660
1,EPA-HQ-OAR-2022-0829,EPA,Multi-Pollutant Emissions Standards for Model ...,366420
2,EPA-HQ-OAR-2015-0072,EPA,Review of the National Ambient Air Quality Sta...,242821
3,EPA-HQ-OAR-2018-0794,EPA,National Emission Standards for Hazardous Air ...,242220
4,EPA-HQ-OAR-2022-0985,EPA,Greenhouse Gas Emissions Standards for Heavy-D...,155700
5,EPA-HQ-OW-2022-0801,EPA,National Primary Drinking Water Regulations: L...,129360
6,EPA-HQ-OW-2009-0819,EPA,Rulemaking for the Steam Electric Power Genera...,123156
7,EPA-HQ-OAR-2022-0730,EPA,New Source Performance Standards for the Synth...,117880
8,EPA-HQ-OW-2022-0114,EPA,Per- and polyfluoroalkyl substances (PFAS): Pe...,95030
9,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",80058


### Comment Volume by Month

In [11]:
conn.execute("""
    SELECT 
        EXTRACT(YEAR FROM TRY_CAST(posted_date AS DATE))::INT as year,
        EXTRACT(MONTH FROM TRY_CAST(posted_date AS DATE))::INT as month,
        COUNT(*) as comment_count
    FROM comments
    WHERE posted_date IS NOT NULL 
      AND TRY_CAST(posted_date AS DATE) IS NOT NULL
      AND TRY_CAST(posted_date AS DATE) >= DATE '2024-01-01'
    GROUP BY 1, 2
    ORDER BY 1, 2
""").fetchdf()

Unnamed: 0,year,month,comment_count
0,2024,1,132054
1,2024,2,143178
2,2024,3,173163
3,2024,4,71426
4,2024,5,194535
5,2024,6,70233
6,2024,7,82098
7,2024,8,146114
8,2024,9,546095
9,2024,10,89850


## 5. Compare: Iceberg vs Raw Parquet

The same queries work against both. The key difference is how you reference the data.

In [12]:
import time

R2_PUBLIC_URL = "https://pub-5fc11ad134984edf8d9af452dd1849d6.r2.dev"

query = "SELECT agency_code, COUNT(*) as cnt FROM {source} GROUP BY 1 ORDER BY 2 DESC LIMIT 5"

# Iceberg table
t0 = time.time()
iceberg_result = conn.execute(query.format(source="dockets")).fetchdf()
iceberg_time = time.time() - t0

# Raw Parquet
t0 = time.time()
parquet_result = conn.execute(
    query.format(source=f"read_parquet('{R2_PUBLIC_URL}/dockets.parquet')")
).fetchdf()
parquet_time = time.time() - t0

print(f"Iceberg: {iceberg_time:.2f}s")
print(f"Parquet:  {parquet_time:.2f}s")
print(f"\nResults match: {iceberg_result.equals(parquet_result)}")
display(iceberg_result)

Iceberg: 0.84s
Parquet:  2.22s

Results match: True


Unnamed: 0,agency_code,cnt
0,FDA,74470
1,FAA,58658
2,EPA,34100
3,DOT,14526
4,USCG,14064


## 6. PyIceberg Access

PyIceberg provides a Python-native way to interact with the catalog — useful for schema inspection, metadata access, and programmatic table management.

In [13]:
from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(
    name="spicy_regs",
    warehouse=WAREHOUSE,
    uri=CATALOG_URI,
    token=TOKEN,
)

# List namespaces and tables
print("Namespaces:", catalog.list_namespaces())
print("\nTables:")
for table in catalog.list_tables("regulations"):
    print(f"  {'.'.join(table)}")

Namespaces: [('default',), ('regulations',)]

Tables:
  regulations.dockets
  regulations.documents
  regulations.comments


In [14]:
# Inspect table metadata
table = catalog.load_table(("regulations", "dockets"))

print(f"Table: {table.name()}")
print(f"Location: {table.location()}")
print(f"Snapshots: {len(table.metadata.snapshots)}")
print(f"\nSchema:")
for field in table.schema().fields:
    print(f"  {field.name}: {field.field_type}")

Table: ('regulations', 'dockets')
Location: s3://spicy-regs/__r2_data_catalog/019c5305-e4f0-7412-9920-1b9f69f538d7/019c5305-ef11-7840-9697-2dd6c92c6591
Snapshots: 1

Schema:
  docket_id: string
  agency_code: string
  title: string
  docket_type: string
  modify_date: string
  abstract: string


In [15]:
# Read data via PyIceberg → Arrow → Pandas
arrow_table = table.scan(
    selected_fields=("docket_id", "agency_code", "title"),
    limit=5
).to_arrow()

arrow_table.to_pandas()

Unnamed: 0,docket_id,agency_code,title
0,ACF-2005-0001,ACF,Grant to United States Conference of Catholic ...
1,ACF-2005-0011,ACF,"Administration on Children, Youth and Families..."
2,ACF-2005-0003,ACF,Agency information collection activities
3,ACF-2005-0007,ACF,"Administration on Children, Youth, and Familie..."
4,ACF-2005-0002,ACF,Agency information collection activities


In [16]:
# Filter with row-level predicates (pushed down to Iceberg)
from pyiceberg.expressions import EqualTo

epa_dockets = catalog.load_table(("regulations", "dockets")).scan(
    row_filter=EqualTo("agency_code", "EPA"),
    selected_fields=("docket_id", "title", "docket_type"),
    limit=10
).to_arrow()

print(f"EPA dockets (filtered at scan level): {len(epa_dockets)} rows")
epa_dockets.to_pandas()

EPA dockets (filtered at scan level): 10 rows


Unnamed: 0,docket_id,title,docket_type
0,EPA-HQ-OA-2007-0706,State Small Business Stationary source Technic...,Nonrulemaking
1,EPA-HQ-OA-2006-0734,EPA Training Dockets,Rulemaking
2,EPA-HQ-OA-2006-0734,EPA Training Dockets,Rulemaking
3,EPA-HQ-OA-2003-0007,Agency Information Collection Activities: Prop...,Nonrulemaking
4,EPA-HQ-OA-2005-0003,Description of Collaboration with the Environm...,Nonrulemaking
5,EPA-HQ-OA-2007-0680,Guidance on Selecting Age Groups or Monitoring...,Nonrulemaking
6,EPA-HQ-OA-2006-0734,EPA Training Dockets,Rulemaking
7,EPA-HQ-OA-2006-0513,ICR for Performance Track 1949.05,Nonrulemaking
8,EPA-HQ-OA-2007-0706,State Small Business Stationary source Technic...,Nonrulemaking
9,EPA-HQ-OA-2006-0074,Voluntary Customer Satisfaction Surveys,Nonrulemaking


## 7. Snapshot History (Time Travel)

Every write to an Iceberg table creates a snapshot. You can inspect the history and (in the future) query historical versions.

In [17]:
from datetime import datetime

for tbl_name in ["dockets", "documents", "comments"]:
    tbl = catalog.load_table(("regulations", tbl_name))
    print(f"\n── {tbl_name} ──")
    print(f"  Snapshots: {len(tbl.metadata.snapshots)}")
    for snap in tbl.metadata.snapshots:
        ts = datetime.fromtimestamp(snap.timestamp_ms / 1000)
        summary = snap.summary or {}
        rows = summary.get("total-records", "?")
        files = summary.get("total-data-files", "?")
        print(f"  Snapshot {snap.snapshot_id}: {ts} | {rows} records, {files} data files")


── dockets ──
  Snapshots: 1
  Snapshot 5552947494974408533: 2026-02-12 13:04:00.641000 | None records, None data files

── documents ──
  Snapshots: 1
  Snapshot 537570208951649699: 2026-02-12 13:04:43.170000 | None records, None data files

── comments ──
  Snapshots: 1
  Snapshot 2285363217501209918: 2026-02-12 13:19:19.252000 | None records, None data files


## Summary

| Method | Best For |
|--------|----------|
| **DuckDB SQL** | Ad-hoc queries, joins, aggregations — most natural for analytics |
| **PyIceberg** | Schema inspection, metadata access, programmatic table management |
| **Raw Parquet** | Quick reads when you don't need catalog features |

All three access the same underlying data on Cloudflare R2.