In [1]:
import polars as pl
from pyiceberg.catalog.rest import RestCatalog

pl.Config.set_thousands_separator(",")
pl.Config.set_float_precision(2)

polars.config.Config

# Querying the data

Now we have some data and we understand a bit better how Iceberg works under the hood. It's time to actually start using it!

The key selling point for Iceberg is that we have the option of using many different query engines to read from the same data storage.
To show this off, let's run some simple queries using a few different query engines. In this demo, we'll focus on locally runnable query engines, but often the pattern will be using something like Snowflake or Databricks for the heavy lifting, but have the optionality of using alternatives for different usecases.

First, we get a reference to our table, since many of these engines are using pyiceberg as a jumping-off point, either to directly interface with the Pyiceberg table, or because we can use Pyiceberg to find out the location of the current metadata.json file

In [2]:
catalog = RestCatalog(
    "lakekeeper", uri="http://lakekeeper:8181/catalog", warehouse="lakehouse"
)
table = catalog.load_table("housing.staging_prices")

## Pyiceberg

Let's see how we would use Pyiceberg directly to handle querying first. For each of these examples, we'll do something simple - we will calculate the mean monthly house price per month in 2024 (we've loaded data for 2024 and 2023 in the previous notebook). 

In [6]:
%%time

iceberg_results = table.scan(
    selected_fields=["price", "date_of_transfer"],
    row_filter="date_of_transfer >= '2024-01-01' and date_of_transfer <= '2024-12-31'",
)
iceberg_results.to_polars().group_by(pl.col("date_of_transfer").dt.month().alias("month_of_year")).agg(
    pl.col("price").mean()
).sort(by="month_of_year")

CPU times: user 54.2 ms, sys: 23.8 ms, total: 78 ms
Wall time: 81.5 ms


month_of_year,price
i8,f64
1,394006.70
2,386516.20
3,410261.50
4,405015.70
5,392130.10
…,…
8,393278.30
9,391477.49
10,395518.96
11,369189.63


## Polars
Pyiceberg provides us with limited filtering and projection capabilities - it provides the building blocks for libraries that build on top of Pyiceberg. We used Polars to finish the job in this example, but polars can read Iceberg directly so we can avoid the extra step

In [9]:
%%time
polars_df = (
    pl.scan_iceberg(table)
    .group_by(pl.col("date_of_transfer").dt.month().alias("month_of_year"))
    .agg(pl.col("price").mean())
    .sort(by="month_of_year")
    .collect()
)
polars_df

CPU times: user 80.4 ms, sys: 11.5 ms, total: 91.9 ms
Wall time: 65.4 ms


month_of_year,price
i8,f64
1,395660.43
2,386895.65
3,410764.53
4,404071.22
5,398286.85
…,…
8,402361.51
9,398748.30
10,395834.50
11,377496.83


## Duckdb
Duckdb is also an excellent choice for working with Iceberg, especially if you want to stick to SQL.

It does require some setup, since Duckdb doesn't yet know how to talk to the REST catalog, so it needs to have it's own credentials, but the [duckdb-iceberg](https://github.com/duckdb/duckdb-iceberg) extension recently got additional sponsorship from AWS to improve Iceberg compatibility, so keep an eye on that

In [10]:
import duckdb

In [11]:
# Create a duckdb connection
conn = duckdb.connect()
# Load the Iceberg extension for DuckDB
conn.install_extension("iceberg")
conn.load_extension("iceberg")

In [16]:
conn.sql("""
CREATE SECRET lakekeeper_secret (
TYPE ICEBERG,
TOKEN 'dummy_token'
);
""")


┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true    │
└─────────┘

In [18]:
conn.sql("""
ATTACH 'lakehouse' as datalake (
TYPE iceberg,
SECRET lakekeeper_secret,
ENDPOINT 'http://lakekeeper:8181/catalog'
)
""")

In [21]:
%%time
# We can read the iceberg data using DuckDB
conn.sql(f"""
SELECT month(date_of_transfer) as transfer_month, mean(price) as mean_price
FROM datalake.housing.staging_prices
GROUP BY 1
""").show()

┌────────────────┬────────────────────┐
│ transfer_month │     mean_price     │
│     int64      │       double       │
├────────────────┼────────────────────┤
│              1 │ 395660.43394136266 │
│              2 │  386895.6456542961 │
│              3 │ 410764.53313914494 │
│              4 │   404071.218302416 │
│              5 │  398286.8499751832 │
│              6 │  404138.3983623484 │
│              7 │  401309.5456024658 │
│              8 │   402361.507596128 │
│              9 │  398748.2988667712 │
│             10 │   395834.498216876 │
│             11 │ 377496.82664732926 │
│             12 │  395920.2946206615 │
├────────────────┴────────────────────┤
│ 12 rows                   2 columns │
└─────────────────────────────────────┘

CPU times: user 416 ms, sys: 142 ms, total: 557 ms
Wall time: 963 ms


## Trino
Trino is another popular option, especially since AWS provides it as a serverless query engine through Athena. Trino is another SQL-based query engine, so the query looks pretty similar, just using Trino SQL dialect

In [22]:
import sqlalchemy as sa

engine = sa.create_engine("trino://trino:@trino:8080/lakekeeper")

sql = """
SELECT month(date_of_transfer) as transfer_month, avg(price) as mean_price 
FROM housing.staging_prices
GROUP BY 1
ORDER BY 1
"""

In [23]:
%%time
pl.read_database(sql, engine)

CPU times: user 199 ms, sys: 24.1 ms, total: 223 ms
Wall time: 1.35 s


transfer_month,mean_price
i64,f64
1,395660.43
2,386895.65
3,410764.53
4,404071.22
5,398286.85
…,…
8,402361.51
9,398748.30
10,395834.50
11,377496.83


## Daft
Daft is a relatively new player in the Dataframe world, similar to Polars, but also designed for scaling out. It's also written in Rust, but Daft has had early support for Iceberg - let's see if that helps

In [24]:
import daft

In [25]:
%%time
(
    daft.read_iceberg(table)
    .groupby(daft.col("date_of_transfer").dt.month())
    .agg(daft.col("price").mean())
    .sort(by=daft.col("date_of_transfer"))
    .show(12)
)

date_of_transfer UInt32,price Float64
1,395660.43394136266
2,386895.6456542961
3,410764.53313914494
4,404071.218302416
5,398286.8499751832
...,...
8,402361.507596128
9,398748.2988667712
10,395834.498216876
11,377496.82664732926


CPU times: user 100 ms, sys: 40.2 ms, total: 141 ms
Wall time: 114 ms


## Query engines
So now we've done a tour of some of the query engines that are also easy to run locally - we've been through Python with Pyiceberg, Rust with Polars and Daft, C++ with Duckdb and finally Java with Trino. One important player we've left out here is Spark. There is no denying that Iceberg was originally a Java project and the Java Iceberg reference library is the most feature-complete. Pyiceberg is a close second though

![Iceberg Query Engines](images/iceberg_query_engine.svg)

In a real enterprise setup, you'll probably a managed service like Databricks or Snowflake that you can rely on as your main Iceberg driver - but the beauty of Iceberg is that you don't have to. You can mix and match these different query engines depending on the task at hand, while not having to move the data anywhere.

# Exercise

Try running a query using your favourite query engine to calculate the average house price for your county. If you don't live in the UK - pick the funniest sounding one. (I quite like WORCESTERSHIRE)