# BONUS: Duckdb

![DuckDB Logo](images/logos/DuckDB_Logo.png)

DuckDB is the new black in data engineering - an in-process analytical database that aims for performance. DuckDB is focused on making it easy to query data from anywhere and has bindings to most popular languages, including Python of course. It even compiles to WASM, letting us do cool stuff like [this](https://shell.duckdb.org/)

DuckDB takes advantage of Arrow as it's internal data format, making it easy to interop with popular Python libraries as DuckDB can read and write the Arrow memory directly

In [1]:
import duckdb

In [2]:
sql = """SELECT * FROM 'data/10.csv' WHERE language = 'english'"""

duckdb.execute(sql).pl()

recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
i64,str,i64,i64,i64,i64,i64,f64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,i64,i64
147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
166652969,"""english""",1717495511,1717495511,1,0,0,0.0,0,1,0,0,1,,76561199025372592,10,2,13452,502,13426,1717500702
166652933,"""english""",1717495460,1717538865,1,1,0,0.517767,0,1,0,0,1,,76561197975930688,436,180,3197,0,3197,1714457233
137537621,"""english""",1682843335,1717481847,1,0,0,0.0,0,0,0,0,1,,76561199148051920,43,6,10173,2,10173,1716744841
154253089,"""english""",1703375726,1717478605,1,0,0,0.0,0,1,0,0,1,,76561199179558352,50,25,195,92,195,1717478564
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
30564247,"""english""",1489733442,1574109131,1,0,1,0.41404,0,0,0,0,0,,76561198095892208,811,72,11,0,7,1541900660
22338865,"""english""",1460293069,1460293069,1,1,1,0.502488,0,0,0,0,0,,76561198149865296,18,1,113891,0,30245,1690362335
15043110,"""english""",1427259949,1571474648,1,0,0,0.0,0,0,0,0,0,,76561198072567552,38,34,7302,0,4661,1571062471
149284037,"""english""",1698800321,1698800321,1,0,0,0.0,0,0,1,0,1,,76561199052025216,46,3,3367,2694,3016,1698892194


DuckDB infers that we want to read a 'csv' file and calls it's `read_csv` function implicitly. We can of course do this explicitly if we want to pass options to handle those messy CSV files.

In [3]:
sql = """SELECT filename, * FROM read_csv('data/10.csv', filename = true) WHERE language = 'english'"""
my_polars_df = duckdb.execute(sql).pl()
my_polars_df

filename,recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
str,i64,str,i64,i64,i64,i64,i64,f64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,i64,i64
"""data/10.csv""",147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
"""data/10.csv""",166652969,"""english""",1717495511,1717495511,1,0,0,0.0,0,1,0,0,1,,76561199025372592,10,2,13452,502,13426,1717500702
"""data/10.csv""",166652933,"""english""",1717495460,1717538865,1,1,0,0.517767,0,1,0,0,1,,76561197975930688,436,180,3197,0,3197,1714457233
"""data/10.csv""",137537621,"""english""",1682843335,1717481847,1,0,0,0.0,0,0,0,0,1,,76561199148051920,43,6,10173,2,10173,1716744841
"""data/10.csv""",154253089,"""english""",1703375726,1717478605,1,0,0,0.0,0,1,0,0,1,,76561199179558352,50,25,195,92,195,1717478564
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""data/10.csv""",30564247,"""english""",1489733442,1574109131,1,0,1,0.41404,0,0,0,0,0,,76561198095892208,811,72,11,0,7,1541900660
"""data/10.csv""",22338865,"""english""",1460293069,1460293069,1,1,1,0.502488,0,0,0,0,0,,76561198149865296,18,1,113891,0,30245,1690362335
"""data/10.csv""",15043110,"""english""",1427259949,1571474648,1,0,0,0.0,0,0,0,0,0,,76561198072567552,38,34,7302,0,4661,1571062471
"""data/10.csv""",149284037,"""english""",1698800321,1698800321,1,0,0,0.0,0,0,1,0,1,,76561199052025216,46,3,3367,2694,3016,1698892194


Because DuckDB is both in-process, as well as Arrow-backed, it's able to easily interop with other analytical tools, such as `polars` and `pandas`

In [4]:
sql = """
SELECT CAST(received_for_free as bool) as received_for_free, 
AVG(votes_up) as num_upvotes  
FROM my_polars_df 
GROUP BY ALL
"""
duckdb.execute(sql).pl()

received_for_free,num_upvotes
bool,f64
False,1.384894
True,0.860875


## Reading remote data
A killer feature is the nativeness of reading data from object stores directly, including common data lake formats such as Parquet. It can even query MySQL and Postgres!

Duckdb comes with a built-in secrets manager to handle credentials for connecting to remote stores so lets set that up

In [5]:
duckdb.execute("""CREATE OR REPLACE SECRET minio (
    TYPE S3,
    KEY_ID 'minio',
    SECRET 'minio1234',
    ENDPOINT 'minio:9000',
    URL_STYLE 'path',
    USE_SSL false,
    REGION 'us-east-1'
)
""");

Secrets can be stored persistently or in-memory - here we persist in-memory

In [6]:
duckdb.execute("FROM duckdb_secrets()").pl()

name,type,provider,persistent,storage,scope,secret_string
str,str,str,bool,str,list[str],str
"""minio""","""s3""","""config""",False,"""memory""","[""s3://"", ""s3n://"", ""s3a://""]","""name=minio;type=s3;provider=co…"


With credentials in order, we can treat S3 as just another file location

In [13]:
r = duckdb.read_parquet('data/parquet/all_reviews.parquet')

In [23]:
r.value_counts('language'

┌────────────┬───────────────────┐
│  language  │ count("language") │
│  varchar   │       int64       │
├────────────┼───────────────────┤
│ polish     │           3088366 │
│ schinese   │          19799992 │
│ greek      │             82484 │
│ french     │           2878882 │
│ czech      │            816372 │
│ japanese   │            659812 │
│ tchinese   │           1281136 │
│ vietnamese │             66599 │
│ norwegian  │            180191 │
│ swedish    │            447369 │
│    ·       │               ·   │
│    ·       │               ·   │
│    ·       │               ·   │
│ ukrainian  │            463960 │
│ indonesian │              6158 │
│ english    │          57424874 │
│ german     │           4182469 │
│ turkish    │           3702329 │
│ hungarian  │            370701 │
│ koreana    │           2534100 │
│ thai       │            537746 │
│ finnish    │            309880 │
│ italian    │            795499 │
├────────────┴───────────────────┤
│ 29 rows (20 shown)

In [7]:
sql = "FROM 's3://datalake/extract/reviews/10.csv'"

duckdb.execute(sql).pl()

recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
i64,str,i64,i64,i64,i64,i64,f64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,i64,i64
147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
166664841,"""russian""",1717510100,1717510100,1,0,0,0.0,0,1,0,0,1,,76561199161536896,24,11,436,71,385,1717512997
166664763,"""russian""",1717510009,1717510009,1,0,0,0.0,0,0,0,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
166663001,"""turkish""",1717508182,1717508182,0,0,0,0.0,0,0,0,0,1,,76561199374468448,32,4,361,19,356,1717508513
166658743,"""brazilian""",1717503385,1717503385,1,1,0,0.52381,0,1,0,0,1,,76561198018922960,9,1,1497,0,1497,1478272196
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
,"""1351380029""",1,0,0,0,0,1.0,0,0,0,,76561198008269840,"""0.0""",2,95521,0,37952,1547225350,,
149330962,"""russian""",1698868683,1698868683,1,0,0,0.0,0,1,0,0,1,,76561199093871104,62,26,29,10,29,1698427415
149284037,"""english""",1698800321,1698800321,1,0,0,0.0,0,0,1,0,1,,76561199052025216,46,3,3367,2694,3016,1698892194
127959835,"""schinese""",1670214704,1698915106,1,0,0,0.0,0,1,0,0,1,,76561199209656688,47,11,1179,0,1179,1694817767


Since DuckDB can both read and write from remote locations in a number of file formats, it's a great swiss army knife for ETL - let's build a tiny pipeline to clean up the review data and convert to Parquet.

In [8]:
sql = "COPY (SELECT * FROM 's3://datalake/extract/reviews/10.csv' WHERE recommendationid is not null) TO 's3://datalake/extract/duckdb/10.parquet' (FORMAT PARQUET)"
duckdb.sql(sql)

In [9]:
sql = "SELECT language, COUNT() as num_languages FROM 's3://datalake/extract/duckdb/10.parquet' GROUP BY ALL ORDER BY num_languages DESC"
duckdb.sql(sql).pl()

language,num_languages
str,i64
"""russian""",73425
"""english""",56543
"""spanish""",28800
"""brazilian""",15574
"""turkish""",14753
…,…
"""greek""",168
"""norwegian""",139
"""thai""",118
"""vietnamese""",50


Now that the data is in Parquet format, DuckDB will intelligently push down query predicates into the Parquet file, reading only the required data. This lets us do things like process larger-than-RAM files with ease.

In [11]:
sql = "SELECT language, COUNT() as num_rows FROM 'data/all_reviews.parquet' GROUP BY ALL"
duckdb.sql(sql).pl()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

language,num_rows
str,i64
"""russian""",15069437
"""ukrainian""",463960
"""indonesian""",6158
"""tchinese""",1281136
"""czech""",816372
…,…
"""polish""",3088366
"""greek""",82484
"""brazilian""",5329227
"""spanish""",5461393


We can also parse multiple files using a glob - very handy for folders of data

In [12]:
sql = """
SELECT filename.parse_filename(true) as game_id, * EXCLUDE filename
FROM read_csv('s3://datalake/extract/reviews/*.csv', filename = true)
WHERE recommendationid is not null
LIMIT 100
"""
duckdb.execute(sql).pl()

game_id,recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
str,i64,str,i64,i64,i64,i64,i64,f64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,i64,i64
"""10""",147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
"""10""",166664841,"""russian""",1717510100,1717510100,1,0,0,0.0,0,1,0,0,1,,76561199161536896,24,11,436,71,385,1717512997
"""10""",166664763,"""russian""",1717510009,1717510009,1,0,0,0.0,0,0,0,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
"""10""",166663001,"""turkish""",1717508182,1717508182,0,0,0,0.0,0,0,0,0,1,,76561199374468448,32,4,361,19,356,1717508513
"""10""",166658743,"""brazilian""",1717503385,1717503385,1,1,0,0.52381,0,1,0,0,1,,76561198018922960,9,1,1497,0,1497,1478272196
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""10""",166533117,"""polish""",1717338208,1717338208,1,0,0,0.0,0,1,0,0,1,,76561198193120528,60,8,43649,0,43649,1712682796
"""10""",163650359,"""spanish""",1714167060,1717333972,1,0,0,0.0,0,1,0,0,1,,76561199226138608,39,21,2712,1839,1922,1717506042
"""10""",166525327,"""romanian""",1717329933,1717329933,1,0,0,0.0,0,1,0,0,1,,76561199491549024,10,1,951,349,804,1717480813
"""10""",166524383,"""russian""",1717328789,1717328789,1,0,0,0.0,0,1,0,0,1,,76561199098111520,0,1,5398,1884,4926,1717498719


Can we do this with Iceberg? Of course! Let's use the AWS data from before to show off a more common usecase. 

In [66]:
sql = """CREATE OR REPLACE SECRET pydata (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN,
    SCOPE 's3://pydata-copenhagen-datalake'
)
"""
duckdb.sql(sql)

┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true    │
└─────────┘

In [49]:
from pyiceberg.catalog import load_catalog

In [50]:
catalog = load_catalog("aws_iceberg", **{"type": "glue", "glue.region": "eu-north-1"})

In [51]:
table = catalog.load_table("steam.reviews")

In [63]:
table.metadata_location

's3://pydata-copenhagen-datalake/staging/reviews/metadata/00000-27183f43-7d16-4636-951b-957ad32c9731.metadata.json'

In [73]:
sql = f"SELECT COUNT(*) FROM iceberg_scan('{table.metadata_location}')"

In [71]:
duckdb.install_extension('iceberg')
duckdb.load_extension('iceberg')

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [74]:
duckdb.sql(sql).pl()

count_star()
i64
127572881


In [None]:
sql = f"SELECT language, count(language) as num_languages FROM iceberg_scan('{table.metadata_location}') GROUP BY ALL"
duckdb.sql(sql).pl()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))