# DuckDB Notebook — Unsolved


## 0. Setup

In this notebook you will work with DuckDB as an **in-process analytical database**.

We will:
- generate two datasets (`sales`, `products`)
- persist them as Parquet
- query them directly using SQL

This notebook is independent from Polars.


In [None]:
!pip install duckdb



In [None]:
import duckdb
from datetime import datetime, timedelta
import pandas as pd
import numpy as np

con = duckdb.connect()

In [None]:
# ----------------------------
# Dataset sizes
# ----------------------------
N_SALES = 300_000
N_PRODUCTS = 2_000

cities = ["Valencia", "Madrid", "Sevilla", "Bilbao", "Barcelona"]
categories = ["Electronics", "Fashion", "Home", "Sports", "Toys"]

# ----------------------------
# Products table
# ----------------------------
products = pd.DataFrame({
    "product_id": np.arange(1, N_PRODUCTS + 1),
    "category": [categories[i % len(categories)] for i in range(N_PRODUCTS)],
    "is_discontinued": (np.arange(1, N_PRODUCTS + 1) % 40 == 0)
})

# ----------------------------
# Sales table
# ----------------------------
sales = pd.DataFrame({
    "sale_id": np.arange(1, N_SALES + 1),
    "product_id": (np.arange(1, N_SALES + 1) * 37 % (N_PRODUCTS + 200)) + 1,
    "user_id": (np.arange(1, N_SALES + 1) * 13) % 50_000,
    "units": (np.arange(1, N_SALES + 1) * 7 % 5) + 1,
    "city": [cities[i % len(cities)] for i in range(N_SALES)],
    "has_discount": (np.arange(1, N_SALES + 1) % 10 == 0),
    "timestamp": [
        datetime(2025, 1, 1) + timedelta(seconds=i)
        for i in range(N_SALES)
    ]
})

sales["gross_value"] = sales["units"] * ((sales["sale_id"] % 200) + 5)

In [None]:
products.to_parquet("products.parquet", index=False)
sales.to_parquet("sales.parquet", index=False)

In [None]:
con = duckdb.connect()
con.execute("SELECT * FROM products.parquet LIMIT 10").df()

Unnamed: 0,product_id,category,is_discontinued
0,1,Electronics,False
1,2,Fashion,False
2,3,Home,False
3,4,Sports,False
4,5,Toys,False
5,6,Electronics,False
6,7,Fashion,False
7,8,Home,False
8,9,Sports,False
9,10,Toys,False


In [None]:
con.execute("SELECT * FROM sales.parquet LIMIT 10").df()

Unnamed: 0,sale_id,product_id,user_id,units,city,has_discount,timestamp,gross_value
0,1,38,13,3,Valencia,False,2025-01-01 00:00:00,18
1,2,75,26,5,Madrid,False,2025-01-01 00:00:01,35
2,3,112,39,2,Sevilla,False,2025-01-01 00:00:02,16
3,4,149,52,4,Bilbao,False,2025-01-01 00:00:03,36
4,5,186,65,1,Barcelona,False,2025-01-01 00:00:04,10
5,6,223,78,3,Valencia,False,2025-01-01 00:00:05,33
6,7,260,91,5,Madrid,False,2025-01-01 00:00:06,60
7,8,297,104,2,Sevilla,False,2025-01-01 00:00:07,26
8,9,334,117,4,Bilbao,False,2025-01-01 00:00:08,56
9,10,371,130,1,Barcelona,True,2025-01-01 00:00:09,15


## Exercise 1 — Read data

Read only:
- sale_id
- units
- timestamp

Return only 5 rows.


In [None]:
# TODO:
# - SELECT only required columns
# - LIMIT 5


In [None]:
con.execute("""
SELECT
FROM
LIMIT
""").df()


Unnamed: 0,sale_id,units,timestamp
0,1,3,2025-01-01 00:00:00
1,2,5,2025-01-01 00:00:01
2,3,2,2025-01-01 00:00:02
3,4,4,2025-01-01 00:00:03
4,5,1,2025-01-01 00:00:04


## Exercise 2 — Filtering and ordering

Filter sales with gross_value > 1000,
order by gross_value descending,
return top 10 rows.


In [None]:
# TODO:
# - WHERE gross_value > 1000
# - ORDER BY gross_value DESC
# - LIMIT 10


In [None]:
con.execute("""
SELECT
FROM
WHERE
ORDER
LIMIT
""").df()


Unnamed: 0,sale_id,city,gross_value
0,197,Madrid,1010
1,397,Madrid,1010
2,597,Madrid,1010
3,797,Madrid,1010
4,997,Madrid,1010
5,1197,Madrid,1010
6,1397,Madrid,1010
7,1597,Madrid,1010
8,1797,Madrid,1010
9,1997,Madrid,1010


## Exercise 3 — Derived column

Create a column `net_value`:
- apply 10% discount if has_discount = true
- otherwise keep gross_value


In [None]:
# TODO:
# - Use CASE WHEN
# - Do NOT modify original data


In [None]:
con.execute("""
SELECT
  CASE
    WHEN ____ THEN ____
    ELSE _____
  END AS net_value
""").df()


Unnamed: 0,sale_id,gross_value,has_discount,net_value
0,1,18,False,18.0
1,2,35,False,35.0
2,3,16,False,16.0
3,4,36,False,36.0
4,5,10,False,10.0
5,6,33,False,33.0
6,7,60,False,60.0
7,8,26,False,26.0
8,9,56,False,56.0
9,10,15,True,13.5


## Exercise 4 — Strings and dates

- Normalize city to lowercase
- Extract year and month from timestamp


In [None]:
# TODO:
# - LOWER(city)
# - EXTRACT(year/month FROM timestamp)


In [None]:
con.execute("""
SELECT
  EXTRACT()...
LIMIT 10
""").df()


Unnamed: 0,city,city_norm,year,month
0,Valencia,valencia,2025,1
1,Madrid,madrid,2025,1
2,Sevilla,sevilla,2025,1
3,Bilbao,bilbao,2025,1
4,Barcelona,barcelona,2025,1
5,Valencia,valencia,2025,1
6,Madrid,madrid,2025,1
7,Sevilla,sevilla,2025,1
8,Bilbao,bilbao,2025,1
9,Barcelona,barcelona,2025,1


## Exercise 5 — Aggregations

Group by city and compute:
- number of orders
- total gross_value
- average gross_value


In [None]:
# TODO:
# - GROUP BY city
# - COUNT, SUM, AVG
# - Use aliases


In [None]:
con.execute("""
SELECT
  COUNT()
  SUM()
  AVG()
GROUP BY
ORDER BY
""").df()


Unnamed: 0,city,orders,total_value,avg_value
0,Madrid,60000,31350000.0,522.5
1,Bilbao,60000,25560000.0,426.0
2,Valencia,60000,18630000.0,310.5
3,Sevilla,60000,12660000.0,211.0
4,Barcelona,60000,6150000.0,102.5


## Exercise 6 — Invalid product references

Find sales that reference a product_id
that does NOT exist in products.


In [None]:
# TODO:
# - Use ANTI JOIN


In [None]:
con.execute("""
SELECT COUNT()
FROM ____ s
ANTI JOIN ____ p
ON _____
""").df()


Unnamed: 0,invalid_sales
0,27270


## Exercise 7 — Window functions

For each user:
- compute running total of gross_value
- show previous sale value


In [None]:
# TODO:
# - SUM(...) OVER (PARTITION BY user_id ORDER BY timestamp)
# - LAG(gross_value)

In [None]:
con.execute("""
SELECT
  ....
  SUM(...) OVER () AS running_total,
  LAG(:::) OVER () AS prev_value
FROM
LIMIT 10
""").df()


Unnamed: 0,user_id,timestamp,gross_value,running_total,prev_value
0,3,2025-01-01 05:20:30,108,108.0,
1,3,2025-01-01 19:13:50,108,216.0,108.0
2,3,2025-01-02 09:07:10,108,324.0,108.0
3,3,2025-01-02 23:00:30,108,432.0,108.0
4,3,2025-01-03 12:53:50,108,540.0,108.0
5,3,2025-01-04 02:47:10,108,648.0,108.0
6,6,2025-01-01 10:41:01,335,335.0,
7,6,2025-01-02 00:34:21,335,670.0,335.0
8,6,2025-01-02 14:27:41,335,1005.0,335.0
9,6,2025-01-03 04:21:01,335,1340.0,335.0


## Exercise 8 — Simple counts & flags

Count how many sales:

- have a discount
- do NOT have a discount

Return both values in the same result.


In [None]:
# TODO:
# - use COUNT(*)
# - use CASE WHEN to separate discounted vs non-discounted

con.execute("""
SELECT
  -- TODO: count discounted sales
  -- TODO: count non-discounted sales
FROM 'sales.parquet'
""").df()


## BI-ready table

Compute KPIs by product category.

The query must:
- apply discounts
- join sales with products
- exclude discontinued products
- compute number of orders and total revenue


In [None]:
# TODO:
# - compute net_value applying the discount
# - join sales with products to get category
# - exclude discontinued products
# - aggregate by category
# - order by revenue desc

In [None]:
con.execute("""
SELECT
  p.category,
  -- TODO: COUNT(*) AS orders
  -- TODO: SUM(...) AS revenue
FROM 'sales.parquet' s
INNER JOIN 'products.parquet' p
  ON ____
WHERE p.is_discontinued = _____
GROUP BY p.category
ORDER BY revenue DESC
""").df()

Unnamed: 0,category,orders,revenue
0,Toys,47726,24680685.0
1,Sports,54546,23236576.0
2,Home,54546,16936503.0
3,Fashion,54546,11508916.0
4,Electronics,54546,5318092.5
