In [1]:
import duckdb

# Load SQL extension
%load_ext sql

# Initialize 🦆 DuckDB connection
conn = duckdb.connect()

# Import database
%sql conn --alias duckdb
%sql IMPORT DATABASE '../../data/nps';

Deploy Dash apps for free on Ploomber Cloud! Learn more: https://ploomber.io/s/signup


Config,value
feedback,True
autopandas,True
displaylimit,10
displaycon,False


Unnamed: 0,Count
0,224


Now that we've discussed `GROUP` and `WINDOWS`, we can discuss some advanced filtering concepts: `QUALIFY` and `HAVING`

`HAVING` is like `WHERE`, but for aggregates. That means it's evaluated after most of the query and let's you filter on aggregated values.

In [3]:
%%sql
SELECT
    p.fullname,
    SUM(c.numberofsitesfirstcomefirstserve + c.numberofsitesreservable) AS num_campsites
FROM nps_public_data.campgrounds c
INNER JOIN nps_public_data.parks p
    ON c.parkcode = p.parkcode
    AND p.designation = 'National Park'
-- This won't work, because num_campsites is an aggregate function
-- WHERE num_campsites > 100
GROUP BY 1
-- This will
HAVING num_campsites > 100
ORDER BY 2 ASC
LIMIT 10;

Unnamed: 0,fullName,num_campsites
0,Channel Islands National Park,106.0
1,Guadalupe Mountains National Park,110.0
2,Black Canyon Of The Gunnison National Park,116.0
3,Wind Cave National Park,124.0
4,Theodore Roosevelt National Park,125.0
5,Mammoth Cave National Park,131.0
6,Great Basin National Park,136.0
7,Pinnacles National Park,139.0
8,Big Bend National Park,196.0
9,Crater Lake National Park,230.0


`QUALIFY` is like `WHERE` and `HAVING` _but_.... It applies to windows! This can be exceedingly helpful, as windows would almost always require a second CTE to filter.

One particularly useful case for qualify is in `ROW_NUMBER` or `RANK` queries"

In [4]:
%%sql
SELECT
    p.fullname as park_name,
    c.name as campground_name,

    -- For each park, which campground has the maximum number of campsites?
    c.numberofsitesfirstcomefirstserve + c.numberofsitesreservable as num_campsites,
    -- RANK, ROW_NUMBER, DENSE_RANK     
    RANK() OVER (PARTITION BY park_name ORDER BY c.numberofsitesfirstcomefirstserve + c.numberofsitesreservable DESC) as park_campsites_rank,
    ROW_NUMBER() OVER (PARTITION BY park_name ORDER BY c.numberofsitesfirstcomefirstserve + c.numberofsitesreservable DESC) as campsites_row_num,
    DENSE_RANK() OVER (PARTITION BY park_name ORDER BY c.numberofsitesfirstcomefirstserve + c.numberofsitesreservable DESC) as campsites_dense_rank,
FROM nps_public_data.campgrounds c
INNER JOIN nps_public_data.parks p
    ON c.parkcode = p.parkcode
    AND p.designation = 'National Park'
-- Get the sencond largest campground for each park
QUALIFY park_campsites_rank = 2
ORDER BY park_name, park_campsites_rank ASC
LIMIT 12;

Unnamed: 0,park_name,campground_name,num_campsites,park_campsites_rank,campsites_row_num,campsites_dense_rank
0,Acadia National Park,Seawall Campground,202,2,2,2
1,Badlands National Park,Sage Creek Campground,0,2,2,2
2,Big Bend National Park,Chisos Basin Campground,56,2,2,2
3,Biscayne National Park,Elliott Key Campground,0,2,2,2
4,Black Canyon Of The Gunnison National Park,East Portal Campground,15,2,2,2
5,Bryce Canyon National Park,North Campground,96,2,2,2
6,Canyonlands National Park,Island in the Sky (Willow Flat) Campground,12,2,2,2
7,Capitol Reef National Park,Primitive campsites at Cathedral Campground,6,2,2,2
8,Channel Islands National Park,Santa Rosa Island Backcountry Beach Camping,30,2,2,2
9,Congaree National Park,Bluff Campground,12,2,2,2


While advanced filters are a simple concept, understanding when and where to use them can save you an extra CTE... and possible a few lines of code 😄