In [1]:
import duckdb

# Load SQL extension
%load_ext sql

# Initialize 🦆 DuckDB connection
conn = duckdb.connect()

# Import database
%sql conn --alias duckdb
%sql IMPORT DATABASE '../../data/nps';

Config,value
feedback,True
autopandas,True
displaylimit,10
displaycon,False


Unnamed: 0,Count
0,224


In our previous examples, we used `WHERE` to filter queries, but we can also do so in `JOIN`s. 

However, we need to be _very_ careful with how joins work.

In [2]:
%%sql
SELECT
    p.name,
    vc.name as visitor_center_name
FROM nps_public_data.parks p
LEFT JOIN nps_public_data.visitorcenters vc
    ON p.parkcode = vc.parkcode
WHERE 1 = 1
-- Filter base query (parks) for national monument
    AND p.designation = 'National Monument'
-- Filter JOIN (!) for passport stamp locations.
-- what will happen to parks without visitor centers?
    AND vc.ispassportstamplocation
LIMIT 1

Unnamed: 0,name,visitor_center_name
0,Statue Of Liberty,Liberty Island Information Center


How many rows are returned with/without the `LEFT JOIN`? What does that say about the number of parks we're querying? Why do you think that is? `INNER JOINS` are identical to `LEFT JOINS` with a `NOT NULL` clause. Why is that?

We can compare the results with a few CTEs and a `UNION`.

In [3]:
%%sql
WITH filter_in_join AS (
    SELECT
        p.name,
        vc.name as visitor_center_name
    FROM nps_public_data.parks p
    INNER JOIN nps_public_data.visitorcenters vc
        ON p.parkcode = vc.parkcode
), filter_in_where AS (
    SELECT
        p.name,
        vc.name as visitor_center_name
    FROM nps_public_data.parks p
    LEFT JOIN nps_public_data.visitorcenters vc
        ON p.parkcode = vc.parkcode
    WHERE vc.parkcode IS NOT NULL
)
SELECT
    COUNT(*) as ct
FROM filter_in_join

UNION ALL

SELECT
    COUNT(*) as ct
FROM filter_in_where


Unnamed: 0,ct
0,703
1,703


Some common ways of filtering data include

1. Comparisons (`>`, `<`, `=`)
2. `BETWEEN`
3. `IN`
4. `IS NULL`
5. `LIKE` & `ILIKE` // `REGEXP`

Comparisons and `BETWEEN` are good for integers, but also timestamps and dates (as we'll see). `IN` can be helpful for lists of data, while `IS NULL` can help us when `NULL` values are a possibility.

`ILIKE`, `LIKE`, and `REGEXP` are all useful when pattern matching is at play.

We can filter numbers and dates with comparisons or between statements

In [4]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart > '2024-01-01'
    AND recurrencedatestart < '2024-01-23'
ORDER BY RANDOM()
LIMIT 2


Unnamed: 0,title,parkfullname,category,isfree,description
0,Acadian Cultural Center - Louisiana Talks & Tales,Jean Lafitte National Historical Park and Pres...,Regular Event,True,"<p>Join a ranger to learn about the history, c..."


In [5]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart BETWEEN '2024-01-01' AND '2024-01-23'
ORDER BY RANDOM()
LIMIT 2

Unnamed: 0,title,parkfullname,category,isfree,description
0,Acadian Cultural Center - Louisiana Talks & Tales,Jean Lafitte National Historical Park and Pres...,Regular Event,True,"<p>Join a ranger to learn about the history, c..."
1,Afternoon Stroll,Joshua Tree National Park,Regular Event,True,<p>Join a ranger for a 0.4 mile (0.6 km) guide...


What's the difference? `BETWEEN` is _inclusive_~

In [6]:
%%sql
SELECT
    'between' as f,
    COUNT(*) as ct
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart BETWEEN '2024-01-01' AND '2024-01-23'
GROUP BY f

UNION ALL

SELECT
    'greater than' as f,
    COUNT(*) as ct
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart > '2024-01-01'
    AND recurrencedatestart < '2024-01-23'
GROUP BY f

Unnamed: 0,f,ct
0,between,3
1,greater than,1


Of course, we can also nest logic for multiple timeframes:

In [7]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    -- Fetch events with dates in January _or_ March
    AND (
            (recurrencedatestart BETWEEN '2024-01-01' AND '2024-01-31') OR
            (recurrencedatestart BETWEEN '2024-03-01' AND '2024-03-31')
    ) 
ORDER BY RANDOM()
LIMIT 2

Unnamed: 0,title,parkfullname,category,isfree,description
0,2:30 pm: Village — Geology Talk Ranger Program...,Grand Canyon National Park,Regular Event,True,"<div class=""Component text-content-size text-c..."
1,Afternoon Stroll,Joshua Tree National Park,Regular Event,True,<p>Join a ranger for a 0.4 mile (0.6 km) guide...


Another handy way to filter datasets is through string matching— if you're familiar with Python, you probably know regex, but SQL has a few other, simpler ways. First, `LIKE`:

In [8]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND title LIKE '%Stroll%'
LIMIT 5

Unnamed: 0,title,parkfullname,category,isfree,description
0,Afternoon Stroll,Joshua Tree National Park,Regular Event,True,<p>Join a ranger for a 0.4 mile (0.6 km) guide...


But `LIKE` is case sensitive, so it's easy to miss results.

In [None]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND title LIKE '%hike%'
LIMIT 5

Instead, we can use `ILIKE`, which is case INsensitive

In [9]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND title ILIKE '%hike%'
LIMIT 5

Unnamed: 0,title,parkfullname,category,isfree,description
0,A Hike Through The (Cactus) Forest (East Distr...,Saguaro National Park,Regular Event,True,<p>Let your interest reach new heights and joi...


`LIKE` is also great for cleaning up messy columns:

In [10]:
%%sql 
SELECT 
    name,
    managedByOrganization,
FROM nps_public_data.parkinglots
LIMIT 10

Unnamed: 0,name,managedByOrganization
0,Corn Creek Road Parking Area,Nevada Department of Transportation
1,Glen Echo Park and Clara Barton National Histo...,National Park Service
2,Glen Echo Park and Clara Barton National Histo...,National Park Service
3,Kaloko Fishpond Parking Area,National Park Service
4,Main Visitor Parking Lot,NPS
5,Natchez Visitor Center Parking Lot,NPS
6,Parking lot Remodel and Construction,NPS
7,Quarai Oversized Vehicle Parking,NPS
8,Quarry Visitor Center Parking Lot,Dinosaur National Monument
9,Shark Valley Parking Lot,NPS


In [11]:
%%sql 
SELECT 
    CASE WHEN name ILIKE '%visitor%' THEN 'Visitor Center'
         WHEN name ILIKE '%parking%' THEN 'Parking Lot'
         ELSE 'Other'
    END as type,
    IF(managedByOrganization ILIKE '%NPS%', 'National Park Service', managedByOrganization) as managed_by,

FROM nps_public_data.parkinglots
LIMIT 10

Unnamed: 0,type,managed_by
0,Parking Lot,Nevada Department of Transportation
1,Parking Lot,National Park Service
2,Parking Lot,National Park Service
3,Parking Lot,National Park Service
4,Visitor Center,National Park Service
5,Visitor Center,National Park Service
6,Parking Lot,National Park Service
7,Parking Lot,National Park Service
8,Visitor Center,Dinosaur National Monument
9,Parking Lot,National Park Service


Depending on your flavor of SQL, there might be other ways to pattern match. DuckDB also has `glob` matching & `regex` matching, too. Those are outside the scope of this course, but you can read more [here](https://duckdb.org/docs/sql/functions/patternmatching.html).

In [12]:
%%sql
SELECT fullname, states FROM nps_public_data.parks

Unnamed: 0,fullName,states
0,Federal Hall National Memorial,NY
1,Lewis & Clark National Historic Trail,"IA,ID,IL,IN,KS,KY,MO,MT,NE,ND,OH,OR,PA,SD,WA,WV"
2,National Capital Parks-East,DC
3,Adams National Historical Park,MA
4,George Washington Memorial Parkway,"DC,MD,VA"
...,...,...
466,Navajo National Monument,AZ
467,Cabrillo National Monument,CA
468,Golden Spike National Historical Park,UT
469,Fort Union Trading Post National Historic Site,"MT,ND"


Sometimes, we might need to construct a list to perform a more robust filter. We can use `split` and cast the result to a list of strings to turn the `states` field in parks into a list. Then, we can query the list more properly.

In this course, we'll challenge you to think critically about the structure of your data and how you can manipulate it to achieve a desired outcome.

In [13]:
%%sql
-- Which parks are fully or partially in Utah?
WITH park_states AS (
    SELECT 
        fullname,
        states AS states_string, 
        split(states, ',') ::string[] AS states_list
    FROM nps_public_data.parks p
    )
SELECT 
    * 
FROM park_states
WHERE list_contains(states_list, 'UT')
LIMIT 5

Unnamed: 0,fullName,states_string,states_list
0,Cedar Breaks National Monument,UT,[UT]
1,Arches National Park,UT,[UT]
2,Bryce Canyon National Park,UT,[UT]
3,California National Historic Trail,"CA,CO,ID,KS,MO,NE,NV,OR,UT,WY","[CA, CO, ID, KS, MO, NE, NV, OR, UT, WY]"
4,Canyonlands National Park,UT,[UT]


This allows for some nifty queries in DuckDB for cross-border parks

In [None]:
%%sql
-- Which parks are both in Utah and Wyoming?
WITH park_states AS (
    SELECT 
        fullname,
        split(states, ',') ::string[] AS states_list
    FROM nps_public_data.parks p
    )
SELECT 
    * 
FROM park_states
WHERE list_has_all(states_list, ['UT', 'WY'])

In [None]:
%%sql
-- Which parks are in Utah and/or Wyoming?
WITH park_states AS (
    SELECT 
        fullname,
        split(states, ',') ::string[] AS states_list
    FROM nps_public_data.parks p
    )
SELECT 
    * 
FROM park_states
WHERE list_has_any(states_list, ['UT', 'WY'])
LIMIT 5

We can also filter values in a list using `IN`. This can be pretty handy for picking out multiple values

In [None]:
%%sql
SELECT 
    fullname,
    states,
    description
FROM nps_public_data.parks p
WHERE name IN ('Arches', 'Bryce Canyon', 'Zion')

When we return cells, we can order the results using the `ORDER BY` clause. We can also `GROUP` results. We'll discuss grouping more in the next section on aggregations, but `GROUPING` can be used to eliminate duplicates, like `DISTINCT`

In [None]:
%%sql
SELECT
    fullname,
    states
FROM nps_public_data.parks
ORDER BY fullname DESC
LIMIT 5

In [None]:
%%sql
SELECT
    DISTINCT states
FROM nps_public_data.parks
LIMIT 5

Voila! That's a bit about joins, comparisons, and filtering!