In [1]:
import duckdb

# Load SQL extension
%load_ext sql

# Initialize 🦆 DuckDB connection
conn = duckdb.connect()

# Import database
%sql conn --alias duckdb
%sql IMPORT DATABASE '../../data/nps';

Config,value
feedback,True
autopandas,True
displaylimit,10
displaycon,False


Unnamed: 0,Count
0,224


Sometimes, _joins themselves_ can be incredibly useful for data transformation. Don't underestimate the power of _self-joins_ or `CROSS JOINS` using SQL.

A self-join is when we join data _to itself_ to answer a question. A `CROSS JOIN` joins _every_ record on the left to _every_ record on the right. They can be useful in certain cases, but be careful, as they can easily blow up a query. As an example, a cross-join of two 1000 record tables will produce 1,000 * 1,000 = 1,000,000 rows.

Here's a practical example: say we wanted to go to Joshua tree campground and camp at two different campsites, but the second should be different than the first.

How could we write a cross join that gives us all of the combinations of campgrounds we could possibly attend?

In [2]:
%%sql
-- If I have two nights to camp at Joshua Tree,
-- how many different ways could I camp?
WITH joshua_tree_campgrounds AS (
    SELECT
        *
    FROM nps_public_data.campgrounds c
    INNER JOIN nps_public_data.parks p
        USING(parkcode)
    WHERE fullname = 'Joshua Tree National Park'
)
SELECT
    jtc.name,
    jtc2.name,
    ROW_NUMBER() OVER () as rn
FROM joshua_tree_campgrounds jtc
CROSS JOIN joshua_tree_campgrounds jtc2
WHERE jtc.name != jtc2.name
ORDER BY rn DESC

Unnamed: 0,name,name_1,rn
0,White Tank Campground,Sheep Pass Group Campground,72
1,White Tank Campground,Ryan Campground,71
2,White Tank Campground,Jumbo Rocks Campground,70
3,White Tank Campground,Indian Cove Campground,69
4,White Tank Campground,Cottonwood Campground,68
...,...,...,...
67,Black Rock Campground,Jumbo Rocks Campground,5
68,Black Rock Campground,Indian Cove Campground,4
69,Black Rock Campground,Cottonwood Campground,3
70,Black Rock Campground,Belle Campground,2


What about three nights?

In [3]:
%%sql
-- What about 3?
WITH joshua_tree_campgrounds AS (
    SELECT
        *
    FROM nps_public_data.campgrounds c
    INNER JOIN nps_public_data.parks p
        USING(parkcode)
    WHERE fullname = 'Joshua Tree National Park'
)
SELECT
    --     COUNT(*),
    --     COUNT(DISTINCT jtc.name)
    jtc.name,
    jtc2.name,
    jtc3.name,
    ROW_NUMBER() OVER () as rn
FROM joshua_tree_campgrounds jtc
CROSS JOIN joshua_tree_campgrounds jtc2
CROSS JOIN joshua_tree_campgrounds jtc3
WHERE 1 = 1
    AND jtc.name != jtc2.name
    AND jtc.name != jtc3.name
    AND jtc2.name != jtc3.name
ORDER BY rn desc

Unnamed: 0,name,name_1,name_2,rn
0,Ryan Campground,White Tank Campground,Sheep Pass Group Campground,504
1,Jumbo Rocks Campground,White Tank Campground,Sheep Pass Group Campground,503
2,Indian Cove Campground,White Tank Campground,Sheep Pass Group Campground,502
3,Cottonwood Campground,White Tank Campground,Sheep Pass Group Campground,501
4,Belle Campground,White Tank Campground,Sheep Pass Group Campground,500
...,...,...,...,...
499,Ryan Campground,Black Rock Campground,Hidden Valley Campground,5
500,Jumbo Rocks Campground,Black Rock Campground,Hidden Valley Campground,4
501,Indian Cove Campground,Black Rock Campground,Hidden Valley Campground,3
502,Cottonwood Campground,Black Rock Campground,Hidden Valley Campground,2


And we can check both: [2 nights](https://www.calculatorsoup.com/calculators/discretemathematics/permutations.php?n=9&r=2&action=solve), [3 nights](https://www.calculatorsoup.com/calculators/discretemathematics/permutations.php?n=9&r=3&action=solve)

We can also use the earlier example of unnesting states to aggregate on cumulative counts. Since our `parks` table _technically_ contains both parks and trails, we can perform a self-join on an unnested state CTE to count both parks and trails in the same query.

This answers the question "For states with a national park, how many national trails are in that state?"

In [4]:
%%sql
WITH park_list AS (
    SELECT
        fullname,
        designation,
        UNNEST(
            SPLIT(states, ',')::string[]
            ) as state
    FROM nps_public_data.parks p
)
SELECT
    DISTINCT parks.state,
             COUNT(DISTINCT parks.fullname) as num_parks,
             COUNT(DISTINCT trails.fullname) as num_trails,
FROM park_list parks
LEFT JOIN park_list trails
    ON trails.state = parks.state
    AND trails.designation ILIKE '%trail%'
WHERE parks.designation = 'National Park'
GROUP BY 1
ORDER BY 2 DESC, 1
LIMIT 10

Unnamed: 0,state,num_parks,num_trails
0,CA,6,5
1,UT,5,4
2,AZ,3,3
3,CO,3,4
4,FL,3,0
5,WA,3,3
6,AK,2,0
7,HI,2,1
8,MT,2,2
9,NM,2,4


Pay very close attention to how and when we're filtering in the above query. The base query is selecting parks, so the filter appears in the `WHERE` clause. The self-join is designed to pull in trails, hence the filter in the `JOIN`. Finally, we're counting `DISTINCT` names to account for multiple records with the same value (duplicate rows created by the join)

So the two patterns we've covered are:

- **Self-joins:** Self joins in SQL are used to join a table to itself, allowing for comparisons within the same table. This can be useful for querying hierarchical data or finding rows in a table that share a common attribute.
- **Cross joins:** A cross join in SQL is used to combine all rows from two or more tables, producing a Cartesian product of the sets. In a cross join, each row from the first table is paired with each row from the second table, leading to a result set that has the number of rows equal to the product of the row counts of the joined tables.

These two patterns can help you manipulate data into a format suitable for further analysis, aggregation, or windowing. 