Recipe Multi Table Joins

Recipe: Multi-Table Joins

Tier: Intermediate Commands used: join, joinp (asof, non-equi), sqlp, select, sort Anchor datasets: NYC 311 + NOAA GHCN-Daily weather; wcp.csv + country_continent.csv; export-style multi-table merges

This recipe expands the original Cookbook "Multi-table join avoiding repeated columns" entry. The original is preserved at the bottom of Cookbook.

Problem

Three common join scenarios that go beyond a single inner join:

Lookup join (small) — enrich a big CSV with values from a tiny reference table.
Asof join (time-series) — match each event to the nearest prior event in another time-series. The classic case: NYC 311 complaints + NOAA daily weather.
Repeated-column wide merge — merging multiple exports that share their first N columns and only differ after column N.

Data

# 1. wcp.csv + country_continent.csv (Recipe 1)
curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/wcp.zip && unzip wcp.zip
curl -LO https://raw.githubusercontent.com/datawookie/data-diaspora/master/africa-data/country-continent.csv

# 2. NYC 311 + NOAA GHCN (Recipe 2)
#    NYC 311: the 1M-row bundled sample
ls resources/test/NYC_311_SR_2010-2020-sample-1M.csv
#    NOAA GHCN station data (Central Park, NYC)
curl -LO "https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/USW00094728.csv"

Solution — Recipe 1: lookup join

1. Inner join wcp + continent

qsv join Country wcp.csv country country-continent.csv > wcp_with_continent.csv

The first two args are the left side; the next two are the right. Output preserves every column of both inputs.

2. Left join — keep wcp rows even if no continent match

qsv join --left Country wcp.csv country country-continent.csv > wcp_left.csv

Empty right-side cells appear for unmatched rows.

3. Anti-join — find wcp countries with no continent in the lookup

qsv join --left-anti Country wcp.csv country country-continent.csv > unmatched.csv
qsv count unmatched.csv
# e.g. 31 missing countries

Useful for "what's in my data that I don't have a reference for?"

4. Semi-join — keep wcp rows that DO have a match (no right-side columns added)

qsv join --left-semi Country wcp.csv country country-continent.csv > matched.csv

Solution — Recipe 2: NYC 311 × NOAA weather (asof join)

The classic time-series enrichment problem: for each 311 complaint, what was the weather on the day of the complaint — without an exact-match join (since the 311 timestamp is finer than the daily weather).

1. Prepare both sides

# Strip NOAA's preamble, keep just date + temp columns
qsv input --auto-skip USW00094728.csv \
  | qsv select 'DATE,TMAX,TMIN,PRCP' \
  | qsv datefmt DATE --formatstr '%Y-%m-%d' \
  | qsv sort --select DATE \
  > nyc_weather.csv

# Normalize 311 Created Date and sort by date
qsv datefmt 'Created Date' --formatstr '%Y-%m-%d' --new-column day \
  NYC_311_SR_2010-2020-sample-1M.csv \
  | qsv sort --select day \
  > nyc311_by_day.csv

Both sides must be sorted by the asof key — Polars enforces it.

2. Asof join

qsv joinp \
  --strategy backward \
  day nyc311_by_day.csv DATE nyc_weather.csv \
  --coalesce > nyc311_weather.csv

Strategies: backward (nearest prior match), forward (nearest subsequent), nearest (either direction). For complaint-day weather, backward is right (the weather observation for that date).

(Exact flag spellings vary across qsv versions — run qsv joinp --help to confirm the asof-related flag names in your build.)

3. Analyze: are noise complaints more common when it's hot?

qsv sqlp nyc311_weather.csv \
  "SELECT
     CASE
       WHEN TMAX > 800 THEN 'hot'        -- NOAA stores temps as tenths of C; 800 = 80°C? No — see note
       WHEN TMAX < 0 THEN 'cold'
       ELSE 'moderate'
     END AS temp_band,
     COUNT(*) AS complaints
   FROM nyc311_weather
   WHERE \"Complaint Type\" = 'Noise - Residential'
   GROUP BY temp_band
   ORDER BY complaints DESC"

(NOAA GHCN-Daily stores TMAX as tenths of degrees Celsius; adjust thresholds accordingly. Always sanity-check unit conventions when joining cross-domain.)

Solution — Recipe 3: repeated-column wide merge

You have table_A.csv, table_B.csv, table_C.csv, table_D.csv. The first 10 columns are identical across all four files; columns 11+ are unique to each file. You want one wide CSV, joined on column 2.

cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
  # Pull just the join column + the unique extras from each table
  qsv join 2 combined.csv 1 <(qsv select 2,11- "$NEXT") > new.csv
  mv new.csv combined.csv
done

<(...) is bash process substitution — qsv select becomes the right-side input without writing an intermediate file. 2,11- selects column 2 and everything from column 11 onward.

Variations

Non-equi join — match by range, not equality

You have salary bands and an employee list:

# bands.csv
min,max,grade
0,50000,Junior
50001,100000,Mid
100001,200000,Senior
200001,9999999,Staff

qsv joinp --non-equi \
  'salary_left >= min_right AND salary_left <= max_right' \
  employees.csv bands.csv > graded.csv

The Polars SQL expression uses _left and _right suffixes to disambiguate column origins.

Cross join (Cartesian product)

qsv joinp --cross flavors.csv toppings.csv > all_combinations.csv

Filter before joining for speed

qsv joinp --filter-left "Borough = 'BROOKLYN'" \
  day nyc311_by_day.csv DATE nyc_weather.csv \
  --coalesce > brooklyn_weather.csv

Polars evaluates --filter-left / --filter-right BEFORE the join — saves shuffling rows that wouldn't survive a WHERE clause anyway.

SQL-style join via `sqlp`

qsv sqlp wcp.csv country-continent.csv \
  "SELECT wcp.AccentCity, wcp.Population, country_continent.continent
   FROM wcp JOIN country_continent ON wcp.Country = country_continent.country
   WHERE wcp.Population > 1000000
   ORDER BY wcp.Population DESC"

Performance notes

join is in-memory. The right side is loaded into a hash table; the left side streams. Works great for "millions × thousands" joins. Fails for "millions × millions."
joinp is Polars-powered. Larger-than-RAM, multithreaded, supports asof and non-equi. Slower for tiny joins (Polars setup cost), much faster for large joins.
For repeated incremental joins (Recipe 3), each pass re-reads and re-writes the full intermediate file. If combined.csv grows large, batch all joins into a single sqlp query for a single pass.

Recipe Multi Table Joins

Recipe: Multi-Table Joins

Problem

Data

Solution — Recipe 1: lookup join

1. Inner join wcp + continent

2. Left join — keep wcp rows even if no continent match

3. Anti-join — find wcp countries with no continent in the lookup

4. Semi-join — keep wcp rows that DO have a match (no right-side columns added)

Solution — Recipe 2: NYC 311 × NOAA weather (asof join)

1. Prepare both sides

2. Asof join

3. Analyze: are noise complaints more common when it's hot?

Solution — Recipe 3: repeated-column wide merge

Variations

Non-equi join — match by range, not equality

Cross join (Cartesian product)

Filter before joining for speed

SQL-style join via sqlp

Performance notes

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

SQL-style join via `sqlp`