Skip to content

Recipe Multi Table Joins

Joel Natividad edited this page May 13, 2026 · 2 revisions

Recipe: Multi-Table Joins

Tier: Intermediate Commands used: join, joinp (asof, non-equi), sqlp, select, sort Anchor datasets: NYC 311 + NOAA GHCN-Daily weather; wcp.csv + country_continent.csv; export-style multi-table merges

This recipe expands the original Cookbook "Multi-table join avoiding repeated columns" entry. The original is preserved at the bottom of Cookbook.

Problem

Three common join scenarios that go beyond a single inner join:

  1. Lookup join (small) — enrich a big CSV with values from a tiny reference table.
  2. Asof join (time-series) — match each event to the nearest prior event in another time-series. The classic case: NYC 311 complaints + NOAA daily weather.
  3. Repeated-column wide merge — merging multiple exports that share their first N columns and only differ after column N.

Data

# 1. wcp.csv + country_continent.csv (Recipe 1)
curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/wcp.zip && unzip wcp.zip
curl -LO https://raw.githubusercontent.com/datawookie/data-diaspora/master/africa-data/country-continent.csv

# 2. NYC 311 + NOAA GHCN (Recipe 2)
#    NYC 311: the 1M-row bundled sample
ls resources/test/NYC_311_SR_2010-2020-sample-1M.csv
#    NOAA GHCN station data (Central Park, NYC)
curl -LO "https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/USW00094728.csv"

Solution — Recipe 1: lookup join

1. Inner join wcp + continent

qsv join Country wcp.csv country country-continent.csv > wcp_with_continent.csv

The first two args are the left side; the next two are the right. Output preserves every column of both inputs.

2. Left join — keep wcp rows even if no continent match

qsv join --left Country wcp.csv country country-continent.csv > wcp_left.csv

Empty right-side cells appear for unmatched rows.

3. Anti-join — find wcp countries with no continent in the lookup

qsv join --left-anti Country wcp.csv country country-continent.csv > unmatched.csv
qsv count unmatched.csv
# e.g. 31 missing countries

Useful for "what's in my data that I don't have a reference for?"

4. Semi-join — keep wcp rows that DO have a match (no right-side columns added)

qsv join --left-semi Country wcp.csv country country-continent.csv > matched.csv

Solution — Recipe 2: NYC 311 × NOAA weather (asof join)

The classic time-series enrichment problem: for each 311 complaint, what was the weather on the day of the complaint — without an exact-match join (since the 311 timestamp is finer than the daily weather).

1. Prepare both sides

# Strip NOAA's preamble, keep just date + temp columns
qsv input --auto-skip USW00094728.csv \
  | qsv select 'DATE,TMAX,TMIN,PRCP' \
  | qsv datefmt DATE --formatstr '%Y-%m-%d' \
  | qsv sort --select DATE \
  > nyc_weather.csv

# Normalize 311 Created Date and sort by date
qsv datefmt 'Created Date' --formatstr '%Y-%m-%d' --new-column day \
  NYC_311_SR_2010-2020-sample-1M.csv \
  | qsv sort --select day \
  > nyc311_by_day.csv

Both sides must be sorted by the asof key — Polars enforces it.

2. Asof join

qsv joinp \
  --strategy backward \
  day nyc311_by_day.csv DATE nyc_weather.csv \
  --coalesce > nyc311_weather.csv

Strategies: backward (nearest prior match), forward (nearest subsequent), nearest (either direction). For complaint-day weather, backward is right (the weather observation for that date).

(Exact flag spellings vary across qsv versions — run qsv joinp --help to confirm the asof-related flag names in your build.)

3. Analyze: are noise complaints more common when it's hot?

qsv sqlp nyc311_weather.csv \
  "SELECT
     CASE
       WHEN TMAX > 800 THEN 'hot'        -- NOAA stores temps as tenths of C; 800 = 80°C? No — see note
       WHEN TMAX < 0 THEN 'cold'
       ELSE 'moderate'
     END AS temp_band,
     COUNT(*) AS complaints
   FROM nyc311_weather
   WHERE \"Complaint Type\" = 'Noise - Residential'
   GROUP BY temp_band
   ORDER BY complaints DESC"

(NOAA GHCN-Daily stores TMAX as tenths of degrees Celsius; adjust thresholds accordingly. Always sanity-check unit conventions when joining cross-domain.)

Solution — Recipe 3: repeated-column wide merge

You have table_A.csv, table_B.csv, table_C.csv, table_D.csv. The first 10 columns are identical across all four files; columns 11+ are unique to each file. You want one wide CSV, joined on column 2.

cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
  # Pull just the join column + the unique extras from each table
  qsv join 2 combined.csv 1 <(qsv select 2,11- "$NEXT") > new.csv
  mv new.csv combined.csv
done

<(...) is bash process substitution — qsv select becomes the right-side input without writing an intermediate file. 2,11- selects column 2 and everything from column 11 onward.

Variations

Non-equi join — match by range, not equality

You have salary bands and an employee list:

# bands.csv
min,max,grade
0,50000,Junior
50001,100000,Mid
100001,200000,Senior
200001,9999999,Staff
qsv joinp --non-equi \
  'salary_left >= min_right AND salary_left <= max_right' \
  employees.csv bands.csv > graded.csv

The Polars SQL expression uses _left and _right suffixes to disambiguate column origins.

Cross join (Cartesian product)

qsv joinp --cross flavors.csv toppings.csv > all_combinations.csv

Filter before joining for speed

qsv joinp --filter-left "Borough = 'BROOKLYN'" \
  day nyc311_by_day.csv DATE nyc_weather.csv \
  --coalesce > brooklyn_weather.csv

Polars evaluates --filter-left / --filter-right BEFORE the join — saves shuffling rows that wouldn't survive a WHERE clause anyway.

SQL-style join via sqlp

qsv sqlp wcp.csv country-continent.csv \
  "SELECT wcp.AccentCity, wcp.Population, country_continent.continent
   FROM wcp JOIN country_continent ON wcp.Country = country_continent.country
   WHERE wcp.Population > 1000000
   ORDER BY wcp.Population DESC"

Performance notes

  • join is in-memory. The right side is loaded into a hash table; the left side streams. Works great for "millions × thousands" joins. Fails for "millions × millions."
  • joinp is Polars-powered. Larger-than-RAM, multithreaded, supports asof and non-equi. Slower for tiny joins (Polars setup cost), much faster for large joins.
  • For repeated incremental joins (Recipe 3), each pass re-reads and re-writes the full intermediate file. If combined.csv grows large, batch all joins into a single sqlp query for a single pass.

See also

Clone this wiki locally