-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe Multi Table Joins
Tier: Intermediate
Commands used: join, joinp (asof, non-equi), sqlp, select, sort
Anchor datasets: NYC 311 + NOAA GHCN-Daily weather; wcp.csv + country_continent.csv; export-style multi-table merges
This recipe expands the original Cookbook "Multi-table join avoiding repeated columns" entry. The original is preserved at the bottom of Cookbook.
Three common join scenarios that go beyond a single inner join:
- Lookup join (small) — enrich a big CSV with values from a tiny reference table.
- Asof join (time-series) — match each event to the nearest prior event in another time-series. The classic case: NYC 311 complaints + NOAA daily weather.
- Repeated-column wide merge — merging multiple exports that share their first N columns and only differ after column N.
# 1. wcp.csv + country_continent.csv (Recipe 1)
curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/wcp.zip && unzip wcp.zip
curl -LO https://raw.githubusercontent.com/datawookie/data-diaspora/master/africa-data/country-continent.csv
# 2. NYC 311 + NOAA GHCN (Recipe 2)
# NYC 311: the 1M-row bundled sample
ls resources/test/NYC_311_SR_2010-2020-sample-1M.csv
# NOAA GHCN station data (Central Park, NYC)
curl -LO "https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/USW00094728.csv"qsv join Country wcp.csv country country-continent.csv > wcp_with_continent.csvThe first two args are the left side; the next two are the right. Output preserves every column of both inputs.
qsv join --left Country wcp.csv country country-continent.csv > wcp_left.csvEmpty right-side cells appear for unmatched rows.
qsv join --left-anti Country wcp.csv country country-continent.csv > unmatched.csv
qsv count unmatched.csv
# e.g. 31 missing countriesUseful for "what's in my data that I don't have a reference for?"
qsv join --left-semi Country wcp.csv country country-continent.csv > matched.csvThe classic time-series enrichment problem: for each 311 complaint, what was the weather on the day of the complaint — without an exact-match join (since the 311 timestamp is finer than the daily weather).
# Strip NOAA's preamble, keep just date + temp columns
qsv input --auto-skip USW00094728.csv \
| qsv select 'DATE,TMAX,TMIN,PRCP' \
| qsv datefmt DATE --formatstr '%Y-%m-%d' \
| qsv sort --select DATE \
> nyc_weather.csv
# Normalize 311 Created Date and sort by date
qsv datefmt 'Created Date' --formatstr '%Y-%m-%d' --new-column day \
NYC_311_SR_2010-2020-sample-1M.csv \
| qsv sort --select day \
> nyc311_by_day.csvBoth sides must be sorted by the asof key — Polars enforces it.
qsv joinp \
--strategy backward \
day nyc311_by_day.csv DATE nyc_weather.csv \
--coalesce > nyc311_weather.csvStrategies: backward (nearest prior match), forward (nearest subsequent), nearest (either direction). For complaint-day weather, backward is right (the weather observation for that date).
(Exact flag spellings vary across qsv versions — run qsv joinp --help to confirm the asof-related flag names in your build.)
qsv sqlp nyc311_weather.csv \
"SELECT
CASE
WHEN TMAX > 800 THEN 'hot' -- NOAA stores temps as tenths of C; 800 = 80°C? No — see note
WHEN TMAX < 0 THEN 'cold'
ELSE 'moderate'
END AS temp_band,
COUNT(*) AS complaints
FROM nyc311_weather
WHERE \"Complaint Type\" = 'Noise - Residential'
GROUP BY temp_band
ORDER BY complaints DESC"(NOAA GHCN-Daily stores TMAX as tenths of degrees Celsius; adjust thresholds accordingly. Always sanity-check unit conventions when joining cross-domain.)
You have table_A.csv, table_B.csv, table_C.csv, table_D.csv. The first 10 columns are identical across all four files; columns 11+ are unique to each file. You want one wide CSV, joined on column 2.
cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
# Pull just the join column + the unique extras from each table
qsv join 2 combined.csv 1 <(qsv select 2,11- "$NEXT") > new.csv
mv new.csv combined.csv
done<(...) is bash process substitution — qsv select becomes the right-side input without writing an intermediate file. 2,11- selects column 2 and everything from column 11 onward.
You have salary bands and an employee list:
# bands.csv
min,max,grade
0,50000,Junior
50001,100000,Mid
100001,200000,Senior
200001,9999999,Staffqsv joinp --non-equi \
'salary_left >= min_right AND salary_left <= max_right' \
employees.csv bands.csv > graded.csvThe Polars SQL expression uses _left and _right suffixes to disambiguate column origins.
qsv joinp --cross flavors.csv toppings.csv > all_combinations.csvqsv joinp --filter-left "Borough = 'BROOKLYN'" \
day nyc311_by_day.csv DATE nyc_weather.csv \
--coalesce > brooklyn_weather.csvPolars evaluates --filter-left / --filter-right BEFORE the join — saves shuffling rows that wouldn't survive a WHERE clause anyway.
qsv sqlp wcp.csv country-continent.csv \
"SELECT wcp.AccentCity, wcp.Population, country_continent.continent
FROM wcp JOIN country_continent ON wcp.Country = country_continent.country
WHERE wcp.Population > 1000000
ORDER BY wcp.Population DESC"-
joinis in-memory. The right side is loaded into a hash table; the left side streams. Works great for "millions × thousands" joins. Fails for "millions × millions." -
joinpis Polars-powered. Larger-than-RAM, multithreaded, supports asof and non-equi. Slower for tiny joins (Polars setup cost), much faster for large joins. - For repeated incremental joins (Recipe 3), each pass re-reads and re-writes the full intermediate file. If
combined.csvgrows large, batch all joins into a singlesqlpquery for a single pass.
- Joins & Set Ops — every flag explained
-
SQL & Polars —
sqlpfor joins-as-SQL,joinpdeep-dive - Polars asof-join docs
- Polars non-equi-join docs
- Recipe: Geographic Enrichment
- Recipe: Fetch & Cache — fetch NOAA station data
- Recipe: Larger-than-RAM CSV
- Cookbook (legacy)
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation