-
Notifications
You must be signed in to change notification settings - Fork 104
Joins and Set Ops
Tier: Intermediate
Commands covered: join, joinp, exclude, partition, split
Note
Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
Two commands do joins (join and joinp) — pick based on size and join type. Three commands chop a CSV into pieces (exclude, partition, split).
| If you want to… | Use | Notes |
|---|---|---|
| Inner/left/right/outer/cross/anti/semi join (small to medium) | join |
In-memory hash join; auto-indexed; simple |
| Same, but for files > RAM or asof / non-equi joins | joinp |
Polars-powered; multithreaded; pre-join filtering |
| Remove rows of B from A based on join key (or keep intersection) | exclude |
Cheaper than join --left-anti for the same use case |
| Split into one file per distinct value of a column | partition |
Creates prefix-{value}.csv per group |
| Split into chunks of N rows / N total chunks / by file size | split |
Multithreaded with an index |
In-memory hash join across two CSV files. Auto-creates an in-memory index for the lookup side. Use this when both files fit in RAM — it's simpler than joinp and produces output in deterministic order.
Example: enrich wcp.csv with continent from a lookup table
# country_continent.csv has columns: country,continent
qsv join Country wcp.csv country country_continent.csv > wcp_with_continent.csvThe Country column comes from the left side (wcp.csv); country is the right side's join key. Output keeps all columns from both files.
Example: left join — keep all wcp rows even if no continent match
qsv join --left Country wcp.csv country country_continent.csv > wcp_left.csvExample: anti join — find wcp countries NOT in the lookup
qsv join --left-anti Country wcp.csv country country_continent.csv > unmatched.csvExample: semi join — find rows in wcp that DO have a continent match (no right-side columns added)
qsv join --left-semi Country wcp.csv country country_continent.csv > matched.csvExample: case-insensitive composite-key join
qsv join --ignore-case 'firstname,lastname' people.csv 'first_name,last_name' contacts.csvSee also: /docs/help/join.md, joinp — for files > RAM and asof joins, exclude — for the --left-anti case without bringing in right-side columns.
Polars-powered join. Multithreaded, larger-than-RAM, supports asof joins (perfect for time-series), non-equi joins (range-based), join-key validation, pre-join filtering, post-join SQL, and column coalescing. The flagship joinp capability is asof — there's no in-memory equivalent.
joinp returns (rows, cols) shape information to stderr. Output is always CSV (write Parquet/JSON/etc. by piping into sqlp). Both inputs need headers; stdin is not supported (use join or a temp file). The argument order is <columns1> <input1> <columns2> <input2> — left keys, left file, right keys, right file. Use comma-separated names for composite keys (the two sides must list the same number of columns).
For the full flag list see
/docs/help/joinp.md; the subsections below are the workflow view.
The default is an inner join. The full matrix of outer / anti / semi variants is available via flags:
| Flag | Result |
|---|---|
| (none) | inner — intersection on the keys |
--left / --right
|
left / right outer — keep all rows of that side, pad the other |
--full |
full outer — keep all rows of both sides |
--left-anti / --right-anti
|
rows of that side with no match on the other (output schema = that side) |
--left-semi / --right-semi
|
rows of that side that do have a match (output schema = that side) |
--cross |
Cartesian product — N × M rows; keys ignored (USE WITH CAUTION) |
# inner join (default): enrich cities with continent
qsv joinp Country wcp.csv country country_continent.csv > wcp_with_continent.csv
# left outer: keep every city row even when the lookup misses
qsv joinp --left Country wcp.csv country country_continent.csv > wcp_left.csv
# left-anti: cities whose Country is NOT in the lookup table
qsv joinp --left-anti Country wcp.csv country country_continent.csv > unmatched.csv
# full outer with coalesced keys (one 'id' column, not 'id' + 'id_right')
qsv joinp --full --coalesce id a.csv id b.csv > merged.csv
# cross join: every flavor × every topping
qsv joinp --cross flavors.csv toppings.csv > all_combinations.csvCaution
--cross returns the Cartesian product — N × M rows. On non-trivial inputs this explodes fast (a 100k × 100k cross is 10 billion rows); keep one side small or constrain the result with a downstream LIMIT / filter.
For inner joins the key columns are coalesced automatically; for other join types add --coalesce to merge the duplicated right-side key column into the left.
--non-equi <expr> joins on a Polars SQL boolean expression instead of equal keys. Reference columns with _left / _right suffixes to disambiguate the two sides. Ideal for band/range lookups.
# employees.csv: name,salary bands.csv: min_salary,max_salary,grade
qsv joinp --non-equi \
'salary_left >= min_salary_right AND salary_left <= max_salary_right' \
employees.csv bands.csv > graded.csv--asof matches each left row to the nearest right row rather than an equal one — the canonical pattern for aligning sparse observations (weather, quotes, calibrations) to a dense event log. Both inputs are sorted on the join keys automatically (pass --no-sort only if you know they're already sorted); --try-parsedates is auto-enabled so date/datetime keys normalize.
--strategy picks the match direction:
-
backward(default) — last right row whose key is<the left key -
forward— first right row whose key is>the left key -
nearest— closest right row in either direction
Add -X / --allow-exact-matches to turn the strict < / > into <= / >=.
# match each 311 event to the most recent prior weather observation
qsv joinp --asof --strategy backward \
created_date NYC_311_SR_2010-2020-sample-1M.csv \
obs_date noaa_ghcn_nyc.csv \
--coalesce > nyc311_with_weather.csv--left_by / --right_by constrain the nearest-key search to subgroups (an "asof_by" join) — e.g. match each trade to the latest quote for the same ticker:
qsv joinp --asof --left_by ticker --right_by ticker \
trade_time trades.csv quote_time quotes.csv > trades_with_quote.csvWith --strategy nearest, --tolerance bounds how far a match may be — a row count, or a duration string for temporal keys (1h, 30m, 3d12h, 1w, 1mo, 1q, 1y; suffix _saturating to clamp month-overflow dates instead of erroring):
# nearest reading per station, but only within 1 hour
qsv joinp --asof --strategy nearest --tolerance 1h \
--left_by station --right_by station \
ts readings.csv ts calibrations.csv > calibrated.csv--filter-left / --filter-right apply a Polars SQL WHERE expression to each side before the join — cheaper than filtering the joined result:
qsv joinp --filter-left "Borough = 'BROOKLYN'" \
'Complaint Type' nyc311.csv \
'Complaint Type' complaint_taxonomy.csv > brooklyn_classified.csv--sql-filter runs a Polars SQL query against the join result after the join — select/rename columns and filter rows in one pass. Select from the join_result table:
qsv joinp Country wcp.csv country country_continent.csv \
--sql-filter "SELECT AccentCity, Population, continent
FROM join_result WHERE Population > 1000000 ORDER BY Population DESC" \
> big_cities_by_continent.csv--validate checks join-key uniqueness before joining and errors on violation — catch unintended fan-out (a "one-to-one" join that's secretly one-to-many):
| Value | Asserts |
|---|---|
none (default) |
no check |
onetomany |
keys unique on the left |
manytoone |
keys unique on the right |
onetoone |
keys unique on both |
qsv joinp --validate onetoone id master.csv id supplement.csv > merged.csvBy default row order is unspecified (faster). --maintain-order (left, right, left_right, right_left) preserves a deterministic order for inner/left/right/full joins when you need it:
qsv joinp --left --maintain-order left Country wcp.csv country country_continent.csv > ordered.csvThese normalize the join keys without modifying the output columns — qsv builds temporary key columns, joins on them, then drops them:
-
-i/--ignore-case— case-insensitive matching -
-z/--ignore-leading-zeros— strip leading zeros (e.g.007↔7) on numeric and string keys -
-N/--norm-unicode <form>— Unicode-normalize keys (nfc,nfd,nfkc,nfkd)
# composite, case-insensitive key join
qsv joinp --ignore-case 'firstname,lastname' people.csv 'first_name,last_name' contacts.csv
# match zero-padded IDs against unpadded ones
qsv joinp --ignore-leading-zeros zip left.csv zip right.csvFor repeated joins over the same large files, cache a Polars schema to skip type inference. --cache-schema takes an integer mode:
| Mode | Behavior |
|---|---|
0 (default) |
infer from --infer-len rows; don't cache |
1 |
use a cached .pschema.json if newer than the input (deriving one from the stats cache or inference if missing) |
-1 |
treat every column as String; don't cache |
-2 |
treat every column as String; cache it |
The -1 / -2 all-string modes are the reliable fix when Polars strips leading zeros from a key it inferred as numeric. An existing, fresh .pschema.json is picked up automatically even at mode 0.
qsv joinp --cache-schema 1 Country wcp.csv country country_continent.csv > out.csvOutput is CSV: tune it with --datetime-format, --date-format, --time-format (chrono specifiers), --float-precision, --null-value, and --decimal-comma (requires a non-comma --delimiter). --nulls makes empty fields participate as join values instead of being ignored. Snappy .sz inputs are auto-decompressed and an --output result.csv.sz is auto-compressed. For OOM situations, --streaming or --low-memory.
See also: /docs/help/joinp.md, Polars asof-join docs, Polars non-equi-join docs, join — in-memory alternative, SQL & Polars, Recipe: Multi-Table Joins.
Remove rows from one CSV that appear in another (matching on specified columns). The -v flag inverts to give you the intersection instead.
exclude is functionally similar to qsv join --left-anti, but cheaper for the common case where you just want "rows in A not in B."
Example: list new records since last week's export
qsv exclude id new_export.csv id last_week.csv > delta.csvExample: intersection — records present in both (e.g., previously-processed-and-still-current)
qsv exclude -v id current.csv id processed.csv > still_valid.csvExample: composite-key exclude (case-insensitive)
qsv exclude --ignore-case 'firstname,lastname' people.csv 'firstname,lastname' bad_actors.csv > clean.csvExample: streaming pipeline with stdin
cat records.csv | qsv exclude id - id previously-processed.csv | qsv sort > new_sorted.csvSee also: /docs/help/exclude.md, join --left-anti, Recipe: Diff & Audit.
Split a CSV into one file per distinct value of a column. Filenames are derived from the value via a {} placeholder.
Example: partition NYC 311 by Borough
qsv partition Borough nyc311_by_borough --filename 'nyc311-{}.csv' \
NYC_311_SR_2010-2020-sample-1M.csv
ls nyc311_by_borough/
# nyc311-Bronx.csv nyc311-Brooklyn.csv nyc311-Manhattan.csv
# nyc311-Queens.csv nyc311-Staten_Island.csv nyc311-Unspecified.csvExample: partition wcp by Country into the current directory
qsv partition Country . --filename 'wcp-{}.csv' wcp.csv
ls wcp-*.csv | head
# wcp-ad.csv wcp-ae.csv wcp-af.csv wcp-ag.csv wcp-ai.csv ...Partitioning a 2.7M-row file by Country produces 209 files. Combine with downstream parallelism (e.g., xargs -P or foreach) for embarrassingly-parallel pipelines.
See also: /docs/help/partition.md, split — for size-based splitting, cat rows — to recombine after processing.
Split by rows (--size N), chunks (--chunks N), or file size (--kb-size N). Multithreaded when the input has an index and you're splitting by rows or chunks.
Example: chunk a 27M-row NYC 311 export into 1M-row files for parallel downstream processing
qsv index nyc311-full.csv
qsv split nyc311_chunks --size 1000000 --jobs 8 nyc311-full.csv
ls nyc311_chunks/
# 0.csv 1000000.csv 2000000.csv ...Example: exactly 10 chunks (good for parallel jobs that want one chunk per worker)
qsv split outdir --chunks 10 --jobs 4 large.csvExample: split into ~1 MB chunks for upload to a service with a size limit
qsv split upload_chunks --kb-size 1000 export.csvExample: split + gzip on the fly (one external command per chunk)
qsv split outdir -s 100000 --filter "gzip $FILE" big.csv
ls outdir/*.gzExample: padded filenames for predictable lexicographic order
qsv split outdir -s 100 --filename 'chunk_{}.csv' --pad 5 input.csv
# chunk_00000.csv chunk_00100.csv chunk_00200.csv ...See also: /docs/help/split.md, partition, cat rows, foreach, Performance Tuning.
- Command Reference (index)
-
SQL & Polars —
sqlp,joinp,pivotp,scoresql -
Transform & Reshape → cat — the inverse of
split/partition - Aggregation & Statistics — group-and-aggregate alternatives
- Cookbook → Multi-Table Joins
-
Cookbook → Diff & Audit —
excludein action - Recipe: Larger-than-RAM CSV
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation