Joins and Set Ops

Joins & Set Ops

Tier: Intermediate Commands covered: join, joinp, exclude, partition, split

Note

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

Two commands do joins (join and joinp) — pick based on size and join type. Three commands chop a CSV into pieces (exclude, partition, split).

Quick decision table

If you want to…	Use	Notes
Inner/left/right/outer/cross/anti/semi join (small to medium)	`join`	In-memory hash join; auto-indexed; simple
Same, but for files > RAM or asof / non-equi joins	`joinp`	Polars-powered; multithreaded; pre-join filtering
Remove rows of B from A based on join key (or keep intersection)	`exclude`	Cheaper than `join --left-anti` for the same use case
Split into one file per distinct value of a column	`partition`	Creates `prefix-{value}.csv` per group
Split into chunks of N rows / N total chunks / by file size	`split`	Multithreaded with an index

`join`

In-memory hash join across two CSV files. Auto-creates an in-memory index for the lookup side. Use this when both files fit in RAM — it's simpler than joinp and produces output in deterministic order.

Example: enrich wcp.csv with continent from a lookup table

# country_continent.csv has columns: country,continent
qsv join Country wcp.csv country country_continent.csv > wcp_with_continent.csv

The Country column comes from the left side (wcp.csv); country is the right side's join key. Output keeps all columns from both files.

Example: left join — keep all wcp rows even if no continent match

qsv join --left Country wcp.csv country country_continent.csv > wcp_left.csv

Example: anti join — find wcp countries NOT in the lookup

qsv join --left-anti Country wcp.csv country country_continent.csv > unmatched.csv

Example: semi join — find rows in wcp that DO have a continent match (no right-side columns added)

qsv join --left-semi Country wcp.csv country country_continent.csv > matched.csv

Example: case-insensitive composite-key join

qsv join --ignore-case 'firstname,lastname' people.csv 'first_name,last_name' contacts.csv

See also: /docs/help/join.md, joinp — for files > RAM and asof joins, exclude — for the --left-anti case without bringing in right-side columns.

`joinp`

Polars-powered join. Multithreaded, larger-than-RAM, supports asof joins (perfect for time-series), non-equi joins (range-based), join-key validation, pre-join filtering, post-join SQL, and column coalescing. The flagship joinp capability is asof — there's no in-memory equivalent.

joinp returns (rows, cols) shape information to stderr. Output is always CSV (write Parquet/JSON/etc. by piping into sqlp). Both inputs need headers; stdin is not supported (use join or a temp file). The argument order is <columns1> <input1> <columns2> <input2> — left keys, left file, right keys, right file. Use comma-separated names for composite keys (the two sides must list the same number of columns).

For the full flag list see /docs/help/joinp.md; the subsections below are the workflow view.

Join types

The default is an inner join. The full matrix of outer / anti / semi variants is available via flags:

Flag	Result
(none)	inner — intersection on the keys
`--left` / `--right`	left / right outer — keep all rows of that side, pad the other
`--full`	full outer — keep all rows of both sides
`--left-anti` / `--right-anti`	rows of that side with no match on the other (output schema = that side)
`--left-semi` / `--right-semi`	rows of that side that do have a match (output schema = that side)
`--cross`	Cartesian product — `N × M` rows; keys ignored (USE WITH CAUTION)

# inner join (default): enrich cities with continent
qsv joinp Country wcp.csv country country_continent.csv > wcp_with_continent.csv

# left outer: keep every city row even when the lookup misses
qsv joinp --left Country wcp.csv country country_continent.csv > wcp_left.csv

# left-anti: cities whose Country is NOT in the lookup table
qsv joinp --left-anti Country wcp.csv country country_continent.csv > unmatched.csv

# full outer with coalesced keys (one 'id' column, not 'id' + 'id_right')
qsv joinp --full --coalesce id a.csv id b.csv > merged.csv

# cross join: every flavor × every topping
qsv joinp --cross flavors.csv toppings.csv > all_combinations.csv

Caution

--cross returns the Cartesian product — N × M rows. On non-trivial inputs this explodes fast (a 100k × 100k cross is 10 billion rows); keep one side small or constrain the result with a downstream LIMIT / filter.

For inner joins the key columns are coalesced automatically; for other join types add --coalesce to merge the duplicated right-side key column into the left.

Non-equi joins (range matches)

--non-equi <expr> joins on a Polars SQL boolean expression instead of equal keys. Reference columns with _left / _right suffixes to disambiguate the two sides. Ideal for band/range lookups.

# employees.csv: name,salary   bands.csv: min_salary,max_salary,grade
qsv joinp --non-equi \
  'salary_left >= min_salary_right AND salary_left <= max_salary_right' \
  employees.csv bands.csv > graded.csv

Asof joins (nearest-key, time-series)

--asof matches each left row to the nearest right row rather than an equal one — the canonical pattern for aligning sparse observations (weather, quotes, calibrations) to a dense event log. Both inputs are sorted on the join keys automatically (pass --no-sort only if you know they're already sorted); --try-parsedates is auto-enabled so date/datetime keys normalize.

--strategy picks the match direction:

backward (default) — last right row whose key is < the left key
forward — first right row whose key is > the left key
nearest — closest right row in either direction

Add -X / --allow-exact-matches to turn the strict < / > into <= / >=.

# match each 311 event to the most recent prior weather observation
qsv joinp --asof --strategy backward \
  created_date NYC_311_SR_2010-2020-sample-1M.csv \
  obs_date noaa_ghcn_nyc.csv \
  --coalesce > nyc311_with_weather.csv

--left_by / --right_by constrain the nearest-key search to subgroups (an "asof_by" join) — e.g. match each trade to the latest quote for the same ticker:

qsv joinp --asof --left_by ticker --right_by ticker \
  trade_time trades.csv quote_time quotes.csv > trades_with_quote.csv

With --strategy nearest, --tolerance bounds how far a match may be — a row count, or a duration string for temporal keys (1h, 30m, 3d12h, 1w, 1mo, 1q, 1y; suffix _saturating to clamp month-overflow dates instead of erroring):

# nearest reading per station, but only within 1 hour
qsv joinp --asof --strategy nearest --tolerance 1h \
  --left_by station --right_by station \
  ts readings.csv ts calibrations.csv > calibrated.csv

Pre-join filtering & post-join SQL

--filter-left / --filter-right apply a Polars SQL WHERE expression to each side before the join — cheaper than filtering the joined result:

qsv joinp --filter-left "Borough = 'BROOKLYN'" \
  'Complaint Type' nyc311.csv \
  'Complaint Type' complaint_taxonomy.csv > brooklyn_classified.csv

--sql-filter runs a Polars SQL query against the join result after the join — select/rename columns and filter rows in one pass. Select from the join_result table:

qsv joinp Country wcp.csv country country_continent.csv \
  --sql-filter "SELECT AccentCity, Population, continent
                FROM join_result WHERE Population > 1000000 ORDER BY Population DESC" \
  > big_cities_by_continent.csv

Key validation, row order & coalescing

--validate checks join-key uniqueness before joining and errors on violation — catch unintended fan-out (a "one-to-one" join that's secretly one-to-many):

Value	Asserts
`none` (default)	no check
`onetomany`	keys unique on the left
`manytoone`	keys unique on the right
`onetoone`	keys unique on both

qsv joinp --validate onetoone id master.csv id supplement.csv > merged.csv

By default row order is unspecified (faster). --maintain-order (left, right, left_right, right_left) preserves a deterministic order for inner/left/right/full joins when you need it:

qsv joinp --left --maintain-order left Country wcp.csv country country_continent.csv > ordered.csv

Join-key transformations

These normalize the join keys without modifying the output columns — qsv builds temporary key columns, joins on them, then drops them:

-i / --ignore-case — case-insensitive matching
-z / --ignore-leading-zeros — strip leading zeros (e.g. 007 ↔ 7) on numeric and string keys
-N / --norm-unicode <form> — Unicode-normalize keys (nfc, nfd, nfkc, nfkd)

# composite, case-insensitive key join
qsv joinp --ignore-case 'firstname,lastname' people.csv 'first_name,last_name' contacts.csv

# match zero-padded IDs against unpadded ones
qsv joinp --ignore-leading-zeros zip left.csv zip right.csv

Schema, parsing & output options

For repeated joins over the same large files, cache a Polars schema to skip type inference. --cache-schema takes an integer mode:

Mode	Behavior
`0` (default)	infer from `--infer-len` rows; don't cache
`1`	use a cached `.pschema.json` if newer than the input (deriving one from the stats cache or inference if missing)
`-1`	treat every column as `String`; don't cache
`-2`	treat every column as `String`; cache it

The -1 / -2 all-string modes are the reliable fix when Polars strips leading zeros from a key it inferred as numeric. An existing, fresh .pschema.json is picked up automatically even at mode 0.

qsv joinp --cache-schema 1 Country wcp.csv country country_continent.csv > out.csv

Output is CSV: tune it with --datetime-format, --date-format, --time-format (chrono specifiers), --float-precision, --null-value, and --decimal-comma (requires a non-comma --delimiter). --nulls makes empty fields participate as join values instead of being ignored. Snappy .sz inputs are auto-decompressed and an --output result.csv.sz is auto-compressed. For OOM situations, --streaming or --low-memory.

See also: /docs/help/joinp.md, Polars asof-join docs, Polars non-equi-join docs, join — in-memory alternative, SQL & Polars, Recipe: Multi-Table Joins.

`exclude`

Remove rows from one CSV that appear in another (matching on specified columns). The -v flag inverts to give you the intersection instead.

exclude is functionally similar to qsv join --left-anti, but cheaper for the common case where you just want "rows in A not in B."

Example: list new records since last week's export

qsv exclude id new_export.csv id last_week.csv > delta.csv

Example: intersection — records present in both (e.g., previously-processed-and-still-current)

qsv exclude -v id current.csv id processed.csv > still_valid.csv

Example: composite-key exclude (case-insensitive)

qsv exclude --ignore-case 'firstname,lastname' people.csv 'firstname,lastname' bad_actors.csv > clean.csv

Example: streaming pipeline with stdin

cat records.csv | qsv exclude id - id previously-processed.csv | qsv sort > new_sorted.csv

See also: /docs/help/exclude.md, join --left-anti, Recipe: Diff & Audit.

`partition`

Split a CSV into one file per distinct value of a column. Filenames are derived from the value via a {} placeholder.

Example: partition NYC 311 by Borough

qsv partition Borough nyc311_by_borough --filename 'nyc311-{}.csv' \
  NYC_311_SR_2010-2020-sample-1M.csv
ls nyc311_by_borough/
# nyc311-Bronx.csv  nyc311-Brooklyn.csv  nyc311-Manhattan.csv
# nyc311-Queens.csv  nyc311-Staten_Island.csv  nyc311-Unspecified.csv

Example: partition wcp by Country into the current directory

qsv partition Country . --filename 'wcp-{}.csv' wcp.csv
ls wcp-*.csv | head
# wcp-ad.csv  wcp-ae.csv  wcp-af.csv  wcp-ag.csv  wcp-ai.csv  ...

Partitioning a 2.7M-row file by Country produces 209 files. Combine with downstream parallelism (e.g., xargs -P or foreach) for embarrassingly-parallel pipelines.

See also: /docs/help/partition.md, split — for size-based splitting, cat rows — to recombine after processing.

`split`

Split by rows (--size N), chunks (--chunks N), or file size (--kb-size N). Multithreaded when the input has an index and you're splitting by rows or chunks.

Example: chunk a 27M-row NYC 311 export into 1M-row files for parallel downstream processing

qsv index nyc311-full.csv
qsv split nyc311_chunks --size 1000000 --jobs 8 nyc311-full.csv
ls nyc311_chunks/
# 0.csv  1000000.csv  2000000.csv  ...

Example: exactly 10 chunks (good for parallel jobs that want one chunk per worker)

qsv split outdir --chunks 10 --jobs 4 large.csv

Example: split into ~1 MB chunks for upload to a service with a size limit

qsv split upload_chunks --kb-size 1000 export.csv

Example: split + gzip on the fly (one external command per chunk)

qsv split outdir -s 100000 --filter "gzip $FILE" big.csv
ls outdir/*.gz

Example: padded filenames for predictable lexicographic order

qsv split outdir -s 100 --filename 'chunk_{}.csv' --pad 5 input.csv
# chunk_00000.csv  chunk_00100.csv  chunk_00200.csv ...

See also: /docs/help/split.md, partition, cat rows, foreach, Performance Tuning.

Joins and Set Ops

Joins & Set Ops

Quick decision table

join

joinp

Join types

Non-equi joins (range matches)

Asof joins (nearest-key, time-series)

Pre-join filtering & post-join SQL

Key validation, row order & coalescing

Join-key transformations

Schema, parsing & output options

exclude

partition

split

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

`join`

`joinp`

`exclude`

`partition`

`split`