Skip to content
Joel Natividad edited this page May 30, 2026 · 4 revisions

Joins & Set Ops

Tier: Intermediate Commands covered: join, joinp, exclude, partition, split

Note

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

Two commands do joins (join and joinp) — pick based on size and join type. Three commands chop a CSV into pieces (exclude, partition, split).

Quick decision table

If you want to… Use Notes
Inner/left/right/outer/cross/anti/semi join (small to medium) join In-memory hash join; auto-indexed; simple
Same, but for files > RAM or asof / non-equi joins joinp Polars-powered; multithreaded; pre-join filtering
Remove rows of B from A based on join key (or keep intersection) exclude Cheaper than join --left-anti for the same use case
Split into one file per distinct value of a column partition Creates prefix-{value}.csv per group
Split into chunks of N rows / N total chunks / by file size split Multithreaded with an index

join

In-memory hash join across two CSV files. Auto-creates an in-memory index for the lookup side. Use this when both files fit in RAM — it's simpler than joinp and produces output in deterministic order.

Example: enrich wcp.csv with continent from a lookup table

# country_continent.csv has columns: country,continent
qsv join Country wcp.csv country country_continent.csv > wcp_with_continent.csv

The Country column comes from the left side (wcp.csv); country is the right side's join key. Output keeps all columns from both files.

Example: left join — keep all wcp rows even if no continent match

qsv join --left Country wcp.csv country country_continent.csv > wcp_left.csv

Example: anti join — find wcp countries NOT in the lookup

qsv join --left-anti Country wcp.csv country country_continent.csv > unmatched.csv

Example: semi join — find rows in wcp that DO have a continent match (no right-side columns added)

qsv join --left-semi Country wcp.csv country country_continent.csv > matched.csv

Example: case-insensitive composite-key join

qsv join --ignore-case 'firstname,lastname' people.csv 'first_name,last_name' contacts.csv

See also: /docs/help/join.md, joinp — for files > RAM and asof joins, exclude — for the --left-anti case without bringing in right-side columns.

joinp

Polars-powered join. Multithreaded, larger-than-RAM, supports asof joins (perfect for time-series), non-equi joins (range-based), join-key validation, pre-join filtering, post-join SQL, and column coalescing. The flagship joinp capability is asof — there's no in-memory equivalent.

joinp returns (rows, cols) shape information to stderr. Output is always CSV (write Parquet/JSON/etc. by piping into sqlp). Both inputs need headers; stdin is not supported (use join or a temp file). The argument order is <columns1> <input1> <columns2> <input2> — left keys, left file, right keys, right file. Use comma-separated names for composite keys (the two sides must list the same number of columns).

For the full flag list see /docs/help/joinp.md; the subsections below are the workflow view.

Join types

The default is an inner join. The full matrix of outer / anti / semi variants is available via flags:

Flag Result
(none) inner — intersection on the keys
--left / --right left / right outer — keep all rows of that side, pad the other
--full full outer — keep all rows of both sides
--left-anti / --right-anti rows of that side with no match on the other (output schema = that side)
--left-semi / --right-semi rows of that side that do have a match (output schema = that side)
--cross Cartesian product — N × M rows; keys ignored (USE WITH CAUTION)
# inner join (default): enrich cities with continent
qsv joinp Country wcp.csv country country_continent.csv > wcp_with_continent.csv

# left outer: keep every city row even when the lookup misses
qsv joinp --left Country wcp.csv country country_continent.csv > wcp_left.csv

# left-anti: cities whose Country is NOT in the lookup table
qsv joinp --left-anti Country wcp.csv country country_continent.csv > unmatched.csv

# full outer with coalesced keys (one 'id' column, not 'id' + 'id_right')
qsv joinp --full --coalesce id a.csv id b.csv > merged.csv

# cross join: every flavor × every topping
qsv joinp --cross flavors.csv toppings.csv > all_combinations.csv

Caution

--cross returns the Cartesian product — N × M rows. On non-trivial inputs this explodes fast (a 100k × 100k cross is 10 billion rows); keep one side small or constrain the result with a downstream LIMIT / filter.

For inner joins the key columns are coalesced automatically; for other join types add --coalesce to merge the duplicated right-side key column into the left.

Non-equi joins (range matches)

--non-equi <expr> joins on a Polars SQL boolean expression instead of equal keys. Reference columns with _left / _right suffixes to disambiguate the two sides. Ideal for band/range lookups.

# employees.csv: name,salary   bands.csv: min_salary,max_salary,grade
qsv joinp --non-equi \
  'salary_left >= min_salary_right AND salary_left <= max_salary_right' \
  employees.csv bands.csv > graded.csv

Asof joins (nearest-key, time-series)

--asof matches each left row to the nearest right row rather than an equal one — the canonical pattern for aligning sparse observations (weather, quotes, calibrations) to a dense event log. Both inputs are sorted on the join keys automatically (pass --no-sort only if you know they're already sorted); --try-parsedates is auto-enabled so date/datetime keys normalize.

--strategy picks the match direction:

  • backward (default) — last right row whose key is < the left key
  • forward — first right row whose key is > the left key
  • nearest — closest right row in either direction

Add -X / --allow-exact-matches to turn the strict < / > into <= / >=.

# match each 311 event to the most recent prior weather observation
qsv joinp --asof --strategy backward \
  created_date NYC_311_SR_2010-2020-sample-1M.csv \
  obs_date noaa_ghcn_nyc.csv \
  --coalesce > nyc311_with_weather.csv

--left_by / --right_by constrain the nearest-key search to subgroups (an "asof_by" join) — e.g. match each trade to the latest quote for the same ticker:

qsv joinp --asof --left_by ticker --right_by ticker \
  trade_time trades.csv quote_time quotes.csv > trades_with_quote.csv

With --strategy nearest, --tolerance bounds how far a match may be — a row count, or a duration string for temporal keys (1h, 30m, 3d12h, 1w, 1mo, 1q, 1y; suffix _saturating to clamp month-overflow dates instead of erroring):

# nearest reading per station, but only within 1 hour
qsv joinp --asof --strategy nearest --tolerance 1h \
  --left_by station --right_by station \
  ts readings.csv ts calibrations.csv > calibrated.csv

Pre-join filtering & post-join SQL

--filter-left / --filter-right apply a Polars SQL WHERE expression to each side before the join — cheaper than filtering the joined result:

qsv joinp --filter-left "Borough = 'BROOKLYN'" \
  'Complaint Type' nyc311.csv \
  'Complaint Type' complaint_taxonomy.csv > brooklyn_classified.csv

--sql-filter runs a Polars SQL query against the join result after the join — select/rename columns and filter rows in one pass. Select from the join_result table:

qsv joinp Country wcp.csv country country_continent.csv \
  --sql-filter "SELECT AccentCity, Population, continent
                FROM join_result WHERE Population > 1000000 ORDER BY Population DESC" \
  > big_cities_by_continent.csv

Key validation, row order & coalescing

--validate checks join-key uniqueness before joining and errors on violation — catch unintended fan-out (a "one-to-one" join that's secretly one-to-many):

Value Asserts
none (default) no check
onetomany keys unique on the left
manytoone keys unique on the right
onetoone keys unique on both
qsv joinp --validate onetoone id master.csv id supplement.csv > merged.csv

By default row order is unspecified (faster). --maintain-order (left, right, left_right, right_left) preserves a deterministic order for inner/left/right/full joins when you need it:

qsv joinp --left --maintain-order left Country wcp.csv country country_continent.csv > ordered.csv

Join-key transformations

These normalize the join keys without modifying the output columns — qsv builds temporary key columns, joins on them, then drops them:

  • -i / --ignore-case — case-insensitive matching
  • -z / --ignore-leading-zeros — strip leading zeros (e.g. 0077) on numeric and string keys
  • -N / --norm-unicode <form> — Unicode-normalize keys (nfc, nfd, nfkc, nfkd)
# composite, case-insensitive key join
qsv joinp --ignore-case 'firstname,lastname' people.csv 'first_name,last_name' contacts.csv

# match zero-padded IDs against unpadded ones
qsv joinp --ignore-leading-zeros zip left.csv zip right.csv

Schema, parsing & output options

For repeated joins over the same large files, cache a Polars schema to skip type inference. --cache-schema takes an integer mode:

Mode Behavior
0 (default) infer from --infer-len rows; don't cache
1 use a cached .pschema.json if newer than the input (deriving one from the stats cache or inference if missing)
-1 treat every column as String; don't cache
-2 treat every column as String; cache it

The -1 / -2 all-string modes are the reliable fix when Polars strips leading zeros from a key it inferred as numeric. An existing, fresh .pschema.json is picked up automatically even at mode 0.

qsv joinp --cache-schema 1 Country wcp.csv country country_continent.csv > out.csv

Output is CSV: tune it with --datetime-format, --date-format, --time-format (chrono specifiers), --float-precision, --null-value, and --decimal-comma (requires a non-comma --delimiter). --nulls makes empty fields participate as join values instead of being ignored. Snappy .sz inputs are auto-decompressed and an --output result.csv.sz is auto-compressed. For OOM situations, --streaming or --low-memory.

See also: /docs/help/joinp.md, Polars asof-join docs, Polars non-equi-join docs, join — in-memory alternative, SQL & Polars, Recipe: Multi-Table Joins.

exclude

Remove rows from one CSV that appear in another (matching on specified columns). The -v flag inverts to give you the intersection instead.

exclude is functionally similar to qsv join --left-anti, but cheaper for the common case where you just want "rows in A not in B."

Example: list new records since last week's export

qsv exclude id new_export.csv id last_week.csv > delta.csv

Example: intersection — records present in both (e.g., previously-processed-and-still-current)

qsv exclude -v id current.csv id processed.csv > still_valid.csv

Example: composite-key exclude (case-insensitive)

qsv exclude --ignore-case 'firstname,lastname' people.csv 'firstname,lastname' bad_actors.csv > clean.csv

Example: streaming pipeline with stdin

cat records.csv | qsv exclude id - id previously-processed.csv | qsv sort > new_sorted.csv

See also: /docs/help/exclude.md, join --left-anti, Recipe: Diff & Audit.

partition

Split a CSV into one file per distinct value of a column. Filenames are derived from the value via a {} placeholder.

Example: partition NYC 311 by Borough

qsv partition Borough nyc311_by_borough --filename 'nyc311-{}.csv' \
  NYC_311_SR_2010-2020-sample-1M.csv
ls nyc311_by_borough/
# nyc311-Bronx.csv  nyc311-Brooklyn.csv  nyc311-Manhattan.csv
# nyc311-Queens.csv  nyc311-Staten_Island.csv  nyc311-Unspecified.csv

Example: partition wcp by Country into the current directory

qsv partition Country . --filename 'wcp-{}.csv' wcp.csv
ls wcp-*.csv | head
# wcp-ad.csv  wcp-ae.csv  wcp-af.csv  wcp-ag.csv  wcp-ai.csv  ...

Partitioning a 2.7M-row file by Country produces 209 files. Combine with downstream parallelism (e.g., xargs -P or foreach) for embarrassingly-parallel pipelines.

See also: /docs/help/partition.md, split — for size-based splitting, cat rows — to recombine after processing.

split

Split by rows (--size N), chunks (--chunks N), or file size (--kb-size N). Multithreaded when the input has an index and you're splitting by rows or chunks.

Example: chunk a 27M-row NYC 311 export into 1M-row files for parallel downstream processing

qsv index nyc311-full.csv
qsv split nyc311_chunks --size 1000000 --jobs 8 nyc311-full.csv
ls nyc311_chunks/
# 0.csv  1000000.csv  2000000.csv  ...

Example: exactly 10 chunks (good for parallel jobs that want one chunk per worker)

qsv split outdir --chunks 10 --jobs 4 large.csv

Example: split into ~1 MB chunks for upload to a service with a size limit

qsv split upload_chunks --kb-size 1000 export.csv

Example: split + gzip on the fly (one external command per chunk)

qsv split outdir -s 100000 --filter "gzip $FILE" big.csv
ls outdir/*.gz

Example: padded filenames for predictable lexicographic order

qsv split outdir -s 100 --filename 'chunk_{}.csv' --pad 5 input.csv
# chunk_00000.csv  chunk_00100.csv  chunk_00200.csv ...

See also: /docs/help/split.md, partition, cat rows, foreach, Performance Tuning.

See also

Clone this wiki locally