Joins and Set Ops

Joins & Set Ops

Tier: Intermediate Commands covered: join, joinp, exclude, partition, split

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

Two commands do joins (join and joinp) — pick based on size and join type. Three commands chop a CSV into pieces (exclude, partition, split).

Quick decision table

If you want to…	Use	Notes
Inner/left/right/outer/cross/anti/semi join (small to medium)	`join`	In-memory hash join; auto-indexed; simple
Same, but for files > RAM or asof / non-equi joins	`joinp`	Polars-powered; multithreaded; pre-join filtering
Remove rows of B from A based on join key (or keep intersection)	`exclude`	Cheaper than `join --left-anti` for the same use case
Split into one file per distinct value of a column	`partition`	Creates `prefix-{value}.csv` per group
Split into chunks of N rows / N total chunks / by file size	`split`	Multithreaded with an index

`join`

In-memory hash join across two CSV files. Auto-creates an in-memory index for the lookup side. Use this when both files fit in RAM — it's simpler than joinp and produces output in deterministic order.

Example: enrich wcp.csv with continent from a lookup table

# country_continent.csv has columns: country,continent
qsv join Country wcp.csv country country_continent.csv > wcp_with_continent.csv

The Country column comes from the left side (wcp.csv); country is the right side's join key. Output keeps all columns from both files.

Example: left join — keep all wcp rows even if no continent match

qsv join --left Country wcp.csv country country_continent.csv > wcp_left.csv

Example: anti join — find wcp countries NOT in the lookup

qsv join --left-anti Country wcp.csv country country_continent.csv > unmatched.csv

Example: semi join — find rows in wcp that DO have a continent match (no right-side columns added)

qsv join --left-semi Country wcp.csv country country_continent.csv > matched.csv

Example: case-insensitive composite-key join

qsv join --ignore-case 'firstname,lastname' people.csv 'first_name,last_name' contacts.csv

See also: /docs/help/join.md, joinp — for files > RAM and asof joins, exclude — for the --left-anti case without bringing in right-side columns.

`joinp`

Polars-powered join. Multithreaded, larger-than-RAM, supports asof joins (perfect for time-series), non-equi joins (range-based), and post-join column coalescing. The flagship joinp capability is asof — there's no in-memory equivalent.

joinp returns (rows, cols) shape information to stderr.

Example: enrich NYC 311 events with the nearest-prior weather observation (asof join)

# Both files must be sorted by the asof key (timestamp).
qsv joinp \
  --left-by timestamp --right-by timestamp \
  --strategy backward \
  'timestamp' NYC_311_SR_2010-2020-sample-1M.csv \
  'timestamp' noaa_ghcn_nyc.csv \
  --coalesce \
  > nyc311_with_weather.csv

(Exact flags vary — consult qsv joinp --help for the asof option names in your qsv version. Asof matches each left-side timestamp to the nearest prior right-side timestamp; perfect for joining sparse weather observations to dense event logs.)

Example: non-equi join — match transactions to a salary band

# bands.csv has: min_salary,max_salary,grade
# employees.csv has: name,salary
qsv joinp --non-equi 'salary_left >= min_salary_right AND salary_left <= max_salary_right' \
  employees.csv bands.csv > graded.csv

Example: cross join — Cartesian product (use with caution)

qsv joinp --cross flavors.csv toppings.csv > all_combinations.csv

Example: filter the left side before joining (skip the rows you don't want)

qsv joinp --filter-left "Borough = 'BROOKLYN'" \
  'Complaint Type' nyc311.csv \
  'Complaint Type' complaint_taxonomy.csv > brooklyn_classified.csv

--filter-right exists too. Both are evaluated in Polars before the join happens, saving cycles.

Example: full outer join with column coalescing

qsv joinp --full --coalesce 'id' a.csv 'id' b.csv > merged.csv

See also: /docs/help/joinp.md, Polars asof-join docs, Polars non-equi-join docs, SQL & Polars, Recipe: Multi-Table Joins.

`exclude`

Remove rows from one CSV that appear in another (matching on specified columns). The -v flag inverts to give you the intersection instead.

exclude is functionally similar to qsv join --left-anti, but cheaper for the common case where you just want "rows in A not in B."

Example: list new records since last week's export

qsv exclude id new_export.csv id last_week.csv > delta.csv

Example: intersection — records present in both (e.g., previously-processed-and-still-current)

qsv exclude -v id current.csv id processed.csv > still_valid.csv

Example: composite-key exclude (case-insensitive)

qsv exclude --ignore-case 'firstname,lastname' people.csv 'firstname,lastname' bad_actors.csv > clean.csv

Example: streaming pipeline with stdin

cat records.csv | qsv exclude id - id previously-processed.csv | qsv sort > new_sorted.csv

See also: /docs/help/exclude.md, join --left-anti, Recipe: Diff & Audit.

`partition`

Split a CSV into one file per distinct value of a column. Filenames are derived from the value via a {} placeholder.

Example: partition NYC 311 by Borough

qsv partition Borough nyc311_by_borough --filename 'nyc311-{}.csv' \
  NYC_311_SR_2010-2020-sample-1M.csv
ls nyc311_by_borough/
# nyc311-Bronx.csv  nyc311-Brooklyn.csv  nyc311-Manhattan.csv
# nyc311-Queens.csv  nyc311-Staten_Island.csv  nyc311-Unspecified.csv

Example: partition wcp by Country into the current directory

qsv partition Country . --filename 'wcp-{}.csv' wcp.csv
ls wcp-*.csv | head
# wcp-ad.csv  wcp-ae.csv  wcp-af.csv  wcp-ag.csv  wcp-ai.csv  ...

Partitioning a 2.7M-row file by Country produces 209 files. Combine with downstream parallelism (e.g., xargs -P or foreach) for embarrassingly-parallel pipelines.

See also: /docs/help/partition.md, split — for size-based splitting, cat rows — to recombine after processing.

`split`

Split by rows (--size N), chunks (--chunks N), or file size (--kb-size N). Multithreaded when the input has an index and you're splitting by rows or chunks.

Example: chunk a 27M-row NYC 311 export into 1M-row files for parallel downstream processing

qsv index nyc311-full.csv
qsv split nyc311_chunks --size 1000000 --jobs 8 nyc311-full.csv
ls nyc311_chunks/
# 0.csv  1000000.csv  2000000.csv  ...

Example: exactly 10 chunks (good for parallel jobs that want one chunk per worker)

qsv split outdir --chunks 10 --jobs 4 large.csv

Example: split into ~1 MB chunks for upload to a service with a size limit

qsv split upload_chunks --kb-size 1000 export.csv

Example: split + gzip on the fly (one external command per chunk)

qsv split outdir -s 100000 --filter "gzip $FILE" big.csv
ls outdir/*.gz

Example: padded filenames for predictable lexicographic order

qsv split outdir -s 100 --filename 'chunk_{}.csv' --pad 5 input.csv
# chunk_00000.csv  chunk_00100.csv  chunk_00200.csv ...

See also: /docs/help/split.md, partition, cat rows, foreach, Performance Tuning.

Joins and Set Ops

Joins & Set Ops

Quick decision table

join

joinp

exclude

partition

split

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

`join`

`joinp`

`exclude`

`partition`

`split`