Skip to content
Joel Natividad edited this page May 13, 2026 · 4 revisions

Joins & Set Ops

Tier: Intermediate Commands covered: join, joinp, exclude, partition, split

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

Two commands do joins (join and joinp) — pick based on size and join type. Three commands chop a CSV into pieces (exclude, partition, split).

Quick decision table

If you want to… Use Notes
Inner/left/right/outer/cross/anti/semi join (small to medium) join In-memory hash join; auto-indexed; simple
Same, but for files > RAM or asof / non-equi joins joinp Polars-powered; multithreaded; pre-join filtering
Remove rows of B from A based on join key (or keep intersection) exclude Cheaper than join --left-anti for the same use case
Split into one file per distinct value of a column partition Creates prefix-{value}.csv per group
Split into chunks of N rows / N total chunks / by file size split Multithreaded with an index

join

In-memory hash join across two CSV files. Auto-creates an in-memory index for the lookup side. Use this when both files fit in RAM — it's simpler than joinp and produces output in deterministic order.

Example: enrich wcp.csv with continent from a lookup table

# country_continent.csv has columns: country,continent
qsv join Country wcp.csv country country_continent.csv > wcp_with_continent.csv

The Country column comes from the left side (wcp.csv); country is the right side's join key. Output keeps all columns from both files.

Example: left join — keep all wcp rows even if no continent match

qsv join --left Country wcp.csv country country_continent.csv > wcp_left.csv

Example: anti join — find wcp countries NOT in the lookup

qsv join --left-anti Country wcp.csv country country_continent.csv > unmatched.csv

Example: semi join — find rows in wcp that DO have a continent match (no right-side columns added)

qsv join --left-semi Country wcp.csv country country_continent.csv > matched.csv

Example: case-insensitive composite-key join

qsv join --ignore-case 'firstname,lastname' people.csv 'first_name,last_name' contacts.csv

See also: /docs/help/join.md, joinp — for files > RAM and asof joins, exclude — for the --left-anti case without bringing in right-side columns.

joinp

Polars-powered join. Multithreaded, larger-than-RAM, supports asof joins (perfect for time-series), non-equi joins (range-based), and post-join column coalescing. The flagship joinp capability is asof — there's no in-memory equivalent.

joinp returns (rows, cols) shape information to stderr.

Example: enrich NYC 311 events with the nearest-prior weather observation (asof join)

# Both files must be sorted by the asof key (timestamp).
qsv joinp \
  --left-by timestamp --right-by timestamp \
  --strategy backward \
  'timestamp' NYC_311_SR_2010-2020-sample-1M.csv \
  'timestamp' noaa_ghcn_nyc.csv \
  --coalesce \
  > nyc311_with_weather.csv

(Exact flags vary — consult qsv joinp --help for the asof option names in your qsv version. Asof matches each left-side timestamp to the nearest prior right-side timestamp; perfect for joining sparse weather observations to dense event logs.)

Example: non-equi join — match transactions to a salary band

# bands.csv has: min_salary,max_salary,grade
# employees.csv has: name,salary
qsv joinp --non-equi 'salary_left >= min_salary_right AND salary_left <= max_salary_right' \
  employees.csv bands.csv > graded.csv

Example: cross join — Cartesian product (use with caution)

qsv joinp --cross flavors.csv toppings.csv > all_combinations.csv

Example: filter the left side before joining (skip the rows you don't want)

qsv joinp --filter-left "Borough = 'BROOKLYN'" \
  'Complaint Type' nyc311.csv \
  'Complaint Type' complaint_taxonomy.csv > brooklyn_classified.csv

--filter-right exists too. Both are evaluated in Polars before the join happens, saving cycles.

Example: full outer join with column coalescing

qsv joinp --full --coalesce 'id' a.csv 'id' b.csv > merged.csv

See also: /docs/help/joinp.md, Polars asof-join docs, Polars non-equi-join docs, SQL & Polars, Recipe: Multi-Table Joins.

exclude

Remove rows from one CSV that appear in another (matching on specified columns). The -v flag inverts to give you the intersection instead.

exclude is functionally similar to qsv join --left-anti, but cheaper for the common case where you just want "rows in A not in B."

Example: list new records since last week's export

qsv exclude id new_export.csv id last_week.csv > delta.csv

Example: intersection — records present in both (e.g., previously-processed-and-still-current)

qsv exclude -v id current.csv id processed.csv > still_valid.csv

Example: composite-key exclude (case-insensitive)

qsv exclude --ignore-case 'firstname,lastname' people.csv 'firstname,lastname' bad_actors.csv > clean.csv

Example: streaming pipeline with stdin

cat records.csv | qsv exclude id - id previously-processed.csv | qsv sort > new_sorted.csv

See also: /docs/help/exclude.md, join --left-anti, Recipe: Diff & Audit.

partition

Split a CSV into one file per distinct value of a column. Filenames are derived from the value via a {} placeholder.

Example: partition NYC 311 by Borough

qsv partition Borough nyc311_by_borough --filename 'nyc311-{}.csv' \
  NYC_311_SR_2010-2020-sample-1M.csv
ls nyc311_by_borough/
# nyc311-Bronx.csv  nyc311-Brooklyn.csv  nyc311-Manhattan.csv
# nyc311-Queens.csv  nyc311-Staten_Island.csv  nyc311-Unspecified.csv

Example: partition wcp by Country into the current directory

qsv partition Country . --filename 'wcp-{}.csv' wcp.csv
ls wcp-*.csv | head
# wcp-ad.csv  wcp-ae.csv  wcp-af.csv  wcp-ag.csv  wcp-ai.csv  ...

Partitioning a 2.7M-row file by Country produces 209 files. Combine with downstream parallelism (e.g., xargs -P or foreach) for embarrassingly-parallel pipelines.

See also: /docs/help/partition.md, split — for size-based splitting, cat rows — to recombine after processing.

split

Split by rows (--size N), chunks (--chunks N), or file size (--kb-size N). Multithreaded when the input has an index and you're splitting by rows or chunks.

Example: chunk a 27M-row NYC 311 export into 1M-row files for parallel downstream processing

qsv index nyc311-full.csv
qsv split nyc311_chunks --size 1000000 --jobs 8 nyc311-full.csv
ls nyc311_chunks/
# 0.csv  1000000.csv  2000000.csv  ...

Example: exactly 10 chunks (good for parallel jobs that want one chunk per worker)

qsv split outdir --chunks 10 --jobs 4 large.csv

Example: split into ~1 MB chunks for upload to a service with a size limit

qsv split upload_chunks --kb-size 1000 export.csv

Example: split + gzip on the fly (one external command per chunk)

qsv split outdir -s 100000 --filter "gzip $FILE" big.csv
ls outdir/*.gz

Example: padded filenames for predictable lexicographic order

qsv split outdir -s 100 --filename 'chunk_{}.csv' --pad 5 input.csv
# chunk_00000.csv  chunk_00100.csv  chunk_00200.csv ...

See also: /docs/help/split.md, partition, cat rows, foreach, Performance Tuning.

See also

Clone this wiki locally