-
Notifications
You must be signed in to change notification settings - Fork 102
Joins and Set Ops
Tier: Intermediate
Commands covered: join, joinp, exclude, partition, split
Per-command flag reference lives in
/docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
Two commands do joins (join and joinp) — pick based on size and join type. Three commands chop a CSV into pieces (exclude, partition, split).
| If you want to… | Use | Notes |
|---|---|---|
| Inner/left/right/outer/cross/anti/semi join (small to medium) | join |
In-memory hash join; auto-indexed; simple |
| Same, but for files > RAM or asof / non-equi joins | joinp |
Polars-powered; multithreaded; pre-join filtering |
| Remove rows of B from A based on join key (or keep intersection) | exclude |
Cheaper than join --left-anti for the same use case |
| Split into one file per distinct value of a column | partition |
Creates prefix-{value}.csv per group |
| Split into chunks of N rows / N total chunks / by file size | split |
Multithreaded with an index |
In-memory hash join across two CSV files. Auto-creates an in-memory index for the lookup side. Use this when both files fit in RAM — it's simpler than joinp and produces output in deterministic order.
Example: enrich wcp.csv with continent from a lookup table
# country_continent.csv has columns: country,continent
qsv join Country wcp.csv country country_continent.csv > wcp_with_continent.csvThe Country column comes from the left side (wcp.csv); country is the right side's join key. Output keeps all columns from both files.
Example: left join — keep all wcp rows even if no continent match
qsv join --left Country wcp.csv country country_continent.csv > wcp_left.csvExample: anti join — find wcp countries NOT in the lookup
qsv join --left-anti Country wcp.csv country country_continent.csv > unmatched.csvExample: semi join — find rows in wcp that DO have a continent match (no right-side columns added)
qsv join --left-semi Country wcp.csv country country_continent.csv > matched.csvExample: case-insensitive composite-key join
qsv join --ignore-case 'firstname,lastname' people.csv 'first_name,last_name' contacts.csvSee also: /docs/help/join.md, joinp — for files > RAM and asof joins, exclude — for the --left-anti case without bringing in right-side columns.
Polars-powered join. Multithreaded, larger-than-RAM, supports asof joins (perfect for time-series), non-equi joins (range-based), and post-join column coalescing. The flagship joinp capability is asof — there's no in-memory equivalent.
joinp returns (rows, cols) shape information to stderr.
Example: enrich NYC 311 events with the nearest-prior weather observation (asof join)
# Both files must be sorted by the asof key (timestamp).
qsv joinp \
--left-by timestamp --right-by timestamp \
--strategy backward \
'timestamp' NYC_311_SR_2010-2020-sample-1M.csv \
'timestamp' noaa_ghcn_nyc.csv \
--coalesce \
> nyc311_with_weather.csv(Exact flags vary — consult qsv joinp --help for the asof option names in your qsv version. Asof matches each left-side timestamp to the nearest prior right-side timestamp; perfect for joining sparse weather observations to dense event logs.)
Example: non-equi join — match transactions to a salary band
# bands.csv has: min_salary,max_salary,grade
# employees.csv has: name,salary
qsv joinp --non-equi 'salary_left >= min_salary_right AND salary_left <= max_salary_right' \
employees.csv bands.csv > graded.csvExample: cross join — Cartesian product (use with caution)
qsv joinp --cross flavors.csv toppings.csv > all_combinations.csvExample: filter the left side before joining (skip the rows you don't want)
qsv joinp --filter-left "Borough = 'BROOKLYN'" \
'Complaint Type' nyc311.csv \
'Complaint Type' complaint_taxonomy.csv > brooklyn_classified.csv--filter-right exists too. Both are evaluated in Polars before the join happens, saving cycles.
Example: full outer join with column coalescing
qsv joinp --full --coalesce 'id' a.csv 'id' b.csv > merged.csvSee also: /docs/help/joinp.md, Polars asof-join docs, Polars non-equi-join docs, SQL & Polars, Recipe: Multi-Table Joins.
Remove rows from one CSV that appear in another (matching on specified columns). The -v flag inverts to give you the intersection instead.
exclude is functionally similar to qsv join --left-anti, but cheaper for the common case where you just want "rows in A not in B."
Example: list new records since last week's export
qsv exclude id new_export.csv id last_week.csv > delta.csvExample: intersection — records present in both (e.g., previously-processed-and-still-current)
qsv exclude -v id current.csv id processed.csv > still_valid.csvExample: composite-key exclude (case-insensitive)
qsv exclude --ignore-case 'firstname,lastname' people.csv 'firstname,lastname' bad_actors.csv > clean.csvExample: streaming pipeline with stdin
cat records.csv | qsv exclude id - id previously-processed.csv | qsv sort > new_sorted.csvSee also: /docs/help/exclude.md, join --left-anti, Recipe: Diff & Audit.
Split a CSV into one file per distinct value of a column. Filenames are derived from the value via a {} placeholder.
Example: partition NYC 311 by Borough
qsv partition Borough nyc311_by_borough --filename 'nyc311-{}.csv' \
NYC_311_SR_2010-2020-sample-1M.csv
ls nyc311_by_borough/
# nyc311-Bronx.csv nyc311-Brooklyn.csv nyc311-Manhattan.csv
# nyc311-Queens.csv nyc311-Staten_Island.csv nyc311-Unspecified.csvExample: partition wcp by Country into the current directory
qsv partition Country . --filename 'wcp-{}.csv' wcp.csv
ls wcp-*.csv | head
# wcp-ad.csv wcp-ae.csv wcp-af.csv wcp-ag.csv wcp-ai.csv ...Partitioning a 2.7M-row file by Country produces 209 files. Combine with downstream parallelism (e.g., xargs -P or foreach) for embarrassingly-parallel pipelines.
See also: /docs/help/partition.md, split — for size-based splitting, cat rows — to recombine after processing.
Split by rows (--size N), chunks (--chunks N), or file size (--kb-size N). Multithreaded when the input has an index and you're splitting by rows or chunks.
Example: chunk a 27M-row NYC 311 export into 1M-row files for parallel downstream processing
qsv index nyc311-full.csv
qsv split nyc311_chunks --size 1000000 --jobs 8 nyc311-full.csv
ls nyc311_chunks/
# 0.csv 1000000.csv 2000000.csv ...Example: exactly 10 chunks (good for parallel jobs that want one chunk per worker)
qsv split outdir --chunks 10 --jobs 4 large.csvExample: split into ~1 MB chunks for upload to a service with a size limit
qsv split upload_chunks --kb-size 1000 export.csvExample: split + gzip on the fly (one external command per chunk)
qsv split outdir -s 100000 --filter "gzip $FILE" big.csv
ls outdir/*.gzExample: padded filenames for predictable lexicographic order
qsv split outdir -s 100 --filename 'chunk_{}.csv' --pad 5 input.csv
# chunk_00000.csv chunk_00100.csv chunk_00200.csv ...See also: /docs/help/split.md, partition, cat rows, foreach, Performance Tuning.
- Command Reference (index)
-
SQL & Polars —
sqlp,joinp,pivotp,scoresql -
Transform & Reshape → cat — the inverse of
split/partition - Aggregation & Statistics — group-and-aggregate alternatives
- Cookbook → Multi-Table Joins
-
Cookbook → Diff & Audit —
excludein action - Recipe: Larger-than-RAM CSV
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation
- Recipes index
- Inspect an Unknown CSV
- Clean & Normalize
- Geographic Enrichment
- Date Enrichment
- CKAN Integration
- JSON Schema Validation
- Build a Data Pipeline
- Stats → Insights
- Fetch & Cache
- Larger-than-RAM CSV
- Diff & Audit
- Multi-table Joins
- Synthesize Fake Data