-
Notifications
You must be signed in to change notification settings - Fork 102
Recipe Larger than RAM
Tier: Advanced
Commands used: index, extsort, extdedup, sqlp, joinp, split, snappy, jemalloc tuning
Anchor dataset: NYC 311 full export (~16 GB / 27M rows)
A 16 GB CSV won't fit in your laptop's RAM. You still need to:
- sort, dedup, join, aggregate, and validate it
- run profiling stats
- export to Parquet for downstream
- do all of this in bounded memory — without OOM crashes
qsv handles this. Many commands stream; the rest have ext-* companion commands (extsort, extdedup) for the on-disk versions. Polars-backed commands (sqlp, joinp, pivotp) handle larger-than-RAM via lazy evaluation. Plus a few env vars that turn on bigger gains.
# NYC 311 full export from NYC Open Data. Roughly 16 GB / 27M rows.
# (Adjust the URL to the latest full export.)
curl -L -o nyc311-full.csv "https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD"
ls -lh nyc311-full.csv
# ~16G nyc311-full.csvIf you don't have 16 GB free, scale down — the techniques work the same on any "doesn't comfortably fit in RAM" dataset.
qsv index nyc311-full.csv
# ~14 seconds on an M2 Pro
ls -lh nyc311-full.csv*
# 16G nyc311-full.csv
# 27M nyc311-full.csv.idxThe index lets count, sample, slice work instantaneously and gives stats, frequency, split, schema multithreaded modes — see Performance Tuning.
Better yet, set the env var:
export QSV_AUTOINDEX_SIZE=100000000 # auto-index any file > 100 MBqsv stats --everything \
--cardinality-method approx \
--quantile-method approx \
--stats-jsonl \
nyc311-full.csv > stats.csvThe approx flags switch from exact (O(cardinality) memory) to Apache DataSketches (constant memory per column, ~1.5% error). Critical for high-cardinality columns like Unique Key.
Since 20.1.0, you no longer have to remember the
approxflags forstatsandfrequency: when qsv detects the file is likely bigger than RAM, it auto-enables them and emits awwarn!listing what got switched on. Pass them explicitly when you want to lock the choice in (e.g. in pipelines where the cache key should be stable), or pass--quantile-method exact/--cardinality-method exactto force precise calculation regardless. Thestatscache key now includes the chosen mode so mixed-mode runs don't return stale results.
You can also set a memory limit via env var:
export QSV_STATS_CHUNK_MEMORY_MB=2048 # max 2 GB per chunkqsv extsort --select 'Created Date' nyc311-full.csv > nyc311-by-date.csvextsort is a multithreaded external merge sort. It uses $QSV_TMPDIR (or /tmp) for spill files — make sure you have ~2× the input size free there.
qsv extdedup --select 'Unique Key' nyc311-full.csv --dupes-output dupes.csv > unique.csvUnlike dedup, extdedup preserves input order and uses a memory-mapped on-disk hash table. It works fine on 100 GB+ files.
For pure line-mode dedup of any text file (not just CSV):
qsv extdedup --no-headers huge.log > unique.logqsv sqlp nyc311-full.csv \
"SELECT Borough,
strftime('%Y-%m', \"Created Date\") AS year_month,
COUNT(*) AS complaints
FROM nyc311_full
GROUP BY Borough, year_month
ORDER BY year_month, complaints DESC" \
--format parquet --output nyc311-by-borough-month.parquetPolars evaluates the query lazily — it builds the plan, optimizes it, and streams the file in chunks. For maximum speed, generate a Polars schema first:
qsv schema --polars nyc311-full.csv
# Writes nyc311-full.pschema.json — sqlp picks it up automaticallyqsv joinp \
'Borough' nyc311-full.csv \
borough boroughs_lookup.csv \
--coalesce > nyc311-with-borough-info.csvjoinp is Polars-powered — multithreaded, larger-than-RAM. Use the asof / non-equi variants for time-series and range joins; see Recipe: Multi-Table Joins.
qsv snappy compress nyc311-full.csv > nyc311-full.csv.sz
ls -lh nyc311-full.csv*
# 16G nyc311-full.csv
# 6G nyc311-full.csv.sz (compression ratio varies)All other qsv commands transparently handle .sz files. Multithreaded compression is 5-6× faster than the implicit single-threaded compression when you just write to *.sz.
qsv split chunks/ --size 1000000 --jobs 8 nyc311-full.csv
ls chunks/
# 0.csv 1000000.csv 2000000.csv ...With an index and --jobs 8, split is multithreaded. Now you can run heavy commands in parallel:
ls chunks/*.csv \
| xargs -P 4 -I {} sh -c 'qsv geocode reverse Location {} > enriched/{}'
qsv cat rows enriched/*.csv > nyc311-enriched.csv| Variable | Default | Purpose |
|---|---|---|
QSV_AUTOINDEX_SIZE |
unset | Auto-create / refresh index for files > N bytes |
QSV_TMPDIR |
/tmp |
Spill directory for extsort / extdedup
|
QSV_STATS_CHUNK_MEMORY_MB |
dynamic | Max memory per chunk in stats
|
QSV_FREQ_CHUNK_MEMORY_MB |
dynamic | Same for frequency
|
QSV_MEMORY_CHECK |
unset (1 to enable) | Pre-check available RAM for in-memory commands |
QSV_MEMORY_HEADROOM_MB |
100 |
Reserve N MB above the qsv working set |
For the full list (23+ vars), see docs/ENVIRONMENT_VARIABLES.md and the Environment Variables wiki page.
qsv index nyc311-full.csv
qsv stats --everything --cardinality-method approx --quantile-method approx --stats-jsonl nyc311-full.csv
qsv schema nyc311-full.csv
qsv validate nyc311-full.csv nyc311-full.csv.schema.jsonEven on 27M rows, this completes in a few minutes thanks to the multithreaded validate (peak 780k rows/sec) and the streaming stats.
qsv to parquet outdir/ nyc311-full.csv
# Now DuckDB can query the Parquet directly:
duckdb -c "SELECT Borough, COUNT(*) FROM read_parquet('outdir/nyc311-full.parquet') GROUP BY Borough"qsv scoresql nyc311-full.csv \
"SELECT * FROM nyc311_full WHERE Borough = 'BROOKLYN'"
# Warns: SELECT * without LIMIT against a 16 GB file. Suggests narrower SELECT or LIMIT.qsv reads stdin, so pipe from any cloud tool:
aws s3 cp s3://my-bucket/huge.csv - | qsv sample --bernoulli 0.001 -For full-file processing in S3, download or mount via goofys/s3fs. The stdin path works for streaming-friendly commands (count, sample, search, …) and avoids the disk cost of the download.
- 27M rows in ~14s to index, instantaneous to count after.
-
statson the full file withapproxmethods completes in tens of seconds and uses bounded memory. -
extsortandextdedupare spill-to-disk multithreaded — bound only by disk space, not RAM. - The biggest single optimization is the stats cache + Polars schema combo (see Stats Cache & Caching).
- For maximum throughput, build qsv from source with
CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'— see Performance Tuning.
- Performance Tuning — index, jemalloc, target-cpu=native, env vars
- Environment Variables — every QSV_* var
-
Aggregation & Statistics —
extsort,extdedup,--cardinality-method approx -
SQL & Polars —
sqlp,joinp,pivotplazy mode -
Validation & Schema —
schema --polarsfor.pschema.json - Stats Cache & Caching — the speed-up secret
- Indexing, Compression & Diff
-
docs/PERFORMANCE.md— canonical reference - Recipe: Build a Data Pipeline — same pipeline on a smaller dataset
- Recipe: Diff & Audit — when both sides are larger-than-RAM
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation
- Recipes index
- Inspect an Unknown CSV
- Clean & Normalize
- Geographic Enrichment
- Date Enrichment
- CKAN Integration
- JSON Schema Validation
- Build a Data Pipeline
- Stats → Insights
- Fetch & Cache
- Larger-than-RAM CSV
- Diff & Audit
- Multi-table Joins
- Synthesize Fake Data