-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe Larger than RAM
Tier: Advanced
Commands used: index, extsort, extdedup, sqlp, joinp, split, snappy, jemalloc tuning
Anchor dataset: NYC 311 full export (~16 GB / 27M rows)
A 16 GB CSV won't fit in your laptop's RAM. You still need to:
- sort, dedup, join, aggregate, and validate it
- run profiling stats
- export to Parquet for downstream
- do all of this in bounded memory — without OOM crashes
qsv handles this. Many commands stream; the rest have ext-* companion commands (extsort, extdedup) for the on-disk versions. Polars-backed commands (sqlp, joinp, pivotp) handle larger-than-RAM via lazy evaluation. Plus a few env vars that turn on bigger gains.
# NYC 311 full export from NYC Open Data. Roughly 16 GB / 27M rows.
# (Adjust the URL to the latest full export.)
curl -L -o nyc311-full.csv "https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD"
ls -lh nyc311-full.csv
# ~16G nyc311-full.csvIf you don't have 16 GB free, scale down — the techniques work the same on any "doesn't comfortably fit in RAM" dataset.
qsv index nyc311-full.csv
# ~14 seconds on an M2 Pro
ls -lh nyc311-full.csv*
# 16G nyc311-full.csv
# 27M nyc311-full.csv.idxThe index lets count, sample, slice work instantaneously and gives stats, frequency, split, schema multithreaded modes — see Performance Tuning.
Better yet, set the env var:
export QSV_AUTOINDEX_SIZE=100000000 # auto-index any file > 100 MBqsv stats --everything \
--cardinality-method approx \
--quantile-method approx \
--stats-jsonl \
nyc311-full.csv > stats.csvThe approx flags switch from exact (O(cardinality) memory) to Apache DataSketches (constant memory per column, ~1.5% error). Critical for high-cardinality columns like Unique Key.
Note
Since 20.1.0, you no longer have to remember the approx flags for stats and frequency: when qsv detects the file is likely bigger than RAM, it auto-enables them and emits a wwarn! listing what got switched on. Pass them explicitly when you want to lock the choice in (e.g. in pipelines where the cache key should be stable), or pass --quantile-method exact / --cardinality-method exact to force precise calculation regardless. The stats cache key now includes the chosen mode so mixed-mode runs don't return stale results.
You can also set a memory limit via env var:
export QSV_STATS_CHUNK_MEMORY_MB=2048 # max 2 GB per chunkqsv extsort --select 'Created Date' nyc311-full.csv > nyc311-by-date.csvextsort is a multithreaded external merge sort. It uses $QSV_TMPDIR (or /tmp) for spill files — make sure you have ~2× the input size free there.
qsv extdedup --select 'Unique Key' nyc311-full.csv --dupes-output dupes.csv > unique.csvUnlike dedup, extdedup preserves input order and uses a memory-mapped on-disk hash table. It works fine on 100 GB+ files.
For pure line-mode dedup of any text file (not just CSV):
qsv extdedup --no-headers huge.log > unique.logqsv sqlp nyc311-full.csv \
"SELECT Borough,
strftime('%Y-%m', \"Created Date\") AS year_month,
COUNT(*) AS complaints
FROM nyc311_full
GROUP BY Borough, year_month
ORDER BY year_month, complaints DESC" \
--format parquet --output nyc311-by-borough-month.parquetPolars evaluates the query lazily — it builds the plan, optimizes it, and streams the file in chunks. For maximum speed, generate a Polars schema first:
qsv schema --polars nyc311-full.csv
# Writes nyc311-full.pschema.json — sqlp picks it up automaticallyqsv joinp \
'Borough' nyc311-full.csv \
borough boroughs_lookup.csv \
--coalesce > nyc311-with-borough-info.csvjoinp is Polars-powered — multithreaded, larger-than-RAM. Use the asof / non-equi variants for time-series and range joins; see Recipe: Multi-Table Joins.
qsv snappy compress nyc311-full.csv > nyc311-full.csv.sz
ls -lh nyc311-full.csv*
# 16G nyc311-full.csv
# 6G nyc311-full.csv.sz (compression ratio varies)All other qsv commands transparently handle .sz files. Multithreaded compression is 5-6× faster than the implicit single-threaded compression when you just write to *.sz.
qsv split chunks/ --size 1000000 --jobs 8 nyc311-full.csv
ls chunks/
# 0.csv 1000000.csv 2000000.csv ...With an index and --jobs 8, split is multithreaded. Now you can run heavy commands in parallel:
ls chunks/*.csv \
| xargs -P 4 -I {} sh -c 'qsv geocode reverse Location {} > enriched/{}'
qsv cat rows enriched/*.csv > nyc311-enriched.csv| Variable | Default | Purpose |
|---|---|---|
QSV_AUTOINDEX_SIZE |
unset | Auto-create / refresh index for files > N bytes |
QSV_TMPDIR |
/tmp |
Spill directory for extsort / extdedup
|
QSV_STATS_CHUNK_MEMORY_MB |
dynamic | Max memory per chunk in stats
|
QSV_FREQ_CHUNK_MEMORY_MB |
dynamic | Same for frequency
|
QSV_MEMORY_CHECK |
unset (1 to enable) | Pre-check available RAM for in-memory commands |
QSV_MEMORY_HEADROOM_MB |
100 |
Reserve N MB above the qsv working set |
For the full list (23+ vars), see docs/ENVIRONMENT_VARIABLES.md and the Environment Variables wiki page.
qsv index nyc311-full.csv
qsv stats --everything --cardinality-method approx --quantile-method approx --stats-jsonl nyc311-full.csv
qsv schema nyc311-full.csv
qsv validate nyc311-full.csv nyc311-full.csv.schema.jsonEven on 27M rows, this completes in a few minutes thanks to the multithreaded validate (peak 780k rows/sec) and the streaming stats.
qsv to parquet outdir/ nyc311-full.csv
# Now DuckDB can query the Parquet directly:
duckdb -c "SELECT Borough, COUNT(*) FROM read_parquet('outdir/nyc311-full.parquet') GROUP BY Borough"qsv scoresql nyc311-full.csv \
"SELECT * FROM nyc311_full WHERE Borough = 'BROOKLYN'"
# Warns: SELECT * without LIMIT against a 16 GB file. Suggests narrower SELECT or LIMIT.qsv reads stdin, so pipe from any cloud tool:
aws s3 cp s3://my-bucket/huge.csv - | qsv sample --bernoulli 0.001 -For full-file processing in S3, download or mount via goofys/s3fs. The stdin path works for streaming-friendly commands (count, sample, search, …) and avoids the disk cost of the download.
- 27M rows in ~14s to index, instantaneous to count after.
-
statson the full file withapproxmethods completes in tens of seconds and uses bounded memory. -
extsortandextdedupare spill-to-disk multithreaded — bound only by disk space, not RAM. - The biggest single optimization is the stats cache + Polars schema combo (see Stats Cache & Caching).
- For maximum throughput, build qsv from source with
CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'— see Performance Tuning.
- Performance Tuning — index, jemalloc, target-cpu=native, env vars
- Environment Variables — every QSV_* var
-
Aggregation & Statistics —
extsort,extdedup,--cardinality-method approx -
SQL & Polars —
sqlp,joinp,pivotplazy mode -
Validation & Schema —
schema --polarsfor.pschema.json - Stats Cache & Caching — the speed-up secret
- Indexing, Compression & Diff
-
docs/PERFORMANCE.md— canonical reference - Recipe: Build a Data Pipeline — same pipeline on a smaller dataset
- Recipe: Diff & Audit — when both sides are larger-than-RAM
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation