Skip to content

Recipe Larger than RAM

Joel Natividad edited this page May 30, 2026 · 4 revisions

Recipe: Larger-than-RAM CSV

Tier: Advanced Commands used: index, extsort, extdedup, sqlp, joinp, split, snappy, jemalloc tuning Anchor dataset: NYC 311 full export (~16 GB / 27M rows)

Problem

A 16 GB CSV won't fit in your laptop's RAM. You still need to:

  • sort, dedup, join, aggregate, and validate it
  • run profiling stats
  • export to Parquet for downstream
  • do all of this in bounded memory — without OOM crashes

qsv handles this. Many commands stream; the rest have ext-* companion commands (extsort, extdedup) for the on-disk versions. Polars-backed commands (sqlp, joinp, pivotp) handle larger-than-RAM via lazy evaluation. Plus a few env vars that turn on bigger gains.

Data

# NYC 311 full export from NYC Open Data. Roughly 16 GB / 27M rows.
# (Adjust the URL to the latest full export.)
curl -L -o nyc311-full.csv "https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD"

ls -lh nyc311-full.csv
# ~16G  nyc311-full.csv

If you don't have 16 GB free, scale down — the techniques work the same on any "doesn't comfortably fit in RAM" dataset.

Solution

1. Index once

qsv index nyc311-full.csv
# ~14 seconds on an M2 Pro
ls -lh nyc311-full.csv*
# 16G  nyc311-full.csv
# 27M  nyc311-full.csv.idx

The index lets count, sample, slice work instantaneously and gives stats, frequency, split, schema multithreaded modes — see Performance Tuning.

Better yet, set the env var:

export QSV_AUTOINDEX_SIZE=100000000     # auto-index any file > 100 MB

2. Compute stats in bounded memory

qsv stats --everything \
  --cardinality-method approx \
  --quantile-method approx \
  --stats-jsonl \
  nyc311-full.csv > stats.csv

The approx flags switch from exact (O(cardinality) memory) to Apache DataSketches (constant memory per column, ~1.5% error). Critical for high-cardinality columns like Unique Key.

Note

Since 20.1.0, you no longer have to remember the approx flags for stats and frequency: when qsv detects the file is likely bigger than RAM, it auto-enables them and emits a wwarn! listing what got switched on. Pass them explicitly when you want to lock the choice in (e.g. in pipelines where the cache key should be stable), or pass --quantile-method exact / --cardinality-method exact to force precise calculation regardless. The stats cache key now includes the chosen mode so mixed-mode runs don't return stale results.

You can also set a memory limit via env var:

export QSV_STATS_CHUNK_MEMORY_MB=2048    # max 2 GB per chunk

3. Sort with extsort

qsv extsort --select 'Created Date' nyc311-full.csv > nyc311-by-date.csv

extsort is a multithreaded external merge sort. It uses $QSV_TMPDIR (or /tmp) for spill files — make sure you have ~2× the input size free there.

4. Dedup with extdedup

qsv extdedup --select 'Unique Key' nyc311-full.csv --dupes-output dupes.csv > unique.csv

Unlike dedup, extdedup preserves input order and uses a memory-mapped on-disk hash table. It works fine on 100 GB+ files.

For pure line-mode dedup of any text file (not just CSV):

qsv extdedup --no-headers huge.log > unique.log

5. Aggregate via Polars SQL (lazy evaluation)

qsv sqlp nyc311-full.csv \
  "SELECT Borough,
          strftime('%Y-%m', \"Created Date\") AS year_month,
          COUNT(*) AS complaints
   FROM nyc311_full
   GROUP BY Borough, year_month
   ORDER BY year_month, complaints DESC" \
  --format parquet --output nyc311-by-borough-month.parquet

Polars evaluates the query lazily — it builds the plan, optimizes it, and streams the file in chunks. For maximum speed, generate a Polars schema first:

qsv schema --polars nyc311-full.csv
# Writes nyc311-full.pschema.json — sqlp picks it up automatically

6. Join with joinp (larger-than-RAM)

qsv joinp \
  'Borough' nyc311-full.csv \
  borough boroughs_lookup.csv \
  --coalesce > nyc311-with-borough-info.csv

joinp is Polars-powered — multithreaded, larger-than-RAM. Use the asof / non-equi variants for time-series and range joins; see Recipe: Multi-Table Joins.

7. Compress for storage with Snappy (multithreaded)

qsv snappy compress nyc311-full.csv > nyc311-full.csv.sz
ls -lh nyc311-full.csv*
# 16G  nyc311-full.csv
# 6G   nyc311-full.csv.sz    (compression ratio varies)

All other qsv commands transparently handle .sz files. Multithreaded compression is 5-6× faster than the implicit single-threaded compression when you just write to *.sz.

8. Split into chunks for embarrassingly-parallel processing

qsv split chunks/ --size 1000000 --jobs 8 nyc311-full.csv
ls chunks/
# 0.csv  1000000.csv  2000000.csv  ...

With an index and --jobs 8, split is multithreaded. Now you can run heavy commands in parallel:

ls chunks/*.csv \
  | xargs -P 4 -I {} sh -c 'qsv geocode reverse Location {} > enriched/{}'
qsv cat rows enriched/*.csv > nyc311-enriched.csv

Memory-tuning env vars

Variable Default Purpose
QSV_AUTOINDEX_SIZE unset Auto-create / refresh index for files > N bytes
QSV_TMPDIR /tmp Spill directory for extsort / extdedup
QSV_STATS_CHUNK_MEMORY_MB dynamic Max memory per chunk in stats
QSV_FREQ_CHUNK_MEMORY_MB dynamic Same for frequency
QSV_MEMORY_CHECK unset (1 to enable) Pre-check available RAM for in-memory commands
QSV_MEMORY_HEADROOM_MB 100 Reserve N MB above the qsv working set

For the full list (23+ vars), see docs/ENVIRONMENT_VARIABLES.md and the Environment Variables wiki page.

Variations

One-shot "validate big" pipeline

qsv index nyc311-full.csv
qsv stats --everything --cardinality-method approx --quantile-method approx --stats-jsonl nyc311-full.csv
qsv schema nyc311-full.csv
qsv validate nyc311-full.csv nyc311-full.csv.schema.json

Even on 27M rows, this completes in a few minutes thanks to the multithreaded validate (peak 780k rows/sec) and the streaming stats.

DuckDB hand-off

qsv to parquet outdir/ nyc311-full.csv
# Now DuckDB can query the Parquet directly:
duckdb -c "SELECT Borough, COUNT(*) FROM read_parquet('outdir/nyc311-full.parquet') GROUP BY Borough"

Scoring queries before running them

qsv scoresql nyc311-full.csv \
  "SELECT * FROM nyc311_full WHERE Borough = 'BROOKLYN'"
# Warns: SELECT * without LIMIT against a 16 GB file. Suggests narrower SELECT or LIMIT.

Cloud / S3 inputs

qsv reads stdin, so pipe from any cloud tool:

aws s3 cp s3://my-bucket/huge.csv - | qsv sample --bernoulli 0.001 -

For full-file processing in S3, download or mount via goofys/s3fs. The stdin path works for streaming-friendly commands (count, sample, search, …) and avoids the disk cost of the download.

Performance notes

  • 27M rows in ~14s to index, instantaneous to count after.
  • stats on the full file with approx methods completes in tens of seconds and uses bounded memory.
  • extsort and extdedup are spill-to-disk multithreaded — bound only by disk space, not RAM.
  • The biggest single optimization is the stats cache + Polars schema combo (see Stats Cache & Caching).
  • For maximum throughput, build qsv from source with CARGO_BUILD_RUSTFLAGS='-C target-cpu=native' — see Performance Tuning.

See also

Clone this wiki locally