Recipe Larger than RAM

Recipe: Larger-than-RAM CSV

Tier: Advanced Commands used: index, extsort, extdedup, sqlp, joinp, split, snappy, jemalloc tuning Anchor dataset: NYC 311 full export (~16 GB / 27M rows)

Problem

A 16 GB CSV won't fit in your laptop's RAM. You still need to:

sort, dedup, join, aggregate, and validate it
run profiling stats
export to Parquet for downstream
do all of this in bounded memory — without OOM crashes

qsv handles this. Many commands stream; the rest have ext-* companion commands (extsort, extdedup) for the on-disk versions. Polars-backed commands (sqlp, joinp, pivotp) handle larger-than-RAM via lazy evaluation. Plus a few env vars that turn on bigger gains.

Data

# NYC 311 full export from NYC Open Data. Roughly 16 GB / 27M rows.
# (Adjust the URL to the latest full export.)
curl -L -o nyc311-full.csv "https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD"

ls -lh nyc311-full.csv
# ~16G  nyc311-full.csv

If you don't have 16 GB free, scale down — the techniques work the same on any "doesn't comfortably fit in RAM" dataset.

Solution

1. Index once

qsv index nyc311-full.csv
# ~14 seconds on an M2 Pro
ls -lh nyc311-full.csv*
# 16G  nyc311-full.csv
# 27M  nyc311-full.csv.idx

The index lets count, sample, slice work instantaneously and gives stats, frequency, split, schema multithreaded modes — see Performance Tuning.

Better yet, set the env var:

export QSV_AUTOINDEX_SIZE=100000000     # auto-index any file > 100 MB

2. Compute stats in bounded memory

qsv stats --everything \
  --cardinality-method approx \
  --quantile-method approx \
  --stats-jsonl \
  nyc311-full.csv > stats.csv

The approx flags switch from exact (O(cardinality) memory) to Apache DataSketches (constant memory per column, ~1.5% error). Critical for high-cardinality columns like Unique Key.

Since 20.1.0, you no longer have to remember the approx flags for stats and frequency: when qsv detects the file is likely bigger than RAM, it auto-enables them and emits a wwarn! listing what got switched on. Pass them explicitly when you want to lock the choice in (e.g. in pipelines where the cache key should be stable), or pass --quantile-method exact / --cardinality-method exact to force precise calculation regardless. The stats cache key now includes the chosen mode so mixed-mode runs don't return stale results.

You can also set a memory limit via env var:

export QSV_STATS_CHUNK_MEMORY_MB=2048    # max 2 GB per chunk

3. Sort with extsort

qsv extsort --select 'Created Date' nyc311-full.csv > nyc311-by-date.csv

extsort is a multithreaded external merge sort. It uses $QSV_TMPDIR (or /tmp) for spill files — make sure you have ~2× the input size free there.

4. Dedup with extdedup

qsv extdedup --select 'Unique Key' nyc311-full.csv --dupes-output dupes.csv > unique.csv

Unlike dedup, extdedup preserves input order and uses a memory-mapped on-disk hash table. It works fine on 100 GB+ files.

For pure line-mode dedup of any text file (not just CSV):

qsv extdedup --no-headers huge.log > unique.log

5. Aggregate via Polars SQL (lazy evaluation)

qsv sqlp nyc311-full.csv \
  "SELECT Borough,
          strftime('%Y-%m', \"Created Date\") AS year_month,
          COUNT(*) AS complaints
   FROM nyc311_full
   GROUP BY Borough, year_month
   ORDER BY year_month, complaints DESC" \
  --format parquet --output nyc311-by-borough-month.parquet

Polars evaluates the query lazily — it builds the plan, optimizes it, and streams the file in chunks. For maximum speed, generate a Polars schema first:

qsv schema --polars nyc311-full.csv
# Writes nyc311-full.pschema.json — sqlp picks it up automatically

6. Join with joinp (larger-than-RAM)

qsv joinp \
  'Borough' nyc311-full.csv \
  borough boroughs_lookup.csv \
  --coalesce > nyc311-with-borough-info.csv

joinp is Polars-powered — multithreaded, larger-than-RAM. Use the asof / non-equi variants for time-series and range joins; see Recipe: Multi-Table Joins.

7. Compress for storage with Snappy (multithreaded)

qsv snappy compress nyc311-full.csv > nyc311-full.csv.sz
ls -lh nyc311-full.csv*
# 16G  nyc311-full.csv
# 6G   nyc311-full.csv.sz    (compression ratio varies)

All other qsv commands transparently handle .sz files. Multithreaded compression is 5-6× faster than the implicit single-threaded compression when you just write to *.sz.

8. Split into chunks for embarrassingly-parallel processing

qsv split chunks/ --size 1000000 --jobs 8 nyc311-full.csv
ls chunks/
# 0.csv  1000000.csv  2000000.csv  ...

With an index and --jobs 8, split is multithreaded. Now you can run heavy commands in parallel:

ls chunks/*.csv \
  | xargs -P 4 -I {} sh -c 'qsv geocode reverse Location {} > enriched/{}'
qsv cat rows enriched/*.csv > nyc311-enriched.csv

Memory-tuning env vars

Variable	Default	Purpose
`QSV_AUTOINDEX_SIZE`	unset	Auto-create / refresh index for files > N bytes
`QSV_TMPDIR`	`/tmp`	Spill directory for `extsort` / `extdedup`
`QSV_STATS_CHUNK_MEMORY_MB`	dynamic	Max memory per chunk in `stats`
`QSV_FREQ_CHUNK_MEMORY_MB`	dynamic	Same for `frequency`
`QSV_MEMORY_CHECK`	unset (1 to enable)	Pre-check available RAM for in-memory commands
`QSV_MEMORY_HEADROOM_MB`	`100`	Reserve N MB above the qsv working set

For the full list (23+ vars), see docs/ENVIRONMENT_VARIABLES.md and the Environment Variables wiki page.

Variations

One-shot "validate big" pipeline

qsv index nyc311-full.csv
qsv stats --everything --cardinality-method approx --quantile-method approx --stats-jsonl nyc311-full.csv
qsv schema nyc311-full.csv
qsv validate nyc311-full.csv nyc311-full.csv.schema.json

Even on 27M rows, this completes in a few minutes thanks to the multithreaded validate (peak 780k rows/sec) and the streaming stats.

DuckDB hand-off

qsv to parquet outdir/ nyc311-full.csv
# Now DuckDB can query the Parquet directly:
duckdb -c "SELECT Borough, COUNT(*) FROM read_parquet('outdir/nyc311-full.parquet') GROUP BY Borough"

Scoring queries before running them

qsv scoresql nyc311-full.csv \
  "SELECT * FROM nyc311_full WHERE Borough = 'BROOKLYN'"
# Warns: SELECT * without LIMIT against a 16 GB file. Suggests narrower SELECT or LIMIT.

Cloud / S3 inputs

qsv reads stdin, so pipe from any cloud tool:

aws s3 cp s3://my-bucket/huge.csv - | qsv sample --bernoulli 0.001 -

For full-file processing in S3, download or mount via goofys/s3fs. The stdin path works for streaming-friendly commands (count, sample, search, …) and avoids the disk cost of the download.

Performance notes

27M rows in ~14s to index, instantaneous to count after.
stats on the full file with approx methods completes in tens of seconds and uses bounded memory.
extsort and extdedup are spill-to-disk multithreaded — bound only by disk space, not RAM.
The biggest single optimization is the stats cache + Polars schema combo (see Stats Cache & Caching).
For maximum throughput, build qsv from source with CARGO_BUILD_RUSTFLAGS='-C target-cpu=native' — see Performance Tuning.

Recipe Larger than RAM

Recipe: Larger-than-RAM CSV

Problem

Data

Solution

1. Index once

2. Compute stats in bounded memory

3. Sort with extsort

4. Dedup with extdedup

5. Aggregate via Polars SQL (lazy evaluation)

6. Join with joinp (larger-than-RAM)

7. Compress for storage with Snappy (multithreaded)

8. Split into chunks for embarrassingly-parallel processing

Memory-tuning env vars

Variations

One-shot "validate big" pipeline

DuckDB hand-off

Scoring queries before running them

Cloud / S3 inputs

Performance notes

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook