Aggregation and Statistics

Aggregation & Statistics

Tier: Intermediate Commands covered: stats, moarstats, frequency, pragmastat, dedup, extdedup, extsort

Note

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

These commands turn raw rows into numbers. The headline feature is stats — 48 metrics in under a second on 2.7M rows. The runner-up is frequency — multithreaded with an index, Apache DataSketches-backed frequent_items mode for huge cardinalities.

Quick decision table

If you want to…	Use	Notes
Compute mean/median/stddev/etc. with guaranteed type inference	`stats`	48 metrics; sub-second on millions of rows
Add bivariate, robust, and outlier stats to an existing stats CSV	`moarstats`	25+ additional measures
See the top-N values per column (and how often each occurs)	`frequency`	Apache DataSketches mode for huge cardinalities
Robust median-of-pairwise stats (Hodges-Lehmann, Shamos)	`pragmastat`	Date/datetime aware via stats cache
Drop duplicate rows	`dedup`	Streaming with `--sorted`, in-memory otherwise
Drop duplicates from a file > RAM	`extdedup`	On-disk hash table; preserves input order
Sort a file > RAM	`extsort`	Multithreaded external merge sort

`stats`

The headline command. Computes up to 48 summary statistics with guaranteed data type inference (Null / String / Float / Integer / Date / DateTime / Boolean). Streaming by default; loads into memory for non-streaming stats like median / quartiles / cardinality. Multithreaded with an index.

A successful run writes a <filename>.stats.csv cache file that many other commands reuse (frequency, schema, pragmastat, pivotp, sqlp scoresql, etc.) — see Stats Cache & Caching.

Example: full profile of NYC 311 with date and boolean inference

qsv stats --everything --infer-dates --infer-boolean \
  NYC_311_SR_2010-2020-sample-1M.csv > nyc311-stats.csv

--everything adds cardinality, modes/antimodes, median, MAD, quartiles, IQR, fences, skewness, percentiles, and zero_padded_numeric (flags String columns that are entirely zero-padded numbers — zip codes, barcodes, padded IDs — so they aren't mistakenly loaded as integer/float downstream). --infer-dates recognizes 19 date formats; --infer-boolean detects 2-valued patterns like t/f, yes/no, 1/0.

The zero_padded_numeric column can also be enabled on its own with --zero-padded-numeric:

qsv stats --zero-padded-numeric data.csv

Example: pre-populate the stats cache to speed up downstream commands

# Run this once; every smart command (frequency, schema, pragmastat, pivotp, ...)
# will pick up the cache automatically.
qsv stats --cardinality --stats-jsonl NYC_311_SR_2010-2020-sample-1M.csv
ls NYC_311_SR_2010-2020-sample-1M.{stats.csv,stats.csv.data.jsonl}

Example: custom percentiles for a population study

qsv stats --quartiles --percentiles --percentile-list '5,10,25,50,75,90,95' \
  wcp.csv > wcp-percentiles.csv

The deciles and quintiles keywords expand to 10,20,30,...,90 and 20,40,60,80.

Example: weighted stats (e.g., revenue-weighted price)

qsv stats --weight 'Quantity' --quartiles \
  transactions.csv > price-weighted-by-volume.stats.csv

Example: approximate quantiles & cardinality on a 100M-row file

qsv stats --everything \
  --quantile-method approx \
  --cardinality-method approx \
  huge.csv

approx uses Apache DataSketches (t-digest for quantiles, HyperLogLog for cardinality) — constant memory per column, ~1 % rank error. Use this when exact memory blows up.

Important

New in 20.1.0 — automatic on OOM. When util::mem_file_check reports that the file is likely bigger than available RAM, stats now auto-enables --quantile-method approx and --cardinality-method approx and emits a wwarn! listing what got switched on. Pass --quantile-method exact / --cardinality-method exact to force the precise calculation regardless. The stats cache key includes the chosen mode, so switching between exact and approximate runs won't return stale results. Requires a little-endian target (Intel, AMD, Apple Silicon, ARM — everything except IBM s390x, which gets a clear error).

See also: /docs/help/stats.md, docs/STATS_DEFINITIONS.md — definitions of every metric, moarstats, Stats Cache & Caching.

`moarstats`

Adds 25+ univariate stats (Pearson's second skewness, IQR-to-range ratio, Z-scores of min/max/mode, quartile coefficient of dispersion, …) plus W3C XSD datatype mapping to an existing .stats.csv. Multithreaded with an index.

moarstats looks for <FILESTEM>.stats.csv and runs stats first if it doesn't exist.

Example: extend allegheny property sales stats

qsv moarstats allegheny_property_sales.csv > allegheny.moarstats.csv

Example: build the cache first, then add moar stats

qsv stats --everything --stats-jsonl allegheny_property_sales.csv
qsv moarstats allegheny_property_sales.csv

See also: /docs/help/moarstats.md, stats, pragmastat — the robust-stats counterpart.

`frequency`

Frequency distribution tables per column. Output as CSV (default), JSON, or TOON (compact LLM-friendly format). Multithreaded with an index. The stats cache gives a big speedup on ID-like (all-unique) columns.

For columns with millions of unique values, use --sketch-method frequent_items (Apache DataSketches Misra-Gries) — bounded memory, top-K with bounded error.

Example: top-10 NYC 311 complaint types

qsv frequency --select 'Complaint Type' --limit 10 NYC_311_SR_2010-2020-sample-1M.csv

field,value,count,percentage,rank
Complaint Type,Noise - Residential,71872,7.19,1
Complaint Type,HEATING,67234,6.72,2
Complaint Type,GENERAL CONSTRUCTION,52111,5.21,3
...
Complaint Type,Other,538295,53.83,0

Example: build a fresh-cached frequency table for downstream consumers

qsv index NYC_311_SR_2010-2020-sample-1M.csv
qsv stats --cardinality --stats-jsonl NYC_311_SR_2010-2020-sample-1M.csv
qsv frequency --frequency-jsonl NYC_311_SR_2010-2020-sample-1M.csv
ls NYC_311_SR_2010-2020-sample-1M.freq.csv.data.jsonl

Example: top-100 cities by frequency, JSON output for an LLM prompt

qsv frequency --select AccentCity --limit 100 --asc wcp.csv > top_cities.csv
qsv frequency --select AccentCity --limit 100 --asc --json wcp.csv > top_cities.json
qsv frequency --select AccentCity --limit 100 --asc --toon wcp.csv > top_cities.toon

Example: heavy-hitter detection on a huge cardinality column

qsv frequency --select 'Session ID' \
  --sketch-method frequent_items \
  --sketch-map-size 16384 \
  --limit 50 events.csv > hot_sessions.csv

Important

New in 20.1.0 — automatic on OOM. frequency now auto-enables --sketch-method frequent_items (Misra-Gries) when the file is likely bigger than RAM and emits a wwarn!. Pass --sketch-method exact to force exact frequencies regardless. Note that frequent_items tracks heavy hitters only — --asc, --ignore-case, --no-trim, --frequency-jsonl, --json, and weighted frequencies are incompatible and require exact. Requires a little-endian target.

Example: filter columns dynamically with a Luau expression on stats

# Skip floats and columns with > 1,000 nulls
qsv frequency --stats-filter "type == 'Float' or nullcount > 1000" data.csv

--stats-filter requires the luau feature and the stats cache.

See also: /docs/help/frequency.md, stats (pre-populate the cache for speed), Stats Cache & Caching.

`pragmastat`

Robust statistics from the Pragmastat library — designed for heavy-tailed, outlier-prone data where mean and stddev mislead. Computes Hodges-Lehmann center (median of pairwise averages) and Shamos spread (median of pairwise absolute differences), each with confidence bounds. Both estimators tolerate up to 29 % corruption.

By default, pragmastat appends 7 ps_* columns to the existing stats cache.

Example: robust spread for Allegheny property sales (highly skewed)

qsv stats -E --infer-dates --stats-jsonl allegheny_property_sales.csv
qsv pragmastat --select 'Sale Price' allegheny_property_sales.csv

The --standalone flag emits standalone CSV instead of appending to the cache.

Example: two-sample comparison — does Borough A differ from Borough B?

qsv pragmastat --twosample --select 'priceA,priceB' market_split.csv

Example: confirmatory test — is median latency below 100 ms?

qsv pragmastat --compare1 'center:100' --select latency_ms requests.csv

See also: /docs/help/pragmastat.md, Pragmastat library, stats, moarstats.

`dedup`

Drop duplicate rows. By default it sorts in memory then deduplicates — use --sorted to stream when input is already sorted (constant memory). With -D <file> it writes the duplicates to a separate file for audit.

Example: deduplicate Boston 311 by composite key

qsv dedup --select 'Created Date,Incident Address' boston311-100.csv > unique.csv

Example: sort then stream-dedup a contact list

qsv sort --select Email contacts.csv \
  | qsv dedup --select Email --sorted -D dupes.csv > unique.csv

Example: case-insensitive dedup of email addresses

qsv dedup --select Email --ignore-case contacts.csv > unique.csv

For files larger than RAM use extdedup.

See also: /docs/help/dedup.md, sort, sortcheck, extdedup.

`extdedup`

Memory-mapped, on-disk hash table dedup for arbitrarily large files. Unlike dedup, it preserves input order. Two modes: CSV (with --select) and line-by-line (any text file).

Example: dedup a 50 GB clickstream by session_id without loading it into memory

qsv extdedup --select session_id huge_clickstream.csv > unique.csv

Example: line-mode dedup on a log file

qsv extdedup --no-headers nginx-access.log > unique-lines.log

See also: /docs/help/extdedup.md, extsort, Recipe: Larger-than-RAM CSV.

`extsort`

Multithreaded external merge sort for files larger than RAM. Two modes: CSV (with --select, requires an index) and line-mode (any text file).

Example: sort a 16 GB NYC 311 export by Created Date for time-series joins

qsv index nyc311-full.csv
qsv extsort --select 'Created Date' nyc311-full.csv > nyc311-by-date.csv

Example: sort a 100 GB log file alphabetically (line mode)

qsv extsort --no-headers huge-log.txt > sorted-log.txt

After extsort, follow up with dedup --sorted for streaming dedup, or joinp asof for time-series joins.

See also: /docs/help/extsort.md, sort, dedup --sorted, Recipe: Larger-than-RAM CSV.

Whitespace markers

Both stats and frequency accept --vis-whitespace, which replaces whitespace characters in the output (e.g. min/max/mode/antimode and frequency values) with visible markers, so leading, trailing, and embedded whitespace is easy to spot. Spaces are only marked when a value is entirely spaces — a value of three spaces becomes 《_》《_》《_》, while embedded single spaces in mixed text are left as-is.

Character	Unicode	Marker
space (only when the whole value is spaces)	`U+0020`	`《_》`
tab	`U+0009`	`《→》`
newline	`U+000A`	`《¶》`
carriage return	`U+000D`	`《⏎》`
vertical tab	`U+000B`	`《⋮》`
form feed	`U+000C`	`《␌》`
next line	`U+0085`	`《␤》`
left-to-right mark	`U+200E`	`《␎》`
right-to-left mark	`U+200F`	`《␏》`
line separator	`U+2028`	`《␊》`
paragraph separator	`U+2029`	`《␍》`
non-breaking space	`U+00A0`	`《⍽》`
em space	`U+2003`	`《emsp》`
figure space	`U+2007`	`《figsp》`
zero-width space	`U+200B`	`《zwsp》`

The first three follow Rust's whitespace reference; the rest cover other common invisible characters. The canonical mapping is WHITESPACE_MARKERS in src/util.rs.

Aggregation and Statistics

Aggregation & Statistics

Quick decision table

stats

moarstats

frequency

pragmastat

dedup

extdedup

extsort

Whitespace markers

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

`stats`

`moarstats`

`frequency`

`pragmastat`

`dedup`

`extdedup`

`extsort`