Skip to content

Aggregation and Statistics

Joel Natividad edited this page Jun 3, 2026 · 6 revisions

Aggregation & Statistics

Tier: Intermediate Commands covered: stats, moarstats, frequency, pragmastat, dedup, extdedup, extsort

Note

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

These commands turn raw rows into numbers. The headline feature is stats — 48 metrics in under a second on 2.7M rows. The runner-up is frequency — multithreaded with an index, Apache DataSketches-backed frequent_items mode for huge cardinalities.

Quick decision table

If you want to… Use Notes
Compute mean/median/stddev/etc. with guaranteed type inference stats 48 metrics; sub-second on millions of rows
Add bivariate, robust, and outlier stats to an existing stats CSV moarstats 25+ additional measures
See the top-N values per column (and how often each occurs) frequency Apache DataSketches mode for huge cardinalities
Robust median-of-pairwise stats (Hodges-Lehmann, Shamos) pragmastat Date/datetime aware via stats cache
Drop duplicate rows dedup Streaming with --sorted, in-memory otherwise
Drop duplicates from a file > RAM extdedup On-disk hash table; preserves input order
Sort a file > RAM extsort Multithreaded external merge sort

stats

The headline command. Computes up to 48 summary statistics with guaranteed data type inference (Null / String / Float / Integer / Date / DateTime / Boolean). Streaming by default; loads into memory for non-streaming stats like median / quartiles / cardinality. Multithreaded with an index.

A successful run writes a <filename>.stats.csv cache file that many other commands reuse (frequency, schema, pragmastat, pivotp, sqlp scoresql, etc.) — see Stats Cache & Caching.

Example: full profile of NYC 311 with date and boolean inference

qsv stats --everything --infer-dates --infer-boolean \
  NYC_311_SR_2010-2020-sample-1M.csv > nyc311-stats.csv

--everything adds cardinality, modes/antimodes, median, MAD, quartiles, IQR, fences, skewness, percentiles, and zero_padded_numeric (flags String columns that are entirely zero-padded numbers — zip codes, barcodes, padded IDs — so they aren't mistakenly loaded as integer/float downstream). --infer-dates recognizes 19 date formats; --infer-boolean detects 2-valued patterns like t/f, yes/no, 1/0.

The zero_padded_numeric column can also be enabled on its own with --zero-padded-numeric:

qsv stats --zero-padded-numeric data.csv

Example: pre-populate the stats cache to speed up downstream commands

# Run this once; every smart command (frequency, schema, pragmastat, pivotp, ...)
# will pick up the cache automatically.
qsv stats --cardinality --stats-jsonl NYC_311_SR_2010-2020-sample-1M.csv
ls NYC_311_SR_2010-2020-sample-1M.{stats.csv,stats.csv.data.jsonl}

Example: custom percentiles for a population study

qsv stats --quartiles --percentiles --percentile-list '5,10,25,50,75,90,95' \
  wcp.csv > wcp-percentiles.csv

The deciles and quintiles keywords expand to 10,20,30,...,90 and 20,40,60,80.

Example: weighted stats (e.g., revenue-weighted price)

qsv stats --weight 'Quantity' --quartiles \
  transactions.csv > price-weighted-by-volume.stats.csv

Example: approximate quantiles & cardinality on a 100M-row file

qsv stats --everything \
  --quantile-method approx \
  --cardinality-method approx \
  huge.csv

approx uses Apache DataSketches (t-digest for quantiles, HyperLogLog for cardinality) — constant memory per column, ~1 % rank error. Use this when exact memory blows up.

Important

New in 20.1.0 — automatic on OOM. When util::mem_file_check reports that the file is likely bigger than available RAM, stats now auto-enables --quantile-method approx and --cardinality-method approx and emits a wwarn! listing what got switched on. Pass --quantile-method exact / --cardinality-method exact to force the precise calculation regardless. The stats cache key includes the chosen mode, so switching between exact and approximate runs won't return stale results. Requires a little-endian target (Intel, AMD, Apple Silicon, ARM — everything except IBM s390x, which gets a clear error).

See also: /docs/help/stats.md, docs/STATS_DEFINITIONS.md — definitions of every metric, moarstats, Stats Cache & Caching.

moarstats

Adds 25+ univariate stats (Pearson's second skewness, IQR-to-range ratio, Z-scores of min/max/mode, quartile coefficient of dispersion, …) plus W3C XSD datatype mapping to an existing .stats.csv. Multithreaded with an index.

moarstats looks for <FILESTEM>.stats.csv and runs stats first if it doesn't exist.

Example: extend allegheny property sales stats

qsv moarstats allegheny_property_sales.csv > allegheny.moarstats.csv

Example: build the cache first, then add moar stats

qsv stats --everything --stats-jsonl allegheny_property_sales.csv
qsv moarstats allegheny_property_sales.csv

See also: /docs/help/moarstats.md, stats, pragmastat — the robust-stats counterpart.

frequency

Frequency distribution tables per column. Output as CSV (default), JSON, or TOON (compact LLM-friendly format). Multithreaded with an index. The stats cache gives a big speedup on ID-like (all-unique) columns.

For columns with millions of unique values, use --sketch-method frequent_items (Apache DataSketches Misra-Gries) — bounded memory, top-K with bounded error.

Example: top-10 NYC 311 complaint types

qsv frequency --select 'Complaint Type' --limit 10 NYC_311_SR_2010-2020-sample-1M.csv
field,value,count,percentage,rank
Complaint Type,Noise - Residential,71872,7.19,1
Complaint Type,HEATING,67234,6.72,2
Complaint Type,GENERAL CONSTRUCTION,52111,5.21,3
...
Complaint Type,Other,538295,53.83,0

Example: build a fresh-cached frequency table for downstream consumers

qsv index NYC_311_SR_2010-2020-sample-1M.csv
qsv stats --cardinality --stats-jsonl NYC_311_SR_2010-2020-sample-1M.csv
qsv frequency --frequency-jsonl NYC_311_SR_2010-2020-sample-1M.csv
ls NYC_311_SR_2010-2020-sample-1M.freq.csv.data.jsonl

Example: top-100 cities by frequency, JSON output for an LLM prompt

qsv frequency --select AccentCity --limit 100 --asc wcp.csv > top_cities.csv
qsv frequency --select AccentCity --limit 100 --asc --json wcp.csv > top_cities.json
qsv frequency --select AccentCity --limit 100 --asc --toon wcp.csv > top_cities.toon

Example: heavy-hitter detection on a huge cardinality column

qsv frequency --select 'Session ID' \
  --sketch-method frequent_items \
  --sketch-map-size 16384 \
  --limit 50 events.csv > hot_sessions.csv

Important

New in 20.1.0 — automatic on OOM. frequency now auto-enables --sketch-method frequent_items (Misra-Gries) when the file is likely bigger than RAM and emits a wwarn!. Pass --sketch-method exact to force exact frequencies regardless. Note that frequent_items tracks heavy hitters only — --asc, --ignore-case, --no-trim, --frequency-jsonl, --json, and weighted frequencies are incompatible and require exact. Requires a little-endian target.

Example: filter columns dynamically with a Luau expression on stats

# Skip floats and columns with > 1,000 nulls
qsv frequency --stats-filter "type == 'Float' or nullcount > 1000" data.csv

--stats-filter requires the luau feature and the stats cache.

See also: /docs/help/frequency.md, stats (pre-populate the cache for speed), Stats Cache & Caching.

pragmastat

Robust statistics from the Pragmastat library — designed for heavy-tailed, outlier-prone data where mean and stddev mislead. Computes Hodges-Lehmann center (median of pairwise averages) and Shamos spread (median of pairwise absolute differences), each with confidence bounds. Both estimators tolerate up to 29 % corruption.

By default, pragmastat appends 7 ps_* columns to the existing stats cache.

Example: robust spread for Allegheny property sales (highly skewed)

qsv stats -E --infer-dates --stats-jsonl allegheny_property_sales.csv
qsv pragmastat --select 'Sale Price' allegheny_property_sales.csv

The --standalone flag emits standalone CSV instead of appending to the cache.

Example: two-sample comparison — does Borough A differ from Borough B?

qsv pragmastat --twosample --select 'priceA,priceB' market_split.csv

Example: confirmatory test — is median latency below 100 ms?

qsv pragmastat --compare1 'center:100' --select latency_ms requests.csv

See also: /docs/help/pragmastat.md, Pragmastat library, stats, moarstats.

dedup

Drop duplicate rows. By default it sorts in memory then deduplicates — use --sorted to stream when input is already sorted (constant memory). With -D <file> it writes the duplicates to a separate file for audit.

Example: deduplicate Boston 311 by composite key

qsv dedup --select 'Created Date,Incident Address' boston311-100.csv > unique.csv

Example: sort then stream-dedup a contact list

qsv sort --select Email contacts.csv \
  | qsv dedup --select Email --sorted -D dupes.csv > unique.csv

Example: case-insensitive dedup of email addresses

qsv dedup --select Email --ignore-case contacts.csv > unique.csv

For files larger than RAM use extdedup.

See also: /docs/help/dedup.md, sort, sortcheck, extdedup.

extdedup

Memory-mapped, on-disk hash table dedup for arbitrarily large files. Unlike dedup, it preserves input order. Two modes: CSV (with --select) and line-by-line (any text file).

Example: dedup a 50 GB clickstream by session_id without loading it into memory

qsv extdedup --select session_id huge_clickstream.csv > unique.csv

Example: line-mode dedup on a log file

qsv extdedup --no-headers nginx-access.log > unique-lines.log

See also: /docs/help/extdedup.md, extsort, Recipe: Larger-than-RAM CSV.

extsort

Multithreaded external merge sort for files larger than RAM. Two modes: CSV (with --select, requires an index) and line-mode (any text file).

Example: sort a 16 GB NYC 311 export by Created Date for time-series joins

qsv index nyc311-full.csv
qsv extsort --select 'Created Date' nyc311-full.csv > nyc311-by-date.csv

Example: sort a 100 GB log file alphabetically (line mode)

qsv extsort --no-headers huge-log.txt > sorted-log.txt

After extsort, follow up with dedup --sorted for streaming dedup, or joinp asof for time-series joins.

See also: /docs/help/extsort.md, sort, dedup --sorted, Recipe: Larger-than-RAM CSV.

Whitespace markers

Both stats and frequency accept --vis-whitespace, which replaces whitespace characters in the output (e.g. min/max/mode/antimode and frequency values) with visible markers, so leading, trailing, and embedded whitespace is easy to spot. Spaces are only marked when a value is entirely spaces — a value of three spaces becomes 《_》《_》《_》, while embedded single spaces in mixed text are left as-is.

Character Unicode Marker
space (only when the whole value is spaces) U+0020 《_》
tab U+0009 《→》
newline U+000A 《¶》
carriage return U+000D 《⏎》
vertical tab U+000B 《⋮》
form feed U+000C 《␌》
next line U+0085 《␤》
left-to-right mark U+200E 《␎》
right-to-left mark U+200F 《␏》
line separator U+2028 《␊》
paragraph separator U+2029 《␍》
non-breaking space U+00A0 《⍽》
em space U+2003 《emsp》
figure space U+2007 《figsp》
zero-width space U+200B 《zwsp》

The first three follow Rust's whitespace reference; the rest cover other common invisible characters. The canonical mapping is WHITESPACE_MARKERS in src/util.rs.

See also

Clone this wiki locally