-
Notifications
You must be signed in to change notification settings - Fork 104
Aggregation and Statistics
Tier: Intermediate
Commands covered: stats, moarstats, frequency, pragmastat, dedup, extdedup, extsort
Note
Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
These commands turn raw rows into numbers. The headline feature is stats — 48 metrics in under a second on 2.7M rows. The runner-up is frequency — multithreaded with an index, Apache DataSketches-backed frequent_items mode for huge cardinalities.
| If you want to… | Use | Notes |
|---|---|---|
| Compute mean/median/stddev/etc. with guaranteed type inference | stats |
48 metrics; sub-second on millions of rows |
| Add bivariate, robust, and outlier stats to an existing stats CSV | moarstats |
25+ additional measures |
| See the top-N values per column (and how often each occurs) | frequency |
Apache DataSketches mode for huge cardinalities |
| Robust median-of-pairwise stats (Hodges-Lehmann, Shamos) | pragmastat |
Date/datetime aware via stats cache |
| Drop duplicate rows | dedup |
Streaming with --sorted, in-memory otherwise |
| Drop duplicates from a file > RAM | extdedup |
On-disk hash table; preserves input order |
| Sort a file > RAM | extsort |
Multithreaded external merge sort |
The headline command. Computes up to 48 summary statistics with guaranteed data type inference (Null / String / Float / Integer / Date / DateTime / Boolean). Streaming by default; loads into memory for non-streaming stats like median / quartiles / cardinality. Multithreaded with an index.
A successful run writes a <filename>.stats.csv cache file that many other commands reuse (frequency, schema, pragmastat, pivotp, sqlp scoresql, etc.) — see Stats Cache & Caching.
Example: full profile of NYC 311 with date and boolean inference
qsv stats --everything --infer-dates --infer-boolean \
NYC_311_SR_2010-2020-sample-1M.csv > nyc311-stats.csv--everything adds cardinality, modes/antimodes, median, MAD, quartiles, IQR, fences, skewness, percentiles, and zero_padded_numeric (flags String columns that are entirely zero-padded numbers — zip codes, barcodes, padded IDs — so they aren't mistakenly loaded as integer/float downstream). --infer-dates recognizes 19 date formats; --infer-boolean detects 2-valued patterns like t/f, yes/no, 1/0.
The zero_padded_numeric column can also be enabled on its own with --zero-padded-numeric:
qsv stats --zero-padded-numeric data.csvExample: pre-populate the stats cache to speed up downstream commands
# Run this once; every smart command (frequency, schema, pragmastat, pivotp, ...)
# will pick up the cache automatically.
qsv stats --cardinality --stats-jsonl NYC_311_SR_2010-2020-sample-1M.csv
ls NYC_311_SR_2010-2020-sample-1M.{stats.csv,stats.csv.data.jsonl}Example: custom percentiles for a population study
qsv stats --quartiles --percentiles --percentile-list '5,10,25,50,75,90,95' \
wcp.csv > wcp-percentiles.csvThe deciles and quintiles keywords expand to 10,20,30,...,90 and 20,40,60,80.
Example: weighted stats (e.g., revenue-weighted price)
qsv stats --weight 'Quantity' --quartiles \
transactions.csv > price-weighted-by-volume.stats.csvExample: approximate quantiles & cardinality on a 100M-row file
qsv stats --everything \
--quantile-method approx \
--cardinality-method approx \
huge.csvapprox uses Apache DataSketches (t-digest for quantiles, HyperLogLog for cardinality) — constant memory per column, ~1 % rank error. Use this when exact memory blows up.
Important
New in 20.1.0 — automatic on OOM. When util::mem_file_check reports that the file is likely bigger than available RAM, stats now auto-enables --quantile-method approx and --cardinality-method approx and emits a wwarn! listing what got switched on. Pass --quantile-method exact / --cardinality-method exact to force the precise calculation regardless. The stats cache key includes the chosen mode, so switching between exact and approximate runs won't return stale results. Requires a little-endian target (Intel, AMD, Apple Silicon, ARM — everything except IBM s390x, which gets a clear error).
See also: /docs/help/stats.md, docs/STATS_DEFINITIONS.md — definitions of every metric, moarstats, Stats Cache & Caching.
Adds 25+ univariate stats (Pearson's second skewness, IQR-to-range ratio, Z-scores of min/max/mode, quartile coefficient of dispersion, …) plus W3C XSD datatype mapping to an existing .stats.csv. Multithreaded with an index.
moarstats looks for <FILESTEM>.stats.csv and runs stats first if it doesn't exist.
Example: extend allegheny property sales stats
qsv moarstats allegheny_property_sales.csv > allegheny.moarstats.csvExample: build the cache first, then add moar stats
qsv stats --everything --stats-jsonl allegheny_property_sales.csv
qsv moarstats allegheny_property_sales.csvSee also: /docs/help/moarstats.md, stats, pragmastat — the robust-stats counterpart.
Frequency distribution tables per column. Output as CSV (default), JSON, or TOON (compact LLM-friendly format). Multithreaded with an index. The stats cache gives a big speedup on ID-like (all-unique) columns.
For columns with millions of unique values, use --sketch-method frequent_items (Apache DataSketches Misra-Gries) — bounded memory, top-K with bounded error.
Example: top-10 NYC 311 complaint types
qsv frequency --select 'Complaint Type' --limit 10 NYC_311_SR_2010-2020-sample-1M.csvfield,value,count,percentage,rank
Complaint Type,Noise - Residential,71872,7.19,1
Complaint Type,HEATING,67234,6.72,2
Complaint Type,GENERAL CONSTRUCTION,52111,5.21,3
...
Complaint Type,Other,538295,53.83,0
Example: build a fresh-cached frequency table for downstream consumers
qsv index NYC_311_SR_2010-2020-sample-1M.csv
qsv stats --cardinality --stats-jsonl NYC_311_SR_2010-2020-sample-1M.csv
qsv frequency --frequency-jsonl NYC_311_SR_2010-2020-sample-1M.csv
ls NYC_311_SR_2010-2020-sample-1M.freq.csv.data.jsonlExample: top-100 cities by frequency, JSON output for an LLM prompt
qsv frequency --select AccentCity --limit 100 --asc wcp.csv > top_cities.csv
qsv frequency --select AccentCity --limit 100 --asc --json wcp.csv > top_cities.json
qsv frequency --select AccentCity --limit 100 --asc --toon wcp.csv > top_cities.toonExample: heavy-hitter detection on a huge cardinality column
qsv frequency --select 'Session ID' \
--sketch-method frequent_items \
--sketch-map-size 16384 \
--limit 50 events.csv > hot_sessions.csvImportant
New in 20.1.0 — automatic on OOM. frequency now auto-enables --sketch-method frequent_items (Misra-Gries) when the file is likely bigger than RAM and emits a wwarn!. Pass --sketch-method exact to force exact frequencies regardless. Note that frequent_items tracks heavy hitters only — --asc, --ignore-case, --no-trim, --frequency-jsonl, --json, and weighted frequencies are incompatible and require exact. Requires a little-endian target.
Example: filter columns dynamically with a Luau expression on stats
# Skip floats and columns with > 1,000 nulls
qsv frequency --stats-filter "type == 'Float' or nullcount > 1000" data.csv--stats-filter requires the luau feature and the stats cache.
See also: /docs/help/frequency.md, stats (pre-populate the cache for speed), Stats Cache & Caching.
Robust statistics from the Pragmastat library — designed for heavy-tailed, outlier-prone data where mean and stddev mislead. Computes Hodges-Lehmann center (median of pairwise averages) and Shamos spread (median of pairwise absolute differences), each with confidence bounds. Both estimators tolerate up to 29 % corruption.
By default, pragmastat appends 7 ps_* columns to the existing stats cache.
Example: robust spread for Allegheny property sales (highly skewed)
qsv stats -E --infer-dates --stats-jsonl allegheny_property_sales.csv
qsv pragmastat --select 'Sale Price' allegheny_property_sales.csvThe --standalone flag emits standalone CSV instead of appending to the cache.
Example: two-sample comparison — does Borough A differ from Borough B?
qsv pragmastat --twosample --select 'priceA,priceB' market_split.csvExample: confirmatory test — is median latency below 100 ms?
qsv pragmastat --compare1 'center:100' --select latency_ms requests.csvSee also: /docs/help/pragmastat.md, Pragmastat library, stats, moarstats.
Drop duplicate rows. By default it sorts in memory then deduplicates — use --sorted to stream when input is already sorted (constant memory). With -D <file> it writes the duplicates to a separate file for audit.
Example: deduplicate Boston 311 by composite key
qsv dedup --select 'Created Date,Incident Address' boston311-100.csv > unique.csvExample: sort then stream-dedup a contact list
qsv sort --select Email contacts.csv \
| qsv dedup --select Email --sorted -D dupes.csv > unique.csvExample: case-insensitive dedup of email addresses
qsv dedup --select Email --ignore-case contacts.csv > unique.csvFor files larger than RAM use extdedup.
See also: /docs/help/dedup.md, sort, sortcheck, extdedup.
Memory-mapped, on-disk hash table dedup for arbitrarily large files. Unlike dedup, it preserves input order. Two modes: CSV (with --select) and line-by-line (any text file).
Example: dedup a 50 GB clickstream by session_id without loading it into memory
qsv extdedup --select session_id huge_clickstream.csv > unique.csvExample: line-mode dedup on a log file
qsv extdedup --no-headers nginx-access.log > unique-lines.logSee also: /docs/help/extdedup.md, extsort, Recipe: Larger-than-RAM CSV.
Multithreaded external merge sort for files larger than RAM. Two modes: CSV (with --select, requires an index) and line-mode (any text file).
Example: sort a 16 GB NYC 311 export by Created Date for time-series joins
qsv index nyc311-full.csv
qsv extsort --select 'Created Date' nyc311-full.csv > nyc311-by-date.csvExample: sort a 100 GB log file alphabetically (line mode)
qsv extsort --no-headers huge-log.txt > sorted-log.txtAfter extsort, follow up with dedup --sorted for streaming dedup, or joinp asof for time-series joins.
See also: /docs/help/extsort.md, sort, dedup --sorted, Recipe: Larger-than-RAM CSV.
Both stats and frequency accept --vis-whitespace, which replaces whitespace characters in the output (e.g. min/max/mode/antimode and frequency values) with visible markers, so leading, trailing, and embedded whitespace is easy to spot. Spaces are only marked when a value is entirely spaces — a value of three spaces becomes 《_》《_》《_》, while embedded single spaces in mixed text are left as-is.
| Character | Unicode | Marker |
|---|---|---|
| space (only when the whole value is spaces) | U+0020 |
《_》 |
| tab | U+0009 |
《→》 |
| newline | U+000A |
《¶》 |
| carriage return | U+000D |
《⏎》 |
| vertical tab | U+000B |
《⋮》 |
| form feed | U+000C |
《␌》 |
| next line | U+0085 |
《》 |
| left-to-right mark | U+200E |
《␎》 |
| right-to-left mark | U+200F |
《␏》 |
| line separator | U+2028 |
《␊》 |
| paragraph separator | U+2029 |
《␍》 |
| non-breaking space | U+00A0 |
《⍽》 |
| em space | U+2003 |
《emsp》 |
| figure space | U+2007 |
《figsp》 |
| zero-width space | U+200B |
《zwsp》 |
The first three follow Rust's whitespace reference; the rest cover other common invisible characters. The canonical mapping is WHITESPACE_MARKERS in src/util.rs.
- Command Reference (index)
- Transform & Reshape — the preceding step in a typical pipeline
- Joins & Set Ops — combine after aggregating
-
SQL & Polars — for
GROUP BY+ window functions docs/STATS_DEFINITIONS.md- Stats Cache & Caching
-
Performance Tuning — when to index, when to use
approxmethods - Cookbook → Stats → Insights
- Cookbook → Larger-than-RAM CSV
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation