Skip to content
Joel Natividad edited this page Jun 3, 2026 · 4 revisions

Why qsv?

Tier: Beginner

You already have awk, pandas, csvkit, miller, xsv, duckdb, Excel, and Polars. Why pick up another CSV tool? Here's the short pitch.

Speed that changes how you work

qsv is fast enough that you stop noticing it. A few headline numbers, all measured against real public datasets:

  • 48 statistical measures for every column of a 2.7M-row CSV in under a second (benchmark). With an index, faster still.
  • Index a 15 GB / 28M-row NYC 311 dataset in ~14 seconds. After that, count, sample, and slice are instantaneous.
  • Validate a 1M-row CSV against a JSON Schema 2020-12 spec at up to 780,000 rows/sec — see docs/Validate.md.
  • Geocode 360,000 records per second against a local Geonames mirror — see docs/help/geocode.md.
  • Diff two 1M × 9-column CSVs in under 600 ms.

For full benchmarks and reproduction instructions, see docs/BENCHMARKS.md and qsv.dathere.com/benchmarks.

Tip

Why it matters: when your tooling is sub-second, you experiment more, you check assumptions more often, and you ship cleaner data. The first thing most new users say is "wait, that's how fast stats runs?"

Composable Unix-style pipelines

qsv follows the Unix philosophy: 70+ single-purpose commands that compose via stdin/stdout. The example from Getting Started chains five commands to find the top 10 US cities by population:

qsv search --select Country '^us$' wcp.csv \
  | qsv sort --select Population --numeric --reverse \
  | qsv slice --len 10 \
  | qsv select 'AccentCity,Region,Population' \
  | qsv table

No SQL parser, no in-memory DataFrame, no schema declaration. Every step streams.

Batteries included

When you need more than streaming row ops, qsv has it built in:

  • SQL on CSV / Parquet / Arrow / JSONLsqlp runs Polars SQL (PostgreSQL dialect) and can process files larger than RAM. See SQL & Polars.
  • joinp for asof, non-equi, and outer joins — Polars-powered, multithreaded, larger-than-RAM. See Joins & Set Ops.
  • Two embedded DSLs — Luau (Lua 0.720, with BEGIN/MAIN/END blocks and lookup tables) and Python (f-string expressions per row). See Scripting (Luau / Python).
  • MiniJinja templating — for report generation (template) and HTTP POST bodies (fetchpost). See HTTP & Web.
  • JSON Schema 2020-12 validation with custom keywords (currency, dynamicEnum, uniqueCombinedWith). See Validation & Schema.
  • Geocoding against an updatable local Geonames mirror — no network calls at runtime. See Geospatial.
  • HTTP fetching with HTTP/2 flow control, RFC RateLimit-aware throttling, Redis/disk caching, and jaq JSON extraction. See HTTP & Web.
  • AI-driven data dictionariesdescribegpt produces neuro-symbolic descriptions and SQL RAG sessions against any OpenAI-compatible LLM (including local Ollama / Jan / LM Studio). See AI & Documentation.

No database required — but speaks Polars/SQL when you want it

You don't have to load anything into a database. Files are the table. But if you want SQL, point sqlp at one or many CSVs (or Parquet, or JSONL) and write a query — Polars handles the rest.

48 statistical measures, with guaranteed type inference

qsv stats doesn't sample-and-guess. For every column it produces:

  • Guaranteed data type inference (Null / String / Float / Integer / Date / DateTime / Boolean)
  • Sum, mean, stddev, variance, min, max, range, geometric/harmonic mean
  • Cardinality, mode/antimode (with weights), sortiness
  • Median, quartiles, percentiles, MAD, IQR, skewness, kurtosis
  • Plus extended outlier, robust, and bivariate stats via moarstats (55 more measures)

See docs/STATS_DEFINITIONS.md for the full list and Aggregation & Statistics for usage.

A real ecosystem

qsv is the engine inside several adjacent projects:

See Integrations for the full picture.

Four binary variants, one toolbox

Variant Size Best for
qsv full Day-to-day workstation use
qsvlite ~16 % xsv migrants, minimal install, low-resource environments
qsvdp ~16 % DataPusher+ / CKAN data pipelines
qsvmcp smaller MCP server deployments

See Binary Variants for the full feature matrix.

When to use qsv vs other tools

Short version:

  • qsv vs xsv — qsv is a fork of xsv with vastly more commands, multithreading, indexing, and active development. If you like xsv, install qsvlite and you have a drop-in.
  • qsv vs csvkit — qsv is roughly 10×–14× faster on real workloads and has more commands. csvkit is Python, easy to extend in Python; qsv is Rust, easy to script in shell.
  • qsv vs miller — overlap is significant; miller is more general (TSV/JSON/DKVP shapes), qsv is more CSV-specialized with deeper stats/validation. Use whichever you reach for first.
  • qsv vs DuckDB CSV reader — they complement each other. qsv to parquet + DuckDB is a great pipeline. qsv specializes in pre-DB cleaning, profiling, validation, and enrichment.
  • qsv vs pandas — pandas is in-memory and Python-native. qsv is streaming, shell-native, and faster for big-CSV profiling. The two coexist in many notebooks (see Integrations).

Full comparison: Comparison vs others.

Try it now

# Run qsv online — no install required:
open https://qsv.dathere.com

Or install it locally and follow Getting Started.

See also

Clone this wiki locally