Skip to content
Joel Natividad edited this page Jun 7, 2026 · 4 revisions

Glossary

Tier: Beginner

Terms used across qsv's docs and this wiki, alphabetical.

Antimode

The least frequently occurring value(s) in a column — the opposite of mode. Useful for spotting anomalies. Configurable max length via QSV_ANTIMODES_LEN (default 100 chars). Computed by stats --everything or stats --mode. See Aggregation & Statistics → stats.

Apache DataSketches

A library of approximate algorithms with bounded memory and bounded error. qsv uses two:

  • t-digest — approximate quantiles (--quantile-method approx)
  • HyperLogLog — approximate cardinality (--cardinality-method approx)
  • Frequent Items (Misra-Gries) — approximate top-K (frequency --sketch-method frequent_items)

Trade ~1.5% error for O(1) memory per column. Critical for stats and frequency on huge files.

Asof join

A time-series join that matches each left-side row to the nearest right-side row by a key — typically the nearest prior timestamp. Implemented by joinp --strategy backward. Use case: enrich an events log with the weather observation that was in effect at the time of each event. See Recipe: Multi-Table Joins.

Automagical (🪄)

qsv's term for commands that use the stats / frequency cache to work smarter without user intervention. frequency short-circuits all-unique columns; pivotp auto-picks aggregations by data type; validate derives schema rules from inferred stats. See Stats Cache & Caching.

BLAKE3

A modern cryptographic hash function — faster than SHA-2, mmap-friendly, multithreaded. qsv's blake3 command computes BLAKE3 hashes for cache keys, pipeline gating, and integrity checks. See Indexing, Compression & Diff → blake3.

Cardinality

The number of distinct values in a column. A column with all-unique values has cardinality equal to row count. Use stats --cardinality to compute. The cardinality value drives a lot of qsv's automagic — frequency short-circuits high-cardinality columns; validate decides whether to use an enum constraint.

CKAN

An open-source data portal platform (ckan.org). qsv has first-class CKAN integration: safenames for CKAN Datastore column-name rules, applydp for DataPusher+ pipelines, ckan:// URL scheme in validate dynamicEnum and Luau qsv_register_lookup. See Recipe: CKAN Integration.

CSV / TSV / SSV

Comma-, tab-, and semicolon-separated values respectively. qsv auto-detects the delimiter based on file extension; override with --delimiter or QSV_DEFAULT_DELIMITER.

dc: (disk-cache input prefix)

A dc:<name> input handle that reads a resource previously cached by qsv get. Any qsv command can use it (e.g. qsv stats dc:data.csv); the blob is decompressed on the fly and stale entries are auto-refreshed per the entry's TTL/refresh policy. See Get & Disk Cache.

dyncols (Dynamic columns)

%dyncols: {col1:fmt1},{col2:fmt2} is qsv's syntax (used by geocode and apply) for producing multiple new columns from a single command in one pass. The left side of each : is the new column name; the right is the format spec.

Extended Input Support (🗄️)

Commands marked 🗄️ accept directories, .infile-list files, snappy-compressed inputs, and stdin uniformly. Examples: cat, headers, sqlp, to, validate. See README → Extended Input Support.

Frequent Items sketch

See Apache DataSketches.

Frequency distribution

A table with field, value, count, percentage, rank. Built by qsv frequency. The companion to stats for categorical columns.

Index (qsv-style)

A <file>.idx sidecar that maps row numbers to byte offsets. Built by qsv index. Lets count, sample, slice work in O(1) and enables multithreading for stats, frequency, split, schema. See Indexing, Compression & Diff → index.

JaQ / jq

A jq-like JSON query language. Used by qsv fetch / fetchpost --jaq flag to extract values from JSON responses. github.com/01mf02/jaq.

Lazy Frame (Polars)

Polars's deferred-evaluation API. A LazyFrame builds up a plan without reading data until you .collect(). qsv's sqlp, joinp, and pivotp use lazy frames internally to enable larger-than-RAM and multithreaded execution.

Lookup table

A reference CSV indexed by its first column, used at runtime by Luau / template / validate. URL schemes: file://, http(s)://, dathere://, ckan://. See Lookup Tables.

MAD (Median Absolute Deviation)

A robust measure of variability — median of |x - median(x)|. Tolerates outliers far better than stddev. Computed by qsv stats --mad.

MCP (Model Context Protocol)

An open protocol for exposing tools to LLMs. qsv ships an MCP server (qsvmcp binary + .mcpb bundle) that turns ~60 qsv commands into tools any MCP-aware client can call. See MCP Server.

MiniJinja

A Rust port of Jinja2, the Python templating engine. Used by qsv template, qsv fetchpost --payload-tpl, and qsv's describegpt prompts. See Scripting (Luau / Python) → template.

Neuro-symbolic

A hybrid AI approach that combines symbolic computation (deterministic algorithms — qsv's stats and frequency caches) with neural computation (LLM-generated narrative). qsv describegpt is qsv's flagship neuro-symbolic command. See AI & Documentation → describegpt.

Non-equi join

A join where the matching predicate is not equality — typically a range comparison. joinp --non-equi 'salary_left >= min_right AND salary_left <= max_right' matches employees to salary bands. See Joins & Set Ops → joinp.

Polars

A vectorized DataFrame engine in Rust (pola.rs). Powers qsv's sqlp, joinp, pivotp, scoresql, plus reading capabilities in count, lens, color, prompt, schema. Polars supports larger-than-RAM and is highly multithreaded.

Polars schema

A <file>.pschema.json sidecar that tells Polars the column dtypes upfront. Skips the inference scan; speeds up sqlp / joinp / pivotp. Generated by qsv schema --polars. See Validation & Schema → schema.

Pragmastat

A robust-statistics library by Andrey Akinshin (pragmastat.dev). qsv's pragmastat command computes:

  • Hodges-Lehmann center — median of pairwise averages (tolerates 29% data corruption)
  • Shamos spread — median of pairwise absolute differences

Better than mean/stddev for skewed or outlier-heavy data. See Aggregation & Statistics → pragmastat.

Pseudonymization

Replacing identifying values with stable, repeatable identifiers. Same input always maps to same output. qsv pseudo does this. For irreversible hashing, use qsv blake3 --keyed. See Transform & Reshape → pseudo.

Reservoir sampling

The default sampling method in qsv sample when there's no index. Visits each row exactly once, keeps memory proportional to sample size, produces a uniform-random sample. Wikipedia.

RFC 4180

The CSV format spec: comma delimiter, double-quote field quoting, CRLF line endings. qsv validate <file> (no schema) checks RFC 4180 conformance plus UTF-8 encoding.

Semantic Markdown

Human-readable markdown enriched with light, agent-parseable conventions that a companion converter turns into structured JSON (semanticmd.org). Available as the semanticmd output format in qsv describegpt (20.2.0+), which emits the Data Dictionary this way. See AI & Documentation → Semantic Markdown output.

Sortiness

A metric in qsv stats measuring how close a column is to being sorted, on a scale from -1 (perfectly reverse-sorted) to +1 (perfectly sorted), with 0 being completely random. Useful for picking whether to use dedup vs dedup --sorted (the latter is constant-memory streaming).

Stats cache

A <file>.stats.csv.data.jsonl sidecar produced by qsv stats --stats-jsonl. Many smart commands consume it. The single biggest performance win for qsv users. See Stats Cache & Caching.

Streaming command

A command that processes one row at a time without loading the full file. Most qsv commands are streaming. Non-streaming commands are marked 🤯 in the README legend.

TOON (Token-Optimized Object Notation)

A compact JSON encoding designed for LLM prompt efficiency (toonformat.dev). Available as an output format in qsv describegpt and qsv frequency.

t-digest

See Apache DataSketches.

WKT (Well-Known Text)

A standard text encoding for vector geometries (points, lines, polygons). qsv's apply operations wkt_point builds WKT POINT(lat lon) values; geoconvert --geometry <col> reads WKT-formatted geometry columns. See Geospatial.

XSD (W3C XML Schema Definition)

A formal schema language for XML. qsv moarstats maps inferred column types to their most-specific XSD datatype — useful for data publishing.

See also

Clone this wiki locally