Skip to content

Stats Cache and Caching

Joel Natividad edited this page Jun 7, 2026 · 3 revisions

Stats Cache & Caching

Tier: Advanced

qsv caches at four levels. The flagship is the stats cache — feeding pre-computed statistics into "smart" downstream commands so they don't redo work. The other three are the frequency cache, the fetch / fetchpost / describegpt HTTP cache, and the Luau lookup-table cache.

This page explains the file layouts, who reads them, when they invalidate, and how to deliberately bust them.

The big idea

qsv leans heavily on sidecar files next to your CSV. Run stats once, and downstream commands get smarter and faster without re-reading the source file. The convention is:

File Producer Consumers
<csv>.idx qsv index count, sample, slice, stats, frequency, split, schema, luau
<csv>.stats.csv qsv stats Human-readable; not parsed by other commands
<csv>.stats.csv.data.jsonl qsv stats --stats-jsonl The actual cache used by frequency, schema, validate, pragmastat, pivotp, sqlp scoresql, tojsonl, sample, describegpt
<csv>.stats.csv.json qsv stats stats itself — records the arguments and qsv version used to build the cache; checked on the next run to decide whether the cache is still valid or must be recomputed
<csv>.freq.csv.data.jsonl qsv frequency --frequency-jsonl describegpt, scoresql, future smart commands
<csv>.schema.json qsv schema validate
<csv>.pschema.json qsv schema --polars sqlp, joinp, pivotp

The stats cache

Layout

my_file.csv                        — source
my_file.csv.idx                    — row-offset index
my_file.stats.csv                  — human-readable stats CSV
my_file.stats.csv.data.jsonl       — machine-readable JSONL (the actual cache)
my_file.stats.csv.json             — cache signature (args + qsv version) used for invalidation

Producing it

The minimum to be useful:

qsv stats --stats-jsonl my_file.csv

The full version for max downstream benefit:

qsv stats --everything --infer-dates --infer-boolean --cardinality --stats-jsonl my_file.csv

Or — set an env var to let qsv create it automatically the next time a smart command needs it:

export QSV_STATSCACHE_MODE=force

Three modes:

  • auto (default) — use the cache if it exists and is fresh
  • force — auto-create the cache on first smart-command invocation
  • none — ignore the cache entirely

Who reads it (the "smart" commands, 🪄)

  • frequency — short-circuits all-unique columns (rowcount == cardinality) using a sentinel ALL_UNIQUE value; needs the cache to handle ID columns without OOM
  • schema — skips redundant type inference; faster generation
  • validate — uses inferred ranges, formats, and patterns from the cache
  • pragmastat — date-aware mode requires the cache (stats -E --infer-dates --stats-jsonl)
  • pivotp — smart aggregation auto-selection based on data type
  • tojsonl — smart JSON type inference (string/number/bool/null per column)
  • sample — extra checks for --systematic, --weighted, --cluster modes
  • sqlp scoresql — query plan analysis (filter selectivity, join cardinality)
  • describegpt — deterministic statistical context for the LLM

Invalidation

The cache is timestamp-keyed against the source CSV. If the source file's mtime changes, the cache is considered stale and silently regenerated (unless QSV_STATSCACHE_MODE=none).

Deliberate bust:

rm my_file.stats.csv my_file.stats.csv.data.jsonl
# Or force a rerun:
qsv stats --force --stats-jsonl my_file.csv

The frequency cache

Layout

my_file.freq.csv.data.jsonl        — frequency distribution per column

Producing it

qsv frequency --frequency-jsonl my_file.csv

Stored as JSONL with metadata per column. Two sentinel values appear for memory-bounded columns:

  • ALL_UNIQUE — when rowcount == cardinality (typical of ID columns)
  • HIGH_CARDINALITY — when cardinality exceeds the smaller of QSV_FREQ_HIGH_CARD_THRESHOLD (default 100) and QSV_FREQ_HIGH_CARD_PCT (default 90%) of rowcount

Who reads it

  • describegpt — frequency context for LLM data dictionaries
  • sqlp scoresql — filter-selectivity scoring

Invalidation

Same as the stats cache: stale if source CSV's mtime is newer. The cache is NOT used when --ignore-case, --no-trim, or --weight are active (those change how values are bucketed). Use --force to regenerate.

The Polars schema cache

Layout

my_file.pschema.json               — Polars schema (column types + Polars dtypes)

Producing it

qsv schema --polars my_file.csv

Who reads it

Every Polars-powered command: sqlp, joinp, pivotp (and lens, count, color, prompt, scoresql when reading Polars-supported formats).

Pre-computed Polars dtypes mean Polars skips the inference scan it would otherwise do. Big speedup for repeated queries against the same file. Generate it once, check it into git, every Polars command becomes faster.

Invalidation

If the file's columns change, regenerate:

qsv schema --polars --force my_file.csv

The HTTP cache (fetch / fetchpost / describegpt)

Four cache backends, chosen via CLI flag:

  1. In-memory LRU (default) — non-persistent, 2M entries, lost on process exit. Tune with --mem-cache-size.
  2. Disk (--disk-cache) — stored at ~/.qsv-cache/fetch/ (or --disk-cache-dir). TTL: 28 days (QSV_DISKCACHE_TTL_SECS).
  3. Redis (--redis-cache) — shared across machines. Default redis://127.0.0.1:6379/1. TTL: 28 days (QSV_REDIS_TTL_SECS).
  4. None (--no-cache) — for live data.

fetch keys

Cache keys are the URL + GET parameters. Two calls to the same URL return the cached value (within TTL).

fetchpost keys

Cache keys include the POST body — so two identical posts share a cache slot.

describegpt keys

Cache keys include the prompt and the data context — so identical questions against unchanged data are cached.

Invalidation

By TTL or by manual delete:

# Disk cache
rm -rf ~/.qsv-cache/fetch/

# Redis cache (database 1 for fetch, 2 for fetchpost, 3 for describegpt)
redis-cli -n 1 FLUSHDB

Or set QSV_DISKCACHE_TTL_REFRESH / QSV_REDIS_TTL_REFRESH to refresh TTL on cache hit (keeps hot URLs cached indefinitely).

See HTTP & Web and Recipe: Fetch & Cache.

The Luau lookup-table cache

When a Luau script calls qsv_register_lookup("table", "URL_OR_PATH"):

  • Local files are read directly.
  • Remote URLs (http://, https://, dathere://, ckan://) are downloaded and cached in $QSV_CACHE_DIR (default ~/.qsv-cache/).

The cache key includes the URL. Subsequent runs reuse the cached copy until you delete it.

See Lookup Tables and Scripting (Luau / Python) → luau.

Cache-busting cheatsheet

Cache Bust by
Stats / frequency / Polars schema / JSON schema qsv <cmd> --force <file> or rm <file>.stats.csv* etc.
Source-CSV-driven (auto) Modify the source CSV (mtime change auto-invalidates)
Index rm <file>.idx or qsv index --force <file>
HTTP disk cache rm -rf ~/.qsv-cache/fetch/
HTTP Redis cache redis-cli -n <db> FLUSHDB
Lookup-table cache rm -rf $QSV_CACHE_DIR/lookup-tables/

CI / pipeline gotchas

  • Commit .stats.csv.data.jsonl and .pschema.json to git for reference datasets you query repeatedly. Tiny files, huge speedup.
  • Don't commit .idx — they're regenerated cheaply and platform-specific.
  • In CI, prefer QSV_STATSCACHE_MODE=force — eliminates "did I remember to run stats first?" mistakes.
  • For multi-stage Make-based pipelines, list the cache files as explicit dependencies so make invalidates them when the source changes.

See also

Clone this wiki locally