Stats Cache and Caching

Stats Cache & Caching

qsv caches at four levels. The flagship is the stats cache — feeding pre-computed statistics into "smart" downstream commands so they don't redo work. The other three are the frequency cache, the fetch / fetchpost / describegpt HTTP cache, and the Luau lookup-table cache.

This page explains the file layouts, who reads them, when they invalidate, and how to deliberately bust them.

The big idea

qsv leans heavily on sidecar files next to your CSV. Run stats once, and downstream commands get smarter and faster without re-reading the source file. The convention is:

File	Producer	Consumers
`<csv>.idx`	`qsv index`	`count`, `sample`, `slice`, `stats`, `frequency`, `split`, `schema`, `luau`
`<csv>.stats.csv`	`qsv stats`	Human-readable; not parsed by other commands
`<csv>.stats.csv.data.jsonl`	`qsv stats --stats-jsonl`	The actual cache used by `frequency`, `schema`, `validate`, `pragmastat`, `pivotp`, `sqlp scoresql`, `tojsonl`, `sample`, `describegpt`
`<csv>.stats.csv.json`	`qsv stats`	`stats` itself — records the arguments and qsv version used to build the cache; checked on the next run to decide whether the cache is still valid or must be recomputed
`<csv>.freq.csv.data.jsonl`	`qsv frequency --frequency-jsonl`	`describegpt`, `scoresql`, future smart commands
`<csv>.schema.json`	`qsv schema`	`validate`
`<csv>.pschema.json`	`qsv schema --polars`	`sqlp`, `joinp`, `pivotp`

The stats cache

Layout

my_file.csv                        — source
my_file.csv.idx                    — row-offset index
my_file.stats.csv                  — human-readable stats CSV
my_file.stats.csv.data.jsonl       — machine-readable JSONL (the actual cache)
my_file.stats.csv.json             — cache signature (args + qsv version) used for invalidation

Producing it

The minimum to be useful:

qsv stats --stats-jsonl my_file.csv

The full version for max downstream benefit:

qsv stats --everything --infer-dates --infer-boolean --cardinality --stats-jsonl my_file.csv

Or — set an env var to let qsv create it automatically the next time a smart command needs it:

export QSV_STATSCACHE_MODE=force

Three modes:

auto (default) — use the cache if it exists and is fresh
force — auto-create the cache on first smart-command invocation
none — ignore the cache entirely

Who reads it (the "smart" commands, 🪄)

frequency — short-circuits all-unique columns (rowcount == cardinality) using a sentinel ALL_UNIQUE value; needs the cache to handle ID columns without OOM
schema — skips redundant type inference; faster generation
validate — uses inferred ranges, formats, and patterns from the cache
pragmastat — date-aware mode requires the cache (stats -E --infer-dates --stats-jsonl)
pivotp — smart aggregation auto-selection based on data type
tojsonl — smart JSON type inference (string/number/bool/null per column)
sample — extra checks for --systematic, --weighted, --cluster modes
sqlp scoresql — query plan analysis (filter selectivity, join cardinality)
describegpt — deterministic statistical context for the LLM

Invalidation

The cache is timestamp-keyed against the source CSV. If the source file's mtime changes, the cache is considered stale and silently regenerated (unless QSV_STATSCACHE_MODE=none).

Deliberate bust:

rm my_file.stats.csv my_file.stats.csv.data.jsonl
# Or force a rerun:
qsv stats --force --stats-jsonl my_file.csv

The frequency cache

Layout

my_file.freq.csv.data.jsonl        — frequency distribution per column

Producing it

qsv frequency --frequency-jsonl my_file.csv

Stored as JSONL with metadata per column. Two sentinel values appear for memory-bounded columns:

ALL_UNIQUE — when rowcount == cardinality (typical of ID columns)
HIGH_CARDINALITY — when cardinality exceeds the smaller of QSV_FREQ_HIGH_CARD_THRESHOLD (default 100) and QSV_FREQ_HIGH_CARD_PCT (default 90%) of rowcount

Who reads it

describegpt — frequency context for LLM data dictionaries
sqlp scoresql — filter-selectivity scoring

Invalidation

Same as the stats cache: stale if source CSV's mtime is newer. The cache is NOT used when --ignore-case, --no-trim, or --weight are active (those change how values are bucketed). Use --force to regenerate.

The Polars schema cache

Layout

my_file.pschema.json               — Polars schema (column types + Polars dtypes)

Producing it

qsv schema --polars my_file.csv

Who reads it

Every Polars-powered command: sqlp, joinp, pivotp (and lens, count, color, prompt, scoresql when reading Polars-supported formats).

Pre-computed Polars dtypes mean Polars skips the inference scan it would otherwise do. Big speedup for repeated queries against the same file. Generate it once, check it into git, every Polars command becomes faster.

Invalidation

If the file's columns change, regenerate:

qsv schema --polars --force my_file.csv

The HTTP cache (fetch / fetchpost / describegpt)

Four cache backends, chosen via CLI flag:

In-memory LRU (default) — non-persistent, 2M entries, lost on process exit. Tune with --mem-cache-size.
Disk (--disk-cache) — stored at ~/.qsv-cache/fetch/ (or --disk-cache-dir). TTL: 28 days (QSV_DISKCACHE_TTL_SECS).
Redis (--redis-cache) — shared across machines. Default redis://127.0.0.1:6379/1. TTL: 28 days (QSV_REDIS_TTL_SECS).
None (--no-cache) — for live data.

`fetch` keys

Cache keys are the URL + GET parameters. Two calls to the same URL return the cached value (within TTL).

`fetchpost` keys

Cache keys include the POST body — so two identical posts share a cache slot.

`describegpt` keys

Cache keys include the prompt and the data context — so identical questions against unchanged data are cached.

Invalidation

By TTL or by manual delete:

# Disk cache
rm -rf ~/.qsv-cache/fetch/

# Redis cache (database 1 for fetch, 2 for fetchpost, 3 for describegpt)
redis-cli -n 1 FLUSHDB

Or set QSV_DISKCACHE_TTL_REFRESH / QSV_REDIS_TTL_REFRESH to refresh TTL on cache hit (keeps hot URLs cached indefinitely).

See HTTP & Web and Recipe: Fetch & Cache.

The Luau lookup-table cache

When a Luau script calls qsv_register_lookup("table", "URL_OR_PATH"):

Local files are read directly.
Remote URLs (http://, https://, dathere://, ckan://) are downloaded and cached in $QSV_CACHE_DIR (default ~/.qsv-cache/).

The cache key includes the URL. Subsequent runs reuse the cached copy until you delete it.

See Lookup Tables and Scripting (Luau / Python) → luau.

Cache-busting cheatsheet

Cache	Bust by
Stats / frequency / Polars schema / JSON schema	`qsv <cmd> --force <file>` or `rm <file>.stats.csv*` etc.
Source-CSV-driven (auto)	Modify the source CSV (mtime change auto-invalidates)
Index	`rm <file>.idx` or `qsv index --force <file>`
HTTP disk cache	`rm -rf ~/.qsv-cache/fetch/`
HTTP Redis cache	`redis-cli -n <db> FLUSHDB`
Lookup-table cache	`rm -rf $QSV_CACHE_DIR/lookup-tables/`

CI / pipeline gotchas

Commit .stats.csv.data.jsonl and .pschema.json to git for reference datasets you query repeatedly. Tiny files, huge speedup.
Don't commit .idx — they're regenerated cheaply and platform-specific.
In CI, prefer QSV_STATSCACHE_MODE=force — eliminates "did I remember to run stats first?" mistakes.
For multi-stage Make-based pipelines, list the cache files as explicit dependencies so make invalidates them when the source changes.

Stats Cache and Caching

Stats Cache & Caching

The big idea

The stats cache

Layout

Producing it

Who reads it (the "smart" commands, 🪄)

Invalidation

The frequency cache

Layout

Producing it

Who reads it

Invalidation

The Polars schema cache

Layout

Producing it

Who reads it

Invalidation

The HTTP cache (fetch / fetchpost / describegpt)

fetch keys

fetchpost keys

describegpt keys

Invalidation

The Luau lookup-table cache

Cache-busting cheatsheet

CI / pipeline gotchas

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`fetch` keys

`fetchpost` keys

`describegpt` keys