-
Notifications
You must be signed in to change notification settings - Fork 104
Stats Cache and Caching
Tier: Advanced
qsv caches at four levels. The flagship is the stats cache — feeding pre-computed statistics into "smart" downstream commands so they don't redo work. The other three are the frequency cache, the fetch / fetchpost / describegpt HTTP cache, and the Luau lookup-table cache.
This page explains the file layouts, who reads them, when they invalidate, and how to deliberately bust them.
qsv leans heavily on sidecar files next to your CSV. Run stats once, and downstream commands get smarter and faster without re-reading the source file. The convention is:
| File | Producer | Consumers |
|---|---|---|
<csv>.idx |
qsv index |
count, sample, slice, stats, frequency, split, schema, luau
|
<csv>.stats.csv |
qsv stats |
Human-readable; not parsed by other commands |
<csv>.stats.csv.data.jsonl |
qsv stats --stats-jsonl |
The actual cache used by frequency, schema, validate, pragmastat, pivotp, sqlp scoresql, tojsonl, sample, describegpt
|
<csv>.stats.csv.json |
qsv stats |
stats itself — records the arguments and qsv version used to build the cache; checked on the next run to decide whether the cache is still valid or must be recomputed |
<csv>.freq.csv.data.jsonl |
qsv frequency --frequency-jsonl |
describegpt, scoresql, future smart commands |
<csv>.schema.json |
qsv schema |
validate |
<csv>.pschema.json |
qsv schema --polars |
sqlp, joinp, pivotp
|
my_file.csv — source
my_file.csv.idx — row-offset index
my_file.stats.csv — human-readable stats CSV
my_file.stats.csv.data.jsonl — machine-readable JSONL (the actual cache)
my_file.stats.csv.json — cache signature (args + qsv version) used for invalidation
The minimum to be useful:
qsv stats --stats-jsonl my_file.csvThe full version for max downstream benefit:
qsv stats --everything --infer-dates --infer-boolean --cardinality --stats-jsonl my_file.csvOr — set an env var to let qsv create it automatically the next time a smart command needs it:
export QSV_STATSCACHE_MODE=forceThree modes:
-
auto(default) — use the cache if it exists and is fresh -
force— auto-create the cache on first smart-command invocation -
none— ignore the cache entirely
-
frequency— short-circuits all-unique columns (rowcount == cardinality) using a sentinelALL_UNIQUEvalue; needs the cache to handle ID columns without OOM -
schema— skips redundant type inference; faster generation -
validate— uses inferred ranges, formats, and patterns from the cache -
pragmastat— date-aware mode requires the cache (stats -E --infer-dates --stats-jsonl) -
pivotp— smart aggregation auto-selection based on data type -
tojsonl— smart JSON type inference (string/number/bool/null per column) -
sample— extra checks for--systematic,--weighted,--clustermodes -
sqlp scoresql— query plan analysis (filter selectivity, join cardinality) -
describegpt— deterministic statistical context for the LLM
The cache is timestamp-keyed against the source CSV. If the source file's mtime changes, the cache is considered stale and silently regenerated (unless QSV_STATSCACHE_MODE=none).
Deliberate bust:
rm my_file.stats.csv my_file.stats.csv.data.jsonl
# Or force a rerun:
qsv stats --force --stats-jsonl my_file.csvmy_file.freq.csv.data.jsonl — frequency distribution per column
qsv frequency --frequency-jsonl my_file.csvStored as JSONL with metadata per column. Two sentinel values appear for memory-bounded columns:
-
ALL_UNIQUE— whenrowcount == cardinality(typical of ID columns) -
HIGH_CARDINALITY— when cardinality exceeds the smaller ofQSV_FREQ_HIGH_CARD_THRESHOLD(default 100) andQSV_FREQ_HIGH_CARD_PCT(default 90%) of rowcount
-
describegpt— frequency context for LLM data dictionaries -
sqlp scoresql— filter-selectivity scoring
Same as the stats cache: stale if source CSV's mtime is newer. The cache is NOT used when --ignore-case, --no-trim, or --weight are active (those change how values are bucketed). Use --force to regenerate.
my_file.pschema.json — Polars schema (column types + Polars dtypes)
qsv schema --polars my_file.csvEvery Polars-powered command: sqlp, joinp, pivotp (and lens, count, color, prompt, scoresql when reading Polars-supported formats).
Pre-computed Polars dtypes mean Polars skips the inference scan it would otherwise do. Big speedup for repeated queries against the same file. Generate it once, check it into git, every Polars command becomes faster.
If the file's columns change, regenerate:
qsv schema --polars --force my_file.csvFour cache backends, chosen via CLI flag:
-
In-memory LRU (default) — non-persistent, 2M entries, lost on process exit. Tune with
--mem-cache-size. -
Disk (
--disk-cache) — stored at~/.qsv-cache/fetch/(or--disk-cache-dir). TTL: 28 days (QSV_DISKCACHE_TTL_SECS). -
Redis (
--redis-cache) — shared across machines. Defaultredis://127.0.0.1:6379/1. TTL: 28 days (QSV_REDIS_TTL_SECS). -
None (
--no-cache) — for live data.
Cache keys are the URL + GET parameters. Two calls to the same URL return the cached value (within TTL).
Cache keys include the POST body — so two identical posts share a cache slot.
Cache keys include the prompt and the data context — so identical questions against unchanged data are cached.
By TTL or by manual delete:
# Disk cache
rm -rf ~/.qsv-cache/fetch/
# Redis cache (database 1 for fetch, 2 for fetchpost, 3 for describegpt)
redis-cli -n 1 FLUSHDBOr set QSV_DISKCACHE_TTL_REFRESH / QSV_REDIS_TTL_REFRESH to refresh TTL on cache hit (keeps hot URLs cached indefinitely).
See HTTP & Web and Recipe: Fetch & Cache.
When a Luau script calls qsv_register_lookup("table", "URL_OR_PATH"):
- Local files are read directly.
- Remote URLs (http://, https://, dathere://, ckan://) are downloaded and cached in
$QSV_CACHE_DIR(default~/.qsv-cache/).
The cache key includes the URL. Subsequent runs reuse the cached copy until you delete it.
See Lookup Tables and Scripting (Luau / Python) → luau.
| Cache | Bust by |
|---|---|
| Stats / frequency / Polars schema / JSON schema |
qsv <cmd> --force <file> or rm <file>.stats.csv* etc. |
| Source-CSV-driven (auto) | Modify the source CSV (mtime change auto-invalidates) |
| Index |
rm <file>.idx or qsv index --force <file>
|
| HTTP disk cache | rm -rf ~/.qsv-cache/fetch/ |
| HTTP Redis cache | redis-cli -n <db> FLUSHDB |
| Lookup-table cache | rm -rf $QSV_CACHE_DIR/lookup-tables/ |
-
Commit
.stats.csv.data.jsonland.pschema.jsonto git for reference datasets you query repeatedly. Tiny files, huge speedup. -
Don't commit
.idx— they're regenerated cheaply and platform-specific. -
In CI, prefer
QSV_STATSCACHE_MODE=force— eliminates "did I remember to run stats first?" mistakes. - For multi-stage Make-based pipelines, list the cache files as explicit dependencies so
makeinvalidates them when the source changes.
- Performance Tuning — when each cache matters
-
Environment Variables —
QSV_STATSCACHE_MODE,QSV_CACHE_DIR, TTL vars -
docs/PERFORMANCE.md— canonical performance reference -
docs/STATS_DEFINITIONS.md— what's in the stats cache - Aggregation & Statistics → stats
- Validation & Schema → schema
-
SQL & Polars —
.pschema.jsonforsqlp/joinp/pivotp - HTTP & Web — fetch / fetchpost cache modes
- Lookup Tables — Luau cache
- Recipe: Build a Data Pipeline — practical cache layout
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation