-
Notifications
You must be signed in to change notification settings - Fork 104
Get and Disk Cache
Tier: Intermediate
Commands covered: get
Note
Per-command flag reference lives in /docs/help/get.md. This page is the workflow layer — when to reach for get, how the cache works, and how dc: lets every other command read cached data.
get fetches tabular data once from a local path, an HTTP(S) URL, a CKAN portal, or cloud object storage, and stores it in a managed, queryable disk cache. Once cached, the resource can be read by any qsv command via the dc: prefix — no re-download, no re-parse.
Think of it as the acquisition layer that sits in front of the rest of qsv: get owns "how do I get this file and keep it fresh?", and every other command just reads dc:<name>.
For each source, get:
- Stores the blob compressed (zstd by default) and content-addressed by BLAKE3 — identical content is stored once.
- Auto-builds a qsv index for the cached file, so downstream commands get instant random access and exact record counts.
-
Records rich metadata —
ETag,Last-Modified, compressed/uncompressed sizes, record count, and TTL. -
Revalidates instead of re-downloading. Re-fetches send a conditional request (
If-None-Match/If-Modified-Since); an unchanged resource returns304 Not Modifiedand the cached copy is kept. -
Streams large remote objects in parallel byte-ranges — tune with
QSV_GET_PART_SIZE(default 8 MiB) andQSV_GET_CONCURRENCY(default 4, clamped 1–64).
Note
get is behind the get feature flag (✨). It is included in the standard qsv binary (distrib_features), qsvmcp, and qsvdp, but not in qsvlite. Cloud sources (s3://, gs://, az://) additionally require the get_cloud sub-feature. See Binary Variants.
| If you want to… | Use |
|---|---|
| Cache a local file (compressed + indexed) for reuse | qsv get path/to/file.csv |
| Fetch a URL once and reuse it across commands | qsv get https://… --name data.csv |
| Pull a CKAN resource by id | qsv get "ckan://<resource-id>" --name ref.csv |
| Pull a CKAN resource by name (resource_search) | qsv get "ckan://<name>?" --name ref.csv |
Fetch from S3 / GCS / Azure (needs get_cloud) |
qsv get s3://bucket/key.csv --name data.csv |
| Read cached data from any command | qsv stats dc:data.csv |
| List / inspect / verify what's cached | qsv get cache-list [--verify] |
| Drop old entries | qsv get cache-prune --older-than=30d |
| Retune one entry's freshness |
qsv get cache-set-ttl / cache-set-policy
|
| Source | Example | Notes |
|---|---|---|
| Local file | qsv get data/sales.csv |
Compressed + indexed copy in the cache |
| HTTP / HTTPS | qsv get https://example.com/data.csv |
Conditional revalidation, ranged parallel download |
| datHere lookup repo | qsv get dathere://us-states.csv |
The datHere qsv-lookup-tables repo |
| CKAN by id | qsv get "ckan://abc123…" |
Resource id on the configured CKAN portal |
| CKAN by name | qsv get "ckan://covid-vaccinations?" |
Trailing ? triggers resource_search
|
AWS S3 (get_cloud) |
qsv get s3://bucket/key.csv |
S3 / S3-compatible endpoints |
Google Cloud Storage (get_cloud) |
qsv get gs://bucket/key.csv |
|
Azure Blob Storage (get_cloud) |
qsv get az://container/key.csv |
sftp://is planned for a later release.
Cloud credentials are read from the standard AWS_* / AZURE_* / GOOGLE_* environment variables (and IAM roles). Use --cloud-opt key=value (repeatable) for one-off overrides such as region=us-east-1, a custom endpoint=…, or skip_signature=true for public buckets.
# Fetch a CSV into the cache, then read it with another command
qsv get https://example.com/data.csv --name data.csv
qsv stats dc:data.csv
# Seed a CKAN reference table by name
qsv get "ckan://covid-vaccinations?" --name vax.csv
# Cloud object storage (requires the get_cloud feature)
qsv get s3://my-bucket/data.csv --name data.csv
qsv get gs://my-bucket/data.csv --cloud-opt skip_signature=trueOnce a resource is cached under a logical name, any qsv command reads it by prefixing that name with dc::
qsv get https://example.com/big.csv --name big.csv
qsv stats dc:big.csv
qsv frequency dc:big.csv
qsv sqlp dc:big.csv "SELECT * FROM big WHERE amount > 100"- The logical name is set with
--name. If omitted, it defaults to the source's terminal path segment (e.g.big.csvfrom…/big.csv).--nameis ignored when multiple sources are given. - Reading a
dc:handle transparently decompresses the blob to a temp CSV with a sibling.idx. -
Stale
dc:entries are auto-refreshed according to the entry's refresh policy (see below) — adc:read can trigger a conditional revalidation before handing the data to the command.
Each cache entry carries a TTL and a refresh policy that together decide what a dc: read does when the entry ages out:
| Option | Values | Default | Meaning |
|---|---|---|---|
--ttl <secs> |
seconds, or -1 to never expire |
2419200 (28 days) |
How long an entry is considered fresh |
--refresh <policy> |
on-stale, always, never
|
on-stale |
on-stale revalidates only past TTL; always revalidates every read; never serves the cached copy without checking |
--force |
— | off | Re-fetch immediately, even if a fresh copy exists |
--compress <algo> |
zstd, none
|
zstd |
Transparent blob compression |
| Command | Purpose |
|---|---|
qsv get cache-list [--verify] |
List cached entries (add --verify to recompute each blob's BLAKE3 and report OK/FAIL — exits non-zero on any failure) |
qsv get cache-info |
Summarize the cache (size, entry count) |
qsv get cache-clear |
Remove all cache entries |
qsv get cache-prune --older-than=<val> |
Remove entries older than an age — <val> is seconds or a value with an s/m/h/d/w suffix (e.g. 3600, 90m, 30d, 2w) |
qsv get cache-set-ttl <name> --ttl=<secs> |
Change one entry's TTL |
qsv get cache-set-policy <name> --refresh=<policy> |
Change one entry's refresh policy |
cache-list / cache-info accept --json for machine-readable output.
# Inspect, verify integrity, then prune
qsv get cache-list
qsv get cache-list --verify
qsv get cache-prune --older-than=30d
# Retune a single entry
qsv get cache-set-ttl data.csv --ttl=86400
qsv get cache-set-policy data.csv --refresh=never| Option | Default | Env override |
|---|---|---|
--ckan-api <url> |
https://data.dathere.com/api/3/action |
QSV_CKAN_API |
--ckan-token <token> |
— (only for private resources) | QSV_CKAN_TOKEN |
-
--timeout <secs>— HTTP request timeout (default30). -
--cache-dir <dir>— cache directory (default~/.qsv-cache; overridesQSV_CACHE_DIR). -
-o, --output <file>— for a single source, also write the fetched (decompressed) data to<file>(use-for stdout). -
-q, --quiet— suppress progress/summary output on stderr.
| Variable | Effect | Default |
|---|---|---|
QSV_CACHE_DIR |
Cache directory | ~/.qsv-cache |
QSV_GET_PART_SIZE |
Byte-range part size for parallel downloads | 8 MiB |
QSV_GET_CONCURRENCY |
Parallel range-download workers (clamped 1–64) | 4 |
QSV_CKAN_API |
CKAN action API endpoint for ckan://
|
datHere portal |
QSV_CKAN_TOKEN |
CKAN access token (private resources) | — |
See Environment Variables for the full list.
These solve different problems:
-
getfetches one resource (a whole file) into a managed cache, so any command can read it viadc:. Acquisition + caching of datasets. -
fetch/fetchpostmake one HTTP call per row to enrich a CSV from an API, caching individual responses. Per-row enrichment.
Reach for get to pull a dataset once and reuse it; reach for fetch to hit an API for every row.
- Command Reference (index)
-
docs/help/get.md— canonical flag reference - HTTP & Web → fetch / fetchpost — per-row HTTP enrichment
- Stats Cache & Caching — the wider qsv caching story
- Recipe: CKAN Integration
- Environment Variables
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation