Skip to content
Joel Natividad edited this page Jun 7, 2026 · 1 revision

Get & Disk Cache

Tier: Intermediate Commands covered: get

Note

Per-command flag reference lives in /docs/help/get.md. This page is the workflow layer — when to reach for get, how the cache works, and how dc: lets every other command read cached data.

get fetches tabular data once from a local path, an HTTP(S) URL, a CKAN portal, or cloud object storage, and stores it in a managed, queryable disk cache. Once cached, the resource can be read by any qsv command via the dc: prefix — no re-download, no re-parse.

Think of it as the acquisition layer that sits in front of the rest of qsv: get owns "how do I get this file and keep it fresh?", and every other command just reads dc:<name>.

What get does on fetch

For each source, get:

  • Stores the blob compressed (zstd by default) and content-addressed by BLAKE3 — identical content is stored once.
  • Auto-builds a qsv index for the cached file, so downstream commands get instant random access and exact record counts.
  • Records rich metadataETag, Last-Modified, compressed/uncompressed sizes, record count, and TTL.
  • Revalidates instead of re-downloading. Re-fetches send a conditional request (If-None-Match / If-Modified-Since); an unchanged resource returns 304 Not Modified and the cached copy is kept.
  • Streams large remote objects in parallel byte-ranges — tune with QSV_GET_PART_SIZE (default 8 MiB) and QSV_GET_CONCURRENCY (default 4, clamped 1–64).

Note

get is behind the get feature flag (✨). It is included in the standard qsv binary (distrib_features), qsvmcp, and qsvdp, but not in qsvlite. Cloud sources (s3://, gs://, az://) additionally require the get_cloud sub-feature. See Binary Variants.

Quick decision table

If you want to… Use
Cache a local file (compressed + indexed) for reuse qsv get path/to/file.csv
Fetch a URL once and reuse it across commands qsv get https://… --name data.csv
Pull a CKAN resource by id qsv get "ckan://<resource-id>" --name ref.csv
Pull a CKAN resource by name (resource_search) qsv get "ckan://<name>?" --name ref.csv
Fetch from S3 / GCS / Azure (needs get_cloud) qsv get s3://bucket/key.csv --name data.csv
Read cached data from any command qsv stats dc:data.csv
List / inspect / verify what's cached qsv get cache-list [--verify]
Drop old entries qsv get cache-prune --older-than=30d
Retune one entry's freshness qsv get cache-set-ttl / cache-set-policy

get

Sources

Source Example Notes
Local file qsv get data/sales.csv Compressed + indexed copy in the cache
HTTP / HTTPS qsv get https://example.com/data.csv Conditional revalidation, ranged parallel download
datHere lookup repo qsv get dathere://us-states.csv The datHere qsv-lookup-tables repo
CKAN by id qsv get "ckan://abc123…" Resource id on the configured CKAN portal
CKAN by name qsv get "ckan://covid-vaccinations?" Trailing ? triggers resource_search
AWS S3 (get_cloud) qsv get s3://bucket/key.csv S3 / S3-compatible endpoints
Google Cloud Storage (get_cloud) qsv get gs://bucket/key.csv
Azure Blob Storage (get_cloud) qsv get az://container/key.csv

sftp:// is planned for a later release.

Cloud credentials are read from the standard AWS_* / AZURE_* / GOOGLE_* environment variables (and IAM roles). Use --cloud-opt key=value (repeatable) for one-off overrides such as region=us-east-1, a custom endpoint=…, or skip_signature=true for public buckets.

# Fetch a CSV into the cache, then read it with another command
qsv get https://example.com/data.csv --name data.csv
qsv stats dc:data.csv

# Seed a CKAN reference table by name
qsv get "ckan://covid-vaccinations?" --name vax.csv

# Cloud object storage (requires the get_cloud feature)
qsv get s3://my-bucket/data.csv --name data.csv
qsv get gs://my-bucket/data.csv --cloud-opt skip_signature=true

The dc: prefix

Once a resource is cached under a logical name, any qsv command reads it by prefixing that name with dc::

qsv get https://example.com/big.csv --name big.csv
qsv stats     dc:big.csv
qsv frequency dc:big.csv
qsv sqlp      dc:big.csv "SELECT * FROM big WHERE amount > 100"
  • The logical name is set with --name. If omitted, it defaults to the source's terminal path segment (e.g. big.csv from …/big.csv). --name is ignored when multiple sources are given.
  • Reading a dc: handle transparently decompresses the blob to a temp CSV with a sibling .idx.
  • Stale dc: entries are auto-refreshed according to the entry's refresh policy (see below) — a dc: read can trigger a conditional revalidation before handing the data to the command.

Freshness: TTL and refresh policy

Each cache entry carries a TTL and a refresh policy that together decide what a dc: read does when the entry ages out:

Option Values Default Meaning
--ttl <secs> seconds, or -1 to never expire 2419200 (28 days) How long an entry is considered fresh
--refresh <policy> on-stale, always, never on-stale on-stale revalidates only past TTL; always revalidates every read; never serves the cached copy without checking
--force off Re-fetch immediately, even if a fresh copy exists
--compress <algo> zstd, none zstd Transparent blob compression

Cache-management subcommands

Command Purpose
qsv get cache-list [--verify] List cached entries (add --verify to recompute each blob's BLAKE3 and report OK/FAIL — exits non-zero on any failure)
qsv get cache-info Summarize the cache (size, entry count)
qsv get cache-clear Remove all cache entries
qsv get cache-prune --older-than=<val> Remove entries older than an age — <val> is seconds or a value with an s/m/h/d/w suffix (e.g. 3600, 90m, 30d, 2w)
qsv get cache-set-ttl <name> --ttl=<secs> Change one entry's TTL
qsv get cache-set-policy <name> --refresh=<policy> Change one entry's refresh policy

cache-list / cache-info accept --json for machine-readable output.

# Inspect, verify integrity, then prune
qsv get cache-list
qsv get cache-list --verify
qsv get cache-prune --older-than=30d

# Retune a single entry
qsv get cache-set-ttl   data.csv --ttl=86400
qsv get cache-set-policy data.csv --refresh=never

CKAN options

Option Default Env override
--ckan-api <url> https://data.dathere.com/api/3/action QSV_CKAN_API
--ckan-token <token> — (only for private resources) QSV_CKAN_TOKEN

Other options

  • --timeout <secs> — HTTP request timeout (default 30).
  • --cache-dir <dir> — cache directory (default ~/.qsv-cache; overrides QSV_CACHE_DIR).
  • -o, --output <file> — for a single source, also write the fetched (decompressed) data to <file> (use - for stdout).
  • -q, --quiet — suppress progress/summary output on stderr.

Environment variables

Variable Effect Default
QSV_CACHE_DIR Cache directory ~/.qsv-cache
QSV_GET_PART_SIZE Byte-range part size for parallel downloads 8 MiB
QSV_GET_CONCURRENCY Parallel range-download workers (clamped 1–64) 4
QSV_CKAN_API CKAN action API endpoint for ckan:// datHere portal
QSV_CKAN_TOKEN CKAN access token (private resources)

See Environment Variables for the full list.

get vs fetch

These solve different problems:

  • get fetches one resource (a whole file) into a managed cache, so any command can read it via dc:. Acquisition + caching of datasets.
  • fetch / fetchpost make one HTTP call per row to enrich a CSV from an API, caching individual responses. Per-row enrichment.

Reach for get to pull a dataset once and reuse it; reach for fetch to hit an API for every row.

See also

Clone this wiki locally