Get and Disk Cache

Get & Disk Cache

Tier: Intermediate Commands covered: get

Note

Per-command flag reference lives in /docs/help/get.md. This page is the workflow layer — when to reach for get, how the cache works, and how dc: lets every other command read cached data.

get fetches tabular data once from a local path, an HTTP(S) URL, a CKAN portal, or cloud object storage, and stores it in a managed, queryable disk cache. Once cached, the resource can be read by any qsv command via the dc: prefix — no re-download, no re-parse.

Think of it as the acquisition layer that sits in front of the rest of qsv: get owns "how do I get this file and keep it fresh?", and every other command just reads dc:<name>.

What `get` does on fetch

For each source, get:

Stores the blob compressed (zstd by default) and content-addressed by BLAKE3 — identical content is stored once.
Auto-builds a qsv index for the cached file, so downstream commands get instant random access and exact record counts.
Records rich metadata — ETag, Last-Modified, compressed/uncompressed sizes, record count, and TTL.
Revalidates instead of re-downloading. Re-fetches send a conditional request (If-None-Match / If-Modified-Since); an unchanged resource returns 304 Not Modified and the cached copy is kept.
Streams large remote objects in parallel byte-ranges — tune with QSV_GET_PART_SIZE (default 8 MiB) and QSV_GET_CONCURRENCY (default 4, clamped 1–64).

Note

get is behind the get feature flag (✨). It is included in the standard qsv binary (distrib_features), qsvmcp, and qsvdp, but not in qsvlite. Cloud sources (s3://, gs://, az://) additionally require the get_cloud sub-feature. See Binary Variants.

Quick decision table

If you want to…	Use
Cache a local file (compressed + indexed) for reuse	`qsv get path/to/file.csv`
Fetch a URL once and reuse it across commands	`qsv get https://… --name data.csv`
Pull a CKAN resource by id	`qsv get "ckan://<resource-id>" --name ref.csv`
Pull a CKAN resource by name (resource_search)	`qsv get "ckan://<name>?" --name ref.csv`
Fetch from S3 / GCS / Azure (needs `get_cloud`)	`qsv get s3://bucket/key.csv --name data.csv`
Read cached data from any command	`qsv stats dc:data.csv`
List / inspect / verify what's cached	`qsv get cache-list [--verify]`
Drop old entries	`qsv get cache-prune --older-than=30d`
Retune one entry's freshness	`qsv get cache-set-ttl` / `cache-set-policy`

`get`

Sources

Source	Example	Notes
Local file	`qsv get data/sales.csv`	Compressed + indexed copy in the cache
HTTP / HTTPS	`qsv get https://example.com/data.csv`	Conditional revalidation, ranged parallel download
datHere lookup repo	`qsv get dathere://us-states.csv`	The datHere qsv-lookup-tables repo
CKAN by id	`qsv get "ckan://abc123…"`	Resource id on the configured CKAN portal
CKAN by name	`qsv get "ckan://covid-vaccinations?"`	Trailing `?` triggers `resource_search`
AWS S3 (`get_cloud`)	`qsv get s3://bucket/key.csv`	S3 / S3-compatible endpoints
Google Cloud Storage (`get_cloud`)	`qsv get gs://bucket/key.csv`
Azure Blob Storage (`get_cloud`)	`qsv get az://container/key.csv`

sftp:// is planned for a later release.

Cloud credentials are read from the standard AWS_* / AZURE_* / GOOGLE_* environment variables (and IAM roles). Use --cloud-opt key=value (repeatable) for one-off overrides such as region=us-east-1, a custom endpoint=…, or skip_signature=true for public buckets.

# Fetch a CSV into the cache, then read it with another command
qsv get https://example.com/data.csv --name data.csv
qsv stats dc:data.csv

# Seed a CKAN reference table by name
qsv get "ckan://covid-vaccinations?" --name vax.csv

# Cloud object storage (requires the get_cloud feature)
qsv get s3://my-bucket/data.csv --name data.csv
qsv get gs://my-bucket/data.csv --cloud-opt skip_signature=true

The `dc:` prefix

Once a resource is cached under a logical name, any qsv command reads it by prefixing that name with dc::

qsv get https://example.com/big.csv --name big.csv
qsv stats     dc:big.csv
qsv frequency dc:big.csv
qsv sqlp      dc:big.csv "SELECT * FROM big WHERE amount > 100"

The logical name is set with --name. If omitted, it defaults to the source's terminal path segment (e.g. big.csv from …/big.csv). --name is ignored when multiple sources are given.
Reading a dc: handle transparently decompresses the blob to a temp CSV with a sibling .idx.
Stale dc: entries are auto-refreshed according to the entry's refresh policy (see below) — a dc: read can trigger a conditional revalidation before handing the data to the command.

Freshness: TTL and refresh policy

Each cache entry carries a TTL and a refresh policy that together decide what a dc: read does when the entry ages out:

Option	Values	Default	Meaning
`--ttl <secs>`	seconds, or `-1` to never expire	`2419200` (28 days)	How long an entry is considered fresh
`--refresh <policy>`	`on-stale`, `always`, `never`	`on-stale`	`on-stale` revalidates only past TTL; `always` revalidates every read; `never` serves the cached copy without checking
`--force`	—	off	Re-fetch immediately, even if a fresh copy exists
`--compress <algo>`	`zstd`, `none`	`zstd`	Transparent blob compression

Cache-management subcommands

Command	Purpose
`qsv get cache-list [--verify]`	List cached entries (add `--verify` to recompute each blob's BLAKE3 and report OK/FAIL — exits non-zero on any failure)
`qsv get cache-info`	Summarize the cache (size, entry count)
`qsv get cache-clear`	Remove all cache entries
`qsv get cache-prune --older-than=<val>`	Remove entries older than an age — `<val>` is seconds or a value with an `s`/`m`/`h`/`d`/`w` suffix (e.g. `3600`, `90m`, `30d`, `2w`)
`qsv get cache-set-ttl <name> --ttl=<secs>`	Change one entry's TTL
`qsv get cache-set-policy <name> --refresh=<policy>`	Change one entry's refresh policy

cache-list / cache-info accept --json for machine-readable output.

# Inspect, verify integrity, then prune
qsv get cache-list
qsv get cache-list --verify
qsv get cache-prune --older-than=30d

# Retune a single entry
qsv get cache-set-ttl   data.csv --ttl=86400
qsv get cache-set-policy data.csv --refresh=never

CKAN options

Option	Default	Env override
`--ckan-api <url>`	`https://data.dathere.com/api/3/action`	`QSV_CKAN_API`
`--ckan-token <token>`	— (only for private resources)	`QSV_CKAN_TOKEN`

Other options

--timeout <secs> — HTTP request timeout (default 30).
--cache-dir <dir> — cache directory (default ~/.qsv-cache; overrides QSV_CACHE_DIR).
-o, --output <file> — for a single source, also write the fetched (decompressed) data to <file> (use - for stdout).
-q, --quiet — suppress progress/summary output on stderr.

Environment variables

Variable	Effect	Default
`QSV_CACHE_DIR`	Cache directory	`~/.qsv-cache`
`QSV_GET_PART_SIZE`	Byte-range part size for parallel downloads	8 MiB
`QSV_GET_CONCURRENCY`	Parallel range-download workers (clamped 1–64)	4
`QSV_CKAN_API`	CKAN action API endpoint for `ckan://`	datHere portal
`QSV_CKAN_TOKEN`	CKAN access token (private resources)	—

See Environment Variables for the full list.

`get` vs `fetch`

These solve different problems:

get fetches one resource (a whole file) into a managed cache, so any command can read it via dc:. Acquisition + caching of datasets.
fetch / fetchpost make one HTTP call per row to enrich a CSV from an API, caching individual responses. Per-row enrichment.

Reach for get to pull a dataset once and reuse it; reach for fetch to hit an API for every row.

Get and Disk Cache

Get & Disk Cache

What get does on fetch

Quick decision table

get

Sources

The dc: prefix

Freshness: TTL and refresh policy

Cache-management subcommands

CKAN options

Other options

Environment variables

get vs fetch

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

What `get` does on fetch

`get`

The `dc:` prefix

`get` vs `fetch`