Get and Disk Cache

Get & Disk Cache

Tier: Intermediate Commands covered: get

Note

Per-command flag reference lives in /docs/help/get.md. This page is the workflow layer — when to reach for get, how the cache works, and how dc: lets every other command read cached data.

get fetches tabular data once from a local path, an HTTP(S) URL, a CKAN portal, or cloud object storage, and stores it in a managed, queryable disk cache. Once cached, the resource can be read by any qsv command via the dc: prefix — no re-download, no re-parse.

Think of it as the acquisition layer that sits in front of the rest of qsv: get owns "how do I get this file and keep it fresh?", and every other command just reads dc:<name>.

What `get` does on fetch

For each source, get:

Stores the blob compressed (zstd by default) and content-addressed by BLAKE3 — identical content is stored once.
Sniffs the delimiter of CSV/TSV ingests so the cached copy parses correctly even when the source uses a non-comma delimiter.
Auto-builds a qsv index for the cached file, so downstream commands get instant random access and exact record counts.
Records rich metadata — ETag, Last-Modified, compressed/uncompressed sizes, record count, and TTL.
Revalidates instead of re-downloading. Re-fetches send a conditional request (If-None-Match / If-Modified-Since); an unchanged resource returns 304 Not Modified and the cached copy is kept.
Streams large remote objects in parallel byte-ranges — tune with QSV_GET_PART_SIZE (default 8 MiB) and QSV_GET_CONCURRENCY (default 4, clamped 1–64).

Note

get is behind the get feature flag (✨). It is included in the standard qsv binary (distrib_features), qsvmcp, and qsvdp, but not in qsvlite. Cloud sources (s3://, gs://, az://) additionally require the get_cloud sub-feature. See Binary Variants.

Quick decision table

If you want to…	Use
Cache a local file (compressed + indexed) for reuse	`qsv get path/to/file.csv`
Cache every matching file in one go	`qsv get '/data/*.csv'`
Peek at a remote file without caching it	`qsv get https://… --sample 10`
Fetch a URL once and reuse it across commands	`qsv get https://… --name data.csv`
Pull a CKAN resource by id	`qsv get "ckan://<resource-id>" --name ref.csv`
Pull a CKAN resource by name (resource_search)	`qsv get "ckan://<name>?" --name ref.csv`
Fetch from S3 / GCS / Azure (needs `get_cloud`)	`qsv get s3://bucket/key.csv --name data.csv`
Read cached data from any command	`qsv stats dc:data.csv`
List / inspect / verify what's cached	`qsv get cache-list [--verify]`
Drop old entries	`qsv get cache-prune --older-than=30d`
Retune one entry's freshness	`qsv get cache-set-ttl` / `cache-set-policy`

`get`

Sources

Source	Example	Notes
Local file	`qsv get data/sales.csv`	Compressed + indexed copy in the cache
Glob / directory	`qsv get '/data/*.csv'`	Fetches every matching tabular file (`.csv`/`.tsv`/`.tab`/`.ssv`); each cached separately; `--name` is ignored
HTTP / HTTPS	`qsv get https://example.com/data.csv`	Conditional revalidation, ranged parallel download
datHere lookup repo	`qsv get dathere://us-states.csv`	The datHere qsv-lookup-tables repo
CKAN by id	`qsv get "ckan://abc123…"`	Resource id on the configured CKAN portal
CKAN by name	`qsv get "ckan://covid-vaccinations?"`	Trailing `?` triggers `resource_search`
AWS S3 (`get_cloud`)	`qsv get s3://bucket/key.csv`	S3 / S3-compatible endpoints
Google Cloud Storage (`get_cloud`)	`qsv get gs://bucket/key.csv`
Azure Blob Storage (`get_cloud`)	`qsv get az://container/key.csv`

sftp:// is planned for a later release.

Cloud credentials are read from the standard AWS_* / AZURE_* / GOOGLE_* environment variables (and IAM roles). Use --cloud-opt key=value (repeatable) for one-off overrides such as region=us-east-1, a custom endpoint=…, or skip_signature=true for public buckets.

# Fetch a CSV into the cache, then read it with another command
qsv get https://example.com/data.csv --name data.csv
qsv stats dc:data.csv

# Seed a CKAN reference table by name
qsv get "ckan://covid-vaccinations?" --name vax.csv

# Cloud object storage (requires the get_cloud feature)
qsv get s3://my-bucket/data.csv --name data.csv
qsv get gs://my-bucket/data.csv --cloud-opt skip_signature=true

The `dc:` prefix

Once a resource is cached under a logical name, any qsv command reads it by prefixing that name with dc::

qsv get https://example.com/big.csv --name big.csv
qsv stats     dc:big.csv
qsv frequency dc:big.csv
qsv sqlp      dc:big.csv "SELECT * FROM big WHERE amount > 100"

The logical name is set with --name. If omitted, it defaults to the source's terminal path segment (e.g. big.csv from …/big.csv). --name is ignored when multiple sources are given.
Reading a dc: handle transparently decompresses the blob to a temp CSV with a sibling .idx.
Stale dc: entries are auto-refreshed according to the entry's refresh policy (see below) — a dc: read can trigger a conditional revalidation before handing the data to the command.

Sampling preview (`--sample`)

--sample N is a cheap peek at a single source — it streams just the first N data records (re-attaching the sniffed header row) to stdout or --output and caches nothing. No dc: entry is created. Because it stops reading early, a huge remote file is barely touched.

Note

This is not a statistical sample. For a random, representative subset use qsv sample instead.

Option	Meaning
`--sample <n>`	Stream the first `N` records of a single source (no caching)
`--offset <mb>`	Skip ~`<mb>` megabytes (via an HTTP Range request) before sampling, realigning to the next record boundary. Implies `--sample`; needs a Range-capable source
`--random`	Reservoir-sample instead of reading the head. Streams and parses the full source from the start (so quoted multi-line records stay intact) — slower, but uniform

# Peek at the first 10 rows of a remote CSV without caching it
qsv get https://example.com/big.csv --sample 10

# Skip ~500 MB, then peek 10 rows
qsv get https://example.com/big.csv --offset 500 --sample 10

# Uniform random preview of 20 rows
qsv get https://example.com/big.csv --sample 20 --random

Glob & directory sources

A glob (e.g. data/*.csv) or directory source fetches every matching tabular file (.csv/.tsv/.tab/.ssv) in one call — each cached separately under its own dc: handle. Supported for local paths and, with the get_cloud feature, cloud buckets/prefixes. --name is ignored when a source expands to multiple files.

# Cache every CSV in a local directory
qsv get '/data/*.csv'
qsv get /data/

# Cache every matching object under an S3 prefix (requires get_cloud)
qsv get 's3://my-bucket/exports/*.csv'

Freshness: TTL and refresh policy

Each cache entry carries a TTL and a refresh policy that together decide what a dc: read does when the entry ages out:

Option	Values	Default	Meaning
`--ttl <secs>`	seconds, or `-1` to never expire	`2419200` (28 days)	How long an entry is considered fresh
`--refresh <policy>`	`on-stale`, `always`, `never`	`on-stale`	`on-stale` revalidates only past TTL; `always` revalidates every read; `never` serves the cached copy without checking
`--force`	—	off	Re-fetch immediately, even if a fresh copy exists
`--compress <algo>`	`zstd`, `none`	`zstd`	Transparent blob compression

Cache-management subcommands

Command	Purpose
`qsv get cache-list [--verify]`	List cached entries (add `--verify` to recompute each blob's BLAKE3 and report OK/FAIL — exits non-zero on any failure)
`qsv get cache-info`	Summarize the cache (size, entry count)
`qsv get cache-clear`	Remove all cache entries
`qsv get cache-prune --older-than=<val>`	Remove entries older than an age — `<val>` is seconds or a value with an `s`/`m`/`h`/`d`/`w` suffix (e.g. `3600`, `90m`, `30d`, `2w`)
`qsv get cache-set-ttl <name> --ttl=<secs>`	Change one entry's TTL
`qsv get cache-set-policy <name> --refresh=<policy>`	Change one entry's refresh policy

cache-list / cache-info accept --json for machine-readable output.

# Inspect, verify integrity, then prune
qsv get cache-list
qsv get cache-list --verify
qsv get cache-prune --older-than=30d

# Retune a single entry
qsv get cache-set-ttl   data.csv --ttl=86400
qsv get cache-set-policy data.csv --refresh=never

CKAN options

Option	Default	Env override
`--ckan-api <url>`	`https://data.dathere.com/api/3/action`	`QSV_CKAN_API`
`--ckan-token <token>`	— (only for private resources)	`QSV_CKAN_TOKEN`

Other options

--sample <n> / --offset <mb> / --random — preview a single source without caching (see Sampling preview above).
--timeout <secs> — HTTP request timeout (default 30).
--cache-dir <dir> — cache directory (default ~/.qsv-cache; overrides QSV_CACHE_DIR).
-o, --output <file> — for a single source, also write the fetched (decompressed) data to <file> (use - for stdout).
-q, --quiet — suppress progress/summary output on stderr.

Environment variables

Variable	Effect	Default
`QSV_CACHE_DIR`	Cache directory	`~/.qsv-cache`
`QSV_GET_PART_SIZE`	Byte-range part size for parallel downloads	8 MiB
`QSV_GET_CONCURRENCY`	Parallel range-download workers (clamped 1–64)	4
`QSV_CKAN_API`	CKAN action API endpoint for `ckan://`	datHere portal
`QSV_CKAN_TOKEN`	CKAN access token (private resources)	—

See Environment Variables for the full list.

`get` vs `fetch`

These solve different problems:

get fetches one resource (a whole file) into a managed cache, so any command can read it via dc:. Acquisition + caching of datasets.
fetch / fetchpost make one HTTP call per row to enrich a CSV from an API, caching individual responses. Per-row enrichment.

Reach for get to pull a dataset once and reuse it; reach for fetch to hit an API for every row.

Get and Disk Cache

Get & Disk Cache

What get does on fetch

Quick decision table

get

Sources

The dc: prefix

Sampling preview (--sample)

Glob & directory sources

Freshness: TTL and refresh policy

Cache-management subcommands

CKAN options

Other options

Environment variables

get vs fetch

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

What `get` does on fetch

`get`

The `dc:` prefix

Sampling preview (`--sample`)

`get` vs `fetch`