-
Notifications
You must be signed in to change notification settings - Fork 102
Get and Disk Cache
Tier: Intermediate
Commands covered: get
Note
Per-command flag reference lives in /docs/help/get.md. This page is the workflow layer — when to reach for get, how the cache works, and how dc: lets every other command read cached data.
get fetches tabular data once from a local path, an HTTP(S) URL, a CKAN portal, or cloud object storage, and stores it in a managed, queryable disk cache. Once cached, the resource can be read by any qsv command via the dc: prefix — no re-download, no re-parse.
Think of it as the acquisition layer that sits in front of the rest of qsv: get owns "how do I get this file and keep it fresh?", and every other command just reads dc:<name>.
For each source, get:
- Stores the blob compressed (zstd by default) and content-addressed by BLAKE3 — identical content is stored once.
- Sniffs the delimiter of CSV/TSV ingests so the cached copy parses correctly even when the source uses a non-comma delimiter.
- Auto-builds a qsv index for the cached file, so downstream commands get instant random access and exact record counts.
-
Records rich metadata —
ETag,Last-Modified, compressed/uncompressed sizes, record count, and TTL. -
Revalidates instead of re-downloading. Re-fetches send a conditional request (
If-None-Match/If-Modified-Since); an unchanged resource returns304 Not Modifiedand the cached copy is kept. -
Streams large remote objects in parallel byte-ranges — tune with
QSV_GET_PART_SIZE(default 8 MiB) andQSV_GET_CONCURRENCY(default 4, clamped 1–64).
Note
get is behind the get feature flag (✨). It is included in the standard qsv binary (distrib_features), qsvmcp, and qsvdp, but not in qsvlite. Cloud sources (s3://, gs://, az://) additionally require the get_cloud sub-feature. See Binary Variants.
| If you want to… | Use |
|---|---|
| Cache a local file (compressed + indexed) for reuse | qsv get path/to/file.csv |
| Cache every matching file in one go | qsv get '/data/*.csv' |
| Peek at a remote file without caching it | qsv get https://… --sample 10 |
| Fetch a URL once and reuse it across commands | qsv get https://… --name data.csv |
| Pull a CKAN resource by id | qsv get "ckan://<resource-id>" --name ref.csv |
| Pull a CKAN resource by name (resource_search) | qsv get "ckan://<name>?" --name ref.csv |
Fetch from S3 / GCS / Azure (needs get_cloud) |
qsv get s3://bucket/key.csv --name data.csv |
| Read cached data from any command | qsv stats dc:data.csv |
| List / inspect / verify what's cached | qsv get cache-list [--verify] |
| Drop old entries | qsv get cache-prune --older-than=30d |
| Retune one entry's freshness |
qsv get cache-set-ttl / cache-set-policy
|
| Source | Example | Notes |
|---|---|---|
| Local file | qsv get data/sales.csv |
Compressed + indexed copy in the cache |
| Glob / directory | qsv get '/data/*.csv' |
Fetches every matching tabular file (.csv/.tsv/.tab/.ssv); each cached separately; --name is ignored |
| HTTP / HTTPS | qsv get https://example.com/data.csv |
Conditional revalidation, ranged parallel download |
| datHere lookup repo | qsv get dathere://us-states.csv |
The datHere qsv-lookup-tables repo |
| CKAN by id | qsv get "ckan://abc123…" |
Resource id on the configured CKAN portal |
| CKAN by name | qsv get "ckan://covid-vaccinations?" |
Trailing ? triggers resource_search
|
AWS S3 (get_cloud) |
qsv get s3://bucket/key.csv |
S3 / S3-compatible endpoints |
Google Cloud Storage (get_cloud) |
qsv get gs://bucket/key.csv |
|
Azure Blob Storage (get_cloud) |
qsv get az://container/key.csv |
sftp://is planned for a later release.
Cloud credentials are read from the standard AWS_* / AZURE_* / GOOGLE_* environment variables (and IAM roles). Use --cloud-opt key=value (repeatable) for one-off overrides such as region=us-east-1, a custom endpoint=…, or skip_signature=true for public buckets.
# Fetch a CSV into the cache, then read it with another command
qsv get https://example.com/data.csv --name data.csv
qsv stats dc:data.csv
# Seed a CKAN reference table by name
qsv get "ckan://covid-vaccinations?" --name vax.csv
# Cloud object storage (requires the get_cloud feature)
qsv get s3://my-bucket/data.csv --name data.csv
qsv get gs://my-bucket/data.csv --cloud-opt skip_signature=trueOnce a resource is cached under a logical name, any qsv command reads it by prefixing that name with dc::
qsv get https://example.com/big.csv --name big.csv
qsv stats dc:big.csv
qsv frequency dc:big.csv
qsv sqlp dc:big.csv "SELECT * FROM big WHERE amount > 100"- The logical name is set with
--name. If omitted, it defaults to the source's terminal path segment (e.g.big.csvfrom…/big.csv).--nameis ignored when multiple sources are given. - Reading a
dc:handle transparently decompresses the blob to a temp CSV with a sibling.idx. -
Stale
dc:entries are auto-refreshed according to the entry's refresh policy (see below) — adc:read can trigger a conditional revalidation before handing the data to the command.
--sample N is a cheap peek at a single source — it streams just the first N data records (re-attaching the sniffed header row) to stdout or --output and caches nothing. No dc: entry is created. Because it stops reading early, a huge remote file is barely touched.
Note
This is not a statistical sample. For a random, representative subset use qsv sample instead.
| Option | Meaning |
|---|---|
--sample <n> |
Stream the first N records of a single source (no caching) |
--offset <mb> |
Skip ~<mb> megabytes (via an HTTP Range request) before sampling, realigning to the next record boundary. Implies --sample; needs a Range-capable source |
--random |
Reservoir-sample instead of reading the head. Streams and parses the full source from the start (so quoted multi-line records stay intact) — slower, but uniform |
# Peek at the first 10 rows of a remote CSV without caching it
qsv get https://example.com/big.csv --sample 10
# Skip ~500 MB, then peek 10 rows
qsv get https://example.com/big.csv --offset 500 --sample 10
# Uniform random preview of 20 rows
qsv get https://example.com/big.csv --sample 20 --randomA glob (e.g. data/*.csv) or directory source fetches every matching tabular file (.csv/.tsv/.tab/.ssv) in one call — each cached separately under its own dc: handle. Supported for local paths and, with the get_cloud feature, cloud buckets/prefixes. --name is ignored when a source expands to multiple files.
# Cache every CSV in a local directory
qsv get '/data/*.csv'
qsv get /data/
# Cache every matching object under an S3 prefix (requires get_cloud)
qsv get 's3://my-bucket/exports/*.csv'Each cache entry carries a TTL and a refresh policy that together decide what a dc: read does when the entry ages out:
| Option | Values | Default | Meaning |
|---|---|---|---|
--ttl <secs> |
seconds, or -1 to never expire |
2419200 (28 days) |
How long an entry is considered fresh |
--refresh <policy> |
on-stale, always, never
|
on-stale |
on-stale revalidates only past TTL; always revalidates every read; never serves the cached copy without checking |
--force |
— | off | Re-fetch immediately, even if a fresh copy exists |
--compress <algo> |
zstd, none
|
zstd |
Transparent blob compression |
| Command | Purpose |
|---|---|
qsv get cache-list [--verify] |
List cached entries (add --verify to recompute each blob's BLAKE3 and report OK/FAIL — exits non-zero on any failure) |
qsv get cache-info |
Summarize the cache (size, entry count) |
qsv get cache-clear |
Remove all cache entries |
qsv get cache-prune --older-than=<val> |
Remove entries older than an age — <val> is seconds or a value with an s/m/h/d/w suffix (e.g. 3600, 90m, 30d, 2w) |
qsv get cache-set-ttl <name> --ttl=<secs> |
Change one entry's TTL |
qsv get cache-set-policy <name> --refresh=<policy> |
Change one entry's refresh policy |
cache-list / cache-info accept --json for machine-readable output.
# Inspect, verify integrity, then prune
qsv get cache-list
qsv get cache-list --verify
qsv get cache-prune --older-than=30d
# Retune a single entry
qsv get cache-set-ttl data.csv --ttl=86400
qsv get cache-set-policy data.csv --refresh=never| Option | Default | Env override |
|---|---|---|
--ckan-api <url> |
https://data.dathere.com/api/3/action |
QSV_CKAN_API |
--ckan-token <token> |
— (only for private resources) | QSV_CKAN_TOKEN |
-
--sample <n>/--offset <mb>/--random— preview a single source without caching (see Sampling preview above). -
--timeout <secs>— HTTP request timeout (default30). -
--cache-dir <dir>— cache directory (default~/.qsv-cache; overridesQSV_CACHE_DIR). -
-o, --output <file>— for a single source, also write the fetched (decompressed) data to<file>(use-for stdout). -
-q, --quiet— suppress progress/summary output on stderr.
| Variable | Effect | Default |
|---|---|---|
QSV_CACHE_DIR |
Cache directory | ~/.qsv-cache |
QSV_GET_PART_SIZE |
Byte-range part size for parallel downloads | 8 MiB |
QSV_GET_CONCURRENCY |
Parallel range-download workers (clamped 1–64) | 4 |
QSV_CKAN_API |
CKAN action API endpoint for ckan://
|
datHere portal |
QSV_CKAN_TOKEN |
CKAN access token (private resources) | — |
See Environment Variables for the full list.
These solve different problems:
-
getfetches one resource (a whole file) into a managed cache, so any command can read it viadc:. Acquisition + caching of datasets. -
fetch/fetchpostmake one HTTP call per row to enrich a CSV from an API, caching individual responses. Per-row enrichment.
Reach for get to pull a dataset once and reuse it; reach for fetch to hit an API for every row.
- Command Reference (index)
-
docs/help/get.md— canonical flag reference - HTTP & Web → fetch / fetchpost — per-row HTTP enrichment
- Stats Cache & Caching — the wider qsv caching story
- Recipe: CKAN Integration
- Environment Variables
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation
- Recipes index
- Inspect an Unknown CSV
- Clean & Normalize
- Geographic Enrichment
- Date Enrichment
- CKAN Integration
- JSON Schema Validation
- Build a Data Pipeline
- Stats → Insights
- Fetch & Cache
- Larger-than-RAM CSV
- Diff & Audit
- Multi-table Joins
- Synthesize Fake Data