feat(get): new command for getting tabular data from various sources (#2263) by jqnatividad · Pull Request #3953 · dathere/qsv

jqnatividad · 2026-06-05T16:32:04Z

Summary

Implements Phase 1 of #2263: a new get command plus a dc: disk-cache subsystem that fetches tabular data from various sources into a managed, queryable cache that any qsv command can read via the dc: prefix (e.g. qsv stats dc:data.csv).

Sources supported in this phase: local files, http(s):// URLs, dathere://, and ckan:// (<id> and <name>?). Cloud storage (S3/Azure/GCS) and sftp:// are deferred to later phases.

What's included

src/diskcache.rs — a two-tier cache module:
- Always-compiled core: the shared source resolver (dathere:// / ckan:// / http), including the security-critical same-origin CKAN-token-strip check. lookup.rs is unified onto this, so there is now a single, audited resolver/fetch path instead of two.
- get-feature-gated rich cache: a content-addressed, zstd-compressed blob store + a per-entry JSON index recording BLAKE3, ETag, Last-Modified, compressed/uncompressed sizes, record count, and TTL. HTTP cache semantics (ETag / Cache-Control / 304 revalidation) come from a custom http_cache::CacheManager wired through http-cache-reqwest + reqwest-middleware. Every cached file is auto-indexed (qsv .idx) so dc: reads get instant counts & random access.
src/cmd/get.rs — qsv get <source>... plus cache-list / cache-info / cache-clear / cache-prune.
Config::new resolves dc:<name> to a decompressed temp CSV + sibling .idx, routing failures through format_error (get-gated, no panics).
Registered in qsv, qsvmcp and qsvdp; the get feature joins distrib_features, qsvmcp and datapusher_plus (not lite).
README row + auto-generated docs/help/get.md / ToC.

Design notes

Stable http-cache-reqwest 0.16 is pinned to reqwest 0.12; only the 1.0.0-alpha.6 pre-release targets qsv's reqwest 0.13. It resolves to a single reqwest version (no second major). Pinned with the url-standard feature.
Use cases (per the issue): efficiently seeds luau lookup tables & validate dynamicEnum reference data, and should speed up Datapusher+ harvesting.

Testing

tests/test_get.rs (actix mock server): ETag 304 revalidation (asserts the body is downloaded only once across two fetches via a shared in-process counter), content-addressed dedup, dc: indexed read, and cache clear/prune — all green.
Lookup-consumer tests (luau/template; local + http + dathere://) pass — no regression from the unification.
qsv, qsvmcp, qsvdp all compile; qsv also compiles without get (gating verified); cargo check -F all_features clean; clippy clean; cargo +nightly fmt applied.

Deferred (later phases of #2263)

Cloud storage via object_store, sftp://, derived stats/frequency caches, cache-set-ttl / cache-set-policy, --verify, and HTTP/3 / multipart downloads.

🤖 Generated with Claude Code

…2263) Add a `get` command + `dc:` disk-cache subsystem that fetches tabular data from local files, http(s) URLs, dathere:// and ckan:// into a managed, queryable cache reusable by any qsv command via the `dc:` prefix. - src/diskcache.rs: two-tier cache module. The always-compiled core shares the source resolver (dathere/ckan/http) incl. the same-origin CKAN token-strip security check; lookup.rs is unified onto it so there is a single, audited resolver path. The get-gated rich cache is a content-addressed zstd blob store + per-entry JSON index (BLAKE3, ETag, Last-Modified, compressed/uncompressed sizes, record count, TTL), with HTTP cache semantics via http-cache-reqwest + reqwest-middleware (custom CacheManager) and auto-indexing of cached files for instant dc: counts / random access. - src/cmd/get.rs: `qsv get <source>...` plus cache-list/info/clear/prune. - Config::new resolves `dc:<name>` to a decompressed temp CSV + sibling .idx, routing failures through format_error (get-gated). - Registered in qsv, qsvmcp and qsvdp; the `get` feature joins distrib_features, qsvmcp and datapusher_plus. - tests/test_get.rs: mock-server ETag 304 revalidation, content-addressed dedup, dc: indexed read, and cache clear/prune. Phase 1 of #2263; cloud storage (S3/Azure/GCS) and sftp are deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codacy-production · 2026-06-05T16:32:58Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 337 complexity

Metric Results

Complexity 337

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Address three roborev findings on the `get`/diskcache subsystem: - Medium: fetching an already-cached HTTP URL under a different --name failed because a fresh cache hit never calls CacheManager::put (so no alias for the new name was written). The manager now records the cache key it operated on (in get or put) and get_resource recovers the entry by that key, binding the requested name to it. - Medium: local entries stored the (possibly relative) given path as the refresh source, so a stale dc: refresh from another directory could read the wrong file. ingest_local now stores the canonicalized absolute path. - Low: reusing a logical name for a different source orphaned the old entry JSON/blob (duplicate names, inflated metadata). write_entry now writes the new entry first, removes the previously-aliased different entry, then writes the alias (content-addressed dedup keeps shared blobs intact). Adds regression tests: get_http_same_url_different_name (fresh-hit recovery) and get_name_reuse_replaces_entry (orphan removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Follow-up to the previous alias hardening. Properly model the cache as URL-keyed *entries* (own the blob + metadata) with many name *aliases* pointing at them, with two-level reference counting: - Medium: the fresh-cache-hit recovery no longer mutates the shared entry's logical_name (which corrupted metadata for other aliases and left the newest name the only one cache-list showed). It now binds the requested name via a dedicated alias without touching the entry; cache-list enumerates aliases so every name is listed, and get_resource reports the requested name only in its returned metadata. - Low: re-fetching the same cache key after its content or compression changed now reclaims the previous blob/index when no other entry references it (write_entry compares prev vs new blob and frees the orphan). Cleanup paths are now alias-aware: deleting/pruning an entry removes all names pointing at it and frees the blob only when unreferenced; cache-info de-dupes by content hash and reports names + unique blobs. Adds regression tests for multi-name listing and changed-content blob reclamation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nup (roborev #2749) - Medium: --force / --refresh always now use http-cache CacheMode::Reload (always hit the origin, then update the cache) instead of deleting the requested alias. Deleting only the alias left a multi-aliased entry behind that could still be served as a fresh hit, so --force didn't actually re-fetch shared entries. - Medium: after every successful fetch, the requested name is now bound unconditionally to the entry the middleware actually used (via the observed cache key), repointing it when it previously pointed at a different (stale) entry — not only when the alias was missing. Previously a fresh hit under a name that already pointed elsewhere kept resolving the old data. - Low: delete_entry_by_keyhash now frees the data blob on an exact blob-path match (content hash AND compression) and reserves the content-hash-only check for the shared {blake3}.idx.zst index, so a compression variant of a shared hash is no longer left orphaned. Removes the now-unused delete_entry_by_name helper. Adds regression tests: get_force_refetches_shared_entry, get_fresh_hit_repoints_stale_name and get_compression_variant_blob_reclaimed (11 get tests pass). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The `get` feature was added to all_features, distrib_features, qsvmcp and datapusher_plus in Cargo.toml; update the docs that enumerate those bundles so the docs-drift-check CI gate passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Implements a new feature-gated get command and shared diskcache subsystem to fetch tabular data from local/HTTP/dathere:///ckan:// sources into a managed on-disk cache, and enables any qsv command to read cached entries via the dc: prefix. The PR also unifies lookup-table remote resolution onto the same CKAN/URI resolver to reduce duplicated (and security-sensitive) fetch logic.

Changes:

Added src/diskcache.rs with an always-compiled resolver/fetch core and a get-feature rich cache (content-addressed blobs, HTTP cache semantics, auto-indexing, dc: resolution).
Introduced src/cmd/get.rs plus integration into binaries/features, and added comprehensive integration tests (tests/test_get.rs).
Updated Config::new to resolve dc: inputs (when get is enabled) and updated lookup.rs to use the shared resolver/CKAN logic.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tests/tests.rs	Wires `test_get` into the integration suite when the `get` feature is enabled.
tests/test_get.rs	Adds end-to-end tests for `get`, HTTP caching behavior, aliasing, dedup, `dc:` reads, and cache maintenance.
src/maindp.rs	Registers the `get` command and `diskcache` module for the `qsvdp` binary variant.
src/main.rs	Registers `get` (feature-gated) and includes the shared `diskcache` module in the main binary.
src/lookup.rs	Unifies URI/CKAN resolution and HTTP conditional GET behavior through `diskcache`.
src/diskcache.rs	New shared cache/resolver module + `get`-feature rich cache implementation and `dc:` resolver.
src/config.rs	Resolves `dc:` references up-front (feature-gated) and surfaces failures via `format_error`.
src/cmd/mod.rs	Registers the new `get` command module behind the `get` feature.
src/cmd/get.rs	Implements `qsv get` and cache management subcommands (`cache-list/info/clear/prune`).
README.md	Adds the `get` command to the command table and updates `qsvmcp` feature listing.
docs/help/TableOfContents.md	Adds `get` to the generated help ToC.
docs/help/get.md	Adds generated help documentation for `get`.
docs/FEATURES.md	Updates feature listings to include `get` in relevant feature groups/binaries.
docs/contributor/PROJECT_TECHNICAL_OVERVIEW.md	Updates contributor docs to reflect `get` in `qsvmcp`.
Cargo.toml	Adds new deps for HTTP caching/middleware and `zstd`, and defines the `get` feature set.
Cargo.lock	Locks new dependency versions introduced by the `get` feature.

Copilot findings: - set_qsv_cache_dir: propagate CliError instead of expand_tilde().unwrap() panicking when the home dir can't be determined (both branches). - resolve_uri_prefix: URL-encode the ckan://<name>? value before building the resource_search query (spaces/&/# no longer corrupt it). - resolve_ckan_resource: same-origin token-strip now compares against the effective API base (default when None), so the token isn't dropped for resources on the default CKAN origin. - atomic_write / ensure_indexed: use process+call-unique temp filenames so concurrent writers/indexers of the same target/content can't collide; clean up the temp on rename failure. - alias filenames: switch from the lossy safe_name() to a reversible, injective hex encoding (encode_name/decode_name) so distinct logical names never collide and cache-list shows the original name. - resolve_dc_path: give the dc: temp file a guaranteed tabular extension so Config's format check accepts extension-less handles. - cache-info: de-dupe on-disk totals by (blake3, compression), content totals by hash. DevSkim: add inline ignores (DS137138, DS162092) for the test-only local mock HTTP server (http://127.0.0.1). Adds regression test get_alias_names_do_not_collide. All 12 get tests + the lookup consumer suite pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The hex-encoded alias filename scheme had two problems: - Medium: hex doubles the filename length, so logical names longer than ~127 bytes exceeded the 255-byte filename limit and failed to cache. Alias filenames are now BLAKE3(name) — a fixed 64-char key — keeping them bounded regardless of name length while staying injective in practice. The original name is preserved inside the alias file content ("{keyhash}\n{name}") and used for cache-list display. - Medium: switching schemes left aliases written by the previous scheme unreachable. alias_keyhash() now falls back to the prior hex (encode_name) and safe_name alias files and migrates them to the canonical hashed alias on lookup; list_entries recovers the name from content, falling back to decoding legacy filenames. Adds regression test get_long_logical_name (200-char name). All 13 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The safe_name legacy-alias fallback is lossy: a different logical name can sanitize to the same filename. alias_keyhash now only adopts (migrates/removes) a safe_name legacy alias when the pointed entry's primary name actually matches the requested name, so resolving e.g. `dc:a b.csv` can't steal an existing `a_b.csv` legacy alias. The injective encode_name (hex) fallback is unaffected. Adds regression test get_legacy_safe_name_alias_not_stolen. All 14 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`get` is unreleased, so no cache exists that uses the prior `encode_name`/ `safe_name` alias filename schemes. Remove the legacy-migration machinery and the whole class of lossy-fallback edge cases it introduced: - alias_keyhash now resolves a name only via its canonical BLAKE3(name) alias. - Remove the now-unused encode_name/decode_name helpers. - list_entries reads the original name from the alias file content only. - Drop the legacy-only regression test. Alias filenames remain bounded BLAKE3 keys with the original name stored in the file content. 13 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds cloud object-storage sources and cache-management subcommands to `qsv get`, completing the originally-scoped feature set for #2263. Cloud storage (behind opt-in `get_cloud` sub-feature): - Fetch s3://, gs://, az:// (and s3a/adl/azure/abfs/abfss) via the object_store crate into the same content-addressed disk cache, readable by any command via the `dc:` prefix. ETag-based conditional revalidation (If-None-Match -> 304) gives the same "don't re-download unchanged" guarantee as the HTTP path. - Credentials come from the standard AWS_*/AZURE_*/GOOGLE_* environment (and IAM roles); a repeatable `--cloud-opt key=value` flag overrides region/endpoint/etc. - object_store is already compiled transitively via polars in distrib/qsvmcp/ datapusher_plus, so the marginal dependency weight there is ~zero (Cargo.lock gains no new crates). A bare `-F get` build stays lean — get_cloud is opt-in. Cache management: - `cache-set-ttl <name> --ttl` / `cache-set-policy <name> --refresh` mutate an existing entry's TTL / refresh policy in place (entry-level: affects every alias pointing at the entry). - `cache-list --verify` recomputes each blob's BLAKE3 and reports OK/FAIL per name, exiting non-zero on any integrity failure. `get_cloud` added to distrib_features, qsvmcp and datapusher_plus; the feature enumerations in FEATURES.md/README.md/PROJECT_TECHNICAL_OVERVIEW.md were updated and docs/help/get.md regenerated (docs-drift clean). Tests: +7 integration tests (S3 fetch + dc read, S3 ETag revalidation, S3 404, cache-set-ttl, cache-set-policy, verify-ok, verify-detects-corruption) using the actix mock with a path-style S3 endpoint override. All 20 get tests pass with get_cloud; 17 under plain get (the S3 tests are feature-gated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cloud cache entries were keyed only by `CLOUD:{source}`, ignoring the store identity supplied via `--cloud-opt` / environment (endpoint, region, account). The same `s3://bucket/key` against two different S3-compatible endpoints (or Azure accounts) therefore collided in the cache, so a second fetch could revalidate against — or `--refresh never` could serve — data from the wrong backing store. Fix: - Derive a non-secret store identity (endpoint/region/account/… lower-cased, sorted, de-duplicated) from the resolved cloud options and fold it into the cache key (`CLOUD:{source}#k=v&...`). Credentials are excluded via a secret-key denylist so they never reach the on-disk cache key. - Persist that identity on the cache entry (new `cloud_identity` field, `#[serde(default)]`) and replay it on `dc:` auto-refresh so the refresh rebuilds the SAME store and resolves to the SAME entry. Credentials still come from the ambient environment and are never persisted. Adds `get_s3_identity_scopes_cache_key`: two fetches of one s3:// URL with different regions must download independently (a collision would make the second a 304 revalidation). All 21 get tests pass; clippy clean; no-cloud build unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The previous fix (#2756) built the cloud store-identity by sorting and de-duplicating whole `(key, value)` pairs, so a key supplied with two different values — e.g. `AWS_REGION=us-east-1` in the environment plus `--cloud-opt aws_region=eu-west-1` — kept BOTH. object_store, by contrast, builds the store with last-wins precedence (`--cloud-opt` overrides env), so the cache key diverged from the effective store, and a later `dc:` auto-refresh (which re-merges ambient env with the persisted identity) recomputed a different key and re-downloaded instead of revalidating the same entry. Fix: reduce the identity to a single value per (lower-cased) key using the same last-wins precedence object_store applies — env first, then `--cloud-opt` — via a BTreeMap (also yielding deterministic key-sorted order). Because the persisted identity is replayed as high-precedence `--cloud-opt` on refresh, it now wins over any ambient env var (even a changed one), so the recomputed key matches the original. Adds `get_s3_env_cloud_opt_override_refresh_stable`: an env-vs-`--cloud-opt` region override followed by a stale `dc:` refresh under a CHANGED `AWS_REGION` must revalidate the same entry (one download). All 22 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…2758) `get_s3_env_cloud_opt_override_refresh_stable` asserted only count==4 and body_sends==1, which `resolve_dc_path`'s best-effort fallback-to-stale would also satisfy on a *failed* refresh — so a broken `cloud_identity` replay could have passed the test without any successful refresh request. Add a `revalidations` (304) counter to the mock `GetWebServer` alongside the existing body-send counter, and assert the stale `dc:` read makes exactly one conditional request that the server answers with 304. This proves the identity replay rebuilt the correct store and resolved to the same entry, rather than silently serving the stale copy. The accessor is gated to get_cloud to avoid a dead-code warning in non-cloud builds. All 22 get tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 10 comments.

…Copilot #3953) Addresses 10 unresolved Copilot review comments on PR #3953: - tests/test_get.rs (7): the `cache-list` invocations that used `wrk.output(&mut list)` then inspected stdout now assert `out.status.success()` first (printing stderr on failure), so a non-zero exit can't be masked by a stdout substring match. - tests/test_get.rs: corrected a stale comment — alias filenames are BLAKE3 hashes and the original name is read from the alias file *content*, not "decoded from a reversible alias". - src/diskcache.rs: `set_qsv_cache_dir` doc no longer claims to "canonicalize" (it expands `~` and creates the dir; the path is otherwise returned as given). - src/diskcache.rs: `dc:` auto-refresh of a `ckan://` entry now re-resolves against the SAME CKAN instance it was fetched from. The effective `--ckan-api` base URL is persisted on the entry (new `ckan_api_url` field, `#[serde(default)]`) and replayed on refresh, falling back to `QSV_CKAN_API`/default only for pre-existing entries. Previously a non-default CKAN entry would silently refresh against the wrong instance. All 22 get tests pass; lookup suite non-regressing; clippy clean; fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The persisted-CKAN-API fix (Copilot #3953) set `ckan_api_url` on the cache entry for every HTTP fetch. Because `--ckan-api` always carries a default value, plain `http(s)://` entries were recorded — and surfaced by `cache-list --json` — with a CKAN API URL, contradicting the field's `None for non-CKAN entries` contract and making the metadata misleading. Fix: only populate `ckan_api_url` on the manager when `resolved.is_ckan` is true; non-CKAN HTTP entries persist `None`. Adds `get_http_entry_has_no_ckan_api_metadata`, asserting a plain http:// entry serializes `"ckan_api_url": null` and never the default CKAN URL. All 23 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems Jun 5, 2026

View reviewed changes

Comment thread tests/test_get.rs Fixed

Comment thread tests/test_get.rs Fixed

jqnatividad and others added 4 commits June 5, 2026 14:03

jqnatividad requested a review from Copilot June 5, 2026 19:50

Copilot started reviewing on behalf of jqnatividad June 5, 2026 19:50 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

jqnatividad and others added 8 commits June 5, 2026 16:18

jqnatividad requested a review from Copilot June 6, 2026 02:49

Copilot started reviewing on behalf of jqnatividad June 6, 2026 02:50 View session

Copilot AI reviewed Jun 6, 2026

View reviewed changes

jqnatividad and others added 2 commits June 5, 2026 23:04

jqnatividad merged commit 2db0ebe into master Jun 6, 2026
18 of 19 checks passed

jqnatividad deleted the feat/get-command-2263 branch June 6, 2026 03:17

jqnatividad mentioned this pull request Jun 7, 2026

feat(get): Phase 3 — streaming + ranged/parallel downloads (HTTP & cloud), dc: stats cache #3958

Merged

Conversation

jqnatividad commented Jun 5, 2026

Summary

What's included

Design notes

Testing

Deferred (later phases of #2263)

Uh oh!

Uh oh!

Uh oh!

codacy-production Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codacy-production Bot commented Jun 5, 2026 •

edited

Loading