Skip to content

feat(get): new command for getting tabular data from various sources (#2263)#3953

Merged
jqnatividad merged 15 commits into
masterfrom
feat/get-command-2263
Jun 6, 2026
Merged

feat(get): new command for getting tabular data from various sources (#2263)#3953
jqnatividad merged 15 commits into
masterfrom
feat/get-command-2263

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Implements Phase 1 of #2263: a new get command plus a dc: disk-cache subsystem that fetches tabular data from various sources into a managed, queryable cache that any qsv command can read via the dc: prefix (e.g. qsv stats dc:data.csv).

Sources supported in this phase: local files, http(s):// URLs, dathere://, and ckan:// (<id> and <name>?). Cloud storage (S3/Azure/GCS) and sftp:// are deferred to later phases.

What's included

  • src/diskcache.rs — a two-tier cache module:
    • Always-compiled core: the shared source resolver (dathere:// / ckan:// / http), including the security-critical same-origin CKAN-token-strip check. lookup.rs is unified onto this, so there is now a single, audited resolver/fetch path instead of two.
    • get-feature-gated rich cache: a content-addressed, zstd-compressed blob store + a per-entry JSON index recording BLAKE3, ETag, Last-Modified, compressed/uncompressed sizes, record count, and TTL. HTTP cache semantics (ETag / Cache-Control / 304 revalidation) come from a custom http_cache::CacheManager wired through http-cache-reqwest + reqwest-middleware. Every cached file is auto-indexed (qsv .idx) so dc: reads get instant counts & random access.
  • src/cmd/get.rsqsv get <source>... plus cache-list / cache-info / cache-clear / cache-prune.
  • Config::new resolves dc:<name> to a decompressed temp CSV + sibling .idx, routing failures through format_error (get-gated, no panics).
  • Registered in qsv, qsvmcp and qsvdp; the get feature joins distrib_features, qsvmcp and datapusher_plus (not lite).
  • README row + auto-generated docs/help/get.md / ToC.

Design notes

  • Stable http-cache-reqwest 0.16 is pinned to reqwest 0.12; only the 1.0.0-alpha.6 pre-release targets qsv's reqwest 0.13. It resolves to a single reqwest version (no second major). Pinned with the url-standard feature.
  • Use cases (per the issue): efficiently seeds luau lookup tables & validate dynamicEnum reference data, and should speed up Datapusher+ harvesting.

Testing

  • tests/test_get.rs (actix mock server): ETag 304 revalidation (asserts the body is downloaded only once across two fetches via a shared in-process counter), content-addressed dedup, dc: indexed read, and cache clear/prune — all green.
  • Lookup-consumer tests (luau/template; local + http + dathere://) pass — no regression from the unification.
  • qsv, qsvmcp, qsvdp all compile; qsv also compiles without get (gating verified); cargo check -F all_features clean; clippy clean; cargo +nightly fmt applied.

Deferred (later phases of #2263)

Cloud storage via object_store, sftp://, derived stats/frequency caches, cache-set-ttl / cache-set-policy, --verify, and HTTP/3 / multipart downloads.

🤖 Generated with Claude Code

…2263)

Add a `get` command + `dc:` disk-cache subsystem that fetches tabular data
from local files, http(s) URLs, dathere:// and ckan:// into a managed,
queryable cache reusable by any qsv command via the `dc:` prefix.

- src/diskcache.rs: two-tier cache module. The always-compiled core shares
  the source resolver (dathere/ckan/http) incl. the same-origin CKAN
  token-strip security check; lookup.rs is unified onto it so there is a
  single, audited resolver path. The get-gated rich cache is a
  content-addressed zstd blob store + per-entry JSON index (BLAKE3, ETag,
  Last-Modified, compressed/uncompressed sizes, record count, TTL), with HTTP
  cache semantics via http-cache-reqwest + reqwest-middleware (custom
  CacheManager) and auto-indexing of cached files for instant dc: counts /
  random access.
- src/cmd/get.rs: `qsv get <source>...` plus cache-list/info/clear/prune.
- Config::new resolves `dc:<name>` to a decompressed temp CSV + sibling .idx,
  routing failures through format_error (get-gated).
- Registered in qsv, qsvmcp and qsvdp; the `get` feature joins
  distrib_features, qsvmcp and datapusher_plus.
- tests/test_get.rs: mock-server ETag 304 revalidation, content-addressed
  dedup, dc: indexed read, and cache clear/prune.

Phase 1 of #2263; cloud storage (S3/Azure/GCS) and sftp are deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread tests/test_get.rs Fixed
Comment thread tests/test_get.rs Fixed
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Jun 5, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 337 complexity

Metric Results
Complexity 337

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

jqnatividad and others added 4 commits June 5, 2026 14:03
Address three roborev findings on the `get`/diskcache subsystem:

- Medium: fetching an already-cached HTTP URL under a different --name failed
  because a fresh cache hit never calls CacheManager::put (so no alias for the
  new name was written). The manager now records the cache key it operated on
  (in get or put) and get_resource recovers the entry by that key, binding the
  requested name to it.
- Medium: local entries stored the (possibly relative) given path as the
  refresh source, so a stale dc: refresh from another directory could read the
  wrong file. ingest_local now stores the canonicalized absolute path.
- Low: reusing a logical name for a different source orphaned the old entry
  JSON/blob (duplicate names, inflated metadata). write_entry now writes the new
  entry first, removes the previously-aliased different entry, then writes the
  alias (content-addressed dedup keeps shared blobs intact).

Adds regression tests: get_http_same_url_different_name (fresh-hit recovery)
and get_name_reuse_replaces_entry (orphan removal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to the previous alias hardening. Properly model the cache as
URL-keyed *entries* (own the blob + metadata) with many name *aliases*
pointing at them, with two-level reference counting:

- Medium: the fresh-cache-hit recovery no longer mutates the shared entry's
  logical_name (which corrupted metadata for other aliases and left the newest
  name the only one cache-list showed). It now binds the requested name via a
  dedicated alias without touching the entry; cache-list enumerates aliases so
  every name is listed, and get_resource reports the requested name only in its
  returned metadata.
- Low: re-fetching the same cache key after its content or compression changed
  now reclaims the previous blob/index when no other entry references it
  (write_entry compares prev vs new blob and frees the orphan).

Cleanup paths are now alias-aware: deleting/pruning an entry removes all names
pointing at it and frees the blob only when unreferenced; cache-info de-dupes by
content hash and reports names + unique blobs. Adds regression tests for
multi-name listing and changed-content blob reclamation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nup (roborev #2749)

- Medium: --force / --refresh always now use http-cache CacheMode::Reload
  (always hit the origin, then update the cache) instead of deleting the
  requested alias. Deleting only the alias left a multi-aliased entry behind
  that could still be served as a fresh hit, so --force didn't actually
  re-fetch shared entries.
- Medium: after every successful fetch, the requested name is now bound
  unconditionally to the entry the middleware actually used (via the observed
  cache key), repointing it when it previously pointed at a different (stale)
  entry — not only when the alias was missing. Previously a fresh hit under a
  name that already pointed elsewhere kept resolving the old data.
- Low: delete_entry_by_keyhash now frees the data blob on an exact blob-path
  match (content hash AND compression) and reserves the content-hash-only check
  for the shared {blake3}.idx.zst index, so a compression variant of a shared
  hash is no longer left orphaned.

Removes the now-unused delete_entry_by_name helper. Adds regression tests:
get_force_refetches_shared_entry, get_fresh_hit_repoints_stale_name and
get_compression_variant_blob_reclaimed (11 get tests pass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `get` feature was added to all_features, distrib_features, qsvmcp and
datapusher_plus in Cargo.toml; update the docs that enumerate those bundles so
the docs-drift-check CI gate passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a new feature-gated get command and shared diskcache subsystem to fetch tabular data from local/HTTP/dathere:///ckan:// sources into a managed on-disk cache, and enables any qsv command to read cached entries via the dc: prefix. The PR also unifies lookup-table remote resolution onto the same CKAN/URI resolver to reduce duplicated (and security-sensitive) fetch logic.

Changes:

  • Added src/diskcache.rs with an always-compiled resolver/fetch core and a get-feature rich cache (content-addressed blobs, HTTP cache semantics, auto-indexing, dc: resolution).
  • Introduced src/cmd/get.rs plus integration into binaries/features, and added comprehensive integration tests (tests/test_get.rs).
  • Updated Config::new to resolve dc: inputs (when get is enabled) and updated lookup.rs to use the shared resolver/CKAN logic.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/tests.rs Wires test_get into the integration suite when the get feature is enabled.
tests/test_get.rs Adds end-to-end tests for get, HTTP caching behavior, aliasing, dedup, dc: reads, and cache maintenance.
src/maindp.rs Registers the get command and diskcache module for the qsvdp binary variant.
src/main.rs Registers get (feature-gated) and includes the shared diskcache module in the main binary.
src/lookup.rs Unifies URI/CKAN resolution and HTTP conditional GET behavior through diskcache.
src/diskcache.rs New shared cache/resolver module + get-feature rich cache implementation and dc: resolver.
src/config.rs Resolves dc: references up-front (feature-gated) and surfaces failures via format_error.
src/cmd/mod.rs Registers the new get command module behind the get feature.
src/cmd/get.rs Implements qsv get and cache management subcommands (cache-list/info/clear/prune).
README.md Adds the get command to the command table and updates qsvmcp feature listing.
docs/help/TableOfContents.md Adds get to the generated help ToC.
docs/help/get.md Adds generated help documentation for get.
docs/FEATURES.md Updates feature listings to include get in relevant feature groups/binaries.
docs/contributor/PROJECT_TECHNICAL_OVERVIEW.md Updates contributor docs to reflect get in qsvmcp.
Cargo.toml Adds new deps for HTTP caching/middleware and zstd, and defines the get feature set.
Cargo.lock Locks new dependency versions introduced by the get feature.

Comment thread src/diskcache.rs Outdated
Comment thread src/diskcache.rs Outdated
Comment thread src/diskcache.rs
Comment thread src/diskcache.rs Outdated
Comment thread src/diskcache.rs Outdated
Comment thread src/diskcache.rs
Comment thread src/diskcache.rs
Comment thread src/diskcache.rs
Comment thread src/cmd/get.rs
jqnatividad and others added 8 commits June 5, 2026 16:18
Copilot findings:
- set_qsv_cache_dir: propagate CliError instead of expand_tilde().unwrap()
  panicking when the home dir can't be determined (both branches).
- resolve_uri_prefix: URL-encode the ckan://<name>? value before building the
  resource_search query (spaces/&/# no longer corrupt it).
- resolve_ckan_resource: same-origin token-strip now compares against the
  effective API base (default when None), so the token isn't dropped for
  resources on the default CKAN origin.
- atomic_write / ensure_indexed: use process+call-unique temp filenames so
  concurrent writers/indexers of the same target/content can't collide; clean
  up the temp on rename failure.
- alias filenames: switch from the lossy safe_name() to a reversible, injective
  hex encoding (encode_name/decode_name) so distinct logical names never collide
  and cache-list shows the original name.
- resolve_dc_path: give the dc: temp file a guaranteed tabular extension so
  Config's format check accepts extension-less handles.
- cache-info: de-dupe on-disk totals by (blake3, compression), content totals
  by hash.

DevSkim: add inline ignores (DS137138, DS162092) for the test-only local mock
HTTP server (http://127.0.0.1).

Adds regression test get_alias_names_do_not_collide. All 12 get tests + the
lookup consumer suite pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hex-encoded alias filename scheme had two problems:

- Medium: hex doubles the filename length, so logical names longer than ~127
  bytes exceeded the 255-byte filename limit and failed to cache. Alias
  filenames are now BLAKE3(name) — a fixed 64-char key — keeping them bounded
  regardless of name length while staying injective in practice. The original
  name is preserved inside the alias file content ("{keyhash}\n{name}") and
  used for cache-list display.
- Medium: switching schemes left aliases written by the previous scheme
  unreachable. alias_keyhash() now falls back to the prior hex (encode_name) and
  safe_name alias files and migrates them to the canonical hashed alias on
  lookup; list_entries recovers the name from content, falling back to decoding
  legacy filenames.

Adds regression test get_long_logical_name (200-char name). All 13 get tests
pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The safe_name legacy-alias fallback is lossy: a different logical name can
sanitize to the same filename. alias_keyhash now only adopts (migrates/removes)
a safe_name legacy alias when the pointed entry's primary name actually matches
the requested name, so resolving e.g. `dc:a b.csv` can't steal an existing
`a_b.csv` legacy alias. The injective encode_name (hex) fallback is unaffected.

Adds regression test get_legacy_safe_name_alias_not_stolen. All 14 get tests
pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`get` is unreleased, so no cache exists that uses the prior `encode_name`/
`safe_name` alias filename schemes. Remove the legacy-migration machinery and
the whole class of lossy-fallback edge cases it introduced:

- alias_keyhash now resolves a name only via its canonical BLAKE3(name) alias.
- Remove the now-unused encode_name/decode_name helpers.
- list_entries reads the original name from the alias file content only.
- Drop the legacy-only regression test.

Alias filenames remain bounded BLAKE3 keys with the original name stored in the
file content. 13 get tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds cloud object-storage sources and cache-management subcommands to `qsv get`,
completing the originally-scoped feature set for #2263.

Cloud storage (behind opt-in `get_cloud` sub-feature):
- Fetch s3://, gs://, az:// (and s3a/adl/azure/abfs/abfss) via the object_store
  crate into the same content-addressed disk cache, readable by any command via
  the `dc:` prefix. ETag-based conditional revalidation (If-None-Match -> 304)
  gives the same "don't re-download unchanged" guarantee as the HTTP path.
- Credentials come from the standard AWS_*/AZURE_*/GOOGLE_* environment (and IAM
  roles); a repeatable `--cloud-opt key=value` flag overrides region/endpoint/etc.
- object_store is already compiled transitively via polars in distrib/qsvmcp/
  datapusher_plus, so the marginal dependency weight there is ~zero (Cargo.lock
  gains no new crates). A bare `-F get` build stays lean — get_cloud is opt-in.

Cache management:
- `cache-set-ttl <name> --ttl` / `cache-set-policy <name> --refresh` mutate an
  existing entry's TTL / refresh policy in place (entry-level: affects every alias
  pointing at the entry).
- `cache-list --verify` recomputes each blob's BLAKE3 and reports OK/FAIL per name,
  exiting non-zero on any integrity failure.

`get_cloud` added to distrib_features, qsvmcp and datapusher_plus; the feature
enumerations in FEATURES.md/README.md/PROJECT_TECHNICAL_OVERVIEW.md were updated
and docs/help/get.md regenerated (docs-drift clean).

Tests: +7 integration tests (S3 fetch + dc read, S3 ETag revalidation, S3 404,
cache-set-ttl, cache-set-policy, verify-ok, verify-detects-corruption) using the
actix mock with a path-style S3 endpoint override. All 20 get tests pass with
get_cloud; 17 under plain get (the S3 tests are feature-gated).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cloud cache entries were keyed only by `CLOUD:{source}`, ignoring the store
identity supplied via `--cloud-opt` / environment (endpoint, region, account).
The same `s3://bucket/key` against two different S3-compatible endpoints (or
Azure accounts) therefore collided in the cache, so a second fetch could
revalidate against — or `--refresh never` could serve — data from the wrong
backing store.

Fix:
- Derive a non-secret store identity (endpoint/region/account/… lower-cased,
  sorted, de-duplicated) from the resolved cloud options and fold it into the
  cache key (`CLOUD:{source}#k=v&...`). Credentials are excluded via a secret-key
  denylist so they never reach the on-disk cache key.
- Persist that identity on the cache entry (new `cloud_identity` field,
  `#[serde(default)]`) and replay it on `dc:` auto-refresh so the refresh rebuilds
  the SAME store and resolves to the SAME entry. Credentials still come from the
  ambient environment and are never persisted.

Adds `get_s3_identity_scopes_cache_key`: two fetches of one s3:// URL with
different regions must download independently (a collision would make the second
a 304 revalidation). All 21 get tests pass; clippy clean; no-cloud build
unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous fix (#2756) built the cloud store-identity by sorting and
de-duplicating whole `(key, value)` pairs, so a key supplied with two different
values — e.g. `AWS_REGION=us-east-1` in the environment plus
`--cloud-opt aws_region=eu-west-1` — kept BOTH. object_store, by contrast, builds
the store with last-wins precedence (`--cloud-opt` overrides env), so the cache
key diverged from the effective store, and a later `dc:` auto-refresh (which
re-merges ambient env with the persisted identity) recomputed a different key and
re-downloaded instead of revalidating the same entry.

Fix: reduce the identity to a single value per (lower-cased) key using the same
last-wins precedence object_store applies — env first, then `--cloud-opt` — via a
BTreeMap (also yielding deterministic key-sorted order). Because the persisted
identity is replayed as high-precedence `--cloud-opt` on refresh, it now wins over
any ambient env var (even a changed one), so the recomputed key matches the
original.

Adds `get_s3_env_cloud_opt_override_refresh_stable`: an env-vs-`--cloud-opt`
region override followed by a stale `dc:` refresh under a CHANGED `AWS_REGION`
must revalidate the same entry (one download). All 22 get tests pass; clippy
clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2758)

`get_s3_env_cloud_opt_override_refresh_stable` asserted only count==4 and
body_sends==1, which `resolve_dc_path`'s best-effort fallback-to-stale would also
satisfy on a *failed* refresh — so a broken `cloud_identity` replay could have
passed the test without any successful refresh request.

Add a `revalidations` (304) counter to the mock `GetWebServer` alongside the
existing body-send counter, and assert the stale `dc:` read makes exactly one
conditional request that the server answers with 304. This proves the identity
replay rebuilt the correct store and resolved to the same entry, rather than
silently serving the stale copy. The accessor is gated to get_cloud to avoid a
dead-code warning in non-cloud builds. All 22 get tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 10 comments.

Comment thread tests/test_get.rs
Comment thread tests/test_get.rs
Comment thread tests/test_get.rs
Comment thread tests/test_get.rs
Comment thread tests/test_get.rs
Comment thread tests/test_get.rs
Comment thread tests/test_get.rs
Comment thread src/diskcache.rs Outdated
Comment thread src/diskcache.rs
Comment thread tests/test_get.rs Outdated
jqnatividad and others added 2 commits June 5, 2026 23:04
…Copilot #3953)

Addresses 10 unresolved Copilot review comments on PR #3953:

- tests/test_get.rs (7): the `cache-list` invocations that used
  `wrk.output(&mut list)` then inspected stdout now assert `out.status.success()`
  first (printing stderr on failure), so a non-zero exit can't be masked by a
  stdout substring match.
- tests/test_get.rs: corrected a stale comment — alias filenames are BLAKE3
  hashes and the original name is read from the alias file *content*, not
  "decoded from a reversible alias".
- src/diskcache.rs: `set_qsv_cache_dir` doc no longer claims to "canonicalize"
  (it expands `~` and creates the dir; the path is otherwise returned as given).
- src/diskcache.rs: `dc:` auto-refresh of a `ckan://` entry now re-resolves
  against the SAME CKAN instance it was fetched from. The effective `--ckan-api`
  base URL is persisted on the entry (new `ckan_api_url` field, `#[serde(default)]`)
  and replayed on refresh, falling back to `QSV_CKAN_API`/default only for
  pre-existing entries. Previously a non-default CKAN entry would silently refresh
  against the wrong instance.

All 22 get tests pass; lookup suite non-regressing; clippy clean; fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The persisted-CKAN-API fix (Copilot #3953) set `ckan_api_url` on the cache entry
for every HTTP fetch. Because `--ckan-api` always carries a default value, plain
`http(s)://` entries were recorded — and surfaced by `cache-list --json` — with a
CKAN API URL, contradicting the field's `None for non-CKAN entries` contract and
making the metadata misleading.

Fix: only populate `ckan_api_url` on the manager when `resolved.is_ckan` is true;
non-CKAN HTTP entries persist `None`. Adds `get_http_entry_has_no_ckan_api_metadata`,
asserting a plain http:// entry serializes `"ckan_api_url": null` and never the
default CKAN URL. All 23 get tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 2db0ebe into master Jun 6, 2026
18 of 19 checks passed
@jqnatividad jqnatividad deleted the feat/get-command-2263 branch June 6, 2026 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants