feat(get): new command for getting tabular data from various sources (#2263)#3953
Merged
Conversation
…2263) Add a `get` command + `dc:` disk-cache subsystem that fetches tabular data from local files, http(s) URLs, dathere:// and ckan:// into a managed, queryable cache reusable by any qsv command via the `dc:` prefix. - src/diskcache.rs: two-tier cache module. The always-compiled core shares the source resolver (dathere/ckan/http) incl. the same-origin CKAN token-strip security check; lookup.rs is unified onto it so there is a single, audited resolver path. The get-gated rich cache is a content-addressed zstd blob store + per-entry JSON index (BLAKE3, ETag, Last-Modified, compressed/uncompressed sizes, record count, TTL), with HTTP cache semantics via http-cache-reqwest + reqwest-middleware (custom CacheManager) and auto-indexing of cached files for instant dc: counts / random access. - src/cmd/get.rs: `qsv get <source>...` plus cache-list/info/clear/prune. - Config::new resolves `dc:<name>` to a decompressed temp CSV + sibling .idx, routing failures through format_error (get-gated). - Registered in qsv, qsvmcp and qsvdp; the `get` feature joins distrib_features, qsvmcp and datapusher_plus. - tests/test_get.rs: mock-server ETag 304 revalidation, content-addressed dedup, dc: indexed read, and cache clear/prune. Phase 1 of #2263; cloud storage (S3/Azure/GCS) and sftp are deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 337 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Address three roborev findings on the `get`/diskcache subsystem: - Medium: fetching an already-cached HTTP URL under a different --name failed because a fresh cache hit never calls CacheManager::put (so no alias for the new name was written). The manager now records the cache key it operated on (in get or put) and get_resource recovers the entry by that key, binding the requested name to it. - Medium: local entries stored the (possibly relative) given path as the refresh source, so a stale dc: refresh from another directory could read the wrong file. ingest_local now stores the canonicalized absolute path. - Low: reusing a logical name for a different source orphaned the old entry JSON/blob (duplicate names, inflated metadata). write_entry now writes the new entry first, removes the previously-aliased different entry, then writes the alias (content-addressed dedup keeps shared blobs intact). Adds regression tests: get_http_same_url_different_name (fresh-hit recovery) and get_name_reuse_replaces_entry (orphan removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to the previous alias hardening. Properly model the cache as URL-keyed *entries* (own the blob + metadata) with many name *aliases* pointing at them, with two-level reference counting: - Medium: the fresh-cache-hit recovery no longer mutates the shared entry's logical_name (which corrupted metadata for other aliases and left the newest name the only one cache-list showed). It now binds the requested name via a dedicated alias without touching the entry; cache-list enumerates aliases so every name is listed, and get_resource reports the requested name only in its returned metadata. - Low: re-fetching the same cache key after its content or compression changed now reclaims the previous blob/index when no other entry references it (write_entry compares prev vs new blob and frees the orphan). Cleanup paths are now alias-aware: deleting/pruning an entry removes all names pointing at it and frees the blob only when unreferenced; cache-info de-dupes by content hash and reports names + unique blobs. Adds regression tests for multi-name listing and changed-content blob reclamation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nup (roborev #2749) - Medium: --force / --refresh always now use http-cache CacheMode::Reload (always hit the origin, then update the cache) instead of deleting the requested alias. Deleting only the alias left a multi-aliased entry behind that could still be served as a fresh hit, so --force didn't actually re-fetch shared entries. - Medium: after every successful fetch, the requested name is now bound unconditionally to the entry the middleware actually used (via the observed cache key), repointing it when it previously pointed at a different (stale) entry — not only when the alias was missing. Previously a fresh hit under a name that already pointed elsewhere kept resolving the old data. - Low: delete_entry_by_keyhash now frees the data blob on an exact blob-path match (content hash AND compression) and reserves the content-hash-only check for the shared {blake3}.idx.zst index, so a compression variant of a shared hash is no longer left orphaned. Removes the now-unused delete_entry_by_name helper. Adds regression tests: get_force_refetches_shared_entry, get_fresh_hit_repoints_stale_name and get_compression_variant_blob_reclaimed (11 get tests pass). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `get` feature was added to all_features, distrib_features, qsvmcp and datapusher_plus in Cargo.toml; update the docs that enumerate those bundles so the docs-drift-check CI gate passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Implements a new feature-gated get command and shared diskcache subsystem to fetch tabular data from local/HTTP/dathere:///ckan:// sources into a managed on-disk cache, and enables any qsv command to read cached entries via the dc: prefix. The PR also unifies lookup-table remote resolution onto the same CKAN/URI resolver to reduce duplicated (and security-sensitive) fetch logic.
Changes:
- Added
src/diskcache.rswith an always-compiled resolver/fetch core and aget-feature rich cache (content-addressed blobs, HTTP cache semantics, auto-indexing,dc:resolution). - Introduced
src/cmd/get.rsplus integration into binaries/features, and added comprehensive integration tests (tests/test_get.rs). - Updated
Config::newto resolvedc:inputs (whengetis enabled) and updatedlookup.rsto use the shared resolver/CKAN logic.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/tests.rs | Wires test_get into the integration suite when the get feature is enabled. |
| tests/test_get.rs | Adds end-to-end tests for get, HTTP caching behavior, aliasing, dedup, dc: reads, and cache maintenance. |
| src/maindp.rs | Registers the get command and diskcache module for the qsvdp binary variant. |
| src/main.rs | Registers get (feature-gated) and includes the shared diskcache module in the main binary. |
| src/lookup.rs | Unifies URI/CKAN resolution and HTTP conditional GET behavior through diskcache. |
| src/diskcache.rs | New shared cache/resolver module + get-feature rich cache implementation and dc: resolver. |
| src/config.rs | Resolves dc: references up-front (feature-gated) and surfaces failures via format_error. |
| src/cmd/mod.rs | Registers the new get command module behind the get feature. |
| src/cmd/get.rs | Implements qsv get and cache management subcommands (cache-list/info/clear/prune). |
| README.md | Adds the get command to the command table and updates qsvmcp feature listing. |
| docs/help/TableOfContents.md | Adds get to the generated help ToC. |
| docs/help/get.md | Adds generated help documentation for get. |
| docs/FEATURES.md | Updates feature listings to include get in relevant feature groups/binaries. |
| docs/contributor/PROJECT_TECHNICAL_OVERVIEW.md | Updates contributor docs to reflect get in qsvmcp. |
| Cargo.toml | Adds new deps for HTTP caching/middleware and zstd, and defines the get feature set. |
| Cargo.lock | Locks new dependency versions introduced by the get feature. |
Copilot findings: - set_qsv_cache_dir: propagate CliError instead of expand_tilde().unwrap() panicking when the home dir can't be determined (both branches). - resolve_uri_prefix: URL-encode the ckan://<name>? value before building the resource_search query (spaces/&/# no longer corrupt it). - resolve_ckan_resource: same-origin token-strip now compares against the effective API base (default when None), so the token isn't dropped for resources on the default CKAN origin. - atomic_write / ensure_indexed: use process+call-unique temp filenames so concurrent writers/indexers of the same target/content can't collide; clean up the temp on rename failure. - alias filenames: switch from the lossy safe_name() to a reversible, injective hex encoding (encode_name/decode_name) so distinct logical names never collide and cache-list shows the original name. - resolve_dc_path: give the dc: temp file a guaranteed tabular extension so Config's format check accepts extension-less handles. - cache-info: de-dupe on-disk totals by (blake3, compression), content totals by hash. DevSkim: add inline ignores (DS137138, DS162092) for the test-only local mock HTTP server (http://127.0.0.1). Adds regression test get_alias_names_do_not_collide. All 12 get tests + the lookup consumer suite pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hex-encoded alias filename scheme had two problems:
- Medium: hex doubles the filename length, so logical names longer than ~127
bytes exceeded the 255-byte filename limit and failed to cache. Alias
filenames are now BLAKE3(name) — a fixed 64-char key — keeping them bounded
regardless of name length while staying injective in practice. The original
name is preserved inside the alias file content ("{keyhash}\n{name}") and
used for cache-list display.
- Medium: switching schemes left aliases written by the previous scheme
unreachable. alias_keyhash() now falls back to the prior hex (encode_name) and
safe_name alias files and migrates them to the canonical hashed alias on
lookup; list_entries recovers the name from content, falling back to decoding
legacy filenames.
Adds regression test get_long_logical_name (200-char name). All 13 get tests
pass; clippy clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The safe_name legacy-alias fallback is lossy: a different logical name can sanitize to the same filename. alias_keyhash now only adopts (migrates/removes) a safe_name legacy alias when the pointed entry's primary name actually matches the requested name, so resolving e.g. `dc:a b.csv` can't steal an existing `a_b.csv` legacy alias. The injective encode_name (hex) fallback is unaffected. Adds regression test get_legacy_safe_name_alias_not_stolen. All 14 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`get` is unreleased, so no cache exists that uses the prior `encode_name`/ `safe_name` alias filename schemes. Remove the legacy-migration machinery and the whole class of lossy-fallback edge cases it introduced: - alias_keyhash now resolves a name only via its canonical BLAKE3(name) alias. - Remove the now-unused encode_name/decode_name helpers. - list_entries reads the original name from the alias file content only. - Drop the legacy-only regression test. Alias filenames remain bounded BLAKE3 keys with the original name stored in the file content. 13 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds cloud object-storage sources and cache-management subcommands to `qsv get`, completing the originally-scoped feature set for #2263. Cloud storage (behind opt-in `get_cloud` sub-feature): - Fetch s3://, gs://, az:// (and s3a/adl/azure/abfs/abfss) via the object_store crate into the same content-addressed disk cache, readable by any command via the `dc:` prefix. ETag-based conditional revalidation (If-None-Match -> 304) gives the same "don't re-download unchanged" guarantee as the HTTP path. - Credentials come from the standard AWS_*/AZURE_*/GOOGLE_* environment (and IAM roles); a repeatable `--cloud-opt key=value` flag overrides region/endpoint/etc. - object_store is already compiled transitively via polars in distrib/qsvmcp/ datapusher_plus, so the marginal dependency weight there is ~zero (Cargo.lock gains no new crates). A bare `-F get` build stays lean — get_cloud is opt-in. Cache management: - `cache-set-ttl <name> --ttl` / `cache-set-policy <name> --refresh` mutate an existing entry's TTL / refresh policy in place (entry-level: affects every alias pointing at the entry). - `cache-list --verify` recomputes each blob's BLAKE3 and reports OK/FAIL per name, exiting non-zero on any integrity failure. `get_cloud` added to distrib_features, qsvmcp and datapusher_plus; the feature enumerations in FEATURES.md/README.md/PROJECT_TECHNICAL_OVERVIEW.md were updated and docs/help/get.md regenerated (docs-drift clean). Tests: +7 integration tests (S3 fetch + dc read, S3 ETag revalidation, S3 404, cache-set-ttl, cache-set-policy, verify-ok, verify-detects-corruption) using the actix mock with a path-style S3 endpoint override. All 20 get tests pass with get_cloud; 17 under plain get (the S3 tests are feature-gated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cloud cache entries were keyed only by `CLOUD:{source}`, ignoring the store
identity supplied via `--cloud-opt` / environment (endpoint, region, account).
The same `s3://bucket/key` against two different S3-compatible endpoints (or
Azure accounts) therefore collided in the cache, so a second fetch could
revalidate against — or `--refresh never` could serve — data from the wrong
backing store.
Fix:
- Derive a non-secret store identity (endpoint/region/account/… lower-cased,
sorted, de-duplicated) from the resolved cloud options and fold it into the
cache key (`CLOUD:{source}#k=v&...`). Credentials are excluded via a secret-key
denylist so they never reach the on-disk cache key.
- Persist that identity on the cache entry (new `cloud_identity` field,
`#[serde(default)]`) and replay it on `dc:` auto-refresh so the refresh rebuilds
the SAME store and resolves to the SAME entry. Credentials still come from the
ambient environment and are never persisted.
Adds `get_s3_identity_scopes_cache_key`: two fetches of one s3:// URL with
different regions must download independently (a collision would make the second
a 304 revalidation). All 21 get tests pass; clippy clean; no-cloud build
unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous fix (#2756) built the cloud store-identity by sorting and de-duplicating whole `(key, value)` pairs, so a key supplied with two different values — e.g. `AWS_REGION=us-east-1` in the environment plus `--cloud-opt aws_region=eu-west-1` — kept BOTH. object_store, by contrast, builds the store with last-wins precedence (`--cloud-opt` overrides env), so the cache key diverged from the effective store, and a later `dc:` auto-refresh (which re-merges ambient env with the persisted identity) recomputed a different key and re-downloaded instead of revalidating the same entry. Fix: reduce the identity to a single value per (lower-cased) key using the same last-wins precedence object_store applies — env first, then `--cloud-opt` — via a BTreeMap (also yielding deterministic key-sorted order). Because the persisted identity is replayed as high-precedence `--cloud-opt` on refresh, it now wins over any ambient env var (even a changed one), so the recomputed key matches the original. Adds `get_s3_env_cloud_opt_override_refresh_stable`: an env-vs-`--cloud-opt` region override followed by a stale `dc:` refresh under a CHANGED `AWS_REGION` must revalidate the same entry (one download). All 22 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2758) `get_s3_env_cloud_opt_override_refresh_stable` asserted only count==4 and body_sends==1, which `resolve_dc_path`'s best-effort fallback-to-stale would also satisfy on a *failed* refresh — so a broken `cloud_identity` replay could have passed the test without any successful refresh request. Add a `revalidations` (304) counter to the mock `GetWebServer` alongside the existing body-send counter, and assert the stale `dc:` read makes exactly one conditional request that the server answers with 304. This proves the identity replay rebuilt the correct store and resolved to the same entry, rather than silently serving the stale copy. The accessor is gated to get_cloud to avoid a dead-code warning in non-cloud builds. All 22 get tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Copilot #3953) Addresses 10 unresolved Copilot review comments on PR #3953: - tests/test_get.rs (7): the `cache-list` invocations that used `wrk.output(&mut list)` then inspected stdout now assert `out.status.success()` first (printing stderr on failure), so a non-zero exit can't be masked by a stdout substring match. - tests/test_get.rs: corrected a stale comment — alias filenames are BLAKE3 hashes and the original name is read from the alias file *content*, not "decoded from a reversible alias". - src/diskcache.rs: `set_qsv_cache_dir` doc no longer claims to "canonicalize" (it expands `~` and creates the dir; the path is otherwise returned as given). - src/diskcache.rs: `dc:` auto-refresh of a `ckan://` entry now re-resolves against the SAME CKAN instance it was fetched from. The effective `--ckan-api` base URL is persisted on the entry (new `ckan_api_url` field, `#[serde(default)]`) and replayed on refresh, falling back to `QSV_CKAN_API`/default only for pre-existing entries. Previously a non-default CKAN entry would silently refresh against the wrong instance. All 22 get tests pass; lookup suite non-regressing; clippy clean; fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The persisted-CKAN-API fix (Copilot #3953) set `ckan_api_url` on the cache entry for every HTTP fetch. Because `--ckan-api` always carries a default value, plain `http(s)://` entries were recorded — and surfaced by `cache-list --json` — with a CKAN API URL, contradicting the field's `None for non-CKAN entries` contract and making the metadata misleading. Fix: only populate `ckan_api_url` on the manager when `resolved.is_ckan` is true; non-CKAN HTTP entries persist `None`. Adds `get_http_entry_has_no_ckan_api_metadata`, asserting a plain http:// entry serializes `"ckan_api_url": null` and never the default CKAN URL. All 23 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Phase 1 of #2263: a new
getcommand plus adc:disk-cache subsystem that fetches tabular data from various sources into a managed, queryable cache that any qsv command can read via thedc:prefix (e.g.qsv stats dc:data.csv).Sources supported in this phase: local files,
http(s)://URLs,dathere://, andckan://(<id>and<name>?). Cloud storage (S3/Azure/GCS) andsftp://are deferred to later phases.What's included
src/diskcache.rs— a two-tier cache module:dathere:///ckan:/// http), including the security-critical same-origin CKAN-token-strip check.lookup.rsis unified onto this, so there is now a single, audited resolver/fetch path instead of two.get-feature-gated rich cache: a content-addressed, zstd-compressed blob store + a per-entry JSON index recording BLAKE3, ETag, Last-Modified, compressed/uncompressed sizes, record count, and TTL. HTTP cache semantics (ETag / Cache-Control / 304 revalidation) come from a customhttp_cache::CacheManagerwired through http-cache-reqwest + reqwest-middleware. Every cached file is auto-indexed (qsv.idx) sodc:reads get instant counts & random access.src/cmd/get.rs—qsv get <source>...pluscache-list/cache-info/cache-clear/cache-prune.Config::newresolvesdc:<name>to a decompressed temp CSV + sibling.idx, routing failures throughformat_error(get-gated, no panics).getfeature joinsdistrib_features,qsvmcpanddatapusher_plus(notlite).docs/help/get.md/ ToC.Design notes
http-cache-reqwest 0.16is pinned to reqwest 0.12; only the1.0.0-alpha.6pre-release targets qsv's reqwest 0.13. It resolves to a single reqwest version (no second major). Pinned with theurl-standardfeature.luaulookup tables &validatedynamicEnumreference data, and should speed up Datapusher+ harvesting.Testing
tests/test_get.rs(actix mock server): ETag 304 revalidation (asserts the body is downloaded only once across two fetches via a shared in-process counter), content-addressed dedup,dc:indexed read, and cache clear/prune — all green.dathere://) pass — no regression from the unification.qsv,qsvmcp,qsvdpall compile;qsvalso compiles withoutget(gating verified);cargo check -F all_featuresclean; clippy clean;cargo +nightly fmtapplied.Deferred (later phases of #2263)
Cloud storage via
object_store,sftp://, derived stats/frequency caches,cache-set-ttl/cache-set-policy,--verify, and HTTP/3 / multipart downloads.🤖 Generated with Claude Code