feat(get): Phase 3 — streaming + ranged/parallel downloads (HTTP & cloud), dc: stats cache#3958
Conversation
Implements Phase 3 §2.3 items (1) streaming ingest and (2) cloud-path parallel ranged downloads for the `get` command, plus two fixes found along the way. - diskcache: add a streaming `BlobSink` (incremental BLAKE3 + streaming zstd to a temp file, atomically renamed to the content-addressed blob). It never buffers the whole object in memory, unlike `store_blob`. (`get_cloud`-gated for now since only the cloud path uses it.) - diskcache: rework `ingest_cloud` to fetch large cloud objects as concurrent, in-order byte-ranges streamed straight into the `BlobSink`, so peak memory is ~`concurrency * part_size` regardless of object size. The first-part ranged GET doubles as both the conditional revalidation request (preserving ETag/304 behavior) and the size probe (Content-Range carries the total), so there is no extra HEAD round-trip. Objects that fit in one part take a single request. Tunable via the new `QSV_GET_PART_SIZE` (default 8 MiB) and `QSV_GET_CONCURRENCY` (default 4) env vars (env-only, so they don't collide with object_store's `--cloud-opt` config keys). No new dependencies. - util: resolve the `dc:` prefix in `process_input` before its existence check. `dc:` was only resolved in `Config::new`, so commands that pre-validate inputs via `process_input` (cat, slice, joins, ...) wrongly rejected a valid `dc:` handle as a missing file. Now `dc:<name>` is resolved to its materialized CSV (with sibling .idx) up front. - diskcache: fix `alias_path` to pass `to_hex().as_str()` to `Path::join` (blake3's `ArrayString` impls `AsRef<str>`, not `AsRef<Path>`). A pre-existing break in minimal `get*` feature combos, masked by `all_features`. - tests: extend the mock S3 server with 206/Content-Range range support and a larger `big.csv` endpoint; add `get_s3_multipart_ranged_download` (byte-for-byte reassembly check via `--output`) and `get_dc_works_with_process_input_commands` (cat/slice/count over `dc:`). - docs: document `QSV_GET_PART_SIZE` and `QSV_GET_CONCURRENCY`. Validated: `all_features` build + full suite (3019 passed, 0 failed); clippy and `cargo +nightly fmt` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 3 §2.2 (derived stats cache for dc: inputs). The investigation found
the stats-cache "smart" commands don't just miss a cache on dc: handles —
they ERROR, so this is a bug fix plus the persist-on-use cache feature.
Part A (src/util.rs) — fix: util::get_stats_records canonicalized the raw
arg_input, so a "dc:<name>" handle failed with `No such file or directory`.
Resolve the dc: prefix to the entry's materialized CSV path up front (and
feed that real path to the stats subprocess). This repairs `frequency`,
`schema`, `joinp`, `sample`, `pivotp`, and `profile` on dc: inputs, which
previously errored.
Part B (src/diskcache.rs) — persist-on-use stats cache: resolve_dc_path now
captures the `.stats.csv.data.jsonl` sidecar a smart command leaves next to
the temp CSV into a durable, content-addressed blob ({blake3}.stats.jsonl.zst)
on the next dc: access, and restores it (with a fresh mtime, like the .idx)
when the temp copy is gone or stale. So a warm stats cache survives temp-dir
cleanup. The blob is freed alongside the data/idx blobs in
delete_entry_by_keyhash (content-hash referenced). The BufWriter/Write
imports (used only by the get_cloud BlobSink) are gated so a bare `-F get`
build stays warning-free.
Scope (per the chosen MVP): stats `.data.jsonl` only, captured once per
content hash — a richer-mode consumer than the captured cache may still
recompute. Frequency cache and full `qsv stats`/`frequency` self-acceleration
are follow-ons.
tests/test_get.rs: get_dc_stats_cache_for_smart_commands asserts frequency &
schema work on dc: and that the stats blob is captured durably.
Validated: all_features build + full suite (487 unit + 3020 integration
passed, 0 failed); clippy and `cargo +nightly fmt` clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… middleware
Replace the http-cache-reqwest middleware on the HTTP fetch path with a new
`ingest_http` that mirrors the cloud path: an ETag/Last-Modified conditional
GET that doubles as a size probe (first ranged part), then parallel in-order
ranged downloads for large objects. This brings HTTP streaming/ranges (no more
whole-body buffering) and unifies the HTTP and cloud ingest paths.
- New `ingest_http(opts, root, final_url, auth_token, ckan_resource_hash,
is_ckan)`: 304 → not-modified; 206 + known total → stream first part then
fetch remaining ranges via buffered concurrency (If-Match guarded); 206 +
unknown total → re-fetch whole in one streamed pass; 200 → stream full body.
Cache key `HTTP:{opts.source}`; persists ETag + Last-Modified and ckan_api_url
only for ckan:// entries. Reuses QSV_GET_PART_SIZE / QSV_GET_CONCURRENCY.
- Rewrite `get_resource` HTTP branch to resolve CKAN then delegate to
ingest_http; remove QsvCacheManager + CacheManager impl, HttpStored,
StoredEntry.http, BoxError, and the http-cache/* imports.
- Un-gate BlobSink/BlobSinkWriter/env_u64/DEFAULT_PART_SIZE/
DEFAULT_DL_CONCURRENCY from get_cloud → get (shared by both ingest paths).
- Drop deps: http-cache, http-cache-reqwest, http-cache-semantics,
reqwest-middleware (4 crates gone from Cargo.lock); `get = ["zstd"]`.
- Tests: `_fresh` mock handlers now honor If-None-Match (304) since the unified
path always sends a conditional request; un-gate big_csv/serve_big/BIG_ETAG;
new `get_http_ranged_download` asserts fan-out (body_sends > 1) and
byte-for-byte reassembly.
Tradeoff: RFC 9111 Cache-Control max-age freshness is no longer honored (a
re-`qsv get` does an ETag/Last-Modified conditional GET → 304 instead of a
zero-request hit). Minor for qsv — dc: staleness is governed by RefreshPolicy/
TTL, not http-cache. Also resolves the §2.5 http-cache-alpha tech debt.
Full all_features suite: 3021 passed, 0 failed, 42 ignored.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address three Medium findings in the unified HTTP ranged downloader: - Lazy range generation: follow-up byte-ranges are now produced via `std::iter::from_fn` (consumed lazily by `buffered`) instead of a pre-built `Vec`, so the range list never materializes in full for a huge object + tiny `QSV_GET_PART_SIZE`. Next-offset uses `saturating_add` to avoid overflow on a hostile `Content-Range` total. - Weak-ETag If-Match: `If-Match` requires strong comparison, so a weak validator (`W/"..."`) made compliant origins answer 412 and fail the download. Now `If-Match` is sent only for strong ETags; a weak/absent ETag falls back to an `If-Unmodified-Since` guard from Last-Modified, else relies on per-range validation. - Per-range validation: each follow-up range response must be 206 with a Content-Range matching the requested start/end/total and a body of the exact requested length. A 200 full body, wrong slice, or short read now errors instead of being stitched into the blob (silent corruption). All 27 get tests pass (incl. get_http_ranged_download); fmt + clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ttp) The cloud ranged-download path pre-built a `Vec` of every outstanding byte-range before streaming, so memory grew with `total / part_size` (unbounded for a huge object + tiny QSV_GET_PART_SIZE), and the `start + part_size` next-offset could overflow on a hostile total — the same class of issue just fixed in `ingest_http` (roborev #2777). Generate the ranges lazily via `std::iter::from_fn` (consumed by `buffered`) so the list never fully materializes, and use `saturating_add` for the next offset. Behavior is otherwise unchanged; object_store still guards each range with If-Match and returns exactly the requested bytes (or errors), so no per-range validation is needed here as on the raw-reqwest HTTP path. All 27 get tests pass (incl. get_s3_multipart_ranged_download); fmt + clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…failure Address roborev #2776 (stats-cache cross-contamination + orphaned blob) and the remaining Low from #2775 (BlobSink temp-file leak). The #2775 Medium (unbounded cloud range Vec) was already fixed in 43aa583. #2776 Medium — stats-cache cross-contamination across extensions: The durable stats blob was keyed only by content hash, but a dc: handle's materialized extension selects the delimiter (.csv => comma, .tsv/.tab => tab, .ssv => semicolon), and two aliases of the same bytes that differ only in extension resolved to the SAME temp dir (keyed by blake3) — so they shared both the on-disk `.stats.csv.data.jsonl` sidecar (same temp stem) and the durable blob, producing wrong schema/frequency for the second delimiter. Now resolve_dc_path isolates each extension in its own temp subdir (`qsv-dc/{blake3}/{ext}/`) AND keys the stats blob by extension (`{blake3}.{ext}.stats.jsonl.zst`), so each delimiter gets its own correct cache. New shared `TABULAR_EXTS` const. #2776 Low — orphaned stats blob on refresh: write_entry's content-changed branch removed the old data + index blobs but not the stats blob; it now reclaims every per-extension stats blob for the old content (delete_entry_by_keyhash does the same). #2775 Low — BlobSink temp-file leak: If an ingest errored between BlobSink::new and finish(), the partial `blobs/ingest-*.tmp` was never removed. BlobSink.writer is now an Option and a Drop impl releases the handle (so Windows can unlink) and removes the temp unless finish() already took it. New regression test get_dc_stats_cache_not_shared_across_extensions proves a .csv alias (3 fields) and a .tsv alias (1 field) of the same bytes keep separate caches. All 28 get tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 124 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Pull request overview
This PR advances the get/dc: subsystem by switching remote ingestion to a bounded-memory, streaming implementation with parallel ranged downloads (HTTP and cloud), and by persisting a derived stats-cache for dc: inputs to avoid recomputation and survive temp-dir cleanup. It also removes the prior http-cache-reqwest middleware dependency cluster and replaces it with explicit conditional revalidation logic.
Changes:
- Implement streaming, content-addressed ingestion via a new
BlobSink, enabling parallel ranged downloads for both HTTP andobject_storecloud sources with env-tunable part size/concurrency. - Make
dc:usable across more command paths by resolvingdc:handles early inutil::process_inputandutil::get_stats_records, and add a durable per-extension stats-cache blob restored on demand. - Clean up dependencies by removing the
http-cache*/reqwest-middlewarestack and updating tests/docs accordingly.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/diskcache.rs |
Core implementation: streaming BlobSink, ranged+conditional HTTP/cloud ingest, dc: materialization with per-extension durable stats-cache blobs, and orphan cleanup updates. |
src/util.rs |
Resolve dc: before existence/canonicalize checks in process_input and get_stats_records so “smart” commands and process_input-based commands accept dc: inputs. |
tests/test_get.rs |
Adds ranged-response support in the mock server plus new integration tests for HTTP/cloud ranged downloads and dc: stats-cache regressions. |
docs/ENVIRONMENT_VARIABLES.md |
Documents new QSV_GET_PART_SIZE / QSV_GET_CONCURRENCY knobs (needs a small correction noted in review). |
Cargo.toml |
Removes http-cache* / reqwest-middleware deps from the get feature definition. |
Cargo.lock |
Lockfile updates reflecting removed HTTP-cache dependency cluster. |
…get` These tuning knobs are read by both ingest_http (HTTP path, `get` feature) and ingest_cloud (cloud path), and env_u64/DEFAULT_PART_SIZE/ DEFAULT_DL_CONCURRENCY are not get_cloud-gated. Correct the ENVIRONMENT_VARIABLES table, which described them as cloud-only and requiring the get_cloud feature. (roborev/Copilot review on #3958) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…svdp CI) (#3959) * fix(get): gate dc: tests that need cat/schema on feature_capable The qsvdp (datapusher_plus) build bundles `get`/`slice`/`count`/`frequency` but NOT `cat` or `schema`, so two get tests merged in #3958 failed the qsvdp CI (they pass under all_features, which has every command): - get_dc_works_with_process_input_commands uses `cat` - get_dc_stats_cache_for_smart_commands uses `schema` Gate both on `feature_capable` (the same convention test_profile.rs uses for a schema-using test). The dc: machinery itself stays covered on qsvdp by the other get tests — slice/count still exercise the process_input dc: path, and get_dc_stats_cache_not_shared_across_extensions covers frequency on dc:. Verified: `cargo test --features=datapusher_plus get_` → 26 passed, the two gated tests excluded; `cargo test -F all_features get_dc` → all 3 run and pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(get): gate only the cat assertion, keep slice dc: check on qsvdp (roborev #2782) Addresses the Low finding: gating the whole get_dc_works_with_process_input_commands test on feature_capable also dropped the `slice dc:` regression check from the qsvdp run, even though `slice` is bundled there and is the part that exercises util::process_input (count resolves dc: via Config::new, a different path). Now only the `cat` assertion is feature_capable-gated (qsvdp does not bundle cat); slice + count run on all builds, restoring process_input dc: coverage on reduced binaries. Verified: `cargo test --features=datapusher_plus get_dc` runs and passes get_dc_works_with_process_input_commands (cat excluded); `cargo test -F all_features` passes the full test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed fetch (#3960) The `get` USAGE (and the README one-liner it shares with the generated help) still described "HTTP cache semantics (ETag/Cache-Control via http-cache)", but the http-cache middleware was removed when the HTTP path was unified (#3958). The HTTP and cloud paths now do ETag/Last-Modified conditional revalidation and download large remote objects as parallel, streamed byte-ranges. - src/cmd/get.rs: rewrite the freshness sentence (conditional ETag/Last-Modified revalidation, not Cache-Control) and add a note on streaming/ranged downloads with the QSV_GET_PART_SIZE / QSV_GET_CONCURRENCY tuning knobs. - README.md: fix the `get` command one-liner (same stale claim). - docs/help/get.md, docs/help/TableOfContents.md: regenerated via `qsv --generate-help-md` (not hand-edited). Verified `qsv get --help` renders the corrected text; no http-cache/ Cache-Control references remain in the get docs. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Builds on the
getcommand from #2263 (Phase 1+2, merged in #3953). This is Phase 3 §2.2 + §2.3 (items 1–3): streaming, bounded-memory ingest with parallel ranged downloads on both the HTTP and cloud paths, a derived stats-cache fordc:handles, and removal of thehttp-cachealpha dependency cluster.What's included
Streaming + ranged downloads (§2.3 items 1–3)
BlobSink(incremental BLAKE3 + zstd straight to a temp file) so ingest never buffers the whole object in memory.ingest_cloud): the first-part ranged GET doubles as the conditional revalidation and the size probe; larger objects fetch as concurrent, in-order byte-ranges streamed into the sink. Peak memory ≈concurrency × part_size.ingest_http): replaces thehttp-cache-reqwestmiddleware with one ETag/Last-Modified conditional + streaming/ranged downloader that mirrors the cloud path. 304 → reuse; 206 + known total → stream first part then parallel ranges; 206 + unknown total → re-fetch whole; 200 → stream full body.object_storeconfig keys):QSV_GET_PART_SIZE(8 MiB),QSV_GET_CONCURRENCY(4).Derived stats cache for
dc:(§2.2)frequency,schema,joinp,sample,pivotp,profile) errored ondc:inputs because the rawdc:string failedcanonicalize. They now resolve the materialized CSV path up front..stats.csv.data.jsonlsidecar into a durable, content-addressed blob so warm runs skip recompute and survive temp-dir cleanup.dc:now also works withutil::process_inputcommands (cat,slice, joins, …), not just direct-Configcommands.Dependency cleanup
http-cache,http-cache-reqwest,http-cache-semantics,reqwest-middleware(4 crates gone fromCargo.lock);get = ["zstd"]. Resolves the pre-1.0-alpha tech debt. Trade-off: RFC 9111Cache-Controlmax-age freshness is no longer honored — a re-qsv getdoes an ETag/Last-Modified conditional GET (→ 304) instead of a zero-request hit. Minor for qsv, wheredc:staleness is governed byRefreshPolicy/TTL, not http-cache.Robustness fixes (local code review)
std::iter::from_fn+saturating_add) on both paths — the range list never fully materializes, no overflow on a hostileContent-Rangetotal.If-Matchonly for strong ETags (weakW/"…"falls back toIf-Unmodified-Since); each range response validated as206with a matchingContent-Range+ exact body length, so a200/wrong-slice/short-read can't silently corrupt the blob.dc:stats cache keyed by content hash and parsing extension, with per-extension temp subdirs, so.csvand.tsvaliases of the same bytes never cross-contaminate schema/frequency.BlobSinkcleans up its partial temp file on a failed/aborted ingest (Drop releases the handle first, so Windows can unlink).Deferred (not in this PR)
§2.3 item 4 (resumable
.partial), §2.1sftp://, §2.4 HTTP/3, and §2.2 follow-ons (frequency cache, full stats/frequency self-acceleration).Testing
all_featuressuite: 3022 passed, 0 failed, 42 ignored.getintegration tests (HTTP + S3 mock +dc:), including new ranged-download and cross-extension stats-cache regression tests.cargo +nightly fmt+cargo clippy --all-targets -F all_featuresclean.🤖 Generated with Claude Code