Skip to content

feat(get): Phase 3 — streaming + ranged/parallel downloads (HTTP & cloud), dc: stats cache#3958

Merged
jqnatividad merged 7 commits into
masterfrom
get-cloud-streaming-ranges
Jun 7, 2026
Merged

feat(get): Phase 3 — streaming + ranged/parallel downloads (HTTP & cloud), dc: stats cache#3958
jqnatividad merged 7 commits into
masterfrom
get-cloud-streaming-ranges

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Builds on the get command from #2263 (Phase 1+2, merged in #3953). This is Phase 3 §2.2 + §2.3 (items 1–3): streaming, bounded-memory ingest with parallel ranged downloads on both the HTTP and cloud paths, a derived stats-cache for dc: handles, and removal of the http-cache alpha dependency cluster.

What's included

Streaming + ranged downloads (§2.3 items 1–3)

  • New streaming BlobSink (incremental BLAKE3 + zstd straight to a temp file) so ingest never buffers the whole object in memory.
  • Cloud (ingest_cloud): the first-part ranged GET doubles as the conditional revalidation and the size probe; larger objects fetch as concurrent, in-order byte-ranges streamed into the sink. Peak memory ≈ concurrency × part_size.
  • HTTP unify (ingest_http): replaces the http-cache-reqwest middleware with one ETag/Last-Modified conditional + streaming/ranged downloader that mirrors the cloud path. 304 → reuse; 206 + known total → stream first part then parallel ranges; 206 + unknown total → re-fetch whole; 200 → stream full body.
  • Tuning knobs (env-only, so they never collide with object_store config keys): QSV_GET_PART_SIZE (8 MiB), QSV_GET_CONCURRENCY (4).

Derived stats cache for dc: (§2.2)

  • Fixes a real bug first: the stats-cache "smart" commands (frequency, schema, joinp, sample, pivotp, profile) errored on dc: inputs because the raw dc: string failed canonicalize. They now resolve the materialized CSV path up front.
  • Persist-on-use capture of the .stats.csv.data.jsonl sidecar into a durable, content-addressed blob so warm runs skip recompute and survive temp-dir cleanup.
  • dc: now also works with util::process_input commands (cat, slice, joins, …), not just direct-Config commands.

Dependency cleanup

  • Drops http-cache, http-cache-reqwest, http-cache-semantics, reqwest-middleware (4 crates gone from Cargo.lock); get = ["zstd"]. Resolves the pre-1.0-alpha tech debt. Trade-off: RFC 9111 Cache-Control max-age freshness is no longer honored — a re-qsv get does an ETag/Last-Modified conditional GET (→ 304) instead of a zero-request hit. Minor for qsv, where dc: staleness is governed by RefreshPolicy/TTL, not http-cache.

Robustness fixes (local code review)

  • Lazy range generation (std::iter::from_fn + saturating_add) on both paths — the range list never fully materializes, no overflow on a hostile Content-Range total.
  • HTTP follow-up ranges: If-Match only for strong ETags (weak W/"…" falls back to If-Unmodified-Since); each range response validated as 206 with a matching Content-Range + exact body length, so a 200/wrong-slice/short-read can't silently corrupt the blob.
  • dc: stats cache keyed by content hash and parsing extension, with per-extension temp subdirs, so .csv and .tsv aliases of the same bytes never cross-contaminate schema/frequency.
  • BlobSink cleans up its partial temp file on a failed/aborted ingest (Drop releases the handle first, so Windows can unlink).

Deferred (not in this PR)

§2.3 item 4 (resumable .partial), §2.1 sftp://, §2.4 HTTP/3, and §2.2 follow-ons (frequency cache, full stats/frequency self-acceleration).

Testing

  • Full all_features suite: 3022 passed, 0 failed, 42 ignored.
  • 28 get integration tests (HTTP + S3 mock + dc:), including new ranged-download and cross-extension stats-cache regression tests.
  • cargo +nightly fmt + cargo clippy --all-targets -F all_features clean.

🤖 Generated with Claude Code

jqnatividad and others added 6 commits June 6, 2026 18:29
Implements Phase 3 §2.3 items (1) streaming ingest and (2) cloud-path
parallel ranged downloads for the `get` command, plus two fixes found
along the way.

- diskcache: add a streaming `BlobSink` (incremental BLAKE3 + streaming
  zstd to a temp file, atomically renamed to the content-addressed blob).
  It never buffers the whole object in memory, unlike `store_blob`.
  (`get_cloud`-gated for now since only the cloud path uses it.)

- diskcache: rework `ingest_cloud` to fetch large cloud objects as
  concurrent, in-order byte-ranges streamed straight into the `BlobSink`,
  so peak memory is ~`concurrency * part_size` regardless of object size.
  The first-part ranged GET doubles as both the conditional revalidation
  request (preserving ETag/304 behavior) and the size probe (Content-Range
  carries the total), so there is no extra HEAD round-trip. Objects that
  fit in one part take a single request. Tunable via the new
  `QSV_GET_PART_SIZE` (default 8 MiB) and `QSV_GET_CONCURRENCY` (default 4)
  env vars (env-only, so they don't collide with object_store's
  `--cloud-opt` config keys). No new dependencies.

- util: resolve the `dc:` prefix in `process_input` before its existence
  check. `dc:` was only resolved in `Config::new`, so commands that
  pre-validate inputs via `process_input` (cat, slice, joins, ...) wrongly
  rejected a valid `dc:` handle as a missing file. Now `dc:<name>` is
  resolved to its materialized CSV (with sibling .idx) up front.

- diskcache: fix `alias_path` to pass `to_hex().as_str()` to `Path::join`
  (blake3's `ArrayString` impls `AsRef<str>`, not `AsRef<Path>`). A
  pre-existing break in minimal `get*` feature combos, masked by
  `all_features`.

- tests: extend the mock S3 server with 206/Content-Range range support
  and a larger `big.csv` endpoint; add `get_s3_multipart_ranged_download`
  (byte-for-byte reassembly check via `--output`) and
  `get_dc_works_with_process_input_commands` (cat/slice/count over `dc:`).

- docs: document `QSV_GET_PART_SIZE` and `QSV_GET_CONCURRENCY`.

Validated: `all_features` build + full suite (3019 passed, 0 failed);
clippy and `cargo +nightly fmt` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 3 §2.2 (derived stats cache for dc: inputs). The investigation found
the stats-cache "smart" commands don't just miss a cache on dc: handles —
they ERROR, so this is a bug fix plus the persist-on-use cache feature.

Part A (src/util.rs) — fix: util::get_stats_records canonicalized the raw
arg_input, so a "dc:<name>" handle failed with `No such file or directory`.
Resolve the dc: prefix to the entry's materialized CSV path up front (and
feed that real path to the stats subprocess). This repairs `frequency`,
`schema`, `joinp`, `sample`, `pivotp`, and `profile` on dc: inputs, which
previously errored.

Part B (src/diskcache.rs) — persist-on-use stats cache: resolve_dc_path now
captures the `.stats.csv.data.jsonl` sidecar a smart command leaves next to
the temp CSV into a durable, content-addressed blob ({blake3}.stats.jsonl.zst)
on the next dc: access, and restores it (with a fresh mtime, like the .idx)
when the temp copy is gone or stale. So a warm stats cache survives temp-dir
cleanup. The blob is freed alongside the data/idx blobs in
delete_entry_by_keyhash (content-hash referenced). The BufWriter/Write
imports (used only by the get_cloud BlobSink) are gated so a bare `-F get`
build stays warning-free.

Scope (per the chosen MVP): stats `.data.jsonl` only, captured once per
content hash — a richer-mode consumer than the captured cache may still
recompute. Frequency cache and full `qsv stats`/`frequency` self-acceleration
are follow-ons.

tests/test_get.rs: get_dc_stats_cache_for_smart_commands asserts frequency &
schema work on dc: and that the stats blob is captured durably.

Validated: all_features build + full suite (487 unit + 3020 integration
passed, 0 failed); clippy and `cargo +nightly fmt` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… middleware

Replace the http-cache-reqwest middleware on the HTTP fetch path with a new
`ingest_http` that mirrors the cloud path: an ETag/Last-Modified conditional
GET that doubles as a size probe (first ranged part), then parallel in-order
ranged downloads for large objects. This brings HTTP streaming/ranges (no more
whole-body buffering) and unifies the HTTP and cloud ingest paths.

- New `ingest_http(opts, root, final_url, auth_token, ckan_resource_hash,
  is_ckan)`: 304 → not-modified; 206 + known total → stream first part then
  fetch remaining ranges via buffered concurrency (If-Match guarded); 206 +
  unknown total → re-fetch whole in one streamed pass; 200 → stream full body.
  Cache key `HTTP:{opts.source}`; persists ETag + Last-Modified and ckan_api_url
  only for ckan:// entries. Reuses QSV_GET_PART_SIZE / QSV_GET_CONCURRENCY.
- Rewrite `get_resource` HTTP branch to resolve CKAN then delegate to
  ingest_http; remove QsvCacheManager + CacheManager impl, HttpStored,
  StoredEntry.http, BoxError, and the http-cache/* imports.
- Un-gate BlobSink/BlobSinkWriter/env_u64/DEFAULT_PART_SIZE/
  DEFAULT_DL_CONCURRENCY from get_cloud → get (shared by both ingest paths).
- Drop deps: http-cache, http-cache-reqwest, http-cache-semantics,
  reqwest-middleware (4 crates gone from Cargo.lock); `get = ["zstd"]`.
- Tests: `_fresh` mock handlers now honor If-None-Match (304) since the unified
  path always sends a conditional request; un-gate big_csv/serve_big/BIG_ETAG;
  new `get_http_ranged_download` asserts fan-out (body_sends > 1) and
  byte-for-byte reassembly.

Tradeoff: RFC 9111 Cache-Control max-age freshness is no longer honored (a
re-`qsv get` does an ETag/Last-Modified conditional GET → 304 instead of a
zero-request hit). Minor for qsv — dc: staleness is governed by RefreshPolicy/
TTL, not http-cache. Also resolves the §2.5 http-cache-alpha tech debt.

Full all_features suite: 3021 passed, 0 failed, 42 ignored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address three Medium findings in the unified HTTP ranged downloader:

- Lazy range generation: follow-up byte-ranges are now produced via
  `std::iter::from_fn` (consumed lazily by `buffered`) instead of a
  pre-built `Vec`, so the range list never materializes in full for a
  huge object + tiny `QSV_GET_PART_SIZE`. Next-offset uses
  `saturating_add` to avoid overflow on a hostile `Content-Range` total.
- Weak-ETag If-Match: `If-Match` requires strong comparison, so a weak
  validator (`W/"..."`) made compliant origins answer 412 and fail the
  download. Now `If-Match` is sent only for strong ETags; a weak/absent
  ETag falls back to an `If-Unmodified-Since` guard from Last-Modified,
  else relies on per-range validation.
- Per-range validation: each follow-up range response must be 206 with a
  Content-Range matching the requested start/end/total and a body of the
  exact requested length. A 200 full body, wrong slice, or short read now
  errors instead of being stitched into the blob (silent corruption).

All 27 get tests pass (incl. get_http_ranged_download); fmt + clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ttp)

The cloud ranged-download path pre-built a `Vec` of every outstanding
byte-range before streaming, so memory grew with `total / part_size`
(unbounded for a huge object + tiny QSV_GET_PART_SIZE), and the
`start + part_size` next-offset could overflow on a hostile total — the
same class of issue just fixed in `ingest_http` (roborev #2777).

Generate the ranges lazily via `std::iter::from_fn` (consumed by
`buffered`) so the list never fully materializes, and use
`saturating_add` for the next offset. Behavior is otherwise unchanged;
object_store still guards each range with If-Match and returns exactly
the requested bytes (or errors), so no per-range validation is needed
here as on the raw-reqwest HTTP path.

All 27 get tests pass (incl. get_s3_multipart_ranged_download); fmt +
clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…failure

Address roborev #2776 (stats-cache cross-contamination + orphaned blob) and
the remaining Low from #2775 (BlobSink temp-file leak). The #2775 Medium
(unbounded cloud range Vec) was already fixed in 43aa583.

#2776 Medium — stats-cache cross-contamination across extensions:
The durable stats blob was keyed only by content hash, but a dc: handle's
materialized extension selects the delimiter (.csv => comma, .tsv/.tab => tab,
.ssv => semicolon), and two aliases of the same bytes that differ only in
extension resolved to the SAME temp dir (keyed by blake3) — so they shared both
the on-disk `.stats.csv.data.jsonl` sidecar (same temp stem) and the durable
blob, producing wrong schema/frequency for the second delimiter. Now
resolve_dc_path isolates each extension in its own temp subdir
(`qsv-dc/{blake3}/{ext}/`) AND keys the stats blob by extension
(`{blake3}.{ext}.stats.jsonl.zst`), so each delimiter gets its own correct
cache. New shared `TABULAR_EXTS` const.

#2776 Low — orphaned stats blob on refresh:
write_entry's content-changed branch removed the old data + index blobs but not
the stats blob; it now reclaims every per-extension stats blob for the old
content (delete_entry_by_keyhash does the same).

#2775 Low — BlobSink temp-file leak:
If an ingest errored between BlobSink::new and finish(), the partial
`blobs/ingest-*.tmp` was never removed. BlobSink.writer is now an Option and a
Drop impl releases the handle (so Windows can unlink) and removes the temp
unless finish() already took it.

New regression test get_dc_stats_cache_not_shared_across_extensions proves a
.csv alias (3 fields) and a .tsv alias (1 field) of the same bytes keep separate
caches. All 28 get tests pass; fmt + clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Jun 7, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 124 complexity

Metric Results
Complexity 124

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances the get/dc: subsystem by switching remote ingestion to a bounded-memory, streaming implementation with parallel ranged downloads (HTTP and cloud), and by persisting a derived stats-cache for dc: inputs to avoid recomputation and survive temp-dir cleanup. It also removes the prior http-cache-reqwest middleware dependency cluster and replaces it with explicit conditional revalidation logic.

Changes:

  • Implement streaming, content-addressed ingestion via a new BlobSink, enabling parallel ranged downloads for both HTTP and object_store cloud sources with env-tunable part size/concurrency.
  • Make dc: usable across more command paths by resolving dc: handles early in util::process_input and util::get_stats_records, and add a durable per-extension stats-cache blob restored on demand.
  • Clean up dependencies by removing the http-cache* / reqwest-middleware stack and updating tests/docs accordingly.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/diskcache.rs Core implementation: streaming BlobSink, ranged+conditional HTTP/cloud ingest, dc: materialization with per-extension durable stats-cache blobs, and orphan cleanup updates.
src/util.rs Resolve dc: before existence/canonicalize checks in process_input and get_stats_records so “smart” commands and process_input-based commands accept dc: inputs.
tests/test_get.rs Adds ranged-response support in the mock server plus new integration tests for HTTP/cloud ranged downloads and dc: stats-cache regressions.
docs/ENVIRONMENT_VARIABLES.md Documents new QSV_GET_PART_SIZE / QSV_GET_CONCURRENCY knobs (needs a small correction noted in review).
Cargo.toml Removes http-cache* / reqwest-middleware deps from the get feature definition.
Cargo.lock Lockfile updates reflecting removed HTTP-cache dependency cluster.

Comment thread docs/ENVIRONMENT_VARIABLES.md Outdated
…get`

These tuning knobs are read by both ingest_http (HTTP path, `get` feature)
and ingest_cloud (cloud path), and env_u64/DEFAULT_PART_SIZE/
DEFAULT_DL_CONCURRENCY are not get_cloud-gated. Correct the
ENVIRONMENT_VARIABLES table, which described them as cloud-only and requiring
the get_cloud feature. (roborev/Copilot review on #3958)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 690a172 into master Jun 7, 2026
17 of 18 checks passed
@jqnatividad jqnatividad deleted the get-cloud-streaming-ranges branch June 7, 2026 02:06
jqnatividad added a commit that referenced this pull request Jun 7, 2026
…svdp CI) (#3959)

* fix(get): gate dc: tests that need cat/schema on feature_capable

The qsvdp (datapusher_plus) build bundles `get`/`slice`/`count`/`frequency`
but NOT `cat` or `schema`, so two get tests merged in #3958 failed the qsvdp
CI (they pass under all_features, which has every command):

- get_dc_works_with_process_input_commands uses `cat`
- get_dc_stats_cache_for_smart_commands uses `schema`

Gate both on `feature_capable` (the same convention test_profile.rs uses for a
schema-using test). The dc: machinery itself stays covered on qsvdp by the
other get tests — slice/count still exercise the process_input dc: path, and
get_dc_stats_cache_not_shared_across_extensions covers frequency on dc:.

Verified: `cargo test --features=datapusher_plus get_` → 26 passed, the two
gated tests excluded; `cargo test -F all_features get_dc` → all 3 run and pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(get): gate only the cat assertion, keep slice dc: check on qsvdp (roborev #2782)

Addresses the Low finding: gating the whole get_dc_works_with_process_input_commands
test on feature_capable also dropped the `slice dc:` regression check from the
qsvdp run, even though `slice` is bundled there and is the part that exercises
util::process_input (count resolves dc: via Config::new, a different path).

Now only the `cat` assertion is feature_capable-gated (qsvdp does not bundle
cat); slice + count run on all builds, restoring process_input dc: coverage on
reduced binaries.

Verified: `cargo test --features=datapusher_plus get_dc` runs and passes
get_dc_works_with_process_input_commands (cat excluded); `cargo test
-F all_features` passes the full test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request Jun 7, 2026
…ed fetch (#3960)

The `get` USAGE (and the README one-liner it shares with the generated help)
still described "HTTP cache semantics (ETag/Cache-Control via http-cache)", but
the http-cache middleware was removed when the HTTP path was unified (#3958).
The HTTP and cloud paths now do ETag/Last-Modified conditional revalidation and
download large remote objects as parallel, streamed byte-ranges.

- src/cmd/get.rs: rewrite the freshness sentence (conditional ETag/Last-Modified
  revalidation, not Cache-Control) and add a note on streaming/ranged downloads
  with the QSV_GET_PART_SIZE / QSV_GET_CONCURRENCY tuning knobs.
- README.md: fix the `get` command one-liner (same stale claim).
- docs/help/get.md, docs/help/TableOfContents.md: regenerated via
  `qsv --generate-help-md` (not hand-edited).

Verified `qsv get --help` renders the corrected text; no http-cache/
Cache-Control references remain in the get docs.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants