fix(data): tune fetch concurrency with throughput hill climb#98
Conversation
dirvine
left a comment
There was a problem hiding this comment.
I reviewed this PR via Hermes Agent. Here are my findings:
Summary
Well-designed replacement of AIMD with a throughput-seeking hill climber for fetch concurrency. The key insight — more parallel chunk GETs ≠ more throughput — is sound, and measuring bytes/sec goodput as the optimization target is the right approach.
CI Status
The one failing check (Unit Tests ubuntu-latest) is a pre-existing flaky test (cached_single.rs::roundtrip_save_load_delete, timestamp-based path comparison 1779790968 vs 1779790969), completely unrelated to this PR. All 303/303 unit tests pass on macOS, and all E2E, Clippy, Format, Security Audit checks pass on both platforms.
What I like
- Goodput-based hill climbing is a significant improvement over blind AIMD for fetch workloads
- Rolling ordered fetch scheduling (
rebucketed_ordered) correctly replacesbuffer_unordered+ manual sort — preserves ordering while allowing cap changes mid-download - The cold-start reduction (64→16) is justified: the climber will prove higher caps where beneficial
- Stress signals still cut concurrency immediately (50% halving) — preserves the safety net
- Well-tested: 4 dedicated hill climb tests covering upward rejection, acceptance, downward acceptance, and stress
Minor concerns (non-blocking)
-
hill_epoch_statsclonesepoch_latenciesVec under the mutex — for large windows or high-throughput downloads, this allocates and clones potentially hundreds of Durations under the lock. Considermem::take(&mut inner.hill.epoch_latencies)or swapping the Vec to avoid cloning, and do the p95/max computation outside the mutex guard. -
Unit fallback (
epoch_bytes == 0→ success count) — correct for unit tests, but means any future fetch path that forgets to report bytes silently uses count-based goodput. This could mask degraded throughput (many small-success responses at low concurrency). Consider adding awarn!()when the fallback is hit in production paths, or making the bytes parameter mandatory in the API. -
The
PERSIST_SCHEMAbump from 1→2 is correct butsnapshot()now returnsbest_concurrencyinstead ofcurrentfor the hill climber. This is the right value to persist (the stable best, not the probe value), but worth double-checking that reading old schema-1 state on upgrade works as expected — the deserializedChannelStart.fetchwill be 64 (old default) which then gets clamped toFETCH_COLD_START_CONCURRENCY=16on the nextAdaptiveController::new()call. This is a behavior change (was 64, now 16 on first load after upgrade), but the climber will quickly re-converge. Worth documenting.
Verdict
Approve. No correctness or safety blockers. The pre-existing flaky test should be fixed separately (the timestamp instability in cached_single.rs).
Semver: patch Use a byte-aware throughput hill climber for chunk fetch concurrency so downloads back away from caps that do not improve goodput. Apply the rolling fetch scheduler to file/data download paths so cap changes can take effect while large downloads are still in progress.
Semver: patch Measure fetch hill epochs from operation start time rather than first completion, and size epochs to cover full concurrency waves so upward probes are judged on steady goodput. Migrate schema-1 adaptive snapshots by preserving quote/store warm-starts while resetting fetch to the hill-climber cold start.
cbaa54e to
e71a48f
Compare
Summary
Problem
The previous adaptive controller treated fetches like quote/store operations and mostly optimized around success rate, timeout rate, and latency. That helped avoid obvious overload, but it could still overshoot download concurrency because a higher cap is not always higher throughput.
In practice, too many parallel chunk GETs can create extra connection pressure, peer timeouts, and slow tail fetches. The old approach could keep probing upward or stay too high even when the machine/network was already saturated.
New Approach
Fetch now uses a throughput-seeking hill climber instead of AIMD:
This means fetch concurrency is selected by measured download throughput, not by the assumption that more parallel chunk GETs are always better.
Scheduler Changes
The file/data download paths now report the number of bytes returned by successful chunk GETs through
observe_op_with_success_bytes.File downloads also use
rebucketed_orderedfor chunk fetch batches. This keeps the ordering guarantees required by self-encryption while allowing the active fetch cap to be re-read between smaller buckets. Large downloads can therefore react to the hill climber mid-download instead of being stuck with one cap for a whole batch.Persistence
The persisted snapshot schema is still bumped because fetch switched algorithms. Schema-1 snapshots are now migrated rather than fully discarded: learned quote/store caps are preserved, and only fetch is reset to the hill-climber cold start.
Semver
Patch. This is download concurrency tuning with no public API change.
Testing
cargo test -p ant-core data::client::adaptive::testscargo test -p ant-core data::client::file::testscargo check -p ant-corecargo fmt --all -- --checkcargo clippy --all-targets --all-features -- -D warnings