Skip to content

fix(data): tune fetch concurrency with throughput hill climb#98

Merged
jacderida merged 2 commits into
rc-2026.5.4from
fix/adaptive-download-concurrency
May 27, 2026
Merged

fix(data): tune fetch concurrency with throughput hill climb#98
jacderida merged 2 commits into
rc-2026.5.4from
fix/adaptive-download-concurrency

Conversation

@mickvandijke
Copy link
Copy Markdown
Contributor

@mickvandijke mickvandijke commented May 26, 2026

Summary

  • Replace fetch-side AIMD concurrency with a byte-aware throughput hill climber.
  • Feed successful chunk byte counts into fetch observations for both in-memory and file downloads.
  • Use rolling ordered fetch scheduling so large downloads can react to concurrency changes while a download is still in progress.

Problem

The previous adaptive controller treated fetches like quote/store operations and mostly optimized around success rate, timeout rate, and latency. That helped avoid obvious overload, but it could still overshoot download concurrency because a higher cap is not always higher throughput.

In practice, too many parallel chunk GETs can create extra connection pressure, peer timeouts, and slow tail fetches. The old approach could keep probing upward or stay too high even when the machine/network was already saturated.

New Approach

Fetch now uses a throughput-seeking hill climber instead of AIMD:

  1. The controller starts fetch concurrency conservatively.
  2. Each epoch records successful payload bytes, successes, timeouts, network errors, and latency samples.
  3. Epoch timing starts from the sampled operation's actual start time, not from the first completion.
  4. Probe epochs cover full concurrency waves, so a higher cap is not judged from a partial wave.
  5. The controller calculates epoch goodput as bytes/sec.
  6. It probes a nearby higher or lower concurrency cap.
  7. An upward probe is accepted only if goodput improves materially.
  8. A downward probe is accepted when goodput is effectively retained, allowing the client to prefer lower pressure for the same throughput.
  9. Stress signals still cut concurrency immediately instead of waiting for a full probe cycle.

This means fetch concurrency is selected by measured download throughput, not by the assumption that more parallel chunk GETs are always better.

Scheduler Changes

The file/data download paths now report the number of bytes returned by successful chunk GETs through observe_op_with_success_bytes.

File downloads also use rebucketed_ordered for chunk fetch batches. This keeps the ordering guarantees required by self-encryption while allowing the active fetch cap to be re-read between smaller buckets. Large downloads can therefore react to the hill climber mid-download instead of being stuck with one cap for a whole batch.

Persistence

The persisted snapshot schema is still bumped because fetch switched algorithms. Schema-1 snapshots are now migrated rather than fully discarded: learned quote/store caps are preserved, and only fetch is reset to the hill-climber cold start.

Semver

Patch. This is download concurrency tuning with no public API change.

Testing

  • cargo test -p ant-core data::client::adaptive::tests
  • cargo test -p ant-core data::client::file::tests
  • cargo check -p ant-core
  • cargo fmt --all -- --check
  • cargo clippy --all-targets --all-features -- -D warnings

Copy link
Copy Markdown

@dirvine dirvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this PR via Hermes Agent. Here are my findings:

Summary

Well-designed replacement of AIMD with a throughput-seeking hill climber for fetch concurrency. The key insight — more parallel chunk GETs ≠ more throughput — is sound, and measuring bytes/sec goodput as the optimization target is the right approach.

CI Status

The one failing check (Unit Tests ubuntu-latest) is a pre-existing flaky test (cached_single.rs::roundtrip_save_load_delete, timestamp-based path comparison 1779790968 vs 1779790969), completely unrelated to this PR. All 303/303 unit tests pass on macOS, and all E2E, Clippy, Format, Security Audit checks pass on both platforms.

What I like

  • Goodput-based hill climbing is a significant improvement over blind AIMD for fetch workloads
  • Rolling ordered fetch scheduling (rebucketed_ordered) correctly replaces buffer_unordered + manual sort — preserves ordering while allowing cap changes mid-download
  • The cold-start reduction (64→16) is justified: the climber will prove higher caps where beneficial
  • Stress signals still cut concurrency immediately (50% halving) — preserves the safety net
  • Well-tested: 4 dedicated hill climb tests covering upward rejection, acceptance, downward acceptance, and stress

Minor concerns (non-blocking)

  1. hill_epoch_stats clones epoch_latencies Vec under the mutex — for large windows or high-throughput downloads, this allocates and clones potentially hundreds of Durations under the lock. Consider mem::take(&mut inner.hill.epoch_latencies) or swapping the Vec to avoid cloning, and do the p95/max computation outside the mutex guard.

  2. Unit fallback (epoch_bytes == 0 → success count) — correct for unit tests, but means any future fetch path that forgets to report bytes silently uses count-based goodput. This could mask degraded throughput (many small-success responses at low concurrency). Consider adding a warn!() when the fallback is hit in production paths, or making the bytes parameter mandatory in the API.

  3. The PERSIST_SCHEMA bump from 1→2 is correct but snapshot() now returns best_concurrency instead of current for the hill climber. This is the right value to persist (the stable best, not the probe value), but worth double-checking that reading old schema-1 state on upgrade works as expected — the deserialized ChannelStart.fetch will be 64 (old default) which then gets clamped to FETCH_COLD_START_CONCURRENCY=16 on the next AdaptiveController::new() call. This is a behavior change (was 64, now 16 on first load after upgrade), but the climber will quickly re-converge. Worth documenting.

Verdict

Approve. No correctness or safety blockers. The pre-existing flaky test should be fixed separately (the timestamp instability in cached_single.rs).

Semver: patch

Use a byte-aware throughput hill climber for chunk fetch concurrency so downloads back away from caps that do not improve goodput.

Apply the rolling fetch scheduler to file/data download paths so cap changes can take effect while large downloads are still in progress.
Semver: patch

Measure fetch hill epochs from operation start time rather than first completion, and size epochs to cover full concurrency waves so upward probes are judged on steady goodput.

Migrate schema-1 adaptive snapshots by preserving quote/store warm-starts while resetting fetch to the hill-climber cold start.
@mickvandijke mickvandijke force-pushed the fix/adaptive-download-concurrency branch from cbaa54e to e71a48f Compare May 27, 2026 13:07
@mickvandijke mickvandijke marked this pull request as ready for review May 27, 2026 14:26
@mickvandijke mickvandijke changed the base branch from main to rc-2026.5.4 May 27, 2026 14:27
@jacderida jacderida merged commit 6f81964 into rc-2026.5.4 May 27, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants