Production-grade async retry with shared budgets, jittered backoff, and per-attempt timeouts.
Naive retry loops cause retry storms — when a dependency fails, every caller retries simultaneously, multiplying load by max_attempts. This cascading amplification turns a partial outage into a total one. retryify solves this with:
- Shared retry budgets that cap the retry ratio across all callers
- Jittered backoff that decorrelates retry timing
- Rich predicates that handle real-world responses (HTTP 429, partial success)
- Dual timeouts that distinguish "this attempt is slow" from "give up entirely"
use retryify::*;
use std::time::Duration;
let result = retry(exponential())
.with_full_jitter()
.max_attempts(5)
.per_attempt_timeout(Duration::from_secs(5))
.total_timeout(Duration::from_secs(30))
.run(
|result: &Result<Response, MyError>, _attempt| match result {
Err(e) if e.is_transient() => RetryDecision::Retry,
_ => RetryDecision::Stop,
},
|| async { call_service().await },
)
.await?;use retryify::*;
use std::time::Duration;
// Create once, clone to each retry site
let budget = RetryBudget::shared()
.ratio(0.2) // 20 retries per 100 successes
.min_per_second(1.0) // floor during low traffic
.window(Duration::from_secs(60))
.build();
// Service A
let b = budget.clone();
retry(exponential()).with_full_jitter().budget(b).run(/* ... */);
// Service B — shares the same token pool
let b = budget.clone();
retry(exponential()).with_full_jitter().budget(b).run(/* ... */);When the budget is exhausted, retries halt immediately — the strongest protection against retry storms.
| Strategy | Formula | Use case |
|---|---|---|
exponential() |
base * multiplier^attempt (capped at max) |
Default for network calls |
linear(base, step) |
base + step * attempt |
Predictable growth |
constant(delay) |
Fixed delay | Idempotent operations |
// Exponential with custom parameters
exponential().base(Duration::from_millis(200)).multiplier(3.0).max(Duration::from_secs(60))
// Linear: 100ms, 200ms, 300ms, ...
linear(Duration::from_millis(100), Duration::from_millis(100))
// Constant: always 500ms
constant(Duration::from_millis(500))| Strategy | Formula | Notes |
|---|---|---|
FullJitter |
rand(0, base) |
AWS recommendation — maximum decorrelation |
EqualJitter |
base/2 + rand(0, base/2) |
Guaranteed minimum spacing |
NoJitter |
base |
Tests only — correlated retries in production cause cascading failures |
The builder enforces jitter selection at compile time — you cannot accidentally forget it:
// Won't compile without choosing a jitter strategy:
retry(exponential())
.with_full_jitter() // or .without_jitter() for tests
.max_attempts(5)
// ...retryify deliberately avoids a single .timeout() method. Ambiguous timeout semantics are a common source of production incidents.
| Method | Scope | On expiry |
|---|---|---|
.per_attempt_timeout(d) |
Single attempt | Attempt is cancelled, retry continues |
.total_timeout(d) |
Entire lifecycle | Returns RetryError::DeadlineExceeded |
Why this matters: A 5-second "timeout" could mean "cancel this one slow call and try again" or "give up on the entire operation." These are radically different behaviors. Making the distinction explicit prevents a class of outages where per-attempt timeouts were accidentally used as total deadlines (or vice versa).
The total timeout also clamps sleep durations: if only 2 seconds remain in the budget, a 10-second backoff is reduced to 2 seconds.
Real retry logic is richer than retry/don't-retry:
- HTTP 429 responses include
Retry-Afterheaders - Rate limiters may specify exact cooldown periods
- Some failures should use longer backoff than the default
RetryDecision::RetryAfter(Duration) captures this — the delay is max(jittered_backoff, retry_after).
Many retryable conditions are not errors:
- HTTP 429 (rate limited) — the request "succeeded" but must be retried
- HTTP 503 (service unavailable) — valid response, retryable condition
- Partial success responses that need full retry
The predicate sees the full result so it can match on Ok variants.
Without a budget, N callers × M max_attempts = N*M requests hitting a failing dependency. A shared budget (token bucket) ensures the total retry rate stays proportional to the success rate, regardless of how many retry sites exist. This is the single most important mechanism for preventing retry-induced cascading failures.
// Closure hook
.on_retry(|event: &RetryEvent| {
metrics::counter!("retries", 1, "attempt" => event.attempt.to_string());
})
// Structured tracing
.instrument() // emits tracing::warn! at target "retryify"1.75 (for RPITIT support)
MIT OR Apache-2.0