Skip to content

avisheknanda/retryify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

retryify

Production-grade async retry with shared budgets, jittered backoff, and per-attempt timeouts.

The problem

Naive retry loops cause retry storms — when a dependency fails, every caller retries simultaneously, multiplying load by max_attempts. This cascading amplification turns a partial outage into a total one. retryify solves this with:

  • Shared retry budgets that cap the retry ratio across all callers
  • Jittered backoff that decorrelates retry timing
  • Rich predicates that handle real-world responses (HTTP 429, partial success)
  • Dual timeouts that distinguish "this attempt is slow" from "give up entirely"

Quick start

use retryify::*;
use std::time::Duration;

let result = retry(exponential())
    .with_full_jitter()
    .max_attempts(5)
    .per_attempt_timeout(Duration::from_secs(5))
    .total_timeout(Duration::from_secs(30))
    .run(
        |result: &Result<Response, MyError>, _attempt| match result {
            Err(e) if e.is_transient() => RetryDecision::Retry,
            _ => RetryDecision::Stop,
        },
        || async { call_service().await },
    )
    .await?;

Shared budget

use retryify::*;
use std::time::Duration;

// Create once, clone to each retry site
let budget = RetryBudget::shared()
    .ratio(0.2)           // 20 retries per 100 successes
    .min_per_second(1.0)  // floor during low traffic
    .window(Duration::from_secs(60))
    .build();

// Service A
let b = budget.clone();
retry(exponential()).with_full_jitter().budget(b).run(/* ... */);

// Service B — shares the same token pool
let b = budget.clone();
retry(exponential()).with_full_jitter().budget(b).run(/* ... */);

When the budget is exhausted, retries halt immediately — the strongest protection against retry storms.

Backoff strategies

Strategy Formula Use case
exponential() base * multiplier^attempt (capped at max) Default for network calls
linear(base, step) base + step * attempt Predictable growth
constant(delay) Fixed delay Idempotent operations
// Exponential with custom parameters
exponential().base(Duration::from_millis(200)).multiplier(3.0).max(Duration::from_secs(60))

// Linear: 100ms, 200ms, 300ms, ...
linear(Duration::from_millis(100), Duration::from_millis(100))

// Constant: always 500ms
constant(Duration::from_millis(500))

Jitter strategies

Strategy Formula Notes
FullJitter rand(0, base) AWS recommendation — maximum decorrelation
EqualJitter base/2 + rand(0, base/2) Guaranteed minimum spacing
NoJitter base Tests only — correlated retries in production cause cascading failures

The builder enforces jitter selection at compile time — you cannot accidentally forget it:

// Won't compile without choosing a jitter strategy:
retry(exponential())
    .with_full_jitter()  // or .without_jitter() for tests
    .max_attempts(5)
    // ...

Timeout semantics

retryify deliberately avoids a single .timeout() method. Ambiguous timeout semantics are a common source of production incidents.

Method Scope On expiry
.per_attempt_timeout(d) Single attempt Attempt is cancelled, retry continues
.total_timeout(d) Entire lifecycle Returns RetryError::DeadlineExceeded

Why this matters: A 5-second "timeout" could mean "cancel this one slow call and try again" or "give up on the entire operation." These are radically different behaviors. Making the distinction explicit prevents a class of outages where per-attempt timeouts were accidentally used as total deadlines (or vice versa).

The total timeout also clamps sleep durations: if only 2 seconds remain in the budget, a 10-second backoff is reduced to 2 seconds.

Design decisions

Why RetryDecision instead of bool

Real retry logic is richer than retry/don't-retry:

  • HTTP 429 responses include Retry-After headers
  • Rate limiters may specify exact cooldown periods
  • Some failures should use longer backoff than the default

RetryDecision::RetryAfter(Duration) captures this — the delay is max(jittered_backoff, retry_after).

Why &Result<T, E> instead of &E

Many retryable conditions are not errors:

  • HTTP 429 (rate limited) — the request "succeeded" but must be retried
  • HTTP 503 (service unavailable) — valid response, retryable condition
  • Partial success responses that need full retry

The predicate sees the full result so it can match on Ok variants.

Why shared budgets

Without a budget, N callers × M max_attempts = N*M requests hitting a failing dependency. A shared budget (token bucket) ensures the total retry rate stays proportional to the success rate, regardless of how many retry sites exist. This is the single most important mechanism for preventing retry-induced cascading failures.

Telemetry

// Closure hook
.on_retry(|event: &RetryEvent| {
    metrics::counter!("retries", 1, "attempt" => event.attempt.to_string());
})

// Structured tracing
.instrument()  // emits tracing::warn! at target "retryify"

Minimum supported Rust version

1.75 (for RPITIT support)

License

MIT OR Apache-2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages