Skip to content

Conversation

@abtreece
Copy link
Owner

Summary

Implements exponential backoff with jitter for Redis connection retries to improve reliability and prevent thundering herd problems. Addresses issue #388.

Problem

Previously, createClient() would try each Redis machine sequentially with no delay between attempts and no retry logic:

  • No exponential backoff
  • No jitter to prevent thundering herd
  • Hardcoded 1-second timeouts
  • Can hammer failed nodes repeatedly

Solution

Added intelligent retry logic with configurable exponential backoff:

New Features

  1. RetryConfig struct - Tunable retry parameters:

    • MaxRetries - Maximum retry attempts (default: 3)
    • BaseDelay - Initial backoff delay (default: 100ms)
    • MaxDelay - Maximum backoff delay (default: 5s)
    • JitterFactor - Jitter factor to prevent thundering herd (default: 30%)
  2. Exponential Backoff Algorithm:

    backoff = min(baseDelay * 2^attempt, maxDelay) * (1 ± jitter)
    
    • Attempt 0: 100ms
    • Attempt 1: 200ms (±30% jitter)
    • Attempt 2: 400ms (±30% jitter)
    • Attempt 3: 800ms (±30% jitter)
    • Capped at 5s
  3. Enhanced Logging:

    • Debug logs for each retry attempt with backoff duration
    • Info log when connection succeeds after retries
    • Error context for failed connections

Benefits

  • ✅ Better handling of transient network failures
  • ✅ Reduced load on failing backend nodes (exponential backoff)
  • ✅ Prevents thundering herd via jitter
  • ✅ Tunable retry behavior (can be customized if needed)
  • ✅ Improved observability with detailed logging
  • ✅ No breaking changes - uses sensible defaults

Testing

  • 4 new test functions with 14 test cases:
    • TestDefaultRetryConfig - Validates default values
    • TestCalculateBackoff - Tests exponential backoff (6 cases)
    • TestCalculateBackoffWithJitter - Validates jitter range
    • TestCalculateBackoffMaxDelay - Ensures cap is respected
  • Updated existing tests to work with new retry config
  • All tests pass: go test ./... -short

Example Behavior

Before: Immediate sequential attempts to all machines, no delays

Trying redis1... failed
Trying redis2... failed  
Trying redis3... failed
Error: connection failed

After: Retries with exponential backoff

Trying redis1... failed
Retry 1/3 in 100ms... failed
Retry 2/3 in 180ms... failed (with jitter)
Retry 3/3 in 420ms... SUCCESS!

Resolves #388

Adds intelligent retry logic with exponential backoff and jitter to improve
Redis connection reliability and prevent thundering herd problems.

Changes:
- Add RetryConfig struct with tunable parameters (MaxRetries, BaseDelay,
  MaxDelay, JitterFactor)
- Implement calculateBackoff() with exponential backoff and jitter formula:
  backoff = min(baseDelay * 2^attempt, maxDelay) * (1 ± jitter)
- Update createClient() to retry connections with backoff between attempts
- Add comprehensive logging for retry attempts and failures
- Use sensible defaults: 3 retries, 100ms-5s backoff, 30% jitter

Benefits:
- Better handling of transient network failures
- Reduced load on failing backend nodes (exponential backoff)
- Prevents thundering herd via jitter
- Configurable retry behavior
- Improved observability with debug/info logging

Testing:
- Add 4 new test functions with 14 test cases
- Test default config, backoff calculation, jitter behavior, max delay cap
- All existing tests pass
- Full test suite passes

Resolves #388
@abtreece abtreece added the enhancement New feature or request label Jan 14, 2026
Copilot AI review requested due to automatic review settings January 14, 2026 01:02
@abtreece abtreece added priority: high High priority item area: backend Backend implementations labels Jan 14, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements exponential backoff with jitter for Redis connection retries to improve reliability and prevent thundering herd problems when connecting to Redis nodes.

Changes:

  • Added RetryConfig struct with configurable retry parameters (MaxRetries, BaseDelay, MaxDelay, JitterFactor)
  • Implemented exponential backoff algorithm with jitter in calculateBackoff() function
  • Updated createClient() to retry failed connections with exponential backoff instead of failing immediately
  • Added comprehensive test coverage for retry configuration and backoff calculation

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/backends/redis/client.go Implements retry configuration struct, exponential backoff calculation, and integrates retry logic into connection attempts
pkg/backends/redis/client_test.go Adds tests for default retry config, backoff calculation with/without jitter, and updates existing tests to use retry config

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if config.JitterFactor > 0 {
jitter := backoff * config.JitterFactor
// Random value between (backoff - jitter) and (backoff + jitter)
backoff = backoff - jitter + (rand.Float64() * 2 * jitter)
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of rand.Float64() without seeding or using a thread-safe random source can cause issues in concurrent scenarios. Consider using math/rand.NewSource() with time.Now().UnixNano() or the thread-safe crypto/rand package to ensure proper randomization across goroutines.

Suggested change
backoff = backoff - jitter + (rand.Float64() * 2 * jitter)
src := rand.NewSource(time.Now().UnixNano())
r := rand.New(src)
backoff = backoff - jitter + (r.Float64() * 2 * jitter)

Copilot uses AI. Check for mistakes.
func createClient(machines []string, password string, withReadTimeout bool) (*redis.Client, int, error) {
func createClient(machines []string, password string, withReadTimeout bool, retryConfig RetryConfig) (*redis.Client, int, error) {
var lastErr error

Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When machines slice is empty, the function returns nil, 0, fmt.Errorf(...) with lastErr being nil, which will cause a nil pointer dereference in the error wrapping at line 146. The error message will show '%!w()' instead of a meaningful error. Consider returning a specific error for empty machines before the loop.

Suggested change
if len(machines) == 0 {
return nil, 0, fmt.Errorf("no redis machines configured")
}

Copilot uses AI. Check for mistakes.
Comment on lines +208 to 210
if err == nil {
t.Error("createClient with empty machines should return error")
}
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should verify the error message content to ensure it properly indicates that no machines were provided, rather than just checking that an error exists.

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Jan 14, 2026

Codecov Report

❌ Patch coverage is 65.51724% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.40%. Comparing base (b2506b1) to head (c9875bb).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/backends/redis/client.go 65.51% 17 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #405      +/-   ##
==========================================
- Coverage   57.41%   57.40%   -0.01%     
==========================================
  Files          31       31              
  Lines        3198     3226      +28     
==========================================
+ Hits         1836     1852      +16     
- Misses       1230     1240      +10     
- Partials      132      134       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@claude
Copy link

claude bot commented Jan 14, 2026

Pull Request Review: Exponential Backoff for Redis Connection Retries

Summary

This PR implements exponential backoff with jitter for Redis connection retries, addressing issue #388. The implementation is well-structured and adds important reliability improvements to the Redis backend.


✅ Strengths

Code Quality

  • Clean architecture: The RetryConfig struct is well-designed and makes the retry behavior tunable
  • Good separation of concerns: calculateBackoff() is properly isolated and testable
  • Sensible defaults: The DefaultRetryConfig() provides production-ready values without requiring configuration
  • No breaking changes: Existing code continues to work with default behavior

Testing

  • Excellent test coverage: 4 new test functions with 14 test cases covering:
    • Default configuration validation
    • Exponential backoff calculations (6 cases)
    • Jitter range validation (100 iterations)
    • Maximum delay cap enforcement
  • Fast tests: Uses MaxRetries: 0 in tests with invalid addresses to prevent slow failures
  • Clear test names: Test cases follow Go conventions and are self-documenting

Documentation

  • Clear comments: Algorithm is documented with the formula in line 41
  • Helpful logging: Debug and info logs provide good observability
  • Good PR description: Includes problem statement, solution, benefits, and examples

⚠️ Issues & Recommendations

1. CRITICAL: Non-thread-safe Random Number Generation

Location: client.go:55

Issue: rand.Float64() uses the global math/rand package without seeding, which:

  • Returns deterministic values (always the same sequence) because it's never seeded
  • Is not thread-safe when called concurrently
  • In Go 1.20+, this was improved, but best practice is to use crypto/rand or seed explicitly

Current code:

backoff = backoff - jitter + (rand.Float64() * 2 * jitter)

Recommended fix (choose one):

Option A - Use math/rand/v2 (Go 1.22+, thread-safe and auto-seeded):

import (
    "math/rand/v2"
)

// In calculateBackoff:
backoff = backoff - jitter + (rand.Float64() * 2 * jitter)

Option B - Explicitly seed on init (if Go < 1.22):

import (
    "math/rand"
    "time"
)

func init() {
    rand.Seed(time.Now().UnixNano())
}

Option C - Create a local seeded rand source:

var rng = rand.New(rand.NewSource(time.Now().UnixNano()))

// In calculateBackoff:
backoff = backoff - jitter + (rng.Float64() * 2 * jitter)

Impact: Without seeding, all instances will use identical jitter values, defeating the thundering herd prevention. This is the most important issue to fix.


2. Error Wrapping Enhancement

Location: client.go:146

Issue: When all machines fail, only the last error is returned. If earlier machines had different failures, that context is lost.

Current code:

return nil, 0, fmt.Errorf("failed to connect to any redis node after retries: %w", lastErr)

Suggested enhancement:

// Track all errors
var allErrors []error
for _, address := range machines {
    // ... attempt connection ...
    if err != nil {
        allErrors = append(allErrors, fmt.Errorf("%s: %w", address, err))
    }
}

// Join all errors for better debugging
return nil, 0, fmt.Errorf("failed to connect to any redis node after retries: %w", errors.Join(allErrors...))

This would require importing errors package and using Go 1.20+ for errors.Join(). This is optional but would improve debugging in multi-machine scenarios.


3. Minor: Test Could Validate Empty Machines Error Message

Location: client_test.go:199-211

Current test:

if err == nil {
    t.Error("createClient with empty machines should return error")
}

Enhancement:

if err == nil {
    t.Error("createClient with empty machines should return error")
} else if !strings.Contains(err.Error(), "failed to connect to any redis node") {
    t.Errorf("unexpected error message: %v", err)
}

Minor improvement for more specific error validation.


4. Documentation: Clarify Retry vs Connection Attempts

Location: client.go:98-106

Issue: The variable name attempt is used within a retry loop, where attempt=0 is the initial connection try. This is correct, but the log message at line 101 might be slightly confusing:

log.Debug("Redis connection attempt %d/%d to %s failed, retrying in %v",
    attempt, retryConfig.MaxRetries, address, backoff)

When attempt=1, this logs "attempt 1/3", but it's actually the 2nd connection try (1 retry). Consider:

log.Debug("Redis connection retry %d/%d to %s after backoff %v",
    attempt, retryConfig.MaxRetries, address, backoff)

This makes it clearer that attempt is the retry count, not total connection attempts.


5. Consider: Configurable Timeouts

Location: client.go:113-114

Observation: The connection timeouts are still hardcoded to 1 second:

DialTimeout:  time.Second,
WriteTimeout: time.Second,

With exponential backoff, you might want longer timeouts on later retries. This is a future enhancement consideration, not a blocker for this PR.

Potential enhancement:

timeout := time.Second * time.Duration(1 + attempt) // 1s, 2s, 3s, 4s
opts.DialTimeout = timeout

🔒 Security Considerations

No security concerns identified:


⚡ Performance Considerations

Positive impacts:

  • Reduces hammering of failed nodes through exponential backoff
  • Jitter prevents thundering herd when multiple instances restart simultaneously
  • Default config (3 retries, 100ms base, 5s max) is reasonable for production

⚠️ Consideration:

  • Worst case: With 3 retries and multiple machines, startup could take longer if all are down
  • Example: 3 machines × (100ms + 200ms + 400ms) = ~2.1 seconds total delay
  • This is acceptable trade-off for reliability, but worth documenting

📋 Test Coverage Assessment

Strong unit test coverage:

  • Backoff calculation logic thoroughly tested
  • Jitter range validated statistically
  • Edge cases covered (empty machines, invalid addresses)
  • Configuration defaults validated

⚠️ Integration tests:

  • PR description mentions integration tests pass
  • Consider adding a test that validates retry behavior with a mock Redis that fails N times then succeeds
  • Current tests with MaxRetries: 0 bypass retry logic for speed

🎯 Recommendations Priority

Must Fix (before merge):

  1. Fix non-deterministic/thread-unsafe random number generation (Issue Consider switching integration tests to use docker #1)

Should Fix (before merge):

  1. Consider improving error context for multi-machine failures (Issue Fixes owner / group support #2)
  2. Clarify retry vs attempt terminology in logs (Issue Add hostname to template functions #4)

Nice to Have (future PR):

  1. More specific test error message validation (Issue Allow conf.d and templates directory to be symlinks #3)
  2. Adaptive timeout increases (Issue Setup GitHub Actions #5)
  3. Integration test for retry behavior

📝 Commit Message Note

Per CLAUDE.md:

All commits should be authored by the repository owner. Do not include Co-Authored-By lines or AI attribution in commit messages.

The current commit should remove the Co-Authored-By line when merged:

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

✅ Final Verdict

Recommendation: Approve with Minor Fixes

This is a well-implemented, well-tested PR that addresses a real production reliability issue. The code quality is high, and the approach is sound.

Required changes:

Optional improvements:

After fixing the random seeding issue, this PR will be production-ready and should be merged. Great work overall! 🎉

Fixes critical and optional issues identified in code review:

Critical Fix:
- Fix non-thread-safe random number generation for jitter
  - Add package-level seeded rand.Source for proper randomization
  - Ensures different jitter values across instances (prevents thundering herd)
  - Thread-safe for concurrent calls

Improvements:
- Aggregate errors from all machines for better debugging
  - Use errors.Join() to preserve context from each failed machine
  - Return more informative error messages
- Clarify log message terminology
  - Change "attempt" to "retry" for clearer distinction
  - Update log at line 106: "retry N/M" instead of "attempt N/M"
  - Update success log at line 136: "after N retries" instead of "after N attempts"

All tests pass.
@claude
Copy link

claude bot commented Jan 14, 2026

Pull Request Review - PR #405

Overall Assessment

This is a well-implemented PR that addresses issue #388 with a solid exponential backoff retry mechanism for Redis connections. The implementation is clean, well-tested, and follows Go best practices. I recommend approval with minor considerations.


✅ Strengths

1. Code Quality

  • Clean separation of concerns with RetryConfig struct and calculateBackoff helper function
  • Well-documented code with clear comments explaining the exponential backoff algorithm
  • Good use of Go idioms (error wrapping with errors.Join, proper resource cleanup)
  • Maintains backward compatibility - uses sensible defaults with no breaking changes

2. Algorithm Implementation

  • Correct exponential backoff: baseDelay * 2^attempt
  • Proper max delay capping to prevent excessive wait times
  • Jitter implementation prevents thundering herd problem
  • Formula is well-documented: backoff = min(baseDelay * 2^attempt, maxDelay) * (1 ± jitter)

3. Test Coverage

  • Comprehensive test suite with 14 test cases across 4 test functions
  • Tests cover edge cases: zero jitter, max delay capping, jitter range validation
  • Tests verify default configuration values
  • Good test naming and structure with table-driven tests

4. Observability

  • Enhanced logging for retry attempts with backoff duration
  • Success logging after retries helps with debugging
  • Improved error messages with context from all failed machines
  • Uses errors.Join for aggregated error reporting (Go 1.20+)

⚠️ Potential Issues & Considerations

1. Thread Safety Concern (Minor)

var rng = rand.New(rand.NewSource(time.Now().UnixNano()))

Issue: rand.New creates a non-thread-safe RNG. While the comment claims "thread-safety", rand.Rand is not safe for concurrent use.

Impact: Low - If multiple goroutines call createClient concurrently (e.g., from connectedClient and WatchPrefix), this could cause race conditions.

Recommendation: Either:

  • Use math/rand/v2 global functions (thread-safe in Go 1.22+)
  • Add a mutex around rng.Float64() calls
  • Use sync.Pool for per-goroutine RNG instances

Example fix:

// Option 1: Use sync/Mutex
var (
    rng = rand.New(rand.NewSource(time.Now().UnixNano()))
    rngMu sync.Mutex
)

func calculateBackoff(attempt int, config RetryConfig) time.Duration {
    // ...
    if config.JitterFactor > 0 {
        jitter := backoff * config.JitterFactor
        rngMu.Lock()
        randomValue := rng.Float64()
        rngMu.Unlock()
        backoff = backoff - jitter + (randomValue * 2 * jitter)
    }
    return time.Duration(backoff)
}

2. Retry Logic Behavior

Current implementation tries each machine MaxRetries + 1 times (0 to MaxRetries inclusive):

for attempt := 0; attempt <= retryConfig.MaxRetries; attempt++ {

Clarification Needed:

  • With MaxRetries: 3, this results in 4 total attempts per machine (initial + 3 retries)
  • The PR description says "Maximum retry attempts (default: 3)" - does this mean 3 total or 3 retries?

Current behavior:

  • Attempt 0: immediate (no backoff)
  • Attempt 1: ~100ms backoff
  • Attempt 2: ~200ms backoff
  • Attempt 3: ~400ms backoff
  • Total: 4 attempts per machine

This is fine if intentional, but the naming MaxRetries: 3 suggests 3 total attempts, not 3 retries + 1 initial.

Recommendation: Either rename to MaxAttempts or adjust loop to attempt < retryConfig.MaxRetries.

3. Configuration Exposure

RetryConfig is currently hardcoded to defaults with no way for users to customize it:

retryConfig := DefaultRetryConfig()

Consideration: Future enhancement could expose these via:

  • Environment variables (CONFD_REDIS_MAX_RETRIES, etc.)
  • Config file parameters
  • Command-line flags

This isn't blocking for this PR since the PR description mentions "Tunable retry behavior (can be customized if needed)", but the current implementation doesn't expose tuning yet.

4. Total Connection Time

With multiple machines and retries, total connection time could be significant:

  • 3 machines × 4 attempts × ~1s timeout = ~12 seconds worst case
  • Plus backoff delays: 100ms + 200ms + 400ms per machine = ~2.1s additional
  • Worst case: ~14 seconds before returning error

Consideration: This might be too long for some use cases. Consider:

  • Adding a global timeout across all machines/retries
  • Making DialTimeout configurable
  • Short-circuiting if context is cancelled

5. Error Message Change

The error message format changed significantly:

// Before: returns last error directly
// After: returns wrapped error with aggregation
fmt.Errorf("failed to connect to any redis node after retries: %w", errors.Join(allErrors...))

Impact: Any code parsing error messages might break, but this is generally fine for user-facing errors.


🔒 Security Review

No security concerns identified.

  • No user input validation issues
  • No credential handling changes
  • No new injection vectors
  • Proper resource cleanup (client.Close())

🚀 Performance Considerations

Positives:

  • Reduces load on failing nodes (exponential backoff)
  • Prevents thundering herd (jitter)
  • Doesn't impact happy path performance (first attempt succeeds immediately)

Considerations:

  • Increased latency on failures (by design - this is acceptable)
  • Additional memory: 1 RNG instance (negligible)
  • No CPU-intensive operations

📊 Test Coverage Assessment

Excellent test coverage for unit-testable code:

  • ✅ Default configuration validation
  • ✅ Exponential backoff calculation (6 cases)
  • ✅ Jitter range validation (100 iterations)
  • ✅ Max delay capping
  • ✅ Updated existing tests for new signature

Missing tests (acceptable for this PR):

  • Integration tests with actual retry behavior (noted in comments, handled by CI)
  • Concurrent access to RNG (would require race detector)
  • Context cancellation during retries

Recommendation: Run tests with race detector:

go test -race ./pkg/backends/redis/...

📝 Documentation

Good:

  • Clear PR description with examples
  • Inline code comments
  • Test comments explaining expected behavior

Could improve:

  • Add godoc comment for RetryConfig explaining field constraints (e.g., JitterFactor range 0.0-1.0)
  • Document the "attempts vs retries" semantics more explicitly

✨ Recommendations

Must Address:

  1. Fix thread-safety issue with rng package variable

Should Consider:

  1. Clarify naming: MaxRetries vs MaxAttempts semantics
  2. Document total worst-case connection time in comments
  3. Add validation for RetryConfig fields (e.g., JitterFactor should be 0.0-1.0)

Nice to Have:

  1. Add godoc examples for DefaultRetryConfig
  2. Consider future work: expose configuration to users
  3. Consider adding context.Context awareness in retry loop for early cancellation

Example: Suggested Validation

// Validate validates the retry configuration
func (r RetryConfig) Validate() error {
    if r.MaxRetries < 0 {
        return fmt.Errorf("MaxRetries must be >= 0, got %d", r.MaxRetries)
    }
    if r.BaseDelay < 0 {
        return fmt.Errorf("BaseDelay must be >= 0, got %v", r.BaseDelay)
    }
    if r.MaxDelay < r.BaseDelay {
        return fmt.Errorf("MaxDelay (%v) must be >= BaseDelay (%v)", r.MaxDelay, r.BaseDelay)
    }
    if r.JitterFactor < 0 || r.JitterFactor > 1 {
        return fmt.Errorf("JitterFactor must be in [0, 1], got %f", r.JitterFactor)
    }
    return nil
}

🎯 Verdict

APPROVE with minor fix for thread safety.

This PR successfully addresses issue #388 with a production-ready implementation. The code quality is high, testing is thorough, and the approach is sound. The only blocking issue is the thread-safety concern with the package-level RNG, which should be addressed before merge.

Estimated effort to address feedback: 15-30 minutes for the thread-safety fix.

Great work on this implementation! 🎉

@abtreece abtreece merged commit 9fc35e5 into main Jan 14, 2026
9 checks passed
@abtreece abtreece deleted the enhancement/redis-exponential-backoff branch January 14, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: backend Backend implementations enhancement New feature or request priority: high High priority item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement exponential backoff for Redis connection retries

1 participant