Skip to content

feat: Phase 7.1 - Integrate Retry Logic in ProcessWorker#145

Merged
artcava merged 11 commits intodevelopfrom
feature/issue-102-retry-logic-integration
Mar 2, 2026
Merged

feat: Phase 7.1 - Integrate Retry Logic in ProcessWorker#145
artcava merged 11 commits intodevelopfrom
feature/issue-102-retry-logic-integration

Conversation

@artcava
Copy link
Copy Markdown
Owner

@artcava artcava commented Mar 2, 2026

📝 Description

Implements comprehensive retry logic in ProcessWorker to handle transient failures with exponential backoff and coordinated message redelivery through RabbitMQ.

Closes #102

🎯 Changes Made

Core Implementation

  • ✅ Created RetryConfiguration class with exponential backoff calculation
  • ✅ Integrated retry logic in ProcessWorker.HandleProcessFailureAsync
  • ✅ Implemented error classification (retryable vs non-retryable)
  • ✅ Added PublishRetryMessageAsync method for delayed message publishing
  • ✅ Configured dependency injection for RetryConfiguration

Configuration

  • ✅ Added appsettings.json with retry configuration
  • ✅ Added appsettings.Development.json with dev-specific settings
  • ✅ Updated Program.cs to register RetryConfiguration

Testing

  • ✅ Created comprehensive unit tests for retry logic (RetryLogicTests)
  • ✅ Tests cover exponential backoff, max delay, and jitter

Documentation

  • ✅ Created docs/RETRY-LOGIC.md with complete implementation guide
  • ✅ Documented retry flow, configuration, and troubleshooting

🛠️ Technical Details

Exponential Backoff Formula

Delay = BaseDelay × (Multiplier ^ RetryCount)

Example with defaults:

  • Retry 0: 5s
  • Retry 1: 10s
  • Retry 2: 20s
  • Retry 3: 40s

Jitter Implementation

Adds ±30% randomization to prevent thundering herd:

var jitter = random.NextDouble() * 0.3 * delaySeconds;
delaySeconds = delaySeconds * (1 + jitter - 0.15);

Error Classification

Retryable:

  • TimeoutException
  • HttpRequestException
  • OperationCanceledException (graceful shutdown)

Non-Retryable:

  • InvalidOperationException (business logic violations)
  • NO_HANDLER_FOUND

Retry Flow

  1. Handler execution fails
  2. Classify error type
  3. Call ProcessService.FailProcessAsync(canRetry)
  4. ProcessService checks RetryCount vs MaxRetries
  5. If retryable: calculate delay with exponential backoff
  6. Publish delayed message to RabbitMQ
  7. Message redelivered after delay

📋 Files Changed

New Files

  • src/StarGate.Core/Configuration/RetryConfiguration.cs
  • src/StarGate.Server/appsettings.json
  • src/StarGate.Server/appsettings.Development.json
  • tests/StarGate.Server.Tests/Workers/RetryLogicTests.cs
  • docs/RETRY-LOGIC.md

Modified Files

  • src/StarGate.Server/Workers/ProcessWorker.cs

    • Added IMessageBroker and RetryConfiguration dependencies
    • Enhanced HandleProcessFailureAsync with retry decision logic
    • Added PublishRetryMessageAsync method
    • Updated error classification
  • src/StarGate.Server/Program.cs

    • Added RetryConfiguration registration

✅ Testing

Unit Tests

All retry logic unit tests pass:

dotnet test tests/StarGate.Server.Tests --filter "FullyQualifiedName~Retry"

Test Coverage:

  • ✅ Exponential backoff calculation (5 test cases)
  • ✅ Max delay enforcement
  • ✅ Jitter randomization
  • ✅ Consistent delays without jitter
  • ✅ Default configuration values

Integration Testing Plan

  1. Start infrastructure: docker-compose up -d
  2. Create policy with maxRetries: 3
  3. Create failing process
  4. Verify retry timing:
    • Attempt 1: Immediate
    • Attempt 2: ~5s delay
    • Attempt 3: ~10s delay
    • Attempt 4: ~20s delay
    • Final: Status = Failed
  5. Verify retryCount = 3 in process

📚 Documentation

Comprehensive documentation added in docs/RETRY-LOGIC.md:

  • Architecture overview
  • Exponential backoff formula and examples
  • Jitter implementation and benefits
  • Error classification rules
  • Complete retry flow diagram
  • Configuration reference
  • RabbitMQ delayed message pattern
  • Testing instructions
  • Monitoring and observability
  • Troubleshooting guide

🔗 Dependencies

Depends On

Related To

⚙️ Configuration

Production (appsettings.json)

{
  "Retry": {
    "BaseDelaySeconds": 5,
    "MaxDelaySeconds": 300,
    "BackoffMultiplier": 2.0,
    "UseJitter": true
  }
}

Development (appsettings.Development.json)

{
  "Retry": {
    "BaseDelaySeconds": 3,
    "MaxDelaySeconds": 60,
    "BackoffMultiplier": 2.0,
    "UseJitter": true
  }
}

🐛 Known Issues / Limitations

None. Implementation follows issue #102 specifications completely.

📝 Checklist

  • Code follows CODING-CONVENTIONS.md
  • All unit tests pass
  • New functionality is covered by tests
  • Documentation updated (RETRY-LOGIC.md)
  • Configuration added to appsettings
  • Logging implemented for all retry operations
  • Error handling comprehensive
  • Dependencies properly injected
  • Commit messages follow git-flow conventions
  • PR references issue Phase 7.1: Integrate Retry Logic in ProcessWorker #102

📦 Deployment Notes

Prerequisites

  • RabbitMQ must be configured with Dead Letter Exchange for delayed messages
  • Process policies must have MaxRetries configured

Configuration Steps

  1. Deploy code to environment
  2. Verify appsettings.json has Retry section
  3. Restart ProcessWorker service
  4. Monitor logs for retry events

Rollback Plan

If issues occur:

  1. Revert to previous version
  2. Existing processes will continue without retry logic
  3. No data migration required

🔍 Additional Notes

Design Decisions

  1. Jitter enabled by default: Prevents thundering herd problem
  2. Max delay cap: Prevents excessive wait times (5 minutes max)
  3. Dead Letter Exchange pattern: No RabbitMQ plugins required
  4. Error classification: Conservative approach (most errors are retryable)

Performance Impact

  • Minimal CPU overhead (simple exponential calculation)
  • Memory: Delayed messages stored in RabbitMQ queues
  • Network: One additional publish per retry
  • No blocking operations in ProcessWorker

Future Enhancements

See docs/RETRY-LOGIC.md for planned improvements:

  • Adaptive backoff based on system load
  • Per-error-type retry configuration
  • Circuit breaker integration
  • Metrics dashboard

Reviewer Notes:

This PR implements the complete retry logic as specified in issue #102. All acceptance criteria have been met:

✅ RetryConfiguration with exponential backoff
✅ Jitter to prevent thundering herd
✅ Error classification (retryable vs non-retryable)
✅ Integration with ProcessService retry handling
✅ Delayed message publishing via RabbitMQ
✅ Comprehensive unit tests
✅ Complete documentation
✅ Configuration in appsettings.json

Please review the implementation, tests, and documentation. Special attention to:

  • Retry delay calculation logic
  • Error classification rules
  • RabbitMQ message flow

@artcava artcava merged commit 0bf0eb5 into develop Mar 2, 2026
4 checks passed
@artcava artcava deleted the feature/issue-102-retry-logic-integration branch March 2, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant