Skip to content

Phase 7.1: Implement Error Handling and Message Acknowledgment (#103)#146

Merged
artcava merged 4 commits intodevelopfrom
feature/103-error-handling-acknowledgment
Mar 2, 2026
Merged

Phase 7.1: Implement Error Handling and Message Acknowledgment (#103)#146
artcava merged 4 commits intodevelopfrom
feature/103-error-handling-acknowledgment

Conversation

@artcava
Copy link
Copy Markdown
Owner

@artcava artcava commented Mar 2, 2026

📋 Overview

Implements comprehensive error handling and message acknowledgment strategy for ProcessWorker to ensure reliable message processing, proper error classification, correct ACK/NACK behavior, and integration with dead-letter queues.

Closes #103

🎯 Implemented Features

1. Error Classification System

  • ✅ Created ErrorClassifier with pattern matching for exception categorization
  • ✅ Implemented ErrorClassification model with ErrorCode, IsRetryable, ShouldRequeue, Severity
  • ✅ Added ErrorSeverity enum (Warning, Error, Critical)
  • ✅ Comprehensive unit tests for all error types

2. Dead Letter Exchange Configuration

  • ✅ Declared DLX: stargate.processes.dlx (topic, durable)
  • ✅ Declared DLQ: stargate.processes.dead-letter (durable)
  • ✅ Configured queue binding: DLQ → DLX with routing key #
  • ✅ Main queue configured with x-dead-letter-exchange arguments

3. Poison Message Detection

  • ✅ Retry count tracking via x-retry-count header
  • ✅ Max retry limit: 5 attempts
  • ✅ Auto-reject poison messages to DLQ
  • ✅ Enhanced logging for retry tracking

4. ProcessWorker Integration

  • ✅ Integrated ErrorClassifier for sophisticated error handling
  • ✅ ACK/NACK strategy based on error classification
  • ✅ Record error classification metadata in process failures
  • ✅ Comprehensive logging with classification details

📊 ACK/NACK Decision Matrix

Error Type ErrorCode Retryable Requeue Action Destination
JsonException MALFORMED_MESSAGE No No NACK DLQ
TimeoutException PROCESS_TIMEOUT Yes Yes NACK Main Queue
HttpRequestException HTTP_ERROR Yes Yes NACK Main Queue
InvalidOperationException INVALID_OPERATION No No NACK DLQ
ArgumentException INVALID_ARGUMENT No No NACK DLQ
Unknown UNKNOWN_ERROR Yes Yes NACK Main Queue
Success N/A N/A N/A ACK Removed

🧪 Testing

Unit Tests

dotnet test tests/StarGate.Core.Tests --filter "FullyQualifiedName~ErrorClassifier"

All tests passing:

  • ✅ JsonException → MALFORMED_MESSAGE (non-retryable)
  • ✅ TimeoutException → PROCESS_TIMEOUT (retryable)
  • ✅ HttpRequestException → HTTP_ERROR (retryable)
  • ✅ InvalidOperationException → INVALID_OPERATION (non-retryable)
  • ✅ ArgumentException → INVALID_ARGUMENT (non-retryable)
  • ✅ Unknown Exception → UNKNOWN_ERROR (retryable)

Integration Testing Instructions

  1. Verify RabbitMQ topology:

    docker-compose up -d rabbitmq
    open http://localhost:15672
    # Login: guest/guest

    Expected exchanges:

    • stargate.processes (main)
    • stargate.processes.dlx (dead-letter)

    Expected queues:

    • stargate.processes (with DLX config)
    • stargate.processes.dead-letter
  2. Test malformed message:

    • Publish invalid JSON directly to queue
    • Verify message moved to DLQ
    • Check logs for "MALFORMED_MESSAGE" classification
  3. Test poison message:

    • Create process that always fails
    • Verify after 5 retries, message goes to DLQ
    • Check message headers for retry count
  4. Monitor DLQ:

    • Access RabbitMQ Management UI
    • Navigate to stargate.processes.dead-letter queue
    • Analyze failure patterns

📝 Technical Details

Error Classification Logic

The ErrorClassifier uses pattern matching to map exceptions to classification:

  • Non-Retryable Errors: Deterministic failures (malformed data, invalid arguments) that won't succeed on retry
  • Retryable Errors: Transient failures (timeouts, network errors) that may succeed on retry
  • Default Behavior: Unknown errors are treated as retryable (safe default)

Dead Letter Exchange Flow

Main Queue → Process Fails → NACK (requeue=false) → DLX → DLQ

Poison Message Protection

Message → Attempt 1 → Fail → Retry (count=1)
       → Attempt 2 → Fail → Retry (count=2)
       → ...
       → Attempt 5 → Fail → Retry (count=5)
       → Attempt 6 → Detected as poison → DLQ

🔄 Changes Summary

New Files

  • src/StarGate.Core/Errors/ErrorClassifier.cs
  • tests/StarGate.Core.Tests/Errors/ErrorClassifierTests.cs

Modified Files

  • src/StarGate.Infrastructure/Messaging/RabbitMQ/RabbitMqConsumer.cs

    • Added DLX configuration
    • Implemented poison message detection
    • Enhanced retry count tracking
  • src/StarGate.Server/Workers/ProcessWorker.cs

    • Integrated ErrorClassifier
    • Enhanced error handling with classification
    • Improved ACK/NACK decision logic

✅ Checklist

  • ErrorClassifier implemented with exception mapping
  • Error classification includes ErrorCode, IsRetryable, ShouldRequeue, Severity
  • Dead-letter exchange configured
  • Dead-letter queue created and bound
  • Main queue configured with DLX arguments
  • Poison message detection (retry count > 5)
  • Retry count tracked in message headers
  • Malformed messages NACK'd without requeue
  • Retryable errors NACK'd with requeue
  • Non-retryable errors NACK'd without requeue
  • Process failure recorded with classification
  • Comprehensive error logging
  • Unit tests for error classification
  • Code follows CODING-CONVENTIONS.md

📚 References

🏷️ Labels

phase-7 process-engine sprint-7.1 error-handling

⏱️ Effort

Actual: ~8 hours (aligned with estimate)

🔗 Dependencies

artcava and others added 4 commits March 2, 2026 11:53
- Add ErrorClassifier for exception categorization
- Add ErrorClassification model with ErrorCode, IsRetryable, ShouldRequeue, Severity
- Add ErrorSeverity enum (Warning, Error, Critical)
- Implement pattern matching for common exception types
- Add default handler for unknown exceptions
- Add Dead Letter Exchange (stargate.processes.dlx) configuration
- Add Dead Letter Queue (stargate.processes.dead-letter) setup
- Configure main queue with x-dead-letter-exchange arguments
- Implement retry count tracking in message headers (x-retry-count)
- Add poison message detection (max 5 retries)
- Auto-move poison messages to DLQ
- Enhanced logging for retry tracking and DLX operations
- Update EnsureQueueExists to create and bind DLX topology
- Replace simple exception type checking with ErrorClassifier
- Use error classification for ACK/NACK decisions
- Record error classification details in process failure
- Add comprehensive logging with classification metadata
- Implement Decision Matrix for error handling:
  * Malformed messages: NACK without requeue -> DLQ
  * Retryable errors: NACK with requeue
  * Non-retryable errors: NACK without requeue -> DLQ
- Enhanced error reporting with ErrorCode and Severity
@artcava artcava merged commit 0aab1bf into develop Mar 2, 2026
4 checks passed
@artcava artcava deleted the feature/103-error-handling-acknowledgment branch March 2, 2026 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant