Problem
During rolling restarts, Eclipse Ditto produces thousands of DittoInternalErrorException (HTTP 500) responses. These are not actual internal errors — they are transient timeouts caused by shard rebalancing in the policies service.
The root cause is a timeout hierarchy mismatch in AbstractPersistenceSupervisor:
Supervisor
└─ askEnforcerChild(signal) ← 10s timeout (localEnforcerAskTimeout)
└─ ThingEnforcerActor
└─ PolicyCacheLoader
└─ AskWithRetry(policiesShardRegion) ← 3s × 3 retries + backoff = ~42s worst case
The supervisor's 10s enforcer timeout fires while the enforcer is still legitimately retrying the policy shard (which can take up to 42s during rebalancing). The resulting AskTimeoutException is a Pekko-native exception, not a DittoRuntimeException, so getEnforcementExceptionAsRuntimeException() wraps it as DittoInternalErrorException (HTTP 500) — the generic fallback for unknown throwables.
This is incorrect: enforcement timeouts during shard rebalancing are transient and retryable. They should produce HTTP 503 (Service Unavailable), not HTTP 500 (Internal Server Error).
Impact
- Clients receive HTTP 500, which signals a permanent server bug rather than a transient condition. Most HTTP clients and load balancers do not retry 500s.
- Each occurrence is logged at ERROR level (the special
DittoInternalErrorException logging path in handleSignalEnforcementResponse), generating noise that masks genuine internal errors.
- The enforcer child continues retrying the policy shard uselessly after the supervisor has already returned 500 to the caller.
Proposed Fix
Introduce EnforcementTimeoutException — a dedicated DittoRuntimeException subclass that maps enforcement AskTimeoutException to HTTP 503 instead of 500.
Problem
During rolling restarts, Eclipse Ditto produces thousands of
DittoInternalErrorException(HTTP 500) responses. These are not actual internal errors — they are transient timeouts caused by shard rebalancing in the policies service.The root cause is a timeout hierarchy mismatch in
AbstractPersistenceSupervisor:The supervisor's 10s enforcer timeout fires while the enforcer is still legitimately retrying the policy shard (which can take up to 42s during rebalancing). The resulting
AskTimeoutExceptionis a Pekko-native exception, not aDittoRuntimeException, sogetEnforcementExceptionAsRuntimeException()wraps it asDittoInternalErrorException(HTTP 500) — the generic fallback for unknown throwables.This is incorrect: enforcement timeouts during shard rebalancing are transient and retryable. They should produce HTTP 503 (Service Unavailable), not HTTP 500 (Internal Server Error).
Impact
DittoInternalErrorExceptionlogging path inhandleSignalEnforcementResponse), generating noise that masks genuine internal errors.Proposed Fix
Introduce
EnforcementTimeoutException— a dedicatedDittoRuntimeExceptionsubclass that maps enforcementAskTimeoutExceptionto HTTP 503 instead of 500.