ref(eco): Adds report_timeout_errors options to Task definitions, allows ProcessingDeadlineExceeded to be retried#592
Conversation
…ry and Task definitions respectively
69d0adc to
b7f98fc
Compare
| ignore: tuple[type[BaseException], ...] | None = None, | ||
| times_exceeded: LastAction = LastAction.Discard, | ||
| delay: int | None = None, | ||
| retry_on_timeout: bool = True, |
There was a problem hiding this comment.
given that we can specify ProcessingDeadlineException in on already and with your change it can work, retry_on_timeout seems not entirely orthogonal. here.
I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.
If we try to make retry_on_timeout stand alone, we potentially don't need to involve should_retry at all or do that localized import, we just check the bool. This seems oddly consistent with how task timeouts aren't the same thing as exceptions that escape; it's framework functionality that may or may not be an exception.
There was a problem hiding this comment.
given that we can specify ProcessingDeadlineException in on already and with your change it can work, retry_on_timeout seems not entirely orthogonal. here.
I agree, I think we could fold this in, but it seemed distinct enough at first with the special handling this case currently has. Perhaps there's enough here to roll the exception handling in worker_child together with the generic exception handling to simplify all of this though.
I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.
I am as well.
I'll let the taskbroker team weigh in on this since they're more likely to have a strong opinion on this.
There was a problem hiding this comment.
I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.
I don't have a strong reason to keep BaseException. We could easily go to Exception and be compatible with application usage.
given that we can specify ProcessingDeadlineException in on already and with your change it can work,
Including ProcessingDeadlineException in an on clause works for me. It exposes less API surface area, and makes it more explicit on what kinds of timeouts are captured as the retry_on_timeout parameter doesn't explain which kinds of timeouts much.
There was a problem hiding this comment.
I ended up just broadening the allowable exception types to include BaseException in should_retry. To me, it makes sense for ProcessingDeadline to continue to be categorized as a BaseException type, since it's not strictly an error, but a system level timer instead.
| ) | ||
| next_state = TASK_ACTIVATION_STATUS_FAILURE | ||
| retry = task_func.retry | ||
| if retry and retry.should_retry(inflight.activation.retry_state, err): |
There was a problem hiding this comment.
Would you have tasks configured to do retries on ProcessingDeadlineExceeded?
There was a problem hiding this comment.
The use case for us would be: we try a webhook dispatch, it fails with a timeout, so we retry x number of times total before bailing on the webhook attempt entirely. If there's another way to achieve this without hooking into processingDeadlineExceeded except block here, I'm game for that.
There was a problem hiding this comment.
You could have the task processing deadline be longer than the cumulative timeouts of your HTTP request (and retries). That would let you do some retries within the task, handle those failures, and request a full task retry without having to catch processing deadlines.
There was a problem hiding this comment.
Discussed off issue but the issue we're hitting is with the DNS resolution in certain cases (ngrok is typically a culprit here), which we can't meaningfully apply a timeout to. This was a way for us to allow retries without reimplementing the same logic we already have here.
| ignore: tuple[type[BaseException], ...] | None = None, | ||
| times_exceeded: LastAction = LastAction.Discard, | ||
| delay: int | None = None, | ||
| retry_on_timeout: bool = True, |
There was a problem hiding this comment.
I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.
I don't have a strong reason to keep BaseException. We could easily go to Exception and be compatible with application usage.
given that we can specify ProcessingDeadlineException in on already and with your change it can work,
Including ProcessingDeadlineException in an on clause works for me. It exposes less API surface area, and makes it more explicit on what kinds of timeouts are captured as the retry_on_timeout parameter doesn't explain which kinds of timeouts much.
| if isinstance(exc, self._allowed_exception_types): | ||
| return True | ||
|
|
||
| from taskbroker_client.worker.workerchild import ProcessingDeadlineExceeded |
There was a problem hiding this comment.
We could move ProcessingDeadlineExceeded to taskbroker_client.types to break the import cycle.
|
|
||
| # In the retry allow list or processing deadline is exceeded | ||
| # When processing deadline is exceeded, the subprocess raises a TimeoutError | ||
| if isinstance(exc, (TimeoutError, self._allowed_exception_types)): |
There was a problem hiding this comment.
I don't think this is a safe change. If tasks did have TimeoutError being raised it would previously automatically retry and now the task will not retry.
There was a problem hiding this comment.
Where is TimeoutError being raised in the code? I was a bit confused on where this is actually going to be an issue since I only saw the ProcessingDeadlineExceeded exception being raised here:
There was a problem hiding this comment.
Restored this in case it breaks anything, but I do think there's a case to deprecate this behavior, I just don't want to regress any behavior in the meantime.
…s retriable exception
| # starts at 1. | ||
| return state.attempts >= (self._times - 1) | ||
|
|
||
| def should_retry(self, state: RetryState, exc: Exception) -> bool: | ||
| def should_retry(self, state: RetryState, exc: BaseException) -> bool: | ||
| # If there are no retries remaining we should not retry | ||
| if self.max_attempts_reached(state): | ||
| return False |
There was a problem hiding this comment.
Bug: The retry logic does not implicitly handle ProcessingDeadlineExceeded as intended. It requires an explicit on=(ProcessingDeadlineExceeded,) configuration, causing tasks to fail instead of retrying on timeout.
Severity: HIGH
Suggested Fix
Update the should_retry method in retry.py to also check for isinstance(exc, ProcessingDeadlineExceeded). This will make the exception implicitly retriable, aligning the code's behavior with the pull request's description. Also, update the stale comment that incorrectly refers to TimeoutError being raised.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: clients/python/src/taskbroker_client/retry.py#L81-L87
Potential issue: The pull request's stated goal is to make `ProcessingDeadlineExceeded`
exceptions implicitly retriable. However, the implementation does not achieve this. The
`should_retry` method checks for `TimeoutError` or exceptions explicitly passed in the
`on` parameter of the `Retry` object. Since `ProcessingDeadlineExceeded` does not
inherit from `TimeoutError`, it is not retried by default. Tasks that exceed their
processing deadline will fail immediately instead of being retried, contrary to the
feature's intent. This requires users to explicitly opt-in via
`retry=Retry(on=(ProcessingDeadlineExceeded,))`, which contradicts the documented
behavior.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
I think that's what we want now.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 91c9dff. Configure here.
| if retry and retry.should_retry(inflight.activation.retry_state, err): | ||
| next_state = TASK_ACTIVATION_STATUS_RETRY | ||
| else: | ||
| next_state = TASK_ACTIVATION_STATUS_FAILURE |
There was a problem hiding this comment.
Sentry error captured before retry check causes duplicate reports
Medium Severity
In the ProcessingDeadlineExceeded handler, sentry_sdk.capture_exception fires before should_retry is checked. When report_timeout_errors is True (the default) and retries are configured, every timeout attempt reports to Sentry — even those that will be retried. This is inconsistent with the Exception handler, which only captures errors when the task is not being retried.
Reviewed by Cursor Bugbot for commit 91c9dff. Configure here.
markstory
left a comment
There was a problem hiding this comment.
Looks good to me. Once this is merged and used in sentry, it would be good to remove the deadline exception workarounds we have in sentry.


Updates the python lib code with the following changes:
retry.pyto include BaseException types (likeProcessingDeadlineExceeded)ProcessingDeadlineExceededhandling branch.report_timeout_errorscheck which allows us to opt out of Sentry reporting on a per-task definition basis.This will allow developers to optionally specify retries on task timeouts.
Resolves ISWF-2447