ref(eco): Adds report_timeout_errors options to Task definitions, allows ProcessingDeadlineExceeded to be retried by GabeVillalobos · Pull Request #592 · getsentry/taskbroker

GabeVillalobos · 2026-04-13T21:53:32Z

Updates the python lib code with the following changes:

Broadens allowable retry exceptions in retry.py to include BaseException types (like ProcessingDeadlineExceeded)
Updates workerchild to check for retries on the ProcessingDeadlineExceeded handling branch.
Adds a new report_timeout_errors check which allows us to opt out of Sentry reporting on a per-task definition basis.

This will allow developers to optionally specify retries on task timeouts.

Resolves ISWF-2447

…ry and Task definitions respectively

kcons · 2026-04-13T22:08:19Z

        ignore: tuple[type[BaseException], ...] | None = None,
        times_exceeded: LastAction = LastAction.Discard,
        delay: int | None = None,
+        retry_on_timeout: bool = True,


given that we can specify ProcessingDeadlineException in on already and with your change it can work, retry_on_timeout seems not entirely orthogonal. here.

I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.

If we try to make retry_on_timeout stand alone, we potentially don't need to involve should_retry at all or do that localized import, we just check the bool. This seems oddly consistent with how task timeouts aren't the same thing as exceptions that escape; it's framework functionality that may or may not be an exception.

given that we can specify ProcessingDeadlineException in on already and with your change it can work, retry_on_timeout seems not entirely orthogonal. here.

I agree, I think we could fold this in, but it seemed distinct enough at first with the special handling this case currently has. Perhaps there's enough here to roll the exception handling in worker_child together with the generic exception handling to simplify all of this though.

I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.

I am as well.

I'll let the taskbroker team weigh in on this since they're more likely to have a strong opinion on this.

I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.

I don't have a strong reason to keep BaseException. We could easily go to Exception and be compatible with application usage.

given that we can specify ProcessingDeadlineException in on already and with your change it can work,

Including ProcessingDeadlineException in an on clause works for me. It exposes less API surface area, and makes it more explicit on what kinds of timeouts are captured as the retry_on_timeout parameter doesn't explain which kinds of timeouts much.

I ended up just broadening the allowable exception types to include BaseException in should_retry. To me, it makes sense for ProcessingDeadline to continue to be categorized as a BaseException type, since it's not strictly an error, but a system level timer instead.

markstory · 2026-04-14T17:25:06Z

                )
-                next_state = TASK_ACTIVATION_STATUS_FAILURE
+                retry = task_func.retry
+                if retry and retry.should_retry(inflight.activation.retry_state, err):


Would you have tasks configured to do retries on ProcessingDeadlineExceeded?

The use case for us would be: we try a webhook dispatch, it fails with a timeout, so we retry x number of times total before bailing on the webhook attempt entirely. If there's another way to achieve this without hooking into processingDeadlineExceeded except block here, I'm game for that.

You could have the task processing deadline be longer than the cumulative timeouts of your HTTP request (and retries). That would let you do some retries within the task, handle those failures, and request a full task retry without having to catch processing deadlines.

Discussed off issue but the issue we're hitting is with the DNS resolution in certain cases (ngrok is typically a culprit here), which we can't meaningfully apply a timeout to. This was a way for us to allow retries without reimplementing the same logic we already have here.

markstory · 2026-04-14T18:00:07Z

        ignore: tuple[type[BaseException], ...] | None = None,
        times_exceeded: LastAction = LastAction.Discard,
        delay: int | None = None,
+        retry_on_timeout: bool = True,


I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.

I don't have a strong reason to keep BaseException. We could easily go to Exception and be compatible with application usage.

given that we can specify ProcessingDeadlineException in on already and with your change it can work,

Including ProcessingDeadlineException in an on clause works for me. It exposes less API surface area, and makes it more explicit on what kinds of timeouts are captured as the retry_on_timeout parameter doesn't explain which kinds of timeouts much.

markstory · 2026-04-14T18:03:32Z

+        if isinstance(exc, self._allowed_exception_types):
+            return True
+
+        from taskbroker_client.worker.workerchild import ProcessingDeadlineExceeded


We could move ProcessingDeadlineExceeded to taskbroker_client.types to break the import cycle.

markstory · 2026-04-14T18:05:27Z


-        # In the retry allow list or processing deadline is exceeded
-        # When processing deadline is exceeded, the subprocess raises a TimeoutError
-        if isinstance(exc, (TimeoutError, self._allowed_exception_types)):


I don't think this is a safe change. If tasks did have TimeoutError being raised it would previously automatically retry and now the task will not retry.

Where is TimeoutError being raised in the code? I was a bit confused on where this is actually going to be an issue since I only saw the ProcessingDeadlineExceeded exception being raised here:

taskbroker/clients/python/src/taskbroker_client/worker/workerchild.py

Line 228 in b7f98fc

with timeout_alarm(inflight.activation.processing_deadline_duration, handle_alarm):

Restored this in case it breaks anything, but I do think there's a case to deprecate this behavior, I just don't want to regress any behavior in the meantime.

…s retriable exception

sentry · 2026-04-20T18:42:21Z

        # starts at 1.
        return state.attempts >= (self._times - 1)

-    def should_retry(self, state: RetryState, exc: Exception) -> bool:
+    def should_retry(self, state: RetryState, exc: BaseException) -> bool:
        # If there are no retries remaining we should not retry
        if self.max_attempts_reached(state):
            return False


Bug: The retry logic does not implicitly handle ProcessingDeadlineExceeded as intended. It requires an explicit on=(ProcessingDeadlineExceeded,) configuration, causing tasks to fail instead of retrying on timeout.
_{Severity: HIGH}

Suggested Fix

Update the should_retry method in retry.py to also check for isinstance(exc, ProcessingDeadlineExceeded). This will make the exception implicitly retriable, aligning the code's behavior with the pull request's description. Also, update the stale comment that incorrectly refers to TimeoutError being raised.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: clients/python/src/taskbroker_client/retry.py#L81-L87 Potential issue: The pull request's stated goal is to make `ProcessingDeadlineExceeded` exceptions implicitly retriable. However, the implementation does not achieve this. The `should_retry` method checks for `TimeoutError` or exceptions explicitly passed in the `on` parameter of the `Retry` object. Since `ProcessingDeadlineExceeded` does not inherit from `TimeoutError`, it is not retried by default. Tasks that exceed their processing deadline will fail immediately instead of being retried, contrary to the feature's intent. This requires users to explicitly opt-in via `retry=Retry(on=(ProcessingDeadlineExceeded,))`, which contradicts the documented behavior.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

I think that's what we want now.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 91c9dff. Configure here.}

cursor · 2026-04-20T18:43:34Z

+                if retry and retry.should_retry(inflight.activation.retry_state, err):
+                    next_state = TASK_ACTIVATION_STATUS_RETRY
+                else:
+                    next_state = TASK_ACTIVATION_STATUS_FAILURE


Sentry error captured before retry check causes duplicate reports

Medium Severity

In the ProcessingDeadlineExceeded handler, sentry_sdk.capture_exception fires before should_retry is checked. When report_timeout_errors is True (the default) and retries are configured, every timeout attempt reports to Sentry — even those that will be retried. This is inconsistent with the Exception handler, which only captures errors when the task is not being retried.

^{Reviewed by Cursor Bugbot for commit 91c9dff. Configure here.}

markstory

Looks good to me. Once this is merged and used in sentry, it would be good to remove the deadline exception workarounds we have in sentry.

linear-code · 2026-04-20T21:14:04Z

ISWF-2447 Add option to Taskbroker library to prevent propagating ProcessingDeadlineExceeded exceptions to Sentry

GabeVillalobos commented Apr 13, 2026

View reviewed changes

Comment thread clients/python/src/taskbroker_client/retry.py

ref(eco): Adds retry_on_timeout, report_timeout_errors options to Ret…

b7f98fc

…ry and Task definitions respectively

GabeVillalobos force-pushed the gv/add_processing_deadline_options_to_retry branch from 69d0adc to b7f98fc Compare April 13, 2026 22:08

kcons reviewed Apr 13, 2026

View reviewed changes

Removes special handling for processingDeadlineExceeded

5482423

markstory reviewed Apr 14, 2026

View reviewed changes

Updates retry logic to implicitly allow ProcessingDeadlineException a…

91c9dff

…s retriable exception

GabeVillalobos marked this pull request as ready for review April 20, 2026 18:38

GabeVillalobos requested a review from a team as a code owner April 20, 2026 18:38

sentry Bot reviewed Apr 20, 2026

View reviewed changes

cursor Bot reviewed Apr 20, 2026

View reviewed changes

markstory approved these changes Apr 20, 2026

View reviewed changes

GabeVillalobos merged commit 5c504e5 into main Apr 20, 2026
32 of 34 checks passed

GabeVillalobos deleted the gv/add_processing_deadline_options_to_retry branch April 20, 2026 20:27

GabeVillalobos changed the title ~~ref(eco): Adds retry_on_timeout, report_timeout_errors options to Retry and Task definitions respectively~~ ref(eco): Adds report_timeout_errors options to Task definitions, allows ProcessingDeadlineExceeded to be retried Apr 20, 2026

sentry-release-bot Bot mentioned this pull request Apr 22, 2026

publish: getsentry/taskbroker@26.4.1 getsentry/publish#7906

Closed

3 tasks

Uh oh!

Conversation

GabeVillalobos commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GabeVillalobos Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GabeVillalobos Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GabeVillalobos Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sentry Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 20, 2026

Choose a reason for hiding this comment

Sentry error captured before retry check causes duplicate reports

Uh oh!

markstory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linear-code Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GabeVillalobos commented Apr 13, 2026 •

edited

Loading

GabeVillalobos Apr 13, 2026 •

edited

Loading

GabeVillalobos Apr 14, 2026 •

edited

Loading

GabeVillalobos Apr 20, 2026 •

edited

Loading