Skip to content

ref(eco): Adds report_timeout_errors options to Task definitions, allows ProcessingDeadlineExceeded to be retried#592

Merged
GabeVillalobos merged 3 commits intomainfrom
gv/add_processing_deadline_options_to_retry
Apr 20, 2026
Merged

ref(eco): Adds report_timeout_errors options to Task definitions, allows ProcessingDeadlineExceeded to be retried#592
GabeVillalobos merged 3 commits intomainfrom
gv/add_processing_deadline_options_to_retry

Conversation

@GabeVillalobos
Copy link
Copy Markdown
Member

@GabeVillalobos GabeVillalobos commented Apr 13, 2026

Updates the python lib code with the following changes:

  1. Broadens allowable retry exceptions in retry.py to include BaseException types (like ProcessingDeadlineExceeded)
  2. Updates workerchild to check for retries on the ProcessingDeadlineExceeded handling branch.
  3. Adds a new report_timeout_errors check which allows us to opt out of Sentry reporting on a per-task definition basis.

This will allow developers to optionally specify retries on task timeouts.

Resolves ISWF-2447

Comment thread clients/python/src/taskbroker_client/retry.py
@GabeVillalobos GabeVillalobos force-pushed the gv/add_processing_deadline_options_to_retry branch from 69d0adc to b7f98fc Compare April 13, 2026 22:08
ignore: tuple[type[BaseException], ...] | None = None,
times_exceeded: LastAction = LastAction.Discard,
delay: int | None = None,
retry_on_timeout: bool = True,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we can specify ProcessingDeadlineException in on already and with your change it can work, retry_on_timeout seems not entirely orthogonal. here.

I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.

If we try to make retry_on_timeout stand alone, we potentially don't need to involve should_retry at all or do that localized import, we just check the bool. This seems oddly consistent with how task timeouts aren't the same thing as exceptions that escape; it's framework functionality that may or may not be an exception.

Copy link
Copy Markdown
Member Author

@GabeVillalobos GabeVillalobos Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we can specify ProcessingDeadlineException in on already and with your change it can work, retry_on_timeout seems not entirely orthogonal. here.

I agree, I think we could fold this in, but it seemed distinct enough at first with the special handling this case currently has. Perhaps there's enough here to roll the exception handling in worker_child together with the generic exception handling to simplify all of this though.

I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.

I am as well.

I'll let the taskbroker team weigh in on this since they're more likely to have a strong opinion on this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.

I don't have a strong reason to keep BaseException. We could easily go to Exception and be compatible with application usage.

given that we can specify ProcessingDeadlineException in on already and with your change it can work,

Including ProcessingDeadlineException in an on clause works for me. It exposes less API surface area, and makes it more explicit on what kinds of timeouts are captured as the retry_on_timeout parameter doesn't explain which kinds of timeouts much.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up just broadening the allowable exception types to include BaseException in should_retry. To me, it makes sense for ProcessingDeadline to continue to be categorized as a BaseException type, since it's not strictly an error, but a system level timer instead.

)
next_state = TASK_ACTIVATION_STATUS_FAILURE
retry = task_func.retry
if retry and retry.should_retry(inflight.activation.retry_state, err):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have tasks configured to do retries on ProcessingDeadlineExceeded?

Copy link
Copy Markdown
Member Author

@GabeVillalobos GabeVillalobos Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case for us would be: we try a webhook dispatch, it fails with a timeout, so we retry x number of times total before bailing on the webhook attempt entirely. If there's another way to achieve this without hooking into processingDeadlineExceeded except block here, I'm game for that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have the task processing deadline be longer than the cumulative timeouts of your HTTP request (and retries). That would let you do some retries within the task, handle those failures, and request a full task retry without having to catch processing deadlines.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed off issue but the issue we're hitting is with the DNS resolution in certain cases (ngrok is typically a culprit here), which we can't meaningfully apply a timeout to. This was a way for us to allow retries without reimplementing the same logic we already have here.

ignore: tuple[type[BaseException], ...] | None = None,
times_exceeded: LastAction = LastAction.Discard,
delay: int | None = None,
retry_on_timeout: bool = True,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in why we use type[BaseException] rather than type[Exception], what with should_retry (at least prior to this) only allowing Exception. I can see an argument for it, but not the motivating case.

I don't have a strong reason to keep BaseException. We could easily go to Exception and be compatible with application usage.

given that we can specify ProcessingDeadlineException in on already and with your change it can work,

Including ProcessingDeadlineException in an on clause works for me. It exposes less API surface area, and makes it more explicit on what kinds of timeouts are captured as the retry_on_timeout parameter doesn't explain which kinds of timeouts much.

if isinstance(exc, self._allowed_exception_types):
return True

from taskbroker_client.worker.workerchild import ProcessingDeadlineExceeded
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move ProcessingDeadlineExceeded to taskbroker_client.types to break the import cycle.


# In the retry allow list or processing deadline is exceeded
# When processing deadline is exceeded, the subprocess raises a TimeoutError
if isinstance(exc, (TimeoutError, self._allowed_exception_types)):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a safe change. If tasks did have TimeoutError being raised it would previously automatically retry and now the task will not retry.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is TimeoutError being raised in the code? I was a bit confused on where this is actually going to be an issue since I only saw the ProcessingDeadlineExceeded exception being raised here:

with timeout_alarm(inflight.activation.processing_deadline_duration, handle_alarm):

Copy link
Copy Markdown
Member Author

@GabeVillalobos GabeVillalobos Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored this in case it breaks anything, but I do think there's a case to deprecate this behavior, I just don't want to regress any behavior in the meantime.

@GabeVillalobos GabeVillalobos marked this pull request as ready for review April 20, 2026 18:38
@GabeVillalobos GabeVillalobos requested a review from a team as a code owner April 20, 2026 18:38
Comment on lines 81 to 87
# starts at 1.
return state.attempts >= (self._times - 1)

def should_retry(self, state: RetryState, exc: Exception) -> bool:
def should_retry(self, state: RetryState, exc: BaseException) -> bool:
# If there are no retries remaining we should not retry
if self.max_attempts_reached(state):
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The retry logic does not implicitly handle ProcessingDeadlineExceeded as intended. It requires an explicit on=(ProcessingDeadlineExceeded,) configuration, causing tasks to fail instead of retrying on timeout.
Severity: HIGH

Suggested Fix

Update the should_retry method in retry.py to also check for isinstance(exc, ProcessingDeadlineExceeded). This will make the exception implicitly retriable, aligning the code's behavior with the pull request's description. Also, update the stale comment that incorrectly refers to TimeoutError being raised.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: clients/python/src/taskbroker_client/retry.py#L81-L87

Potential issue: The pull request's stated goal is to make `ProcessingDeadlineExceeded`
exceptions implicitly retriable. However, the implementation does not achieve this. The
`should_retry` method checks for `TimeoutError` or exceptions explicitly passed in the
`on` parameter of the `Retry` object. Since `ProcessingDeadlineExceeded` does not
inherit from `TimeoutError`, it is not retried by default. Tasks that exceed their
processing deadline will fail immediately instead of being retried, contrary to the
feature's intent. This requires users to explicitly opt-in via
`retry=Retry(on=(ProcessingDeadlineExceeded,))`, which contradicts the documented
behavior.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what we want now.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 91c9dff. Configure here.

if retry and retry.should_retry(inflight.activation.retry_state, err):
next_state = TASK_ACTIVATION_STATUS_RETRY
else:
next_state = TASK_ACTIVATION_STATUS_FAILURE
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentry error captured before retry check causes duplicate reports

Medium Severity

In the ProcessingDeadlineExceeded handler, sentry_sdk.capture_exception fires before should_retry is checked. When report_timeout_errors is True (the default) and retries are configured, every timeout attempt reports to Sentry — even those that will be retried. This is inconsistent with the Exception handler, which only captures errors when the task is not being retried.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 91c9dff. Configure here.

Copy link
Copy Markdown
Member

@markstory markstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Once this is merged and used in sentry, it would be good to remove the deadline exception workarounds we have in sentry.

@GabeVillalobos GabeVillalobos merged commit 5c504e5 into main Apr 20, 2026
32 of 34 checks passed
@GabeVillalobos GabeVillalobos deleted the gv/add_processing_deadline_options_to_retry branch April 20, 2026 20:27
@linear-code
Copy link
Copy Markdown

linear-code Bot commented Apr 20, 2026

@GabeVillalobos GabeVillalobos changed the title ref(eco): Adds retry_on_timeout, report_timeout_errors options to Retry and Task definitions respectively ref(eco): Adds report_timeout_errors options to Task definitions, allows ProcessingDeadlineExceeded to be retried Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants