Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR is a first proof of concept for
TaskGroupretries (related to: #21867), intended to get an end‑to‑end implementation in place so we can iterate on behavior and UX in follow‑ups.Key Changes
TaskGroupretry configuration in the SDK (retries,retry_delay,retry_exponential_backoff,max_retry_delay,retry_condition,retry_fast_fail).TaskGroupperDagRunvia newtask_group_instancemodel and migration.TaskGroupretry conditions, clear group tasks for another attempt, and enforce retry delay.TaskGroupretry dependency to block scheduling while a group is waiting for its retry delay.Retry Group Options
retry_condition: Controls when aTaskGroupis considered failed and should retry. Built‑ins:any_failed(default),all_failed. Can be a callable for custom logic, receiving task instances and optional context (task_group,task_group_id,ti).retry_fast_fail: Controls how quickly remaining group tasks are stopped once the retry condition is met.False(default): let remaining tasks finish naturally.True: running tasks are forced to fail, queued/scheduled tasks are skipped (teardown tasks are respected), enabling faster retry loops.Design Notes
I did not add support for restarting only failing tasks (a
retry_strategyor similar). That behavior is already covered byTaskInstanceretries, so partialTaskGroupretries did not add meaningful value in this initial implementation.UI Note
No UI changes were introduced in this PR. A potential follow‑up could be adding a display‑only state like
“Up for Group Retry”to indicate tasks waiting on aTaskGroupretry delay.No Partial (Selective) TaskGroup Retries (for now)
This implementation does not support restarting only the failing tasks within a
TaskGroup.My reasoning is that selective retries are already covered by
TaskInstanceretries, and introducing partial group retries at this stage would add semantic overlap and scheduler complexity without a clearly distinct use case.That said, this is not a hard constraint of the design. If there are compelling scenarios where partial
TaskGroupretries provide meaningful value beyondTaskInstanceretries, we can revisit the model.