PoC of Task Group retries #61809

aeroyorch · 2026-02-12T09:52:27Z

Summary

This PR is a first proof of concept for TaskGroup retries (related to: #21867), intended to get an end‑to‑end implementation in place so we can iterate on behavior and UX in follow‑ups.

Key Changes

Add TaskGroup retry configuration in the SDK (retries, retry_delay, retry_exponential_backoff, max_retry_delay, retry_condition, retry_fast_fail).
Persist retry state per TaskGroup per DagRun via new task_group_instance model and migration.
Implement scheduler logic to evaluate TaskGroup retry conditions, clear group tasks for another attempt, and enforce retry delay.
Add a TaskGroup retry dependency to block scheduling while a group is waiting for its retry delay.
Add unit/integration tests for retry behavior, delay/backoff, and fast‑fail sibling cancellation.

Retry Group Options

retry_condition: Controls when a TaskGroup is considered failed and should retry. Built‑ins: any_failed (default), all_failed. Can be a callable for custom logic, receiving task instances and optional context (task_group, task_group_id, ti).
retry_fast_fail: Controls how quickly remaining group tasks are stopped once the retry condition is met.
- False (default): let remaining tasks finish naturally.
- True: running tasks are forced to fail, queued/scheduled tasks are skipped (teardown tasks are respected), enabling faster retry loops.

Design Notes

I did not add support for restarting only failing tasks (a retry_strategy or similar). That behavior is already covered by TaskInstance retries, so partial TaskGroup retries did not add meaningful value in this initial implementation.

UI Note

No UI changes were introduced in this PR. A potential follow‑up could be adding a display‑only state like “Up for Group Retry” to indicate tasks waiting on a TaskGroup retry delay.

No Partial (Selective) TaskGroup Retries (for now)

This implementation does not support restarting only the failing tasks within a TaskGroup.

My reasoning is that selective retries are already covered by TaskInstance retries, and introducing partial group retries at this stage would add semantic overlap and scheduler complexity without a clearly distinct use case.

That said, this is not a hard constraint of the design. If there are compelling scenarios where partial TaskGroup retries provide meaningful value beyond TaskInstance retries, we can revisit the model.

aeroyorch requested review from XD-DENG, amoghrajesh, ashb, bolkedebruin, ephraimbuddy and kaxil as code owners February 12, 2026 09:52

boring-cyborg bot added area:DAG-processing area:db-migrations PRs with DB migration area:task-sdk kind:documentation labels Feb 12, 2026

aeroyorch marked this pull request as draft February 12, 2026 09:53

aeroyorch force-pushed the task-group-retries branch 6 times, most recently from c1f029a to 7b1e8a5 Compare February 12, 2026 18:06

PoC of Task Group retries

3d83a8f

aeroyorch force-pushed the task-group-retries branch from 7b1e8a5 to 3d83a8f Compare February 12, 2026 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC of Task Group retries #61809

PoC of Task Group retries #61809

aeroyorch commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PoC of Task Group retries #61809

Are you sure you want to change the base?

PoC of Task Group retries #61809

Conversation

aeroyorch commented Feb 12, 2026

Summary

Key Changes

Retry Group Options

Design Notes

UI Note

No Partial (Selective) TaskGroup Retries (for now)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant