Skip to content

Conversation

@aeroyorch
Copy link
Contributor

Summary

This PR is a first proof of concept for TaskGroup retries (related to: #21867), intended to get an end‑to‑end implementation in place so we can iterate on behavior and UX in follow‑ups.

Key Changes

  • Add TaskGroup retry configuration in the SDK (retries, retry_delay, retry_exponential_backoff, max_retry_delay, retry_condition, retry_fast_fail).
  • Persist retry state per TaskGroup per DagRun via new task_group_instance model and migration.
  • Implement scheduler logic to evaluate TaskGroup retry conditions, clear group tasks for another attempt, and enforce retry delay.
  • Add a TaskGroup retry dependency to block scheduling while a group is waiting for its retry delay.
  • Add unit/integration tests for retry behavior, delay/backoff, and fast‑fail sibling cancellation.

Retry Group Options

  • retry_condition: Controls when a TaskGroup is considered failed and should retry. Built‑ins: any_failed (default), all_failed. Can be a callable for custom logic, receiving task instances and optional context (task_group, task_group_id, ti).
  • retry_fast_fail: Controls how quickly remaining group tasks are stopped once the retry condition is met.
    • False (default): let remaining tasks finish naturally.
    • True: running tasks are forced to fail, queued/scheduled tasks are skipped (teardown tasks are respected), enabling faster retry loops.

Design Notes

I did not add support for restarting only failing tasks (a retry_strategy or similar). That behavior is already covered by TaskInstance retries, so partial TaskGroup retries did not add meaningful value in this initial implementation.

UI Note

No UI changes were introduced in this PR. A potential follow‑up could be adding a display‑only state like “Up for Group Retry” to indicate tasks waiting on a TaskGroup retry delay.

No Partial (Selective) TaskGroup Retries (for now)

This implementation does not support restarting only the failing tasks within a TaskGroup.

My reasoning is that selective retries are already covered by TaskInstance retries, and introducing partial group retries at this stage would add semantic overlap and scheduler complexity without a clearly distinct use case.

That said, this is not a hard constraint of the design. If there are compelling scenarios where partial TaskGroup retries provide meaningful value beyond TaskInstance retries, we can revisit the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant