Add task-level circuit breaker to pause failing tasks#67724
Conversation
Operators declare a failure budget: max failures within a rolling time window. When the threshold is hit the circuit opens and future scheduled instances are skipped until reset via the REST API or after a configurable cooldown period. New operator params (all optional, backward-compatible): circuit_breaker_max_failures: int | None circuit_breaker_window: timedelta | None circuit_breaker_reset_delay: timedelta | None New TaskCircuitBreaker model tracks one row per (dag_id, task_id). Alembic migration adds the table. Execution API records failures. Scheduler excludes open-circuit tasks before the QUEUED transition. Heartbeat auto-resets expired circuits every 60s. Three REST endpoints expose and reset circuit state per task and per Dag.
|
Found a few blockers in the current diff:
Drafted-by: Codex (GPT-5); reviewed by @Vamsi-klu before posting |
4e9fc08 to
0e6e67a
Compare
|
Hi @Vamsi-klu thank you for your comment, I've looked into your suggestions and put them into place |
- Rename migration to 0118 and chain it after 8812eb67b63c to resolve the duplicate Alembic head (was both 0117s revising acc215baed80) - Add row-level locking (with_for_update) and IntegrityError handling to record_failure() to prevent lost increments under concurrent writes - Batch reset_expired() and _skip_circuit_breaker_blocked_tis() at 100 rows per flush to avoid unbounded single-transaction scans - Move session to keyword-only position on all new @provide_session methods (reset_expired, _reset_expired_circuit_breakers, _skip_circuit_breaker_blocked_tis)
0e6e67a to
6cc2174
Compare
potiuk
left a comment
There was a problem hiding this comment.
Thanks for putting real work into this — automatically pausing chronically-failing tasks is a genuinely interesting idea. But the size and nature of the change mean it should start as a design discussion, not a code review, and I'd like to steer it there rather than iterate on the implementation.
This is a cross-cutting behavioral change to core: a new scheduler state transition (open circuits auto-SKIP SCHEDULED TIs, which propagates downstream via trigger rules), a new metadata table + migration, a new Execution-API path, and three new public REST endpoints. Changes that alter the scheduler state machine and add public API surface go through an AIP (Airflow Improvement Proposal) / dev-list discussion first, so the community can agree on the model — e.g. whether auto-SKIP is the right terminal transition, how it's opted into, and whether the failure budget belongs on the task — before anyone reviews ~1700 lines of implementation.
A few things also block the current code regardless of the design outcome:
- The Alembic migration collides with one already merged to
main(0118_filename + a second head off8812eb67b63c) — this is the root cause of the red CI. - The migration uses naive
sa.DateTime()while the model usesUtcDateTime; useTIMESTAMP(timezone=True). - The Execution-API failure path deserializes the full Dag per task failure to read three scalars — that config should ride on the TI/payload instead.
- The auto-
SKIPhas no opt-out/config and silently skips downstream subgraphs.
Suggested next step: open an AIP (or start a thread on dev@airflow.apache.org) describing the problem and proposed model, link it here, and we'll take the design discussion there. I'm converting this to draft in the meantime so it's clear it's pending that discussion rather than waiting on a code review — flip it back to ready once there's consensus.
AIP process: https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
This review was drafted by an AI-assisted tool and confirmed by an Apache Airflow maintainer. After you've addressed the points above and pushed an update, an Apache Airflow maintainer — a real person — will take the next look. The findings cite the project's review criteria; if you think one is mis-applied, please reply on the PR and a maintainer will weigh in.
More on how Apache Airflow handles maintainer review: contributing-docs/05_pull_requests.rst.
Drafted-by: Claude Code (Opus 4.8); reviewed by @potiuk before posting
|
Hi @potiuk thank you very much for your comment and attention to this PR. I've followed your advice and opened an AIP here: https://cwiki.apache.org/confluence/display/AIRFLOW/Add+task-level+circuit+breaker+to+pause+failing+tasks I will also get to fixing the blockers you mentioned shortly |
Operators declare a failure budget: max failures within a rolling time window. When the threshold is hit the circuit opens and future scheduled instances are skipped until reset via the REST API or after a configurable cooldown period.
New operator params (all optional, backward-compatible):
circuit_breaker_max_failures: int | None
circuit_breaker_window: timedelta | None
circuit_breaker_reset_delay: timedelta | None
New TaskCircuitBreaker model tracks one row per (dag_id, task_id). Alembic migration adds the table. Execution API records failures. Scheduler excludes open-circuit tasks before the QUEUED transition. Heartbeat auto-resets expired circuits every 60s. Three REST endpoints expose and reset circuit state per task and per Dag.
Was generative AI tooling used to co-author this PR?
Generated-by: [Claude Sonnet 4.6] following the guidelines