Skip to content

Conversation

@kaxil
Copy link
Member

@kaxil kaxil commented Oct 16, 2025

When a task fails and fail_fast is enabled, the API-server needs to stop remaining tasks. Previously, this required loading the entire 5-50 MB SerializedDAG for every task failure (although it comes from cache -- but it is likely that if multiple replicas are run -- it might not have it in local cache) to check the fail_fast setting.

This change adds fail_fast column to the dag table and checks it with a simple database lookup first. The SerializedDAG is only loaded when fail_fast=True (affecting ~1% of DAGs), avoiding unnecessary memory and I/O overhead in 99% of cases.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

When a task fails and ``fail_fast`` is enabled, the system needs to stop
remaining tasks. Previously, this required loading the entire 5-50 MB
SerializedDAG for every task failure to check the ``fail_fast`` setting.

This change adds `fail_fast` column to the dag table and checks it with
a simple database lookup first. The `SerializedDAG` is only loaded when
`fail_fast=True` (affecting ~1% of DAGs), avoiding unnecessary memory
and I/O overhead in 99% of cases.
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good optimisation!

Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimisation, I had given it a thought but didn't do it !

@kaxil kaxil merged commit cafa765 into apache:main Oct 16, 2025
117 checks passed
@kaxil kaxil deleted the add-fail-fast branch October 16, 2025 10:38
snreddygopu pushed a commit to Teradata/airflow that referenced this pull request Oct 16, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 17, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 19, 2025
TyrellHaywood pushed a commit to TyrellHaywood/airflow that referenced this pull request Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants