Fix float overflow for retry_exponential_backoff by alealandreev · Pull Request #47967 · apache/airflow

alealandreev · 2025-03-19T18:01:17Z

Fixed float overflow for exponential_backoff calculation. I encountered with error when retry_delay = 5 minutes, max_retry_delay=1 hour, retry_exponential_backoff is True and retries=1000 in DAG configuration. In this case Scheduler brokes down on ~1000 retry due to float overflow (delay is calculating on each retry) and after 1000 retries DAG is still trying to start. So total number of retries I encountered is 1017, which is more than 1000. This is due to this formula in line 2657 in taskinstance.py: min_backoff = math.ceil(delay.total_seconds() * (2 ** (self.try_number - 1))).
We should limit degree to reasonable value, such as 30 for instance. After that we need to avoid all possible exceptions. This fix repairs exponential backoff logic, so float overflow will never happen.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

(cherry picked from commit cb80bda)

This reverts commit da55393. (cherry picked from commit 0a06cc6)

…ty (#41382) When using older FAB providers on the new airflow, this function is called in the old provider and is no longer available in the new airflow. This PR brings this back to fix issue in main and v2-10-test branch where all DAGs fail because of lack of this function (cherry picked from commit 0576f55)

…41344) (cherry picked from commit f811ac3)

* Attempt to fix TriggerDagRunOperator for Database Isolation Tests * Finalize making tests run for triggerdagrunoperator in db isolation mode * Adjust query count assert for adjustments to serialization * Review feedback (cherry picked from commit 6b810b8)

…41369) * Skip core tests from start to SkipMixin for Database Isolation Mode * Skip core tests from start to SkipMixin for Database Isolation Mode, uups * Skip core tests from start to SkipMixin for Database Isolation Mode, uups (cherry picked from commit b87f987)

…ests (#41370) Fixing remaining Variable tests for db isolation mode, also fixing secret backend haven't called from EnvironmentVariablesBackend, Metastore and custom ones. This caused side effect to move the Variable.get() method to internal API (cherry picked from commit c98d1a1)

(cherry picked from commit 60cbea5)

(cherry picked from commit b4a92f8)

(cherry picked from commit 54c165c)

(cherry picked from commit 68a6a05)

) * Pass serialized parameter for dag_maker * Serialisation of object is on __exit__ moving out the dag definition out of dag_maker context (cherry picked from commit 278f3c4)

…iases are resolved into new datasets (#41398) * fix(datasets/manager): fix DagPriorityParsingRequest unique constraint error when dataset aliases are resolved into new datasets this happens when dynamic task mapping is used * refactor(dataset/manager): reword debug log Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com> * refactor(dataset/manager): remove unnecessary logging Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com> --------- Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com> (cherry picked from commit bf64cb6)

) The PROD image building fails currently in non-main because it attempts to build source provider packages rather than use them from PyPi when PR is run against "v-test" branch. This PR fixes it: * PROD images in non-main-targetted build will pull providers from PyPI rather than build them * they use PyPI constraints to install the providers * they use UV - which should speed up building of the images (cherry picked from commit 4d5f1c4) (cherry picked from commit bf0d412)

…41610) * Enable pull requests to be run from v*test branches (#41474) (#41476) Since we switch from direct push of cherry-picking to open PRs against v*test branch, we should enable PRs to run for the target branch. (cherry picked from commit a9363e6) * Prevent provider lowest-dependency tests to run in non-main branch (#41478) (#41481) When running tests in v2-10-test branch, lowest depenency tests are run for providers - because when calculating separate tests, the "skip_provider_tests" has not been used to filter them out. This PR fixes it. (cherry picked from commit 75da507) * Make PROD image building works in non-main PRs (#41480) (#41484) The PROD image building fails currently in non-main because it attempts to build source provider packages rather than use them from PyPi when PR is run against "v-test" branch. This PR fixes it: * PROD images in non-main-targetted build will pull providers from PyPI rather than build them * they use PyPI constraints to install the providers * they use UV - which should speed up building of the images (cherry picked from commit 4d5f1c4) * Add WebEncoder for trigger page rendering to avoid render failure (#41350) (#41485) Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com> * Incorrect try number subtraction producing invalid span id for OTEL airflow (issue #41501) (#41502) (#41535) * Fix for issue #39336 * removed unnecessary import (cherry picked from commit dd3c3a7) Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com> * Fix failing pydantic v1 tests (#41534) (#41541) We need to exclude some versions of Pydantic v1 because it conflicts with aws provider. (cherry picked from commit a033c5f) * Fix Non-DB test calculation for main builds (#41499) (#41543) Pytest has a weird behaviour that it will not collect tests from parent folder when subfolder of it is specified after the parent folder. This caused some non-db tests from providers folder have been skipped during main build. The issue in Pytest 8.2 (used to work before) is tracked at pytest-dev/pytest#12605 (cherry picked from commit d489826) * Add changelog for airflow python client 2.10.0 (#41583) (#41584) * Add changelog for airflow python client 2.10.0 * Update client version (cherry picked from commit 317a28e) * Make all test pass in Database Isolation mode (#41567) This adds dedicated "DatabaseIsolation" test to airflow v2-10-test branch.. The DatabaseIsolation test will run all "db-tests" with enabled DB isolation mode and running `internal-api` component - groups of tests marked with "skip-if-database-isolation" will be skipped. * Upgrade build and chart dependencies (#41570) (#41588) (cherry picked from commit c88192c) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> * Limit watchtower as depenendcy as 3.3.0 breaks moin. (#41612) (cherry picked from commit 1b602d5) * Enable running Pull Requests against v2-10-stable branch (#41624) (cherry picked from commit e306e7f) * Fix tests/models/test_variable.py for database isolation mode (#41414) * Fix tests/models/test_variable.py for database isolation mode * Review feedback (cherry picked from commit 736ebfe) * Make latest botocore tests green (#41626) The latest botocore tests are conflicting with a few requirements and until apache-beam upcoming version is released we need to do some manual exclusions. Those exclusions should make latest botocore test green again. (cherry picked from commit a13ccbb) * Simpler task retrieval for taskinstance test (#41389) The test has been updated for DB isolation but the retrieval of task was not intuitive and it could lead to flaky tests possibly (cherry picked from commit f25adf1) * Skip database isolation case for task mapping taskinstance tests (#41471) Related: #41067 (cherry picked from commit 7718bd7) * Skipping tests for db isolation because similar tests were skipped (#41450) (cherry picked from commit e94b508) --------- Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Brent Bovenzi <brent@astronomer.io> Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com> Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com> Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com> Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>

(cherry picked from commit 5c323a9)

* able to change the 'Changed Row' display message after edit * added message in connection form to warn of empty fields * attempt to warn the specific fields cannot be empty * revert change because need to check fields before save is clicked * issues warning for specific fields that can't be deleted after save * removed the individual warnings * changed status to concise string * added more concise suggestion --------- Co-authored-by: Lucy Hu <90779522+lh5844@users.noreply.github.com>

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

Co-authored-by: phi-friday <phi.friday@gmail.com>

Co-authored-by: Computer Network Investigation <121175071+JSCU-CNI@users.noreply.github.com>

(cherry picked from commit 6c463b3)

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>

#43567) It's possible that the start/end date are null when processing an executor event, and there is no point in adding an OTEL event in that case. Before this, we'd try and convert `None` to nanoseconds and blow up the scheduler. Note: I don't think `queued_dttm` can be empty, but figured it didn't hurt to guard against it just in case I've overlooked a way it can be possible. (cherry picked from commit fe41e15) (cherry picked from commit c83e524)

(cherry picked from commit d99c1c1) (cherry picked from commit af03e48)

#43572) * Mark all tasks as skipped when failing a dag_run manually including tasks with None state (#43482) (cherry picked from commit eda6a8f) * Fix tests for 2.10.x --------- Co-authored-by: Abhishek <abhishek.bhakat@hotmail.com> (cherry picked from commit 72eef0f)

…d from an extended operator (#42849) (#43577) * refactor: Don't raise a warning when execute is called from an extended operator, as this should always be allowed. * refactored: Fixed import of test_utils in test_dag_run --------- Co-authored-by: David Blain <david.blain@infrabel.be> (cherry picked from commit 95c46ec) Co-authored-by: David Blain <info@dabla.be> (cherry picked from commit 2f29c57)

(cherry picked from commit f1735b4) Co-authored-by: GPK <gopidesupavan@gmail.com> (cherry picked from commit 0286c95)

pierrejeambrun

Thank you for the PR, the branch looks broken (wrong rebase or wrong target branch).

Do you mind cleaning the branch to keep only the relevant change so we can proceed with the review.

pierrejeambrun

The branch appears to still be in a non reviewable state.

Please clean the branch before asking for another review.

potiuk

Please remove all unrleated changes before doing anythin here. Therea are plenty plenty of commits in this PR.

ephraimbuddy and others added 30 commits August 9, 2024 07:24

Update default branches for 2-10

ac5c3e0

Typo fix dataset guide (#41353)

ac40b27

regenerate command hashes

1002266

Fix Gantt Task Tries (#41342)

f2a15d4

(cherry picked from commit cb80bda)

Revert "Send context using in venv operator (#41039)" (#41362)

0ee7aff

This reverts commit da55393. (cherry picked from commit 0a06cc6)

Update version to 2.10.0

0d87d27

Update RELEASE_NOTES.rst

4fce874

Fix tests/models/test_taskinstance.py for Database Isolation Tests (#…

a7d48cb

…41344) (cherry picked from commit f811ac3)

Fix pytests for Core except Variable for DB Isolation Mode (#41375)

ef147dd

(cherry picked from commit 60cbea5)

bump uv version to 0.2.34 (#41334)

34e2c9c

(cherry picked from commit b4a92f8)

Skip docs publishing on non-main brnaches (#41385)

09567dc

(cherry picked from commit 54c165c)

Fix mypy checks for new azure libraries (#41386)

6c6797c

(cherry picked from commit 68a6a05)

Fix tests/decorators/test_python.py for database isolation tests (#41387

8ea4eb1

) * Pass serialized parameter for dag_maker * Serialisation of object is on __exit__ moving out the dag definition out of dag_maker context (cherry picked from commit 278f3c4)

Fix try selector refresh (#41483) (#41503)

6f2121a

Remove debian bullseye support (#41568) (#41569)

29270af

(cherry picked from commit 5c323a9)

[Backport] Deprecate implicit default DAG schedule (#41469)

9a32f7d

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

Fix UI rendering when XCom is INT, FLOAT, BOOL or NULL (#41516) (#41605)

29f61a0

Fix InletEventsAccessors type stub (#41572) (#41607)

36ea9e7

Co-authored-by: phi-friday <phi.friday@gmail.com>

Set better logging level for path wrapper (#41615) (#41668)

8ffe7d6

Co-authored-by: Computer Network Investigation <121175071+JSCU-CNI@users.noreply.github.com>

Change confirmation text (#41650) (#41679)

400dddc

Adding url sanitisation for extra links (#41665) (#41680)

ceb6051

(cherry picked from commit 6c463b3)

Splitting syspath preparation into stages (#41672) (#41694)

03e01e7

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>

jedcunningham and others added 7 commits November 1, 2024 15:18

Fix TrySelector for Mapped Tasks in Logs and Details Grid Panel (#43566)

8e79c7a

(cherry picked from commit d99c1c1) (cherry picked from commit af03e48)

mark test_task_workflow_trigger_success as flaky (#42972) (#43580)

1c7fba7

(cherry picked from commit f1735b4) Co-authored-by: GPK <gopidesupavan@gmail.com> (cherry picked from commit 0286c95)

Update RELEASE_NOTES.rst

c99887e

Retry exponential backoff float overflow fixed

f1cb6c3

alealandreev requested review from XD-DENG, ashb, bolkedebruin, dstandish, ephraimbuddy, hussein-awala, jedcunningham, kaxil, o-nikolas, pierrejeambrun, potiuk, uranusjr and vincbeck as code owners March 19, 2025 18:01

boring-cyborg bot added area:API Airflow's REST/HTTP API area:CLI area:dev-tools area:production-image Production image improvements and fixes labels Mar 19, 2025

alealandreev mentioned this pull request Mar 19, 2025

Retry exponential backoff max float overflow #47971

Closed

2 tasks

pierrejeambrun reviewed Mar 20, 2025

View reviewed changes

alealandreev requested a review from pierrejeambrun March 20, 2025 14:42

pierrejeambrun reviewed Mar 20, 2025

View reviewed changes

potiuk requested changes Mar 21, 2025

View reviewed changes

alealandreev closed this Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix float overflow for retry_exponential_backoff#47967

Fix float overflow for retry_exponential_backoff#47967
alealandreev wants to merge 161 commits intoapache:mainfrom
alealandreev:alealandreev-chngs

alealandreev commented Mar 19, 2025 •

edited

Loading

Uh oh!

pierrejeambrun left a comment

Uh oh!

pierrejeambrun left a comment •

edited

Loading

Uh oh!

potiuk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

alealandreev commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pierrejeambrun left a comment

Choose a reason for hiding this comment

Uh oh!

pierrejeambrun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

potiuk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

alealandreev commented Mar 19, 2025 •

edited

Loading

pierrejeambrun left a comment •

edited

Loading