Skip to content

Fix float overflow for retry_exponential_backoff#47967

Closed
alealandreev wants to merge 161 commits intoapache:mainfrom
alealandreev:alealandreev-chngs
Closed

Fix float overflow for retry_exponential_backoff#47967
alealandreev wants to merge 161 commits intoapache:mainfrom
alealandreev:alealandreev-chngs

Conversation

@alealandreev
Copy link

@alealandreev alealandreev commented Mar 19, 2025

Issue: #47971

Fixed float overflow for exponential_backoff calculation. I encountered with error when retry_delay = 5 minutes, max_retry_delay=1 hour, retry_exponential_backoff is True and retries=1000 in DAG configuration. In this case Scheduler brokes down on ~1000 retry due to float overflow (delay is calculating on each retry) and after 1000 retries DAG is still trying to start. So total number of retries I encountered is 1017, which is more than 1000. This is due to this formula in line 2657 in taskinstance.py: min_backoff = math.ceil(delay.total_seconds() * (2 ** (self.try_number - 1))).
We should limit degree to reasonable value, such as 30 for instance. After that we need to avoid all possible exceptions. This fix repairs exponential backoff logic, so float overflow will never happen.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

ephraimbuddy and others added 30 commits August 9, 2024 07:24
(cherry picked from commit cb80bda)
This reverts commit da55393.

(cherry picked from commit 0a06cc6)
…ty (#41382)

When using older FAB providers on the new airflow, this function is called
in the old provider and is no longer available in the new airflow. This
PR brings this back to fix issue in main and v2-10-test branch where all
DAGs fail because of lack of this function

(cherry picked from commit 0576f55)
* Attempt to fix TriggerDagRunOperator for Database Isolation Tests

* Finalize making tests run for triggerdagrunoperator in db isolation mode

* Adjust query count assert for adjustments to serialization

* Review feedback

(cherry picked from commit 6b810b8)
…41369)

* Skip core tests from start to SkipMixin for Database Isolation Mode

* Skip core tests from start to SkipMixin for Database Isolation Mode, uups

* Skip core tests from start to SkipMixin for Database Isolation Mode, uups

(cherry picked from commit b87f987)
…ests (#41370)

Fixing remaining Variable tests for db isolation mode, also fixing secret backend haven't called from EnvironmentVariablesBackend, Metastore and custom ones. This caused side effect to move the Variable.get() method to internal API

(cherry picked from commit c98d1a1)
(cherry picked from commit b4a92f8)
)

* Pass serialized parameter for dag_maker

* Serialisation of object is on __exit__ moving out the dag definition out of dag_maker context

(cherry picked from commit 278f3c4)
…iases are resolved into new datasets (#41398)

* fix(datasets/manager): fix DagPriorityParsingRequest unique constraint error when dataset aliases are resolved into new datasets

this happens when dynamic task mapping is used

* refactor(dataset/manager): reword debug log

Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>

* refactor(dataset/manager): remove unnecessary logging

Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>

---------

Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
(cherry picked from commit bf64cb6)
)

The PROD image building fails currently in non-main because it
attempts to build source provider packages rather than use them from
PyPi when PR is run against "v-test" branch.

This PR fixes it:

* PROD images in non-main-targetted build will pull providers from
  PyPI rather than build them
* they use PyPI constraints to install the providers
* they use UV - which should speed up building of the images

(cherry picked from commit 4d5f1c4)
(cherry picked from commit bf0d412)
…41610)

* Enable pull requests to be run from v*test branches (#41474) (#41476)

Since we switch from direct push of cherry-picking to open PRs
against v*test branch, we should enable PRs to run for the target
branch.

(cherry picked from commit a9363e6)

* Prevent provider lowest-dependency tests to run in non-main branch (#41478) (#41481)

When running tests in v2-10-test branch, lowest depenency tests
are run for providers - because when calculating separate tests,
the "skip_provider_tests" has not been used to filter them out.

This PR fixes it.

(cherry picked from commit 75da507)

* Make PROD image building works in non-main PRs (#41480) (#41484)

The PROD image building fails currently in non-main because it
attempts to build source provider packages rather than use them from
PyPi when PR is run against "v-test" branch.

This PR fixes it:

* PROD images in non-main-targetted build will pull providers from
  PyPI rather than build them
* they use PyPI constraints to install the providers
* they use UV - which should speed up building of the images

(cherry picked from commit 4d5f1c4)

* Add WebEncoder for trigger page rendering to avoid render failure (#41350) (#41485)

Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com>

* Incorrect try number subtraction producing invalid span id for OTEL airflow (issue #41501) (#41502) (#41535)

* Fix for issue #39336

* removed unnecessary import

(cherry picked from commit dd3c3a7)

Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>

* Fix failing pydantic v1 tests (#41534) (#41541)

We need to exclude some versions of Pydantic v1 because it conflicts
with aws provider.

(cherry picked from commit a033c5f)

* Fix Non-DB test calculation for main builds (#41499) (#41543)

Pytest has a weird behaviour that it will not collect tests
from parent folder when subfolder of it is specified after the
parent folder. This caused some non-db tests from providers folder
have been skipped during main build.

The issue in Pytest 8.2 (used to work before) is tracked at
pytest-dev/pytest#12605

(cherry picked from commit d489826)

* Add changelog for airflow python client 2.10.0 (#41583) (#41584)

* Add changelog for airflow python client 2.10.0

* Update client version

(cherry picked from commit 317a28e)

* Make all test pass in Database Isolation mode (#41567)

This adds dedicated "DatabaseIsolation" test to airflow v2-10-test
branch..

The DatabaseIsolation test will run all "db-tests" with enabled
DB isolation mode and running `internal-api` component - groups
of tests marked with "skip-if-database-isolation" will be skipped.

* Upgrade build and chart dependencies (#41570) (#41588)

(cherry picked from commit c88192c)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

* Limit watchtower as depenendcy as 3.3.0 breaks moin. (#41612)

(cherry picked from commit 1b602d5)

* Enable running Pull Requests against v2-10-stable branch (#41624)

(cherry picked from commit e306e7f)

* Fix tests/models/test_variable.py for database isolation mode (#41414)

* Fix tests/models/test_variable.py for database isolation mode

* Review feedback

(cherry picked from commit 736ebfe)

* Make latest botocore tests green (#41626)

The latest botocore tests are conflicting with a few requirements
and until apache-beam upcoming version is released we need to do
some manual exclusions. Those exclusions should make latest botocore
test green again.

(cherry picked from commit a13ccbb)

* Simpler task retrieval for taskinstance test (#41389)

The test has been updated for DB isolation but the retrieval of
task was not intuitive and it could lead to flaky tests possibly

(cherry picked from commit f25adf1)

* Skip  database isolation case for task mapping taskinstance tests (#41471)

Related: #41067
(cherry picked from commit 7718bd7)

* Skipping tests for db isolation because similar tests were skipped (#41450)

(cherry picked from commit e94b508)

---------

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Brent Bovenzi <brent@astronomer.io>
Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com>
Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>
* able to change the 'Changed Row' display message after edit

* added message in connection form to warn of empty fields

* attempt to warn the specific fields cannot be empty

* revert change because need to check fields before save is clicked

* issues warning for specific fields that can't be deleted after save

* removed the individual warnings

* changed status to concise string

* added more concise suggestion



---------

Co-authored-by: Lucy Hu <90779522+lh5844@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: phi-friday <phi.friday@gmail.com>
Co-authored-by: Computer Network Investigation <121175071+JSCU-CNI@users.noreply.github.com>
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
jedcunningham and others added 7 commits November 1, 2024 15:18
#43567)

It's possible that the start/end date are null when processing an
executor event, and there is no point in adding an OTEL event in that
case.

Before this, we'd try and convert `None` to nanoseconds and blow up the
scheduler.

Note: I don't think `queued_dttm` can be empty, but figured it didn't hurt to
guard against it just in case I've overlooked a way it can be possible.

(cherry picked from commit fe41e15)
(cherry picked from commit c83e524)
#43572)

* Mark all tasks as skipped when failing a dag_run manually including tasks with None state (#43482)

(cherry picked from commit eda6a8f)

* Fix tests for 2.10.x

---------

Co-authored-by: Abhishek <abhishek.bhakat@hotmail.com>
(cherry picked from commit 72eef0f)
…d from an extended operator (#42849) (#43577)

* refactor: Don't raise a warning when execute is called from an extended operator, as this should always be allowed.

* refactored: Fixed import of test_utils in test_dag_run

---------

Co-authored-by: David Blain <david.blain@infrabel.be>
(cherry picked from commit 95c46ec)

Co-authored-by: David Blain <info@dabla.be>
(cherry picked from commit 2f29c57)
(cherry picked from commit f1735b4)

Co-authored-by: GPK <gopidesupavan@gmail.com>
(cherry picked from commit 0286c95)
Copy link
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, the branch looks broken (wrong rebase or wrong target branch).

Do you mind cleaning the branch to keep only the relevant change so we can proceed with the review.

Copy link
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch appears to still be in a non reviewable state.

Please clean the branch before asking for another review.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all unrleated changes before doing anythin here. Therea are plenty plenty of commits in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:CLI area:dev-tools area:production-image Production image improvements and fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.