fix(db_cleanup): add --error-on-cleanup-failure flag to airflow db clean#65239
Open
hkc-8010 wants to merge 5 commits intoapache:mainfrom
Open
fix(db_cleanup): add --error-on-cleanup-failure flag to airflow db clean#65239hkc-8010 wants to merge 5 commits intoapache:mainfrom
hkc-8010 wants to merge 5 commits intoapache:mainfrom
Conversation
airflow db clean suppresses all per-table cleanup errors via _suppress_with_logging() and exits 0 even when tables could not be cleaned. This makes it impossible to detect silent failures in automated DAG-based maintenance workflows, which can lead to unchecked table growth and eventual migration failures on upgrade. This commit adds an opt-in --error-on-cleanup-failure flag that causes run_cleanup() to raise AirflowException (and the CLI to exit 1) if any table cleanup encountered an error. Default behaviour is unchanged. Additionally, a warning listing all tables that were not cleaned is now always emitted when failures occur, even without the flag, improving observability without requiring any opt-in. Changes: - airflow/utils/db_cleanup.py: update _suppress_with_logging to track whether an exception was suppressed via a SimpleNamespace context object; collect failed table names in run_cleanup(); emit a warning summary and optionally raise AirflowException. - airflow/cli/cli_config.py: add ARG_DB_ERROR_ON_CLEANUP_FAILURE and wire it into the db clean ActionCommand args list. - airflow/cli/commands/db_command.py: forward error_on_cleanup_failure from CLI args to run_cleanup(). - tests/utils/test_db_cleanup.py: add unit tests covering the new flag and the warning summary behaviour. Made-with: Cursor
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
…-cleanup-failure - Use collections.abc.Generator for the _suppress_with_logging return type annotation (ruff UP035 compliant) instead of typing.Generator. - Expand the _suppress_with_logging docstring to describe the yielded SimpleNamespace context object and the failure-tracking behaviour. - Add a new "Detecting cleanup failures" section to docs/howto/usage-cli.rst documenting the --error-on-cleanup-failure flag and the --skip-archive recommendation for large tables. Made-with: Cursor
26d0c3d to
ecd0697
Compare
potiuk
approved these changes
Apr 14, 2026
8e6c101 to
dd0e46d
Compare
Member
|
please fix the issues |
- Fix mypy arg-type errors: OperationalError third argument must be a
BaseException, not None. Replace OperationalError("", {}, None) with
OperationalError("", {}, Exception("mock db error")) in three new
tests in test_db_cleanup.py.
- Fix ruff ISC violation: collapse implicit string concatenation in the
run_cleanup() warning call into a single string literal.
- Update existing CLI tests in test_db_command.py to include the new
error_on_cleanup_failure=False kwarg in all ten
assert_called_once_with assertions.
- Add test_error_on_cleanup_failure to test_db_command.py to verify the
--error-on-cleanup-failure flag is correctly forwarded to run_cleanup.
Made-with: Cursor
dd0e46d to
712de81
Compare
…d tests - Introduce a new news fragment detailing the addition of the ``--error-on-cleanup-failure`` flag to the ``airflow db clean`` command, allowing for better error handling during table cleanup. - Update unit tests in `test_db_cleanup.py` to ensure proper functionality of the new flag, including checks for raised exceptions and warning messages for failed tables. - Adjust the known exceptions list to reflect changes in `db_cleanup.py`.
jscheffl
reviewed
Apr 15, 2026
jscheffl
reviewed
Apr 15, 2026
jscheffl
reviewed
Apr 15, 2026
jscheffl
reviewed
Apr 15, 2026
jscheffl
reviewed
Apr 15, 2026
- Change the behavior of the `--error-on-cleanup-failure` flag to raise a RuntimeError instead of an AirflowException when table cleanup encounters errors. - Update the documentation and help text for the flag to clarify its functionality. - Ensure that warning messages for failed tables are always emitted, regardless of the flag's state. - Modify unit tests in `test_db_cleanup.py` to reflect the new error handling and verify the correct logging behavior. This update improves error visibility during automated workflows by ensuring that cleanup failures are properly reported.
Author
|
@jscheffl All your review comments have been addressed:
Would you mind taking another look when you get a chance? Thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What do the changes do?
airflow db clean(the CLI command wrappingairflow.utils.db_cleanup.run_cleanup) currently suppresses all per-table cleanup errors via_suppress_with_logging()and always exits 0, even when one or more tables could not be cleaned. This makes it impossible for operators to detect that their database is not actually being cleaned without manually inspecting task logs and grepping for specific warning strings.This PR adds:
--error-on-cleanup-failureflag that causes the command to exit 1 (raiseAirflowExceptionfromrun_cleanup()) if any table cleanup encountered an error. Default behaviour is unchanged (backward-compatible).Why?
A large deployment ran
airflow db cleandaily for 14+ months. Each run silently failed on thelogtable because the CTAS archival step exceededstatement_timeout(300 s) and was rolled back. The DAG task showed green on every run. Thelogtable grew to 337 M rows / 151 GB. When the deployment was later upgraded to Airflow 3.x, a migration adding a new column + index to thelogtable could not complete withinstatement_timeout, leaving the deployment in a migration loop for ~15 hours.Root cause:
_suppress_with_logging()catchesOperationalError/ProgrammingError, logs a WARNING, and swallows the exception.run_cleanup()has no way to know any table failed.Changes
airflow/utils/db_cleanup.py_suppress_with_loggingyields aSimpleNamespace(failed=False)context so callers can detect suppression;run_cleanupcollects failed table names, emits a warning summary, and optionally raisesAirflowExceptionairflow/cli/cli_config.pyARG_DB_ERROR_ON_CLEANUP_FAILURE; wire intodb cleanActionCommandairflow/cli/commands/db_command.pyerror_on_cleanup_failurearg from CLI torun_cleanup()tests/utils/test_db_cleanup.pyairflow-core/docs/howto/usage-cli.rst--skip-archiverecommendationUsage
In a DAG-based cleanup workflow:
Note on
--skip-archiveWhen the CTAS archival step is itself the source of the timeout (as in the motivating incident), combining
--error-on-cleanup-failurewith--skip-archiveis recommended:--skip-archivedeletes rows directly without the costlyCREATE TABLE … AS SELECT, making the cleanup both faster and less likely to time out.Checklist
mainbranchFalse)airflow-core/docs/howto/usage-cli.rst— added "Detecting cleanup failures" section)