Skip to content

fix(db_cleanup): add --error-on-cleanup-failure flag to airflow db clean#65239

Open
hkc-8010 wants to merge 5 commits intoapache:mainfrom
hkc-8010:fix/db-cleanup-error-on-failure
Open

fix(db_cleanup): add --error-on-cleanup-failure flag to airflow db clean#65239
hkc-8010 wants to merge 5 commits intoapache:mainfrom
hkc-8010:fix/db-cleanup-error-on-failure

Conversation

@hkc-8010
Copy link
Copy Markdown

@hkc-8010 hkc-8010 commented Apr 14, 2026

What do the changes do?

airflow db clean (the CLI command wrapping airflow.utils.db_cleanup.run_cleanup) currently suppresses all per-table cleanup errors via _suppress_with_logging() and always exits 0, even when one or more tables could not be cleaned. This makes it impossible for operators to detect that their database is not actually being cleaned without manually inspecting task logs and grepping for specific warning strings.

This PR adds:

  1. An opt-in --error-on-cleanup-failure flag that causes the command to exit 1 (raise AirflowException from run_cleanup()) if any table cleanup encountered an error. Default behaviour is unchanged (backward-compatible).
  2. A warning summary that lists all tables that were not cleaned due to errors. This summary is always emitted when failures occur, without requiring any opt-in, making the silent failure visible in task logs.

Why?

A large deployment ran airflow db clean daily for 14+ months. Each run silently failed on the log table because the CTAS archival step exceeded statement_timeout (300 s) and was rolled back. The DAG task showed green on every run. The log table grew to 337 M rows / 151 GB. When the deployment was later upgraded to Airflow 3.x, a migration adding a new column + index to the log table could not complete within statement_timeout, leaving the deployment in a migration loop for ~15 hours.

Root cause: _suppress_with_logging() catches OperationalError / ProgrammingError, logs a WARNING, and swallows the exception. run_cleanup() has no way to know any table failed.

Changes

File Change
airflow/utils/db_cleanup.py _suppress_with_logging yields a SimpleNamespace(failed=False) context so callers can detect suppression; run_cleanup collects failed table names, emits a warning summary, and optionally raises AirflowException
airflow/cli/cli_config.py Add ARG_DB_ERROR_ON_CLEANUP_FAILURE; wire into db clean ActionCommand
airflow/cli/commands/db_command.py Forward error_on_cleanup_failure arg from CLI to run_cleanup()
tests/utils/test_db_cleanup.py Unit tests for the new flag and warning summary behaviour
airflow-core/docs/howto/usage-cli.rst Added "Detecting cleanup failures" section documenting the new flag and --skip-archive recommendation

Usage

# Default: errors are suppressed, exits 0 (unchanged behaviour)
airflow db clean --clean-before-timestamp 2024-01-01 --yes

# Opt-in: exit 1 if any table cleanup failed
airflow db clean --clean-before-timestamp 2024-01-01 --yes --error-on-cleanup-failure

In a DAG-based cleanup workflow:

BashOperator(
    task_id="clean_db",
    bash_command=(
        "airflow db clean --yes "
        "--clean-before-timestamp '{{ macros.ds_add(ds, -21) }}' "
        "--error-on-cleanup-failure"
    ),
)

Note on --skip-archive

When the CTAS archival step is itself the source of the timeout (as in the motivating incident), combining --error-on-cleanup-failure with --skip-archive is recommended: --skip-archive deletes rows directly without the costly CREATE TABLE … AS SELECT, making the cleanup both faster and less likely to time out.

Checklist

  • My PR is targeted at the main branch
  • My changes are backward-compatible (new flag defaults to False)
  • I have added unit tests
  • I have updated the documentation (airflow-core/docs/howto/usage-cli.rst — added "Detecting cleanup failures" section)

airflow db clean suppresses all per-table cleanup errors via
_suppress_with_logging() and exits 0 even when tables could not be
cleaned. This makes it impossible to detect silent failures in automated
DAG-based maintenance workflows, which can lead to unchecked table
growth and eventual migration failures on upgrade.

This commit adds an opt-in --error-on-cleanup-failure flag that causes
run_cleanup() to raise AirflowException (and the CLI to exit 1) if any
table cleanup encountered an error. Default behaviour is unchanged.

Additionally, a warning listing all tables that were not cleaned is now
always emitted when failures occur, even without the flag, improving
observability without requiring any opt-in.

Changes:
- airflow/utils/db_cleanup.py: update _suppress_with_logging to track
  whether an exception was suppressed via a SimpleNamespace context
  object; collect failed table names in run_cleanup(); emit a warning
  summary and optionally raise AirflowException.
- airflow/cli/cli_config.py: add ARG_DB_ERROR_ON_CLEANUP_FAILURE and
  wire it into the db clean ActionCommand args list.
- airflow/cli/commands/db_command.py: forward error_on_cleanup_failure
  from CLI args to run_cleanup().
- tests/utils/test_db_cleanup.py: add unit tests covering the new flag
  and the warning summary behaviour.

Made-with: Cursor
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg bot commented Apr 14, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

…-cleanup-failure

- Use collections.abc.Generator for the _suppress_with_logging return
  type annotation (ruff UP035 compliant) instead of typing.Generator.
- Expand the _suppress_with_logging docstring to describe the yielded
  SimpleNamespace context object and the failure-tracking behaviour.
- Add a new "Detecting cleanup failures" section to
  docs/howto/usage-cli.rst documenting the --error-on-cleanup-failure
  flag and the --skip-archive recommendation for large tables.

Made-with: Cursor
@hkc-8010 hkc-8010 force-pushed the fix/db-cleanup-error-on-failure branch from 26d0c3d to ecd0697 Compare April 14, 2026 18:26
@hkc-8010 hkc-8010 marked this pull request as ready for review April 14, 2026 18:38
@potiuk potiuk force-pushed the fix/db-cleanup-error-on-failure branch from 8e6c101 to dd0e46d Compare April 14, 2026 21:14
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 14, 2026

please fix the issues

- Fix mypy arg-type errors: OperationalError third argument must be a
  BaseException, not None. Replace OperationalError("", {}, None) with
  OperationalError("", {}, Exception("mock db error")) in three new
  tests in test_db_cleanup.py.
- Fix ruff ISC violation: collapse implicit string concatenation in the
  run_cleanup() warning call into a single string literal.
- Update existing CLI tests in test_db_command.py to include the new
  error_on_cleanup_failure=False kwarg in all ten
  assert_called_once_with assertions.
- Add test_error_on_cleanup_failure to test_db_command.py to verify the
  --error-on-cleanup-failure flag is correctly forwarded to run_cleanup.

Made-with: Cursor
@hkc-8010 hkc-8010 force-pushed the fix/db-cleanup-error-on-failure branch from dd0e46d to 712de81 Compare April 15, 2026 04:17
…d tests

- Introduce a new news fragment detailing the addition of the ``--error-on-cleanup-failure`` flag to the ``airflow db clean`` command, allowing for better error handling during table cleanup.
- Update unit tests in `test_db_cleanup.py` to ensure proper functionality of the new flag, including checks for raised exceptions and warning messages for failed tables.
- Adjust the known exceptions list to reflect changes in `db_cleanup.py`.
Comment thread airflow-core/newsfragments/65239.bugfix.rst Outdated
Comment thread airflow-core/src/airflow/cli/cli_config.py Outdated
Comment thread airflow-core/src/airflow/utils/db_cleanup.py Outdated
Comment thread airflow-core/src/airflow/utils/db_cleanup.py Outdated
Comment thread airflow-core/tests/unit/utils/test_db_cleanup.py Outdated
- Change the behavior of the `--error-on-cleanup-failure` flag to raise a RuntimeError instead of an AirflowException when table cleanup encounters errors.
- Update the documentation and help text for the flag to clarify its functionality.
- Ensure that warning messages for failed tables are always emitted, regardless of the flag's state.
- Modify unit tests in `test_db_cleanup.py` to reflect the new error handling and verify the correct logging behavior.

This update improves error visibility during automated workflows by ensuring that cleanup failures are properly reported.
@hkc-8010
Copy link
Copy Markdown
Author

@jscheffl All your review comments have been addressed:

  • Newsfragment: Removed.
  • Help text: Simplified to a single concise sentence.
  • AirflowException: Replaced with RuntimeError.
  • Duplicate output: Restructured so the summary warning is only emitted when the flag is not set — when error_on_cleanup_failure=True, the RuntimeError message already lists the failed tables, avoiding duplicate output.
  • caplog: Replaced all caplog usage in the affected tests with patch('airflow.utils.db_cleanup.logger') and assert_any_call assertions.

Would you mind taking another look when you get a chance? Thank you!

@hkc-8010 hkc-8010 requested a review from jscheffl April 18, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants