Skip to content

fix(alerts): reset stuck WORKING reports to NOOP instead of ERROR on timeout#39779

Open
dpgaspar wants to merge 5 commits into
apache:masterfrom
preset-io:danielgaspar/sc-104379/alert-schedule-not-working-properly-for-opentable
Open

fix(alerts): reset stuck WORKING reports to NOOP instead of ERROR on timeout#39779
dpgaspar wants to merge 5 commits into
apache:masterfrom
preset-io:danielgaspar/sc-104379/alert-schedule-not-working-properly-for-opentable

Conversation

@dpgaspar
Copy link
Copy Markdown
Member

SUMMARY

When a Celery worker crashes (OOM, pod eviction) mid-execution, the report stays stuck in WORKING state in the database. On the next scheduled tick, ReportWorkingState.next() checks is_on_working_timeout():

Before: If the working_timeout has elapsed, it transitions to ERROR and raises ReportScheduleWorkingTimeoutError. For daily schedules, this means no retry until the next day — a 24-hour wait for something that could recover immediately.

After: It transitions to NOOP instead and returns (no exception). The next cron tick picks it up naturally and retries. This:

  • Fixes daily schedules: They recover on the next tick instead of waiting 24 hours
  • Avoids duplicate execution risk: No immediate retry that could clash with celery's broker requeue (~30 min)
  • No new config: Uses the existing working_timeout per-report setting — no additional threshold to coordinate
  • Single behavior: One config, one meaning — "how long before a WORKING state is considered stuck"

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A — backend behavior change only.

TESTING INSTRUCTIONS

  1. Create a report schedule with a working_timeout of e.g. 60 seconds
  2. Manually set the report's last_state to WORKING in the DB and create a ReportExecutionLog entry in WORKING state with end_dttm older than working_timeout
  3. Trigger the scheduler
  4. Before: Report transitions to ERROR, raises exception, daily schedules wait until next day
  5. After: Report transitions to NOOP, next cron tick retries it

Unit tests updated to validate the new behavior.

ADDITIONAL INFORMATION

  • Has associated issue: SC-104379
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

🤖 Generated with Claude Code

…timeout

When a Celery worker crashes (OOM, pod eviction) mid-execution, the report
stays stuck in WORKING state. Previously, after working_timeout elapsed,
it transitioned to ERROR — which for daily schedules meant no retry until
the next day (24-hour wait).

Now it transitions to NOOP instead, so the next scheduled cron tick picks
it up and retries naturally. This avoids duplicate execution risk (no
immediate retry that could clash with celery broker requeue) while ensuring
daily schedules recover promptly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bito-code-review
Copy link
Copy Markdown
Contributor

bito-code-review Bot commented Apr 30, 2026

Code Review Agent Run #c62461

Actionable Suggestions - 0
Review Details
  • Files reviewed - 3 · Commit Range: 856735d..856735d
    • superset/commands/report/execute.py
    • tests/integration_tests/reports/commands_tests.py
    • tests/unit_tests/commands/report/execute_test.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

@dosubot dosubot Bot added alert-reports Namespace | Anything related to the Alert & Reports feature change:backend Requires changing the backend labels Apr 30, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 64.37%. Comparing base (dc1c0f6) to head (1df27dd).

Files with missing lines Patch % Lines
superset/commands/report/execute.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #39779      +/-   ##
==========================================
+ Coverage   64.35%   64.37%   +0.01%     
==========================================
  Files        2569     2569              
  Lines      134680   134619      -61     
  Branches    31254    31224      -30     
==========================================
- Hits        86679    86665      -14     
+ Misses      46505    46455      -50     
- Partials     1496     1499       +3     
Flag Coverage Δ
hive 39.67% <0.00%> (?)
mysql 59.94% <66.66%> (+<0.01%) ⬆️
postgres 60.03% <66.66%> (+<0.01%) ⬆️
presto 41.42% <0.00%> (ø)
python 61.56% <66.66%> (+0.05%) ⬆️
sqlite 59.65% <66.66%> (+<0.01%) ⬆️
unit 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…etrying immediately

When a Celery worker crashes (OOM, pod eviction) mid-execution, the report
stays stuck in WORKING state. Previously, after working_timeout elapsed,
it transitioned to ERROR — which for daily schedules meant the report
wouldn't actually retry until two days later (day 1: crash, day 2: timeout
fires → ERROR, day 3: cron sees ERROR → retries).

Now on working_timeout it resets to NOOP and immediately re-executes via
ReportNotTriggeredErrorState in the same tick. This is safe because by the
time working_timeout (typically >= 1 hour) has elapsed, any celery broker
requeue (~30 min) has already been attempted and rejected with
ReportSchedulePreviousWorkingError — so there is no duplicate execution risk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 30, 2026

Deploy Preview for superset-docs-preview ready!

Name Link
🔨 Latest commit 1df27dd
🔍 Latest deploy log https://app.netlify.com/projects/superset-docs-preview/deploys/69f84b2cf1ff6f0008a19aa0
😎 Deploy Preview https://deploy-preview-39779--superset-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

…meout

The retry via ReportNotTriggeredErrorState now actually executes, which
fails in CI (no webdriver). Updated the test to expect a CommandException
from the retry and verify the NOOP reset happened as an intermediate step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bito-code-review
Copy link
Copy Markdown
Contributor

bito-code-review Bot commented May 1, 2026

Code Review Agent Run #2bbdc5

Actionable Suggestions - 0
Review Details
  • Files reviewed - 3 · Commit Range: 856735d..fd7a8a3
    • superset/commands/report/execute.py
    • tests/unit_tests/commands/report/execute_test.py
    • tests/integration_tests/reports/commands_tests.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

self.update_report_schedule_and_log(
ReportState.ERROR,
error_message=str(exception_timeout),
ReportState.NOOP,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By setting as NOOP and recovering the stuck state, would we be introducing an infinite loop? Asking because I see no "stop" mechanism for the attempts after automatically re-scheduling X amount of time?

@bito-code-review
Copy link
Copy Markdown
Contributor

The change resets stuck reports to NOOP and retries immediately via ReportNotTriggeredErrorState, which could lead to repeated retries if the report execution keeps getting stuck after each timeout. However, the design assumes this is safe post-celery requeue attempts, preventing infinite loops in normal scenarios.

Copy link
Copy Markdown
Contributor

@aminghadersohi aminghadersohi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — apache/superset PR #39779

HEAD SHA: 1df27dde987637ad7a7ad6bf71565ec60e79d567
Files changed: superset/commands/report/execute.py, tests/integration_tests/reports/commands_tests.py, tests/unit_tests/commands/report/execute_test.py
Second opinion: Not needed — 3 files, ~80 lines, no security/migration/multi-tenant code.


Automated scan coverage

# Scan Result
1 Inline imports inside functions 1 pre-existing (StaleDataError in create_log), 0 new
2 Bare except Exception / swallowed exceptions Pre-existing instances all have # pylint: disable=broad-except, 0 new
3 Optional[X] type hints (should be X | None) 0 new
4 datetime.utcnow() usage 0 new (pre-existing throughout file)
5 Missing @transaction() on DB writes New writes in ReportWorkingState.next() called from ReportScheduleStateMachine.run() which carries @transaction() — OK
6 db.session.add() without flush/commit 0 new
7 Raw SQL / string interpolation in queries 0
8 Missing error handling on external calls 0
9 Walrus operator opportunities 0 applicable cases
10 Missing type hints 0 new violations — next() -> None correctly typed
11 f-strings in logger calls 0 — all new logger calls correctly use %s/%.2f
12 Hardcoded strings that should be constants 1 — see MEDIUM #3
13 Missing model migrations N/A — no model changes
14 Test assertion patterns 1 concern — see HIGH #1

BLOCKER

None.


HIGH

H1 — Integration test dropped critical assertions

tests/integration_tests/reports/commands_tests.py ~L1762

The old test verified three concrete postconditions:

assert len(logs) == 2
assert ReportScheduleWorkingTimeoutError.message in [log.error_message for log in logs]
assert create_report_slack_chart_working.last_state == ReportState.ERROR

The new test collapses this to:

assert any("stuck" in (log.error_message or "").lower() for log in logs)

This passes as long as a log entry containing "stuck" exists — it does not verify:

  • The final last_state of the report schedule (the most important observable)
  • That the retry actually ran (WORKING log entry exists)
  • The number of log entries created

A regression where the state machine ends up in an unexpected final state (e.g., stays NOOP or reverts to WORKING) would not be caught. The test should be extended:

# After the with-block:
assert create_report_slack_chart_working.last_state in (ReportState.ERROR, ReportState.NOOP)
# verify the noop reset log exists
noop_logs = [l for l in logs if l.state == ReportState.NOOP]
assert len(noop_logs) >= 1
assert "stuck" in (noop_logs[0].error_message or "").lower()

H2 — Persistent OOM-kill loop has no error signal

superset/commands/report/execute.py L954–983

The docstring correctly describes the recovery path for one-time worker crashes. But for reports that are consistently killed by OOM (not via a Python exception — no SoftTimeLimitExceeded raised, the process just dies), the cycle becomes:

WORKING → [OOM kill] → WORKING (stuck)
  → timeout → NOOP → immediate retry → [OOM kill again] → WORKING (stuck)
  → timeout → NOOP → ...

Under the old behavior each timeout cycle surfaced an ERROR state and sent an owner notification. Under the new behavior the ERROR state is only set if the retry raises a catchable exception. For OOM-killed retries that also get stuck, owners get no notification between cycles — only the NOOP reset log is written.

This concern is bounded (cycle rate = 1 per working_timeout, typically ≥ 1 hr), but it's a real visibility degradation for diagnosing persistently broken reports. The docstring's comment "this is safe because…" should explicitly acknowledge this tradeoff. A logger.warning that counts resets (or a metric increment) would preserve operational observability.


MEDIUM

M1 — Dead exception class: ReportScheduleWorkingTimeoutError

superset/commands/report/exceptions.py L212–214

The class is no longer imported or raised anywhere after this PR. It should be removed in the same commit. Leaving dead exception classes in exceptions.py misleads readers into thinking the timeout path still raises it, and future code may accidentally reference it.

# Remove this entire class:
class ReportScheduleWorkingTimeoutError(CommandException):
    status = 408
    message = _("Report Schedule reached a working timeout.")

M2 — Missing test for the success path after timeout reset

The unit and integration tests only cover the failure path (retry fails with CommandException). There is no test verifying that a stuck report which resets and retries successfully ends in ReportState.SUCCESS. The success path exercises a different code branch in ReportNotTriggeredErrorState.next() (no exception, update_report_schedule_and_log(SUCCESS) is called). A unit test mocking the retry to succeed would close this gap with minimal effort.

M3 — Error message string implicitly contracts with the test

superset/commands/report/execute.py L972, tests/integration_tests/reports/commands_tests.py ~L1769

The source hardcodes "...stuck (possibly due to a worker crash)..." and the test checks "stuck" in log.error_message.lower(). There is no shared constant binding them. If either changes independently, the test silently degrades (e.g., message changes to "stalled" — test fails for the wrong reason, or message removes "stuck" — test passes incorrectly if another log happens to contain "stuck").


NIT

N1 — call_args tuple indexing is fragile

tests/unit_tests/commands/report/execute_test.py ~L1139–1141

assert call_args[0][0] == ReportState.NOOP           # fragile positional
assert "stuck" in call_args[1]["error_message"].lower()

Prefer the named-attribute style available in modern unittest.mock:

assert call_args.args[0] == ReportState.NOOP
assert "stuck" in call_args.kwargs["error_message"].lower()

This makes intent explicit and avoids off-by-one errors if the call signature changes.


PRAISE

P1 — Docstring is exemplary. The ReportWorkingState class docstring (L929–950) is the clearest behavioral documentation in this module. It names both scenarios, states the invariant about broker requeue timing, and describes the resulting state transition. More like this.

P2 — logger.errorlogger.warning is semantically correct. A working timeout caused by a worker crash is an operational event, not an application error. The level change reflects that accurately.

P3 — DRY recovery path. Calling ReportNotTriggeredErrorState(...).next() directly rather than duplicating the execution logic is the right call. The retry correctly inherits all the error-handling, state updates, and notification logic of the initial execution path.


Verdict

COMMENT — do not approve yet.

The semantic choice (NOOP + immediate retry vs. ERROR) is well-reasoned and correct for the primary use case (one-time worker crash). The unit test is solid. However:

  1. The integration test's weakened assertions (H1) don't verify the most important observable (last_state), which means a state-machine regression could slip through.
  2. ReportScheduleWorkingTimeoutError (M1) is dead code that belongs in the same PR.
  3. The persistent-OOM loop (H2) should be acknowledged with a log count or a note in the docstring, since owners lose the ERROR notification signal in that scenario.

These are all fixable with small additions — no architectural objection to the approach.

Copy link
Copy Markdown
Contributor

@aminghadersohi aminghadersohi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formalizing my earlier comment as request-changes since the PR hasn't been updated. Three items need to be addressed before approval:

H1 — Integration test weakened too much (tests/integration_tests/reports/commands_tests.py ~L1762)
The new assertion (any("stuck" in log.error_message...)) doesn't verify last_state or that the retry ran. A regression leaving the report in an unexpected final state (WORKING, NOOP) would silently pass. Minimum addition:

assert create_report_slack_chart_working.last_state in (ReportState.ERROR, ReportState.NOOP)
noop_logs = [l for l in logs if l.state == ReportState.NOOP]
assert len(noop_logs) >= 1

M1 — Dead exception class (superset/commands/report/exceptions.py)
ReportScheduleWorkingTimeoutError is no longer imported or raised anywhere after this PR. Remove it in the same commit.

H2 — Persistent OOM loop loses owner notification signal
For reports that are consistently OOM-killed (process dies, no Python exception), the cycle WORKING → NOOP → retry → WORKING (stuck) → … never surfaces an ERROR state and never notifies owners. The old behavior surfaced an ERROR on each timeout cycle. The docstring should acknowledge this tradeoff explicitly; a logger.warning with a reset counter or a note about monitoring would preserve operational visibility.

The semantic approach (NOOP + immediate retry instead of ERROR) is correct. Unit test is solid. Docstring is excellent. Just needs these fixes.

@aminghadersohi aminghadersohi self-requested a review May 20, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

alert-reports Namespace | Anything related to the Alert & Reports feature change:backend Requires changing the backend size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants