fix(alerts): reset stuck WORKING reports to NOOP instead of ERROR on timeout#39779
Conversation
…timeout When a Celery worker crashes (OOM, pod eviction) mid-execution, the report stays stuck in WORKING state. Previously, after working_timeout elapsed, it transitioned to ERROR — which for daily schedules meant no retry until the next day (24-hour wait). Now it transitions to NOOP instead, so the next scheduled cron tick picks it up and retries naturally. This avoids duplicate execution risk (no immediate retry that could clash with celery broker requeue) while ensuring daily schedules recover promptly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code Review Agent Run #c62461Actionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #39779 +/- ##
==========================================
+ Coverage 64.35% 64.37% +0.01%
==========================================
Files 2569 2569
Lines 134680 134619 -61
Branches 31254 31224 -30
==========================================
- Hits 86679 86665 -14
+ Misses 46505 46455 -50
- Partials 1496 1499 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…etrying immediately When a Celery worker crashes (OOM, pod eviction) mid-execution, the report stays stuck in WORKING state. Previously, after working_timeout elapsed, it transitioned to ERROR — which for daily schedules meant the report wouldn't actually retry until two days later (day 1: crash, day 2: timeout fires → ERROR, day 3: cron sees ERROR → retries). Now on working_timeout it resets to NOOP and immediately re-executes via ReportNotTriggeredErrorState in the same tick. This is safe because by the time working_timeout (typically >= 1 hour) has elapsed, any celery broker requeue (~30 min) has already been attempted and rejected with ReportSchedulePreviousWorkingError — so there is no duplicate execution risk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
…meout The retry via ReportNotTriggeredErrorState now actually executes, which fails in CI (no webdriver). Updated the test to expect a CommandException from the retry and verify the NOOP reset happened as an intermediate step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code Review Agent Run #2bbdc5Actionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
…working-properly-for-opentable
…working-properly-for-opentable
| self.update_report_schedule_and_log( | ||
| ReportState.ERROR, | ||
| error_message=str(exception_timeout), | ||
| ReportState.NOOP, |
There was a problem hiding this comment.
By setting as NOOP and recovering the stuck state, would we be introducing an infinite loop? Asking because I see no "stop" mechanism for the attempts after automatically re-scheduling X amount of time?
|
The change resets stuck reports to NOOP and retries immediately via ReportNotTriggeredErrorState, which could lead to repeated retries if the report execution keeps getting stuck after each timeout. However, the design assumes this is safe post-celery requeue attempts, preventing infinite loops in normal scenarios. |
aminghadersohi
left a comment
There was a problem hiding this comment.
Code Review — apache/superset PR #39779
HEAD SHA: 1df27dde987637ad7a7ad6bf71565ec60e79d567
Files changed: superset/commands/report/execute.py, tests/integration_tests/reports/commands_tests.py, tests/unit_tests/commands/report/execute_test.py
Second opinion: Not needed — 3 files, ~80 lines, no security/migration/multi-tenant code.
Automated scan coverage
| # | Scan | Result |
|---|---|---|
| 1 | Inline imports inside functions | 1 pre-existing (StaleDataError in create_log), 0 new |
| 2 | Bare except Exception / swallowed exceptions |
Pre-existing instances all have # pylint: disable=broad-except, 0 new |
| 3 | Optional[X] type hints (should be X | None) |
0 new |
| 4 | datetime.utcnow() usage |
0 new (pre-existing throughout file) |
| 5 | Missing @transaction() on DB writes |
New writes in ReportWorkingState.next() called from ReportScheduleStateMachine.run() which carries @transaction() — OK |
| 6 | db.session.add() without flush/commit |
0 new |
| 7 | Raw SQL / string interpolation in queries | 0 |
| 8 | Missing error handling on external calls | 0 |
| 9 | Walrus operator opportunities | 0 applicable cases |
| 10 | Missing type hints | 0 new violations — next() -> None correctly typed |
| 11 | f-strings in logger calls | 0 — all new logger calls correctly use %s/%.2f |
| 12 | Hardcoded strings that should be constants | 1 — see MEDIUM #3 |
| 13 | Missing model migrations | N/A — no model changes |
| 14 | Test assertion patterns | 1 concern — see HIGH #1 |
BLOCKER
None.
HIGH
H1 — Integration test dropped critical assertions
tests/integration_tests/reports/commands_tests.py ~L1762
The old test verified three concrete postconditions:
assert len(logs) == 2
assert ReportScheduleWorkingTimeoutError.message in [log.error_message for log in logs]
assert create_report_slack_chart_working.last_state == ReportState.ERRORThe new test collapses this to:
assert any("stuck" in (log.error_message or "").lower() for log in logs)This passes as long as a log entry containing "stuck" exists — it does not verify:
- The final
last_stateof the report schedule (the most important observable) - That the retry actually ran (WORKING log entry exists)
- The number of log entries created
A regression where the state machine ends up in an unexpected final state (e.g., stays NOOP or reverts to WORKING) would not be caught. The test should be extended:
# After the with-block:
assert create_report_slack_chart_working.last_state in (ReportState.ERROR, ReportState.NOOP)
# verify the noop reset log exists
noop_logs = [l for l in logs if l.state == ReportState.NOOP]
assert len(noop_logs) >= 1
assert "stuck" in (noop_logs[0].error_message or "").lower()H2 — Persistent OOM-kill loop has no error signal
superset/commands/report/execute.py L954–983
The docstring correctly describes the recovery path for one-time worker crashes. But for reports that are consistently killed by OOM (not via a Python exception — no SoftTimeLimitExceeded raised, the process just dies), the cycle becomes:
WORKING → [OOM kill] → WORKING (stuck)
→ timeout → NOOP → immediate retry → [OOM kill again] → WORKING (stuck)
→ timeout → NOOP → ...
Under the old behavior each timeout cycle surfaced an ERROR state and sent an owner notification. Under the new behavior the ERROR state is only set if the retry raises a catchable exception. For OOM-killed retries that also get stuck, owners get no notification between cycles — only the NOOP reset log is written.
This concern is bounded (cycle rate = 1 per working_timeout, typically ≥ 1 hr), but it's a real visibility degradation for diagnosing persistently broken reports. The docstring's comment "this is safe because…" should explicitly acknowledge this tradeoff. A logger.warning that counts resets (or a metric increment) would preserve operational observability.
MEDIUM
M1 — Dead exception class: ReportScheduleWorkingTimeoutError
superset/commands/report/exceptions.py L212–214
The class is no longer imported or raised anywhere after this PR. It should be removed in the same commit. Leaving dead exception classes in exceptions.py misleads readers into thinking the timeout path still raises it, and future code may accidentally reference it.
# Remove this entire class:
class ReportScheduleWorkingTimeoutError(CommandException):
status = 408
message = _("Report Schedule reached a working timeout.")M2 — Missing test for the success path after timeout reset
The unit and integration tests only cover the failure path (retry fails with CommandException). There is no test verifying that a stuck report which resets and retries successfully ends in ReportState.SUCCESS. The success path exercises a different code branch in ReportNotTriggeredErrorState.next() (no exception, update_report_schedule_and_log(SUCCESS) is called). A unit test mocking the retry to succeed would close this gap with minimal effort.
M3 — Error message string implicitly contracts with the test
superset/commands/report/execute.py L972, tests/integration_tests/reports/commands_tests.py ~L1769
The source hardcodes "...stuck (possibly due to a worker crash)..." and the test checks "stuck" in log.error_message.lower(). There is no shared constant binding them. If either changes independently, the test silently degrades (e.g., message changes to "stalled" — test fails for the wrong reason, or message removes "stuck" — test passes incorrectly if another log happens to contain "stuck").
NIT
N1 — call_args tuple indexing is fragile
tests/unit_tests/commands/report/execute_test.py ~L1139–1141
assert call_args[0][0] == ReportState.NOOP # fragile positional
assert "stuck" in call_args[1]["error_message"].lower()Prefer the named-attribute style available in modern unittest.mock:
assert call_args.args[0] == ReportState.NOOP
assert "stuck" in call_args.kwargs["error_message"].lower()This makes intent explicit and avoids off-by-one errors if the call signature changes.
PRAISE
P1 — Docstring is exemplary. The ReportWorkingState class docstring (L929–950) is the clearest behavioral documentation in this module. It names both scenarios, states the invariant about broker requeue timing, and describes the resulting state transition. More like this.
P2 — logger.error → logger.warning is semantically correct. A working timeout caused by a worker crash is an operational event, not an application error. The level change reflects that accurately.
P3 — DRY recovery path. Calling ReportNotTriggeredErrorState(...).next() directly rather than duplicating the execution logic is the right call. The retry correctly inherits all the error-handling, state updates, and notification logic of the initial execution path.
Verdict
COMMENT — do not approve yet.
The semantic choice (NOOP + immediate retry vs. ERROR) is well-reasoned and correct for the primary use case (one-time worker crash). The unit test is solid. However:
- The integration test's weakened assertions (H1) don't verify the most important observable (
last_state), which means a state-machine regression could slip through. ReportScheduleWorkingTimeoutError(M1) is dead code that belongs in the same PR.- The persistent-OOM loop (H2) should be acknowledged with a log count or a note in the docstring, since owners lose the ERROR notification signal in that scenario.
These are all fixable with small additions — no architectural objection to the approach.
aminghadersohi
left a comment
There was a problem hiding this comment.
Formalizing my earlier comment as request-changes since the PR hasn't been updated. Three items need to be addressed before approval:
H1 — Integration test weakened too much (tests/integration_tests/reports/commands_tests.py ~L1762)
The new assertion (any("stuck" in log.error_message...)) doesn't verify last_state or that the retry ran. A regression leaving the report in an unexpected final state (WORKING, NOOP) would silently pass. Minimum addition:
assert create_report_slack_chart_working.last_state in (ReportState.ERROR, ReportState.NOOP)
noop_logs = [l for l in logs if l.state == ReportState.NOOP]
assert len(noop_logs) >= 1M1 — Dead exception class (superset/commands/report/exceptions.py)
ReportScheduleWorkingTimeoutError is no longer imported or raised anywhere after this PR. Remove it in the same commit.
H2 — Persistent OOM loop loses owner notification signal
For reports that are consistently OOM-killed (process dies, no Python exception), the cycle WORKING → NOOP → retry → WORKING (stuck) → … never surfaces an ERROR state and never notifies owners. The old behavior surfaced an ERROR on each timeout cycle. The docstring should acknowledge this tradeoff explicitly; a logger.warning with a reset counter or a note about monitoring would preserve operational visibility.
The semantic approach (NOOP + immediate retry instead of ERROR) is correct. Unit test is solid. Docstring is excellent. Just needs these fixes.
SUMMARY
When a Celery worker crashes (OOM, pod eviction) mid-execution, the report stays stuck in
WORKINGstate in the database. On the next scheduled tick,ReportWorkingState.next()checksis_on_working_timeout():Before: If the
working_timeouthas elapsed, it transitions toERRORand raisesReportScheduleWorkingTimeoutError. For daily schedules, this means no retry until the next day — a 24-hour wait for something that could recover immediately.After: It transitions to
NOOPinstead and returns (no exception). The next cron tick picks it up naturally and retries. This:working_timeoutper-report setting — no additional threshold to coordinateBEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
N/A — backend behavior change only.
TESTING INSTRUCTIONS
working_timeoutof e.g. 60 secondslast_statetoWORKINGin the DB and create aReportExecutionLogentry inWORKINGstate withend_dttmolder thanworking_timeoutERROR, raises exception, daily schedules wait until next dayNOOP, next cron tick retries itUnit tests updated to validate the new behavior.
ADDITIONAL INFORMATION
🤖 Generated with Claude Code