fix(alerts): reset stuck WORKING reports to NOOP instead of ERROR on timeout by dpgaspar · Pull Request #39779 · apache/superset

dpgaspar · 2026-04-30T12:35:13Z

SUMMARY

When a Celery worker crashes (OOM, pod eviction) mid-execution, the report stays stuck in WORKING state in the database. On the next scheduled tick, ReportWorkingState.next() checks is_on_working_timeout():

Before: If the working_timeout has elapsed, it transitions to ERROR and raises ReportScheduleWorkingTimeoutError. For daily schedules, this means no retry until the next day — a 24-hour wait for something that could recover immediately.

After: It transitions to NOOP instead and returns (no exception). The next cron tick picks it up naturally and retries. This:

Fixes daily schedules: They recover on the next tick instead of waiting 24 hours
Avoids duplicate execution risk: No immediate retry that could clash with celery's broker requeue (~30 min)
No new config: Uses the existing working_timeout per-report setting — no additional threshold to coordinate
Single behavior: One config, one meaning — "how long before a WORKING state is considered stuck"

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A — backend behavior change only.

TESTING INSTRUCTIONS

Create a report schedule with a working_timeout of e.g. 60 seconds
Manually set the report's last_state to WORKING in the DB and create a ReportExecutionLog entry in WORKING state with end_dttm older than working_timeout
Trigger the scheduler
Before: Report transitions to ERROR, raises exception, daily schedules wait until next day
After: Report transitions to NOOP, next cron tick retries it

Unit tests updated to validate the new behavior.

ADDITIONAL INFORMATION

Has associated issue: SC-104379
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

🤖 Generated with Claude Code

…timeout When a Celery worker crashes (OOM, pod eviction) mid-execution, the report stays stuck in WORKING state. Previously, after working_timeout elapsed, it transitioned to ERROR — which for daily schedules meant no retry until the next day (24-hour wait). Now it transitions to NOOP instead, so the next scheduled cron tick picks it up and retries naturally. This avoids duplicate execution risk (no immediate retry that could clash with celery broker requeue) while ensuring daily schedules recover promptly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bito-code-review · 2026-04-30T12:35:23Z

Code Review Agent Run #c62461

Actionable Suggestions - 0

Review Details

Files reviewed - 3 · Commit Range: 856735d..856735d
- superset/commands/report/execute.py
- tests/integration_tests/reports/commands_tests.py
- tests/unit_tests/commands/report/execute_test.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

codecov · 2026-04-30T12:46:44Z

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 64.37%. Comparing base (dc1c0f6) to head (1df27dd).

Files with missing lines	Patch %	Lines
superset/commands/report/execute.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #39779      +/-   ##
==========================================
+ Coverage   64.35%   64.37%   +0.01%     
==========================================
  Files        2569     2569              
  Lines      134680   134619      -61     
  Branches    31254    31224      -30     
==========================================
- Hits        86679    86665      -14     
+ Misses      46505    46455      -50     
- Partials     1496     1499       +3

Flag	Coverage Δ
hive	`39.67% <0.00%> (?)`
mysql	`59.94% <66.66%> (+<0.01%)`	⬆️
postgres	`60.03% <66.66%> (+<0.01%)`	⬆️
presto	`41.42% <0.00%> (ø)`
python	`61.56% <66.66%> (+0.05%)`	⬆️
sqlite	`59.65% <66.66%> (+<0.01%)`	⬆️
unit	`100.00% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…etrying immediately When a Celery worker crashes (OOM, pod eviction) mid-execution, the report stays stuck in WORKING state. Previously, after working_timeout elapsed, it transitioned to ERROR — which for daily schedules meant the report wouldn't actually retry until two days later (day 1: crash, day 2: timeout fires → ERROR, day 3: cron sees ERROR → retries). Now on working_timeout it resets to NOOP and immediately re-executes via ReportNotTriggeredErrorState in the same tick. This is safe because by the time working_timeout (typically >= 1 hour) has elapsed, any celery broker requeue (~30 min) has already been attempted and rejected with ReportSchedulePreviousWorkingError — so there is no duplicate execution risk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

netlify · 2026-04-30T13:22:57Z

✅ Deploy Preview for superset-docs-preview ready!

Name	Link
🔨 Latest commit	`1df27dd`
🔍 Latest deploy log	https://app.netlify.com/projects/superset-docs-preview/deploys/69f84b2cf1ff6f0008a19aa0
😎 Deploy Preview	https://deploy-preview-39779--superset-docs-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
🤖 Make changes	Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

…meout The retry via ReportNotTriggeredErrorState now actually executes, which fails in CI (no webdriver). Updated the test to expect a CommandException from the retry and verify the NOOP reset happened as an intermediate step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bito-code-review · 2026-05-01T04:10:26Z

Code Review Agent Run #2bbdc5

Actionable Suggestions - 0

Review Details

Files reviewed - 3 · Commit Range: 856735d..fd7a8a3
- superset/commands/report/execute.py
- tests/unit_tests/commands/report/execute_test.py
- tests/integration_tests/reports/commands_tests.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

…working-properly-for-opentable

Antonio-RiveroMartnez · 2026-05-04T14:50:24Z

            self.update_report_schedule_and_log(
-                ReportState.ERROR,
-                error_message=str(exception_timeout),
+                ReportState.NOOP,


By setting as NOOP and recovering the stuck state, would we be introducing an infinite loop? Asking because I see no "stop" mechanism for the attempts after automatically re-scheduling X amount of time?

bito-code-review · 2026-05-04T14:52:04Z

The change resets stuck reports to NOOP and retries immediately via ReportNotTriggeredErrorState, which could lead to repeated retries if the report execution keeps getting stuck after each timeout. However, the design assumes this is safe post-celery requeue attempts, preventing infinite loops in normal scenarios.

aminghadersohi

Code Review — apache/superset PR #39779

HEAD SHA: 1df27dde987637ad7a7ad6bf71565ec60e79d567
Files changed: superset/commands/report/execute.py, tests/integration_tests/reports/commands_tests.py, tests/unit_tests/commands/report/execute_test.py
Second opinion: Not needed — 3 files, ~80 lines, no security/migration/multi-tenant code.

Automated scan coverage

#	Scan	Result
1	Inline imports inside functions	1 pre-existing (`StaleDataError` in `create_log`), 0 new
2	Bare `except Exception` / swallowed exceptions	Pre-existing instances all have `# pylint: disable=broad-except`, 0 new
3	`Optional[X]` type hints (should be `X \| None`)	0 new
4	`datetime.utcnow()` usage	0 new (pre-existing throughout file)
5	Missing `@transaction()` on DB writes	New writes in `ReportWorkingState.next()` called from `ReportScheduleStateMachine.run()` which carries `@transaction()` — OK
6	`db.session.add()` without flush/commit	0 new
7	Raw SQL / string interpolation in queries	0
8	Missing error handling on external calls	0
9	Walrus operator opportunities	0 applicable cases
10	Missing type hints	0 new violations — `next() -> None` correctly typed
11	f-strings in logger calls	0 — all new logger calls correctly use `%s`/`%.2f`
12	Hardcoded strings that should be constants	1 — see MEDIUM #3
13	Missing model migrations	N/A — no model changes
14	Test assertion patterns	1 concern — see HIGH #1

BLOCKER

None.

HIGH

H1 — Integration test dropped critical assertions

tests/integration_tests/reports/commands_tests.py ~L1762

The old test verified three concrete postconditions:

assert len(logs) == 2
assert ReportScheduleWorkingTimeoutError.message in [log.error_message for log in logs]
assert create_report_slack_chart_working.last_state == ReportState.ERROR

The new test collapses this to:

assert any("stuck" in (log.error_message or "").lower() for log in logs)

This passes as long as a log entry containing "stuck" exists — it does not verify:

The final last_state of the report schedule (the most important observable)
That the retry actually ran (WORKING log entry exists)
The number of log entries created

A regression where the state machine ends up in an unexpected final state (e.g., stays NOOP or reverts to WORKING) would not be caught. The test should be extended:

# After the with-block:
assert create_report_slack_chart_working.last_state in (ReportState.ERROR, ReportState.NOOP)
# verify the noop reset log exists
noop_logs = [l for l in logs if l.state == ReportState.NOOP]
assert len(noop_logs) >= 1
assert "stuck" in (noop_logs[0].error_message or "").lower()

H2 — Persistent OOM-kill loop has no error signal

superset/commands/report/execute.py L954–983

The docstring correctly describes the recovery path for one-time worker crashes. But for reports that are consistently killed by OOM (not via a Python exception — no SoftTimeLimitExceeded raised, the process just dies), the cycle becomes:

WORKING → [OOM kill] → WORKING (stuck)
  → timeout → NOOP → immediate retry → [OOM kill again] → WORKING (stuck)
  → timeout → NOOP → ...

Under the old behavior each timeout cycle surfaced an ERROR state and sent an owner notification. Under the new behavior the ERROR state is only set if the retry raises a catchable exception. For OOM-killed retries that also get stuck, owners get no notification between cycles — only the NOOP reset log is written.

This concern is bounded (cycle rate = 1 per working_timeout, typically ≥ 1 hr), but it's a real visibility degradation for diagnosing persistently broken reports. The docstring's comment "this is safe because…" should explicitly acknowledge this tradeoff. A logger.warning that counts resets (or a metric increment) would preserve operational observability.

MEDIUM

M1 — Dead exception class: `ReportScheduleWorkingTimeoutError`

superset/commands/report/exceptions.py L212–214

The class is no longer imported or raised anywhere after this PR. It should be removed in the same commit. Leaving dead exception classes in exceptions.py misleads readers into thinking the timeout path still raises it, and future code may accidentally reference it.

# Remove this entire class:
class ReportScheduleWorkingTimeoutError(CommandException):
    status = 408
    message = _("Report Schedule reached a working timeout.")

M2 — Missing test for the success path after timeout reset

The unit and integration tests only cover the failure path (retry fails with CommandException). There is no test verifying that a stuck report which resets and retries successfully ends in ReportState.SUCCESS. The success path exercises a different code branch in ReportNotTriggeredErrorState.next() (no exception, update_report_schedule_and_log(SUCCESS) is called). A unit test mocking the retry to succeed would close this gap with minimal effort.

M3 — Error message string implicitly contracts with the test

superset/commands/report/execute.py L972, tests/integration_tests/reports/commands_tests.py ~L1769

The source hardcodes "...stuck (possibly due to a worker crash)..." and the test checks "stuck" in log.error_message.lower(). There is no shared constant binding them. If either changes independently, the test silently degrades (e.g., message changes to "stalled" — test fails for the wrong reason, or message removes "stuck" — test passes incorrectly if another log happens to contain "stuck").

NIT

N1 — `call_args` tuple indexing is fragile

tests/unit_tests/commands/report/execute_test.py ~L1139–1141

assert call_args[0][0] == ReportState.NOOP           # fragile positional
assert "stuck" in call_args[1]["error_message"].lower()

Prefer the named-attribute style available in modern unittest.mock:

assert call_args.args[0] == ReportState.NOOP
assert "stuck" in call_args.kwargs["error_message"].lower()

This makes intent explicit and avoids off-by-one errors if the call signature changes.

PRAISE

P1 — Docstring is exemplary. The ReportWorkingState class docstring (L929–950) is the clearest behavioral documentation in this module. It names both scenarios, states the invariant about broker requeue timing, and describes the resulting state transition. More like this.

P2 — logger.error → logger.warning is semantically correct. A working timeout caused by a worker crash is an operational event, not an application error. The level change reflects that accurately.

P3 — DRY recovery path. Calling ReportNotTriggeredErrorState(...).next() directly rather than duplicating the execution logic is the right call. The retry correctly inherits all the error-handling, state updates, and notification logic of the initial execution path.

Verdict

COMMENT — do not approve yet.

The semantic choice (NOOP + immediate retry vs. ERROR) is well-reasoned and correct for the primary use case (one-time worker crash). The unit test is solid. However:

The integration test's weakened assertions (H1) don't verify the most important observable (last_state), which means a state-machine regression could slip through.
ReportScheduleWorkingTimeoutError (M1) is dead code that belongs in the same PR.
The persistent-OOM loop (H2) should be acknowledged with a log count or a note in the docstring, since owners lose the ERROR notification signal in that scenario.

These are all fixable with small additions — no architectural objection to the approach.

aminghadersohi

Formalizing my earlier comment as request-changes since the PR hasn't been updated. Three items need to be addressed before approval:

H1 — Integration test weakened too much (tests/integration_tests/reports/commands_tests.py ~L1762)
The new assertion (any("stuck" in log.error_message...)) doesn't verify last_state or that the retry ran. A regression leaving the report in an unexpected final state (WORKING, NOOP) would silently pass. Minimum addition:

assert create_report_slack_chart_working.last_state in (ReportState.ERROR, ReportState.NOOP)
noop_logs = [l for l in logs if l.state == ReportState.NOOP]
assert len(noop_logs) >= 1

M1 — Dead exception class (superset/commands/report/exceptions.py)
ReportScheduleWorkingTimeoutError is no longer imported or raised anywhere after this PR. Remove it in the same commit.

H2 — Persistent OOM loop loses owner notification signal
For reports that are consistently OOM-killed (process dies, no Python exception), the cycle WORKING → NOOP → retry → WORKING (stuck) → … never surfaces an ERROR state and never notifies owners. The old behavior surfaced an ERROR on each timeout cycle. The docstring should acknowledge this tradeoff explicitly; a logger.warning with a reset counter or a note about monitoring would preserve operational visibility.

The semantic approach (NOOP + immediate retry instead of ERROR) is correct. Unit test is solid. Docstring is excellent. Just needs these fixes.

pull-request-size Bot added the size/M label Apr 30, 2026

dosubot Bot added alert-reports Namespace | Anything related to the Alert & Reports feature change:backend Requires changing the backend labels Apr 30, 2026

dpgaspar added 2 commits May 1, 2026 17:21

Merge branch 'master' into danielgaspar/sc-104379/alert-schedule-not-…

96cc8c5

…working-properly-for-opentable

Merge branch 'master' into danielgaspar/sc-104379/alert-schedule-not-…

1df27dd

…working-properly-for-opentable

dpgaspar requested review from Antonio-RiveroMartnez, Vitor-Avila, aminghadersohi and eschutho May 4, 2026 14:42

Antonio-RiveroMartnez reviewed May 4, 2026

View reviewed changes

aminghadersohi reviewed May 20, 2026

View reviewed changes

aminghadersohi requested changes May 20, 2026

View reviewed changes

aminghadersohi self-requested a review May 20, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(alerts): reset stuck WORKING reports to NOOP instead of ERROR on timeout#39779

fix(alerts): reset stuck WORKING reports to NOOP instead of ERROR on timeout#39779
dpgaspar wants to merge 5 commits into
apache:masterfrom
preset-io:danielgaspar/sc-104379/alert-schedule-not-working-properly-for-opentable

dpgaspar commented Apr 30, 2026

Uh oh!

bito-code-review Bot commented Apr 30, 2026 •

edited

Loading

Code Review Agent Run #c62461

Uh oh!

codecov Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

bito-code-review Bot commented May 1, 2026 •

edited

Loading

Code Review Agent Run #2bbdc5

Uh oh!

Antonio-RiveroMartnez May 4, 2026

Uh oh!

bito-code-review Bot commented May 4, 2026

Uh oh!

aminghadersohi left a comment

Uh oh!

aminghadersohi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dpgaspar commented Apr 30, 2026

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Uh oh!

bito-code-review Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #c62461

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

netlify Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for superset-docs-preview ready!

Uh oh!

bito-code-review Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #2bbdc5

Uh oh!

Antonio-RiveroMartnez May 4, 2026

Choose a reason for hiding this comment

Uh oh!

bito-code-review Bot commented May 4, 2026

Uh oh!

aminghadersohi left a comment

Choose a reason for hiding this comment

Code Review — apache/superset PR #39779

Automated scan coverage

BLOCKER

HIGH

H1 — Integration test dropped critical assertions

H2 — Persistent OOM-kill loop has no error signal

MEDIUM

M1 — Dead exception class: ReportScheduleWorkingTimeoutError

M2 — Missing test for the success path after timeout reset

M3 — Error message string implicitly contracts with the test

NIT

N1 — call_args tuple indexing is fragile

PRAISE

Verdict

Uh oh!

aminghadersohi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bito-code-review Bot commented Apr 30, 2026 •

edited

Loading

codecov Bot commented Apr 30, 2026 •

edited

Loading

netlify Bot commented Apr 30, 2026 •

edited

Loading

bito-code-review Bot commented May 1, 2026 •

edited

Loading

M1 — Dead exception class: `ReportScheduleWorkingTimeoutError`

N1 — `call_args` tuple indexing is fragile