fix(worker): cap retry countdown to visibility timeout across all tasks#720
fix(worker): cap retry countdown to visibility timeout across all tasks#720drazisil-codecov wants to merge 18 commits intomainfrom
Conversation
…imeout Prevents retry countdown from exceeding TASK_VISIBILITY_TIMEOUT_SECONDS, which could cause tasks to become invisible in the queue before being retried. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #720 +/- ##
=======================================
Coverage 92.24% 92.24%
=======================================
Files 1302 1302
Lines 47888 47896 +8
Branches 1628 1628
=======================================
+ Hits 44175 44183 +8
Misses 3404 3404
Partials 309 309
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…ceiling Extract retry countdown clamping into a helper `_clamp_retry_countdown` used by both retry paths in BundleAnalysisProcessorTask. The countdown is bounded between a floor of 30s and a ceiling of TASK_VISIBILITY_TIMEOUT_SECONDS - 30 (870s in production). The floor prevents a near-zero countdown when the visibility timeout is small (e.g. in tests or low-timeout environments). The ceiling ensures the ETA always fires before Redis can redeliver the task from `unacked`, with a 30s buffer instead of the previous 1s. Fixes CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>
When TASK_VISIBILITY_TIMEOUT_SECONDS is small (e.g. in low-timeout environments or tests), TASK_VISIBILITY_TIMEOUT_SECONDS - 30 could be less than the floor of 30s, causing the floor to silently override the ceiling and making the ceiling meaningless. Compute _RETRY_COUNTDOWN_CEILING as max(visibility_timeout - 30, floor) so the two bounds are always consistent. Also removes the mocker patches from the clamp tests — they're no longer needed since the ceiling is now well-defined in all environments. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>
Previously the floor was a hardcoded 30s, which could exceed the visibility timeout in low-timeout environments (e.g. 20s test env), meaning even the floor could trigger redelivery from unacked. Floor is now min(30, visibility_timeout // 2), so it scales down proportionally for small timeouts while staying at 30s in production. Examples: 900s visibility → floor=30s, ceiling=870s (unchanged) 20s visibility → floor=10s, ceiling=10s 10s visibility → floor=5s, ceiling=5s Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>
…tasks Move the clamp_retry_countdown helper from bundle_analysis_processor to tasks/base.py so every task that schedules retries is protected from setting a countdown that exceeds the Celery visibility timeout. Any countdown longer than TASK_VISIBILITY_TIMEOUT_SECONDS causes Redis to redeliver the task from unacked before the ETA fires, creating duplicate executions that multiply exponentially. Clamping to [floor, ceiling] where both bounds are derived from TASK_VISIBILITY_TIMEOUT_SECONDS prevents this. Applied to: notify, manual_trigger, http_request, upload_processor, bundle_analysis_notify, bundle_analysis_processor, test_analytics_notifier, test_results_finisher. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>
preprocess_upload, upload_finisher, and upload were missed in the previous commit. All three had unclamped LockRetry countdowns, and upload.py additionally takes max(lock_countdown, exponential_backoff) before scheduling the retry — compounding both unbounded sources. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>
… tests Replace the hardcoded 5-hour cap in LockManager with TASK_VISIBILITY_TIMEOUT_SECONDS - 30, so LockRetry.countdown is bounded at the source rather than relying solely on call sites to clamp. Tasks still wrap with clamp_retry_countdown as a safety net. Move TestClampRetryCountdown from test_bundle_analysis_processor_task to test_base where it belongs. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>
…otifier Lines 165 and 212 were missed in the previous pass. All three LockRetry handlers in this file now consistently wrap e.countdown. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>
Clamping at the override ensures every self.retry() call is safe without requiring each call site to wrap its countdown. Removes all per-site clamp_retry_countdown() calls and their imports. Co-Authored-By: Claude <noreply@anthropic.com>
- Fix `countdown or 0` bug in BaseCodecovTask.retry() that was changing parameterless self.retry() calls from Celery's default 180s to 30s; now only clamps when countdown is explicitly provided - Update test_locked_exponential_backoff_retry_2 to assert countdown equals MAX_RETRY_COUNTDOWN_SECONDS since retry_num=2 always exceeds the visibility timeout cap (1800s uncapped > 870s cap) - Fix ruff formatting in test_analytics_notifier.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…omments Replace bare `30` literals with named constants to clarify intent: - `TASK_RETRY_COUNTDOWN_BUFFER_SECONDS` (shared/celery_config.py) — the safety margin subtracted from the visibility timeout ceiling, now shared between base.py and lock_manager.py instead of duplicated - `_RETRY_COUNTDOWN_FLOOR_SECONDS` (tasks/base.py) — the minimum retry delay, renamed for clarity Add comments explaining there is no hard requirement for these values and noting that our backoff is ~6× more aggressive than Celery's built-in default_retry_delay (180 s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move TASK_RETRY_COUNTDOWN_FLOOR_SECONDS and TASK_RETRY_COUNTDOWN_CEILING_SECONDS into shared/celery_config.py so both tasks/base.py and services/lock_manager.py import them rather than each computing the same expression independently. base.py no longer needs TASK_VISIBILITY_TIMEOUT_SECONDS since all derived constants now live in celery_config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rename TASK_RETRY_COUNTDOWN_FLOOR_SECONDS -> TASK_RETRY_COUNTDOWN_MIN_SECONDS and TASK_RETRY_COUNTDOWN_CEILING_SECONDS -> TASK_RETRY_COUNTDOWN_MAX_SECONDS for clarity. Clean up comments: drop the fragile 6x ratio (would go stale if the backoff base changes), remove the redundant MAX constant description (the expression is self-explanatory), and use short BUFFER/MIN labels instead of repeating full constant names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace hardcoded 18000 (old 5-hour cap) with MAX_RETRY_COUNTDOWN_SECONDS and strengthen the assertion from <= to == since retry_num=10 always hits the cap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 30-second buffer in clamp_retry_countdown was arbitrary — no test could assert it must be 30 rather than any other value. The principled bound is: a task calling retry() may have been running for up to hard_time_limit_task seconds, so the safe countdown ceiling is TASK_VISIBILITY_TIMEOUT_SECONDS - hard_time_limit_task. This matches the same logic already used in get_lock_timeout(). Move clamp_retry_countdown from a standalone function to a method on BaseCodecovTask so it can read self.hard_time_limit_task. Falls back to TASK_RETRY_COUNTDOWN_MAX_SECONDS (visibility timeout - 30 s) for tasks with no configured hard limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When tests patch BundleAnalysisProcessorTask.app with a MagicMock, self.app.conf.task_time_limit returns a MagicMock, which clamp_retry_countdown then tries to compare with > 0, raising TypeError. Add isinstance(hard_limit, int) check so non-int values fall through to the TASK_RETRY_COUNTDOWN_MAX_SECONDS fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| """ | ||
| hard_limit = self.hard_time_limit_task | ||
| if isinstance(hard_limit, int) and hard_limit > 0: | ||
| return min(countdown, TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit) |
There was a problem hiding this comment.
Negative countdown when hard_time_limit exceeds visibility timeout
High Severity
clamp_retry_countdown computes TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit without guarding against hard_limit >= TASK_VISIBILITY_TIMEOUT_SECONDS. Several tasks have time_limit well above 900s (e.g., delete_owner at 2880s, mark_owner_for_deletion and sync_repos at 1440s). For those tasks the subtraction yields a large negative number (e.g., 900 − 2880 = −1980), and min(countdown, −1980) returns −1980. A negative Celery countdown schedules the retry in the past, causing immediate re-execution — the exact runaway-retry behavior this PR is trying to prevent.
| Falls back to TASK_RETRY_COUNTDOWN_MAX_SECONDS for tasks with no hard limit. | ||
| """ | ||
| hard_limit = self.hard_time_limit_task | ||
| if isinstance(hard_limit, int) and hard_limit > 0: |
There was a problem hiding this comment.
Type check excludes float hard time limits
Medium Severity
isinstance(hard_limit, int) rejects float values, but hard_time_limit_task can return floats via self.request.timelimit[0] (Celery stores time limits as int or float). When the hard limit is a float like 720.0, the per-task visibility cap is silently bypassed and the method falls through to the generic TASK_RETRY_COUNTDOWN_MAX_SECONDS. This is inconsistent with get_lock_timeout, which uses hard_time_limit_task without any type check.
| log = logging.getLogger(__name__) | ||
|
|
||
| MAX_RETRY_COUNTDOWN_SECONDS = 60 * 60 * 5 | ||
| MAX_RETRY_COUNTDOWN_SECONDS = TASK_RETRY_COUNTDOWN_MAX_SECONDS |
There was a problem hiding this comment.
we should probably just get rid of this variable
| @@ -209,6 +212,20 @@ def hard_time_limit_task(self): | |||
| return self.time_limit | |||
| return self.app.conf.task_time_limit or 0 | |||
There was a problem hiding this comment.
I think we should force this to int instead of doing the check later
There was a problem hiding this comment.
Can a MagicMock be forced into an int? I will look.
| Falls back to TASK_RETRY_COUNTDOWN_MAX_SECONDS for tasks with no hard limit. | ||
| """ | ||
| hard_limit = self.hard_time_limit_task | ||
| if isinstance(hard_limit, int) and hard_limit > 0: |
There was a problem hiding this comment.
if we do the above comment, you can just do if self.hard_time_limit_task
There was a problem hiding this comment.
The isinstance check is required for tests, where self.app is mocked and tests fail due to it not being an int. That is the only time the limit is ignored
There was a problem hiding this comment.
I think that means the way we mock this needs to change, we shouldn't build based on how we expect tests to be run
There was a problem hiding this comment.
Fair. I don't know if we xan, since iirc, self.app is assigned by celery at runtime, but i will investigate. I remember quite a bit of trouble getting that to exist when i started, i almost want to say its a read-only property.
There was a problem hiding this comment.
godspeed, mocks are painful, but it's probably the right way to handle this if it's possible. if not, we should rethink how these are tested
There was a problem hiding this comment.
This property only exists while the task is being executed, unless i am misunderstanding https://docs.celeryq.dev/en/stable/userguide/tasks.html#task-request
There was a problem hiding this comment.
I honestly can not remember if we were even testing retries prior the start of all my work. I will have to dig back.
There was a problem hiding this comment.
Yes, these tests were created by me. This wasn't being tested before. I don't know if there is another way to do this, you will note that I'm patching it in to begin with 427ec64
I'll look though my prior tries.
| return min(countdown, TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit) | ||
| return min(countdown, TASK_RETRY_COUNTDOWN_MAX_SECONDS) |
There was a problem hiding this comment.
seems like there a logic disparity here, can you explain how we'd have hard_limit actually be 0 and why we don't care about the visibility timeout here?
There was a problem hiding this comment.
Per comment above, this is only for tests where the timeout is not a number.


Any `self.retry(countdown=N)` where `N > TASK_VISIBILITY_TIMEOUT_SECONDS` causes Redis to redeliver the task from `unacked` before the ETA fires, creating duplicate executions that multiply exponentially. This is the root cause of the bundle analysis queue explosions documented in CODECOV-59.
What was broken
Two retry paths were affected across multiple tasks:
Lock-contention retries (`LockRetry.countdown`) — `LockManager` uses `200 * 3**retry_num`, which reaches ~1800s at `retry_num=2` (above the 900s visibility timeout). This formula predates PR #589 for bundle analysis tasks, but #589 extended `LockManager` to cover the entire upload pipeline (`UPLOAD`, `UPLOAD_PROCESSING`, `UPLOAD_FINISHER`, `PREPROCESS_UPLOAD`, `MANUAL_TRIGGER`), significantly expanding the blast radius.
Exponential backoff retries — Various tasks use formulas like `30 * 2retries`, `20 * 2retries`, `60 * 3**retries`. These all exceed 900s within a handful of retries and had no upper bound.
Fix
Two layers of defense:
`LockManager.MAX_RETRY_COUNTDOWN_SECONDS` is now bounded at `TASK_RETRY_COUNTDOWN_MAX_SECONDS` (was 5 hours).
`BaseCodecovTask.clamp_retry_countdown()` caps any explicitly-provided countdown before it is passed to Celery. Parameterless `self.retry()` calls (`countdown=None`) are left alone so Celery's built-in 180s default is preserved — 180s is already within the safe window, and overriding it would silently change retry cadence for cleanup tasks that rely on that default.
The cap is derived from the task's own `hard_time_limit_task`: a task calling `retry()` may have been running for up to that many seconds, so the safe ceiling is `TASK_VISIBILITY_TIMEOUT_SECONDS - hard_time_limit_task`. This is the same reasoning already used in `get_lock_timeout()`. Tasks with no configured hard limit fall back to `TASK_RETRY_COUNTDOWN_MAX_SECONDS` (visibility timeout − 30 s).
Fixes CODECOV-59
Follows up on CCMRG-2042 and #589.
Note
Medium Risk
Changes core retry timing across many worker tasks; while intended to prevent Redis redelivery duplication, it could alter retry cadence (and edge-case countdown values) for long-running or misconfigured time-limit tasks.
Overview
Prevents runaway duplicate task execution by capping all explicit
self.retry(countdown=...)delays to fit within the remaining RedisTASK_VISIBILITY_TIMEOUT_SECONDSwindow.Adds
BaseCodecovTask.clamp_retry_countdown()and applies it automatically in the overriddenBaseCodecovTask.retry(), while introducing a sharedTASK_RETRY_COUNTDOWN_MAX_SECONDSfallback for tasks without a hard time limit.Also reduces
LockManager’s exponential backoff cap from a hardcoded 5 hours toTASK_RETRY_COUNTDOWN_MAX_SECONDS, and updates/extends unit tests to assert the new clamping behavior.Written by Cursor Bugbot for commit 8a3a18a. This will update automatically on new commits. Configure here.