fix(worker): cap retry countdown to visibility timeout across all tasks by drazisil-codecov · Pull Request #720 · codecov/umbrella

drazisil-codecov · 2026-02-23T13:27:14Z

Any `self.retry(countdown=N)` where `N > TASK_VISIBILITY_TIMEOUT_SECONDS` causes Redis to redeliver the task from `unacked` before the ETA fires, creating duplicate executions that multiply exponentially. This is the root cause of the bundle analysis queue explosions documented in CODECOV-59.

What was broken

Two retry paths were affected across multiple tasks:

Lock-contention retries (`LockRetry.countdown`) — `LockManager` uses `200 * 3**retry_num`, which reaches ~1800s at `retry_num=2` (above the 900s visibility timeout). This formula predates PR #589 for bundle analysis tasks, but #589 extended `LockManager` to cover the entire upload pipeline (`UPLOAD`, `UPLOAD_PROCESSING`, `UPLOAD_FINISHER`, `PREPROCESS_UPLOAD`, `MANUAL_TRIGGER`), significantly expanding the blast radius.

Exponential backoff retries — Various tasks use formulas like `30 * 2retries`, `20 * 2retries`, `60 * 3**retries`. These all exceed 900s within a handful of retries and had no upper bound.

Fix

Two layers of defense:

`LockManager.MAX_RETRY_COUNTDOWN_SECONDS` is now bounded at `TASK_RETRY_COUNTDOWN_MAX_SECONDS` (was 5 hours).
`BaseCodecovTask.clamp_retry_countdown()` caps any explicitly-provided countdown before it is passed to Celery. Parameterless `self.retry()` calls (`countdown=None`) are left alone so Celery's built-in 180s default is preserved — 180s is already within the safe window, and overriding it would silently change retry cadence for cleanup tasks that rely on that default.

The cap is derived from the task's own `hard_time_limit_task`: a task calling `retry()` may have been running for up to that many seconds, so the safe ceiling is `TASK_VISIBILITY_TIMEOUT_SECONDS - hard_time_limit_task`. This is the same reasoning already used in `get_lock_timeout()`. Tasks with no configured hard limit fall back to `TASK_RETRY_COUNTDOWN_MAX_SECONDS` (visibility timeout − 30 s).

# In BaseCodecovTask
def clamp_retry_countdown(self, countdown: int) -> int:
    hard_limit = self.hard_time_limit_task
    if hard_limit > 0:
        return min(countdown, TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit)
    return min(countdown, TASK_RETRY_COUNTDOWN_MAX_SECONDS)

Fixes CODECOV-59

Follows up on CCMRG-2042 and #589.

Note

Medium Risk
Changes core retry timing across many worker tasks; while intended to prevent Redis redelivery duplication, it could alter retry cadence (and edge-case countdown values) for long-running or misconfigured time-limit tasks.

Overview
Prevents runaway duplicate task execution by capping all explicit self.retry(countdown=...) delays to fit within the remaining Redis TASK_VISIBILITY_TIMEOUT_SECONDS window.

Adds BaseCodecovTask.clamp_retry_countdown() and applies it automatically in the overridden BaseCodecovTask.retry(), while introducing a shared TASK_RETRY_COUNTDOWN_MAX_SECONDS fallback for tasks without a hard time limit.

Also reduces LockManager’s exponential backoff cap from a hardcoded 5 hours to TASK_RETRY_COUNTDOWN_MAX_SECONDS, and updates/extends unit tests to assert the new clamping behavior.

^{Written by Cursor Bugbot for commit 8a3a18a. This will update automatically on new commits. Configure here.}

…imeout Prevents retry countdown from exceeding TASK_VISIBILITY_TIMEOUT_SECONDS, which could cause tasks to become invisible in the queue before being retried. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

linear · 2026-02-23T13:27:18Z

CODECOV-59 Cap bundle analysis retry countdown to task visibility timeout

apps/worker/tasks/bundle_analysis_processor.py

sentry · 2026-02-23T13:35:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.24%. Comparing base (aae0aa4) to head (8a3a18a).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #720   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files        1302     1302           
  Lines       47888    47896    +8     
  Branches     1628     1628           
=======================================
+ Hits        44175    44183    +8     
  Misses       3404     3404           
  Partials      309      309

Flag	Coverage Δ
apiunit	`96.36% <ø> (ø)`
sharedintegration	`37.01% <100.00%> (+<0.01%)`	⬆️
sharedunit	`84.89% <100.00%> (+<0.01%)`	⬆️
workerintegration	`58.63% <76.92%> (+0.01%)`	⬆️
workerunit	`90.33% <84.61%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codecov-notifications · 2026-02-23T13:35:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

…ceiling Extract retry countdown clamping into a helper `_clamp_retry_countdown` used by both retry paths in BundleAnalysisProcessorTask. The countdown is bounded between a floor of 30s and a ceiling of TASK_VISIBILITY_TIMEOUT_SECONDS - 30 (870s in production). The floor prevents a near-zero countdown when the visibility timeout is small (e.g. in tests or low-timeout environments). The ceiling ensures the ETA always fires before Redis can redeliver the task from `unacked`, with a 30s buffer instead of the previous 1s. Fixes CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>

apps/worker/tasks/bundle_analysis_processor.py

When TASK_VISIBILITY_TIMEOUT_SECONDS is small (e.g. in low-timeout environments or tests), TASK_VISIBILITY_TIMEOUT_SECONDS - 30 could be less than the floor of 30s, causing the floor to silently override the ceiling and making the ceiling meaningless. Compute _RETRY_COUNTDOWN_CEILING as max(visibility_timeout - 30, floor) so the two bounds are always consistent. Also removes the mocker patches from the clamp tests — they're no longer needed since the ceiling is now well-defined in all environments. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>

apps/worker/tasks/bundle_analysis_processor.py

Previously the floor was a hardcoded 30s, which could exceed the visibility timeout in low-timeout environments (e.g. 20s test env), meaning even the floor could trigger redelivery from unacked. Floor is now min(30, visibility_timeout // 2), so it scales down proportionally for small timeouts while staying at 30s in production. Examples: 900s visibility → floor=30s, ceiling=870s (unchanged) 20s visibility → floor=10s, ceiling=10s 10s visibility → floor=5s, ceiling=5s Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>

…tasks Move the clamp_retry_countdown helper from bundle_analysis_processor to tasks/base.py so every task that schedules retries is protected from setting a countdown that exceeds the Celery visibility timeout. Any countdown longer than TASK_VISIBILITY_TIMEOUT_SECONDS causes Redis to redeliver the task from unacked before the ETA fires, creating duplicate executions that multiply exponentially. Clamping to [floor, ceiling] where both bounds are derived from TASK_VISIBILITY_TIMEOUT_SECONDS prevents this. Applied to: notify, manual_trigger, http_request, upload_processor, bundle_analysis_notify, bundle_analysis_processor, test_analytics_notifier, test_results_finisher. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>

apps/worker/tasks/base.py

apps/worker/tasks/test_analytics_notifier.py

apps/worker/tasks/upload_processor.py

preprocess_upload, upload_finisher, and upload were missed in the previous commit. All three had unclamped LockRetry countdowns, and upload.py additionally takes max(lock_countdown, exponential_backoff) before scheduling the retry — compounding both unbounded sources. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>

apps/worker/tasks/base.py

apps/worker/tasks/test_analytics_notifier.py

… tests Replace the hardcoded 5-hour cap in LockManager with TASK_VISIBILITY_TIMEOUT_SECONDS - 30, so LockRetry.countdown is bounded at the source rather than relying solely on call sites to clamp. Tasks still wrap with clamp_retry_countdown as a safety net. Move TestClampRetryCountdown from test_bundle_analysis_processor_task to test_base where it belongs. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>

…otifier Lines 165 and 212 were missed in the previous pass. All three LockRetry handlers in this file now consistently wrap e.countdown. Refs CODECOV-59 Co-Authored-By: Claude <noreply@anthropic.com>

apps/worker/tasks/base.py

Clamping at the override ensures every self.retry() call is safe without requiring each call site to wrap its countdown. Removes all per-site clamp_retry_countdown() calls and their imports. Co-Authored-By: Claude <noreply@anthropic.com>

apps/worker/tasks/base.py

apps/worker/services/lock_manager.py

- Fix `countdown or 0` bug in BaseCodecovTask.retry() that was changing parameterless self.retry() calls from Celery's default 180s to 30s; now only clamps when countdown is explicitly provided - Update test_locked_exponential_backoff_retry_2 to assert countdown equals MAX_RETRY_COUNTDOWN_SECONDS since retry_num=2 always exceeds the visibility timeout cap (1800s uncapped > 870s cap) - Fix ruff formatting in test_analytics_notifier.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

apps/worker/tasks/base.py

…omments Replace bare `30` literals with named constants to clarify intent: - `TASK_RETRY_COUNTDOWN_BUFFER_SECONDS` (shared/celery_config.py) — the safety margin subtracted from the visibility timeout ceiling, now shared between base.py and lock_manager.py instead of duplicated - `_RETRY_COUNTDOWN_FLOOR_SECONDS` (tasks/base.py) — the minimum retry delay, renamed for clarity Add comments explaining there is no hard requirement for these values and noting that our backoff is ~6× more aggressive than Celery's built-in default_retry_delay (180 s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move TASK_RETRY_COUNTDOWN_FLOOR_SECONDS and TASK_RETRY_COUNTDOWN_CEILING_SECONDS into shared/celery_config.py so both tasks/base.py and services/lock_manager.py import them rather than each computing the same expression independently. base.py no longer needs TASK_VISIBILITY_TIMEOUT_SECONDS since all derived constants now live in celery_config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Rename TASK_RETRY_COUNTDOWN_FLOOR_SECONDS -> TASK_RETRY_COUNTDOWN_MIN_SECONDS and TASK_RETRY_COUNTDOWN_CEILING_SECONDS -> TASK_RETRY_COUNTDOWN_MAX_SECONDS for clarity. Clean up comments: drop the fragile 6x ratio (would go stale if the backoff base changes), remove the redundant MAX constant description (the expression is self-explanatory), and use short BUFFER/MIN labels instead of repeating full constant names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codspeed-hq · 2026-02-25T15:37:12Z

Merging this PR will not alter performance

✅ 9 untouched benchmarks

_{Comparing joebecher/cap-bundle-analysis-retry-countdown (8a3a18a) with main (afa4356)¹}

No successful run was found on main (aae0aa4) during the generation of this report, so afa4356 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

apps/worker/services/tests/test_lock_manager.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

apps/worker/services/lock_manager.py

apps/worker/tasks/tests/unit/test_base.py

libs/shared/shared/celery_config.py

Replace hardcoded 18000 (old 5-hour cap) with MAX_RETRY_COUNTDOWN_SECONDS and strengthen the assertion from <= to == since retry_num=10 always hits the cap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The 30-second buffer in clamp_retry_countdown was arbitrary — no test could assert it must be 30 rather than any other value. The principled bound is: a task calling retry() may have been running for up to hard_time_limit_task seconds, so the safe countdown ceiling is TASK_VISIBILITY_TIMEOUT_SECONDS - hard_time_limit_task. This matches the same logic already used in get_lock_timeout(). Move clamp_retry_countdown from a standalone function to a method on BaseCodecovTask so it can read self.hard_time_limit_task. Falls back to TASK_RETRY_COUNTDOWN_MAX_SECONDS (visibility timeout - 30 s) for tasks with no configured hard limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When tests patch BundleAnalysisProcessorTask.app with a MagicMock, self.app.conf.task_time_limit returns a MagicMock, which clamp_retry_countdown then tries to compare with > 0, raising TypeError. Add isinstance(hard_limit, int) check so non-int values fall through to the TASK_RETRY_COUNTDOWN_MAX_SECONDS fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-25T16:50:29Z

apps/worker/tasks/base.py

+        """
+        hard_limit = self.hard_time_limit_task
+        if isinstance(hard_limit, int) and hard_limit > 0:
+            return min(countdown, TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit)


Negative countdown when hard_time_limit exceeds visibility timeout

High Severity

clamp_retry_countdown computes TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit without guarding against hard_limit >= TASK_VISIBILITY_TIMEOUT_SECONDS. Several tasks have time_limit well above 900s (e.g., delete_owner at 2880s, mark_owner_for_deletion and sync_repos at 1440s). For those tasks the subtraction yields a large negative number (e.g., 900 − 2880 = −1980), and min(countdown, −1980) returns −1980. A negative Celery countdown schedules the retry in the past, causing immediate re-execution — the exact runaway-retry behavior this PR is trying to prevent.

cursor · 2026-02-25T16:50:29Z

apps/worker/tasks/base.py

+        Falls back to TASK_RETRY_COUNTDOWN_MAX_SECONDS for tasks with no hard limit.
+        """
+        hard_limit = self.hard_time_limit_task
+        if isinstance(hard_limit, int) and hard_limit > 0:


Type check excludes float hard time limits

Medium Severity

isinstance(hard_limit, int) rejects float values, but hard_time_limit_task can return floats via self.request.timelimit[0] (Celery stores time limits as int or float). When the hard limit is a float like 720.0, the per-task visibility cap is silently bypassed and the method falls through to the generic TASK_RETRY_COUNTDOWN_MAX_SECONDS. This is inconsistent with get_lock_timeout, which uses hard_time_limit_task without any type check.

thomasrockhu-codecov · 2026-02-25T17:23:38Z

apps/worker/services/lock_manager.py

 log = logging.getLogger(__name__)

-MAX_RETRY_COUNTDOWN_SECONDS = 60 * 60 * 5
+MAX_RETRY_COUNTDOWN_SECONDS = TASK_RETRY_COUNTDOWN_MAX_SECONDS


we should probably just get rid of this variable

thomasrockhu-codecov · 2026-02-25T17:24:25Z

apps/worker/tasks/base.py

@@ -209,6 +212,20 @@ def hard_time_limit_task(self):
            return self.time_limit
        return self.app.conf.task_time_limit or 0


I think we should force this to int instead of doing the check later

Can a MagicMock be forced into an int? I will look.

thomasrockhu-codecov · 2026-02-25T17:25:00Z

apps/worker/tasks/base.py

+        Falls back to TASK_RETRY_COUNTDOWN_MAX_SECONDS for tasks with no hard limit.
+        """
+        hard_limit = self.hard_time_limit_task
+        if isinstance(hard_limit, int) and hard_limit > 0:


if we do the above comment, you can just do if self.hard_time_limit_task

The isinstance check is required for tests, where self.app is mocked and tests fail due to it not being an int. That is the only time the limit is ignored

I think that means the way we mock this needs to change, we shouldn't build based on how we expect tests to be run

Fair. I don't know if we xan, since iirc, self.app is assigned by celery at runtime, but i will investigate. I remember quite a bit of trouble getting that to exist when i started, i almost want to say its a read-only property.

godspeed, mocks are painful, but it's probably the right way to handle this if it's possible. if not, we should rethink how these are tested

This property only exists while the task is being executed, unless i am misunderstanding https://docs.celeryq.dev/en/stable/userguide/tasks.html#task-request

I honestly can not remember if we were even testing retries prior the start of all my work. I will have to dig back.

Yes, these tests were created by me. This wasn't being tested before. I don't know if there is another way to do this, you will note that I'm patching it in to begin with 427ec64

I'll look though my prior tries.

thomasrockhu-codecov · 2026-02-25T17:26:27Z

apps/worker/tasks/base.py

+            return min(countdown, TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit)
+        return min(countdown, TASK_RETRY_COUNTDOWN_MAX_SECONDS)


seems like there a logic disparity here, can you explain how we'd have hard_limit actually be 0 and why we don't care about the visibility timeout here?

Per comment above, this is only for tests where the timeout is not a number.

drazisil-codecov requested review from a team and thomasrockhu-codecov February 23, 2026 13:28

drazisil-codecov marked this pull request as ready for review February 23, 2026 13:28