Skip to content

fix(worker): cap retry countdown to visibility timeout across all tasks#720

Open
drazisil-codecov wants to merge 18 commits intomainfrom
joebecher/cap-bundle-analysis-retry-countdown
Open

fix(worker): cap retry countdown to visibility timeout across all tasks#720
drazisil-codecov wants to merge 18 commits intomainfrom
joebecher/cap-bundle-analysis-retry-countdown

Conversation

@drazisil-codecov
Copy link
Contributor

@drazisil-codecov drazisil-codecov commented Feb 23, 2026

Any `self.retry(countdown=N)` where `N > TASK_VISIBILITY_TIMEOUT_SECONDS` causes Redis to redeliver the task from `unacked` before the ETA fires, creating duplicate executions that multiply exponentially. This is the root cause of the bundle analysis queue explosions documented in CODECOV-59.

What was broken

Two retry paths were affected across multiple tasks:

Lock-contention retries (`LockRetry.countdown`) — `LockManager` uses `200 * 3**retry_num`, which reaches ~1800s at `retry_num=2` (above the 900s visibility timeout). This formula predates PR #589 for bundle analysis tasks, but #589 extended `LockManager` to cover the entire upload pipeline (`UPLOAD`, `UPLOAD_PROCESSING`, `UPLOAD_FINISHER`, `PREPROCESS_UPLOAD`, `MANUAL_TRIGGER`), significantly expanding the blast radius.

Exponential backoff retries — Various tasks use formulas like `30 * 2retries`, `20 * 2retries`, `60 * 3**retries`. These all exceed 900s within a handful of retries and had no upper bound.

Fix

Two layers of defense:

  1. `LockManager.MAX_RETRY_COUNTDOWN_SECONDS` is now bounded at `TASK_RETRY_COUNTDOWN_MAX_SECONDS` (was 5 hours).

  2. `BaseCodecovTask.clamp_retry_countdown()` caps any explicitly-provided countdown before it is passed to Celery. Parameterless `self.retry()` calls (`countdown=None`) are left alone so Celery's built-in 180s default is preserved — 180s is already within the safe window, and overriding it would silently change retry cadence for cleanup tasks that rely on that default.

The cap is derived from the task's own `hard_time_limit_task`: a task calling `retry()` may have been running for up to that many seconds, so the safe ceiling is `TASK_VISIBILITY_TIMEOUT_SECONDS - hard_time_limit_task`. This is the same reasoning already used in `get_lock_timeout()`. Tasks with no configured hard limit fall back to `TASK_RETRY_COUNTDOWN_MAX_SECONDS` (visibility timeout − 30 s).

# In BaseCodecovTask
def clamp_retry_countdown(self, countdown: int) -> int:
    hard_limit = self.hard_time_limit_task
    if hard_limit > 0:
        return min(countdown, TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit)
    return min(countdown, TASK_RETRY_COUNTDOWN_MAX_SECONDS)

Fixes CODECOV-59

Follows up on CCMRG-2042 and #589.


Note

Medium Risk
Changes core retry timing across many worker tasks; while intended to prevent Redis redelivery duplication, it could alter retry cadence (and edge-case countdown values) for long-running or misconfigured time-limit tasks.

Overview
Prevents runaway duplicate task execution by capping all explicit self.retry(countdown=...) delays to fit within the remaining Redis TASK_VISIBILITY_TIMEOUT_SECONDS window.

Adds BaseCodecovTask.clamp_retry_countdown() and applies it automatically in the overridden BaseCodecovTask.retry(), while introducing a shared TASK_RETRY_COUNTDOWN_MAX_SECONDS fallback for tasks without a hard time limit.

Also reduces LockManager’s exponential backoff cap from a hardcoded 5 hours to TASK_RETRY_COUNTDOWN_MAX_SECONDS, and updates/extends unit tests to assert the new clamping behavior.

Written by Cursor Bugbot for commit 8a3a18a. This will update automatically on new commits. Configure here.

…imeout

Prevents retry countdown from exceeding TASK_VISIBILITY_TIMEOUT_SECONDS,
which could cause tasks to become invisible in the queue before being retried.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@linear
Copy link

linear bot commented Feb 23, 2026

@drazisil-codecov drazisil-codecov marked this pull request as ready for review February 23, 2026 13:28
@sentry
Copy link

sentry bot commented Feb 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.24%. Comparing base (aae0aa4) to head (8a3a18a).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #720   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files        1302     1302           
  Lines       47888    47896    +8     
  Branches     1628     1628           
=======================================
+ Hits        44175    44183    +8     
  Misses       3404     3404           
  Partials      309      309           
Flag Coverage Δ
apiunit 96.36% <ø> (ø)
sharedintegration 37.01% <100.00%> (+<0.01%) ⬆️
sharedunit 84.89% <100.00%> (+<0.01%) ⬆️
workerintegration 58.63% <76.92%> (+0.01%) ⬆️
workerunit 90.33% <84.61%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Feb 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

…ceiling

Extract retry countdown clamping into a helper `_clamp_retry_countdown`
used by both retry paths in BundleAnalysisProcessorTask. The countdown
is bounded between a floor of 30s and a ceiling of
TASK_VISIBILITY_TIMEOUT_SECONDS - 30 (870s in production).

The floor prevents a near-zero countdown when the visibility timeout is
small (e.g. in tests or low-timeout environments). The ceiling ensures
the ETA always fires before Redis can redeliver the task from `unacked`,
with a 30s buffer instead of the previous 1s.

Fixes CODECOV-59
Co-Authored-By: Claude <noreply@anthropic.com>
When TASK_VISIBILITY_TIMEOUT_SECONDS is small (e.g. in low-timeout
environments or tests), TASK_VISIBILITY_TIMEOUT_SECONDS - 30 could be
less than the floor of 30s, causing the floor to silently override
the ceiling and making the ceiling meaningless.

Compute _RETRY_COUNTDOWN_CEILING as max(visibility_timeout - 30, floor)
so the two bounds are always consistent. Also removes the mocker patches
from the clamp tests — they're no longer needed since the ceiling is now
well-defined in all environments.

Refs CODECOV-59
Co-Authored-By: Claude <noreply@anthropic.com>
drazisil-codecov and others added 2 commits February 23, 2026 11:28
Previously the floor was a hardcoded 30s, which could exceed the
visibility timeout in low-timeout environments (e.g. 20s test env),
meaning even the floor could trigger redelivery from unacked.

Floor is now min(30, visibility_timeout // 2), so it scales down
proportionally for small timeouts while staying at 30s in production.

Examples:
  900s visibility → floor=30s, ceiling=870s (unchanged)
   20s visibility → floor=10s, ceiling=10s
   10s visibility → floor=5s,  ceiling=5s

Refs CODECOV-59
Co-Authored-By: Claude <noreply@anthropic.com>
…tasks

Move the clamp_retry_countdown helper from bundle_analysis_processor to
tasks/base.py so every task that schedules retries is protected from
setting a countdown that exceeds the Celery visibility timeout.

Any countdown longer than TASK_VISIBILITY_TIMEOUT_SECONDS causes Redis to
redeliver the task from unacked before the ETA fires, creating duplicate
executions that multiply exponentially. Clamping to [floor, ceiling] where
both bounds are derived from TASK_VISIBILITY_TIMEOUT_SECONDS prevents this.

Applied to: notify, manual_trigger, http_request, upload_processor,
bundle_analysis_notify, bundle_analysis_processor, test_analytics_notifier,
test_results_finisher.

Refs CODECOV-59
Co-Authored-By: Claude <noreply@anthropic.com>
@drazisil-codecov drazisil-codecov changed the title fix(worker): cap bundle analysis retry countdown to task visibility timeout fix(worker): cap retry countdown to visibility timeout across all tasks Feb 23, 2026
preprocess_upload, upload_finisher, and upload were missed in the
previous commit. All three had unclamped LockRetry countdowns, and
upload.py additionally takes max(lock_countdown, exponential_backoff)
before scheduling the retry — compounding both unbounded sources.

Refs CODECOV-59
Co-Authored-By: Claude <noreply@anthropic.com>
drazisil-codecov and others added 2 commits February 23, 2026 12:26
… tests

Replace the hardcoded 5-hour cap in LockManager with
TASK_VISIBILITY_TIMEOUT_SECONDS - 30, so LockRetry.countdown is
bounded at the source rather than relying solely on call sites to
clamp. Tasks still wrap with clamp_retry_countdown as a safety net.

Move TestClampRetryCountdown from test_bundle_analysis_processor_task
to test_base where it belongs.

Refs CODECOV-59
Co-Authored-By: Claude <noreply@anthropic.com>
…otifier

Lines 165 and 212 were missed in the previous pass. All three LockRetry
handlers in this file now consistently wrap e.countdown.

Refs CODECOV-59
Co-Authored-By: Claude <noreply@anthropic.com>
Clamping at the override ensures every self.retry() call is safe without
requiring each call site to wrap its countdown. Removes all per-site
clamp_retry_countdown() calls and their imports.

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix `countdown or 0` bug in BaseCodecovTask.retry() that was changing
  parameterless self.retry() calls from Celery's default 180s to 30s;
  now only clamps when countdown is explicitly provided
- Update test_locked_exponential_backoff_retry_2 to assert countdown
  equals MAX_RETRY_COUNTDOWN_SECONDS since retry_num=2 always exceeds
  the visibility timeout cap (1800s uncapped > 870s cap)
- Fix ruff formatting in test_analytics_notifier.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
drazisil-codecov and others added 2 commits February 25, 2026 10:09
…omments

Replace bare `30` literals with named constants to clarify intent:
- `TASK_RETRY_COUNTDOWN_BUFFER_SECONDS` (shared/celery_config.py) — the
  safety margin subtracted from the visibility timeout ceiling, now shared
  between base.py and lock_manager.py instead of duplicated
- `_RETRY_COUNTDOWN_FLOOR_SECONDS` (tasks/base.py) — the minimum retry
  delay, renamed for clarity

Add comments explaining there is no hard requirement for these values and
noting that our backoff is ~6× more aggressive than Celery's built-in
default_retry_delay (180 s).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move TASK_RETRY_COUNTDOWN_FLOOR_SECONDS and TASK_RETRY_COUNTDOWN_CEILING_SECONDS
into shared/celery_config.py so both tasks/base.py and services/lock_manager.py
import them rather than each computing the same expression independently.

base.py no longer needs TASK_VISIBILITY_TIMEOUT_SECONDS since all derived
constants now live in celery_config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rename TASK_RETRY_COUNTDOWN_FLOOR_SECONDS -> TASK_RETRY_COUNTDOWN_MIN_SECONDS
and TASK_RETRY_COUNTDOWN_CEILING_SECONDS -> TASK_RETRY_COUNTDOWN_MAX_SECONDS for
clarity.

Clean up comments: drop the fragile 6x ratio (would go stale if the backoff base
changes), remove the redundant MAX constant description (the expression is
self-explanatory), and use short BUFFER/MIN labels instead of repeating full
constant names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 25, 2026

Merging this PR will not alter performance

✅ 9 untouched benchmarks


Comparing joebecher/cap-bundle-analysis-retry-countdown (8a3a18a) with main (afa4356)1

Open in CodSpeed

Footnotes

  1. No successful run was found on main (aae0aa4) during the generation of this report, so afa4356 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
drazisil-codecov and others added 3 commits February 25, 2026 10:46
Replace hardcoded 18000 (old 5-hour cap) with MAX_RETRY_COUNTDOWN_SECONDS
and strengthen the assertion from <= to == since retry_num=10 always hits
the cap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 30-second buffer in clamp_retry_countdown was arbitrary — no test
could assert it must be 30 rather than any other value.

The principled bound is: a task calling retry() may have been running
for up to hard_time_limit_task seconds, so the safe countdown ceiling is
TASK_VISIBILITY_TIMEOUT_SECONDS - hard_time_limit_task. This matches the
same logic already used in get_lock_timeout().

Move clamp_retry_countdown from a standalone function to a method on
BaseCodecovTask so it can read self.hard_time_limit_task. Falls back to
TASK_RETRY_COUNTDOWN_MAX_SECONDS (visibility timeout - 30 s) for tasks
with no configured hard limit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When tests patch BundleAnalysisProcessorTask.app with a MagicMock,
self.app.conf.task_time_limit returns a MagicMock, which clamp_retry_countdown
then tries to compare with > 0, raising TypeError.

Add isinstance(hard_limit, int) check so non-int values fall through
to the TASK_RETRY_COUNTDOWN_MAX_SECONDS fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

"""
hard_limit = self.hard_time_limit_task
if isinstance(hard_limit, int) and hard_limit > 0:
return min(countdown, TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative countdown when hard_time_limit exceeds visibility timeout

High Severity

clamp_retry_countdown computes TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit without guarding against hard_limit >= TASK_VISIBILITY_TIMEOUT_SECONDS. Several tasks have time_limit well above 900s (e.g., delete_owner at 2880s, mark_owner_for_deletion and sync_repos at 1440s). For those tasks the subtraction yields a large negative number (e.g., 900 − 2880 = −1980), and min(countdown, −1980) returns −1980. A negative Celery countdown schedules the retry in the past, causing immediate re-execution — the exact runaway-retry behavior this PR is trying to prevent.

Fix in Cursor Fix in Web

Falls back to TASK_RETRY_COUNTDOWN_MAX_SECONDS for tasks with no hard limit.
"""
hard_limit = self.hard_time_limit_task
if isinstance(hard_limit, int) and hard_limit > 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type check excludes float hard time limits

Medium Severity

isinstance(hard_limit, int) rejects float values, but hard_time_limit_task can return floats via self.request.timelimit[0] (Celery stores time limits as int or float). When the hard limit is a float like 720.0, the per-task visibility cap is silently bypassed and the method falls through to the generic TASK_RETRY_COUNTDOWN_MAX_SECONDS. This is inconsistent with get_lock_timeout, which uses hard_time_limit_task without any type check.

Fix in Cursor Fix in Web

log = logging.getLogger(__name__)

MAX_RETRY_COUNTDOWN_SECONDS = 60 * 60 * 5
MAX_RETRY_COUNTDOWN_SECONDS = TASK_RETRY_COUNTDOWN_MAX_SECONDS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably just get rid of this variable

@@ -209,6 +212,20 @@ def hard_time_limit_task(self):
return self.time_limit
return self.app.conf.task_time_limit or 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should force this to int instead of doing the check later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a MagicMock be forced into an int? I will look.

Falls back to TASK_RETRY_COUNTDOWN_MAX_SECONDS for tasks with no hard limit.
"""
hard_limit = self.hard_time_limit_task
if isinstance(hard_limit, int) and hard_limit > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do the above comment, you can just do if self.hard_time_limit_task

Copy link
Contributor Author

@drazisil-codecov drazisil-codecov Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isinstance check is required for tests, where self.app is mocked and tests fail due to it not being an int. That is the only time the limit is ignored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that means the way we mock this needs to change, we shouldn't build based on how we expect tests to be run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. I don't know if we xan, since iirc, self.app is assigned by celery at runtime, but i will investigate. I remember quite a bit of trouble getting that to exist when i started, i almost want to say its a read-only property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godspeed, mocks are painful, but it's probably the right way to handle this if it's possible. if not, we should rethink how these are tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property only exists while the task is being executed, unless i am misunderstanding https://docs.celeryq.dev/en/stable/userguide/tasks.html#task-request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly can not remember if we were even testing retries prior the start of all my work. I will have to dig back.

Copy link
Contributor Author

@drazisil-codecov drazisil-codecov Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these tests were created by me. This wasn't being tested before. I don't know if there is another way to do this, you will note that I'm patching it in to begin with 427ec64

I'll look though my prior tries.

Comment on lines +226 to +227
return min(countdown, TASK_VISIBILITY_TIMEOUT_SECONDS - hard_limit)
return min(countdown, TASK_RETRY_COUNTDOWN_MAX_SECONDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like there a logic disparity here, can you explain how we'd have hard_limit actually be 0 and why we don't care about the visibility timeout here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per comment above, this is only for tests where the timeout is not a number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants