Skip to content

Conversation

@drazisil-codecov
Copy link
Contributor

@drazisil-codecov drazisil-codecov commented Jan 5, 2026

Problem

The bundle analysis processor was experiencing infinite retry loops due to two related bugs:

  1. Wrong retry counter check: The code was checking self.request.retries instead of self.attempts when determining if max retries were exceeded
  2. Task recreation resetting counter: The apply_async method was always resetting attempts to 1, even when recreating tasks that had already been retried

Root Cause

Bug #1: Wrong retry counter

  • self.request.retries only counts intentional retries via self.retry()
  • self.attempts includes visibility timeout re-deliveries via the attempts header

When a task times out and gets redelivered due to visibility timeout, self.request.retries stays the same but self.attempts increases, causing the max retry check to fail.

Bug #2: Task recreation resetting counter

  • apply_async was always setting attempts: 1 in headers, overwriting any existing attempts value
  • When tasks were recreated via apply_async (e.g., via chain or re-scheduling), the retry counter would reset to 1
  • This allowed tasks to retry indefinitely even after hitting max retries

Example

Task 61998622-ac1a-4861-a459-98bb4f51b8ed kept retrying after hitting max retries because:

  • self.request.retries was 10 (at max_retries)
  • self.attempts was 11+ (exceeded due to visibility timeout re-deliveries)
  • The code checked self.request.retries >= self.max_retries which didn't account for the additional attempts
  • If the task was recreated via apply_async, the attempts header would reset to 1, allowing infinite retries

Solution

Fix #1: Use correct retry counter

  1. Use self.attempts instead of self.request.retries when checking max retries
  2. Pass self.attempts to lock_manager.locked() as documented in the method signature
  3. Pass max_retries to lock_manager.locked() for proper retry limit checking
  4. Use self._has_exceeded_max_attempts() helper method which correctly uses self.attempts

Fix #2: Preserve attempts counter

  1. Preserve existing attempts value from opt_headers if present when calling apply_async
  2. Only default to attempts: 1 for new task creations
  3. This ensures retry counters are maintained across task recreations

Changes

  • apps/worker/tasks/bundle_analysis_processor.py: Fixed retry check to use self.attempts instead of self.request.retries
  • apps/worker/tasks/base.py: Fixed apply_async to preserve existing attempts header instead of always resetting to 1
  • apps/worker/tasks/tests/unit/test_bundle_analysis_processor_task.py: Added test to catch visibility timeout re-delivery scenario

Testing

Added test test_bundle_analysis_processor_task_max_retries_exceeded_visibility_timeout that simulates:

  • self.request.retries = 5 (below max_retries of 10)
  • self.attempts = 11 (exceeds max_retries due to visibility timeout re-deliveries)
  • Verifies task stops retrying and returns previous_result instead of continuing to retry

This test would have caught both bugs before they were deployed.


Note

Addresses infinite retry loops in bundle analysis processing caused by visibility-timeout re-deliveries being ignored.

  • Use self.attempts and pass retry_num=self.attempts, max_retries=self.max_retries to LockManager.locked(); check limits via _has_exceeded_max_attempts() and enhance error logging
  • Preserve attempts in BaseCodecovTask.apply_async headers instead of resetting to 1
  • Add unit test test_bundle_analysis_processor_task_max_retries_exceeded_visibility_timeout to ensure tasks stop retrying when attempts exceed max

Written by Cursor Bugbot for commit 21fdf2f. This will update automatically on new commits. Configure here.

…eout

The bundle analysis processor was checking self.request.retries instead of
self.attempts when determining if max retries were exceeded. This caused
tasks to continue retrying after hitting max retries when visibility timeout
caused re-deliveries, because:

- self.request.retries only counts intentional retries via self.retry()
- self.attempts includes visibility timeout re-deliveries via the attempts header

When a task times out and gets redelivered due to visibility timeout,
self.request.retries stays the same but self.attempts increases, causing
the max retry check to fail.

Changes:
- Use self.attempts instead of self.request.retries when checking max retries
- Pass self.attempts to lock_manager.locked() as documented
- Pass max_retries to lock_manager.locked() for proper retry limit checking
- Add test to catch visibility timeout re-delivery scenario

Fixes infinite retry loop for task 61998622-ac1a-4861-a459-98bb4f51b8ed
@sentry
Copy link

sentry bot commented Jan 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.90%. Comparing base (367d3f1) to head (21fdf2f).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #633   +/-   ##
=======================================
  Coverage   93.90%   93.90%           
=======================================
  Files        1286     1286           
  Lines       46802    46803    +1     
  Branches     1517     1517           
=======================================
+ Hits        43951    43952    +1     
  Misses       2542     2542           
  Partials      309      309           
Flag Coverage Δ
workerintegration 59.16% <100.00%> (+0.06%) ⬆️
workerunit 91.30% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Jan 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

The apply_async method was always setting attempts=1, even when opt_headers
already contained an attempts value. This caused tasks to reset their retry
counter when recreated via apply_async (e.g., via chain or re-scheduling).

Now we preserve the existing attempts value from opt_headers if present,
only defaulting to 1 for new task creations. This ensures retry counters
are properly maintained across task recreations.
log.error(
"Bundle analysis processor exceeded max retries",
extra={
"attempts": attempts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do self.attempts here and remove the call of attempts = self.attempts

**opt_headers,
"created_timestamp": current_time.isoformat(),
"attempts": 1,
"attempts": attempts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be inlined also with opt_headers.get("attempts", 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, is this needed? we call **opt_headers

@drazisil-codecov
Copy link
Contributor Author

Regarding the question "is this needed? we call **opt_headers":

Yes, the fix is needed. While **opt_headers spreads the existing headers (including any attempts value), the problem is in the original code:

headers = {
    **opt_headers,           # Spreads existing headers including any "attempts"
    "created_timestamp": ...,
    "attempts": 1,           # ⚠️ ALWAYS overwrites to 1!
}

In Python dicts, later keys override earlier ones. So even if opt_headers contains {"attempts": 5}, the final "attempts": 1 overwrites it back to 1.

This is exactly the bug causing the infinite retry loop - when tasks are recreated via apply_async, the retry counter gets reset to 1, allowing tasks to retry indefinitely.

The fix ensures we preserve the existing attempts value (or default to 1 for new tasks):

headers = {
    **opt_headers,
    "created_timestamp": ...,
    "attempts": opt_headers.get("attempts", 1),  # Preserves existing, defaults to 1
}

I'll inline this as you suggested in the earlier comment.

@drazisil-codecov drazisil-codecov added this pull request to the merge queue Jan 7, 2026
Merged via the queue into main with commit 0360937 Jan 7, 2026
40 checks passed
@drazisil-codecov drazisil-codecov deleted the fix/bundle-analysis-max-retries-visibility-timeout branch January 7, 2026 15:56
drazisil-codecov added a commit that referenced this pull request Jan 7, 2026
- Add debug logging for session creation/cleanup in apply_async and run
- Log session_id, task name, and whether transaction was open
- Preserve attempts header from PR #633 to fix visibility timeout tracking
- Logging helps verify session cleanup is working correctly in production
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants