fix(worker): use finite blocking_timeout in upload finisher locks by thomasrockhu-codecov · Pull Request #729 · codecov/umbrella

thomasrockhu-codecov · 2026-02-27T00:56:47Z

Summary

The upload finisher creates LockManager with blocking_timeout=None in both _process_reports_with_lock and _handle_finisher_lock, which blocks worker threads indefinitely until the Redis lock becomes available.
For high-concurrency commits (e.g., stacks-network with 167 finisher tasks for one commit, mixpanel with 78), this causes all finisher tasks to block on the same per-commit lock, exhausting the worker pool and starving finisher tasks from all other repos.
Changes blocking_timeout to DEFAULT_BLOCKING_TIMEOUT_SECONDS (5s), enabling the existing LockRetry mechanism with exponential backoff. Worker threads are freed immediately instead of blocking for up to 600s (the soft time limit).

Root Cause

GCP logs showed:

stacks-network/stacks-core: 57 soft timeouts, every single one stuck at "Acquiring lock" for the full 600s — never did any work
mixpanel/analytics: 222 psycopg2.OperationalError from stale DB connections after blocking on the lock for minutes
Zero LockRetry events in 6 hours — the backoff mechanism exists but was completely bypassed by blocking_timeout=None

Test plan

Added test verifying LockManager is instantiated with finite blocking_timeout
Existing test_retry_on_report_lock and test_die_on_finisher_lock tests continue to pass (they test the LockRetry behavior that will now actually fire in production)
CI passes

Made with Cursor

Note

Medium Risk
Changes lock acquisition behavior in the upload_finisher worker by introducing a finite Redis lock wait, which can alter retry timing and throughput under contention. Scope is limited to lock configuration and covered by a new unit test, but it affects a hot path for upload processing.

Overview
Prevents upload_finisher workers from blocking indefinitely on per-commit Redis locks by switching LockManager instantiation in _process_reports_with_lock and _handle_finisher_lock from blocking_timeout=None to DEFAULT_BLOCKING_TIMEOUT_SECONDS, allowing the existing LockRetry backoff/retry path to execute.

Adds a unit test (test_lock_manager_uses_finite_blocking_timeout) to assert the finisher constructs LockManager with the configured finite blocking_timeout when lock contention triggers retries.

^{Written by Cursor Bugbot for commit 3e575fe. This will update automatically on new commits. Configure here.}

sentry · 2026-02-27T01:00:03Z

apps/worker/tasks/upload_finisher.py

            repoid=repoid,
            commitid=commitid,
            lock_timeout=self.get_lock_timeout(DEFAULT_LOCK_TIMEOUT_SECONDS),
-            blocking_timeout=None,
+            blocking_timeout=DEFAULT_BLOCKING_TIMEOUT_SECONDS,
        )

        try:


Bug: A shared Redis lock counter is incremented by all concurrent tasks, causing them to prematurely exceed the max_retries limit and terminate, even on their first attempt.
_{Severity: CRITICAL}

Suggested Fix

The max retry check should be based on the individual task's attempt count (retry_num) rather than the shared Redis counter. The shared counter logic should be removed or redesigned to avoid causing cascading failures across independent tasks. The implementation should be updated to match the docstring's intent, which states that retry_num is used for max retry checking.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: apps/worker/tasks/upload_finisher.py#L419-L425 Potential issue: In high-concurrency scenarios, the lock manager uses a shared Redis counter to track failed lock acquisition attempts. When many tasks compete for the same lock, each task that times out after 5 seconds increments this shared counter. Once the counter reaches the `max_retries` limit (e.g., 5), all subsequent tasks attempting to acquire the lock will immediately fail with a `max_retries_exceeded` error. This causes tasks to terminate prematurely, even on their first attempt, because the failure count is shared across all concurrent tasks rather than being tracked on a per-task basis. This issue is exacerbated by the change from an indefinite blocking timeout to a 5-second timeout.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

sentry · 2026-02-27T01:03:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.24%. Comparing base (ed64332) to head (716ae09).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #729   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files        1302     1302           
  Lines       47890    47890           
  Branches     1628     1628           
=======================================
  Hits        44177    44177           
  Misses       3404     3404           
  Partials      309      309

Flag	Coverage Δ
workerintegration	`58.61% <ø> (ø)`
workerunit	`90.33% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codecov-notifications · 2026-02-27T01:03:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

The upload finisher was creating LockManager with blocking_timeout=None, which blocks the worker thread indefinitely until the lock is available. For high-concurrency commits (100+ parallel CI jobs), this causes all finisher tasks to block on the same lock, exhausting the worker pool and starving all other repos. Change both UPLOAD_PROCESSING and UPLOAD_FINISHER lock acquisitions to use DEFAULT_BLOCKING_TIMEOUT_SECONDS (5s). This enables the existing LockRetry mechanism with exponential backoff, freeing worker threads immediately instead of blocking for up to 600s (the soft time limit). Made-with: Cursor

sentry bot reviewed Feb 27, 2026

View reviewed changes

thomasrockhu-codecov mentioned this pull request Feb 27, 2026

feat(worker): cooperative finisher merging from Redis processed set #730

Open

5 tasks

thomasrockhu-codecov force-pushed the tom/fix-finisher-blocking-timeout branch from 716ae09 to 3e575fe Compare March 5, 2026 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker): use finite blocking_timeout in upload finisher locks#729

fix(worker): use finite blocking_timeout in upload finisher locks#729
thomasrockhu-codecov wants to merge 1 commit intomainfrom
tom/fix-finisher-blocking-timeout

thomasrockhu-codecov commented Feb 27, 2026 •

edited by cursor bot

Loading

Uh oh!

sentry bot Feb 27, 2026

Uh oh!

sentry bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

codecov-notifications bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomasrockhu-codecov commented Feb 27, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Test plan

Uh oh!

sentry bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

sentry bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov-notifications bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thomasrockhu-codecov commented Feb 27, 2026 •

edited by cursor bot

Loading

sentry bot commented Feb 27, 2026 •

edited

Loading

codecov-notifications bot commented Feb 27, 2026 •

edited

Loading