Skip to content

fix(worker): use finite blocking_timeout in upload finisher locks#729

Open
thomasrockhu-codecov wants to merge 1 commit intomainfrom
tom/fix-finisher-blocking-timeout
Open

fix(worker): use finite blocking_timeout in upload finisher locks#729
thomasrockhu-codecov wants to merge 1 commit intomainfrom
tom/fix-finisher-blocking-timeout

Conversation

@thomasrockhu-codecov
Copy link
Contributor

@thomasrockhu-codecov thomasrockhu-codecov commented Feb 27, 2026

Summary

  • The upload finisher creates LockManager with blocking_timeout=None in both _process_reports_with_lock and _handle_finisher_lock, which blocks worker threads indefinitely until the Redis lock becomes available.
  • For high-concurrency commits (e.g., stacks-network with 167 finisher tasks for one commit, mixpanel with 78), this causes all finisher tasks to block on the same per-commit lock, exhausting the worker pool and starving finisher tasks from all other repos.
  • Changes blocking_timeout to DEFAULT_BLOCKING_TIMEOUT_SECONDS (5s), enabling the existing LockRetry mechanism with exponential backoff. Worker threads are freed immediately instead of blocking for up to 600s (the soft time limit).

Root Cause

GCP logs showed:

  • stacks-network/stacks-core: 57 soft timeouts, every single one stuck at "Acquiring lock" for the full 600s — never did any work
  • mixpanel/analytics: 222 psycopg2.OperationalError from stale DB connections after blocking on the lock for minutes
  • Zero LockRetry events in 6 hours — the backoff mechanism exists but was completely bypassed by blocking_timeout=None

Test plan

  • Added test verifying LockManager is instantiated with finite blocking_timeout
  • Existing test_retry_on_report_lock and test_die_on_finisher_lock tests continue to pass (they test the LockRetry behavior that will now actually fire in production)
  • CI passes

Made with Cursor


Note

Medium Risk
Changes lock acquisition behavior in the upload_finisher worker by introducing a finite Redis lock wait, which can alter retry timing and throughput under contention. Scope is limited to lock configuration and covered by a new unit test, but it affects a hot path for upload processing.

Overview
Prevents upload_finisher workers from blocking indefinitely on per-commit Redis locks by switching LockManager instantiation in _process_reports_with_lock and _handle_finisher_lock from blocking_timeout=None to DEFAULT_BLOCKING_TIMEOUT_SECONDS, allowing the existing LockRetry backoff/retry path to execute.

Adds a unit test (test_lock_manager_uses_finite_blocking_timeout) to assert the finisher constructs LockManager with the configured finite blocking_timeout when lock contention triggers retries.

Written by Cursor Bugbot for commit 3e575fe. This will update automatically on new commits. Configure here.

Comment on lines 419 to 425
repoid=repoid,
commitid=commitid,
lock_timeout=self.get_lock_timeout(DEFAULT_LOCK_TIMEOUT_SECONDS),
blocking_timeout=None,
blocking_timeout=DEFAULT_BLOCKING_TIMEOUT_SECONDS,
)

try:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: A shared Redis lock counter is incremented by all concurrent tasks, causing them to prematurely exceed the max_retries limit and terminate, even on their first attempt.
Severity: CRITICAL

Suggested Fix

The max retry check should be based on the individual task's attempt count (retry_num) rather than the shared Redis counter. The shared counter logic should be removed or redesigned to avoid causing cascading failures across independent tasks. The implementation should be updated to match the docstring's intent, which states that retry_num is used for max retry checking.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: apps/worker/tasks/upload_finisher.py#L419-L425

Potential issue: In high-concurrency scenarios, the lock manager uses a shared Redis
counter to track failed lock acquisition attempts. When many tasks compete for the same
lock, each task that times out after 5 seconds increments this shared counter. Once the
counter reaches the `max_retries` limit (e.g., 5), all subsequent tasks attempting to
acquire the lock will immediately fail with a `max_retries_exceeded` error. This causes
tasks to terminate prematurely, even on their first attempt, because the failure count
is shared across all concurrent tasks rather than being tracked on a per-task basis.
This issue is exacerbated by the change from an indefinite blocking timeout to a
5-second timeout.

Did we get this right? 👍 / 👎 to inform future reviews.

@sentry
Copy link

sentry bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.24%. Comparing base (ed64332) to head (716ae09).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #729   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files        1302     1302           
  Lines       47890    47890           
  Branches     1628     1628           
=======================================
  Hits        44177    44177           
  Misses       3404     3404           
  Partials      309      309           
Flag Coverage Δ
workerintegration 58.61% <ø> (ø)
workerunit 90.33% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

The upload finisher was creating LockManager with blocking_timeout=None,
which blocks the worker thread indefinitely until the lock is available.
For high-concurrency commits (100+ parallel CI jobs), this causes all
finisher tasks to block on the same lock, exhausting the worker pool and
starving all other repos.

Change both UPLOAD_PROCESSING and UPLOAD_FINISHER lock acquisitions to
use DEFAULT_BLOCKING_TIMEOUT_SECONDS (5s). This enables the existing
LockRetry mechanism with exponential backoff, freeing worker threads
immediately instead of blocking for up to 600s (the soft time limit).

Made-with: Cursor
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tom/fix-finisher-blocking-timeout branch from 716ae09 to 3e575fe Compare March 5, 2026 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant