fix(worker): use finite blocking_timeout in upload finisher locks#729
fix(worker): use finite blocking_timeout in upload finisher locks#729thomasrockhu-codecov wants to merge 1 commit intomainfrom
Conversation
| repoid=repoid, | ||
| commitid=commitid, | ||
| lock_timeout=self.get_lock_timeout(DEFAULT_LOCK_TIMEOUT_SECONDS), | ||
| blocking_timeout=None, | ||
| blocking_timeout=DEFAULT_BLOCKING_TIMEOUT_SECONDS, | ||
| ) | ||
|
|
||
| try: |
There was a problem hiding this comment.
Bug: A shared Redis lock counter is incremented by all concurrent tasks, causing them to prematurely exceed the max_retries limit and terminate, even on their first attempt.
Severity: CRITICAL
Suggested Fix
The max retry check should be based on the individual task's attempt count (retry_num) rather than the shared Redis counter. The shared counter logic should be removed or redesigned to avoid causing cascading failures across independent tasks. The implementation should be updated to match the docstring's intent, which states that retry_num is used for max retry checking.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: apps/worker/tasks/upload_finisher.py#L419-L425
Potential issue: In high-concurrency scenarios, the lock manager uses a shared Redis
counter to track failed lock acquisition attempts. When many tasks compete for the same
lock, each task that times out after 5 seconds increments this shared counter. Once the
counter reaches the `max_retries` limit (e.g., 5), all subsequent tasks attempting to
acquire the lock will immediately fail with a `max_retries_exceeded` error. This causes
tasks to terminate prematurely, even on their first attempt, because the failure count
is shared across all concurrent tasks rather than being tracked on a per-task basis.
This issue is exacerbated by the change from an indefinite blocking timeout to a
5-second timeout.
Did we get this right? 👍 / 👎 to inform future reviews.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #729 +/- ##
=======================================
Coverage 92.24% 92.24%
=======================================
Files 1302 1302
Lines 47890 47890
Branches 1628 1628
=======================================
Hits 44177 44177
Misses 3404 3404
Partials 309 309
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The upload finisher was creating LockManager with blocking_timeout=None, which blocks the worker thread indefinitely until the lock is available. For high-concurrency commits (100+ parallel CI jobs), this causes all finisher tasks to block on the same lock, exhausting the worker pool and starving all other repos. Change both UPLOAD_PROCESSING and UPLOAD_FINISHER lock acquisitions to use DEFAULT_BLOCKING_TIMEOUT_SECONDS (5s). This enables the existing LockRetry mechanism with exponential backoff, freeing worker threads immediately instead of blocking for up to 600s (the soft time limit). Made-with: Cursor
716ae09 to
3e575fe
Compare
Summary
LockManagerwithblocking_timeout=Nonein both_process_reports_with_lockand_handle_finisher_lock, which blocks worker threads indefinitely until the Redis lock becomes available.blocking_timeouttoDEFAULT_BLOCKING_TIMEOUT_SECONDS(5s), enabling the existingLockRetrymechanism with exponential backoff. Worker threads are freed immediately instead of blocking for up to 600s (the soft time limit).Root Cause
GCP logs showed:
psycopg2.OperationalErrorfrom stale DB connections after blocking on the lock for minutesLockRetryevents in 6 hours — the backoff mechanism exists but was completely bypassed byblocking_timeout=NoneTest plan
LockManageris instantiated with finiteblocking_timeouttest_retry_on_report_lockandtest_die_on_finisher_locktests continue to pass (they test theLockRetrybehavior that will now actually fire in production)Made with Cursor
Note
Medium Risk
Changes lock acquisition behavior in the
upload_finisherworker by introducing a finite Redis lock wait, which can alter retry timing and throughput under contention. Scope is limited to lock configuration and covered by a new unit test, but it affects a hot path for upload processing.Overview
Prevents
upload_finisherworkers from blocking indefinitely on per-commit Redis locks by switchingLockManagerinstantiation in_process_reports_with_lockand_handle_finisher_lockfromblocking_timeout=NonetoDEFAULT_BLOCKING_TIMEOUT_SECONDS, allowing the existingLockRetrybackoff/retry path to execute.Adds a unit test (
test_lock_manager_uses_finite_blocking_timeout) to assert the finisher constructsLockManagerwith the configured finiteblocking_timeoutwhen lock contention triggers retries.Written by Cursor Bugbot for commit 3e575fe. This will update automatically on new commits. Configure here.