fix: Adjust timing on WorkerLocks to allow faster resolution of the lock #221

jason-famedly · 2026-01-02T13:17:39Z

Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.

Right now, the retry interval in seconds looks like [0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)] after which logging should start about excessive times

With this change, retry intervals in seconds should look more like:

[
0.2, 
0.4, 
0.8, 
1.6, 
3.2, 
6.4, 
12.8, 
25.6, 
51.2, 
102.4,  # 1.7 minutes
204.8,  # 3.41 minutes
409.6,  # 6.83 minutes
819.2,  # 13.65 minutes  < logging about excessive times will start here, 13th iteration
900,  # 15 minutes
]

Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea

codecov · 2026-01-02T13:36:06Z

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.14%. Comparing base (b52bebf) to head (b399876).

Files with missing lines	Patch %	Lines
synapse/handlers/worker_lock.py	25.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@                   Coverage Diff                   @@
##           famedly-release/v1.145     #221   +/-   ##
=======================================================
  Coverage                   80.14%   80.14%           
=======================================================
  Files                         498      498           
  Lines                       71133    71133           
  Branches                    10683    10683           
=======================================================
+ Hits                        57007    57009    +2     
+ Misses                      10886    10885    -1     
+ Partials                     3240     3239    -1

Files with missing lines	Coverage Δ
synapse/handlers/worker_lock.py	`87.75% <25.00%> (ø)`

... and 3 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b52bebf...b399876. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

MadLittleMods · 2026-01-12T17:28:37Z

synapse/handlers/worker_lock.py

    def _get_next_retry_interval(self) -> float:
        next = self._retry_interval
-        self._retry_interval = max(5, next * 2)
+        self._retry_interval = min(5.0, next * 2)


With this change, retry intervals in seconds should look more like [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 5.0, 5.0, 5.0, (stays at 5.0)]

Any interest in upstreaming this change?

It was mentioned as part of the fix for element-hq/synapse#19315 to prevent getting to the point that the number grows so large that we see ValueError: Exceeds the limit (4300 digits) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit.

The only concern with a low value of 5.0 seconds is that it increases the amount of database activity from all of the workers checking the lock more often. It's unclear if the increased database activity is even a problem in practice. At the very-least, we could introduce some high maximum like 15 minutes. 15 minutes is higher than the 10 minute sanity check we have so we could still notice this failure mode:

synapse/synapse/handlers/worker_lock.py

Lines 279 to 283 in b958daf

if self._retry_interval > 10 * ONE_MINUTE_SECONDS: # >7 iterations

logger.warning(

"Lock timeout is getting excessive: %ss. There may be a deadlock.",

self._retry_interval,

)

Hello! I think it would be great to upstream. I have compiled some information on declarations of WorkerLocks, as I agree that a ceiling of 5 seconds is very much to low in general. I will leave the information here in another comment, but we are still investigating if it is possible to use this repo to submit to upstream

NEW_EVENT_DURING_PURGE_LOCK_NAME = "new_event_during_purge_lock"
PURGE_PAGINATION_LOCK_NAME = "purge_pagination_lock"
ONE_TIME_KEY_UPLOAD = "one_time_key_upload_lock"

federation/federation_server.FederationServer._process_incoming_pdus_in_room_inner()

note: a LockStore.Lock(_INBOUND_EVENT_HANDLING_LOCK_NAME, room_id) is taken before this is called(from _handle_received_pdu()). I believe this is a background task

lock: acquire_read_write_lock()

task: NEW_EVENT_DURING_PURGE_LOCK_NAME

lock_key: room_id

write: False

handlers/e2e_keys.E2eKeysHandler._upload_one_time_keys_for_user()

lock: acquire_lock()

task: ONE_TIME_KEY_UPLOAD

lock_key: f"{user_id}_{device_id}"

handlers/federation.FederationHandler.maybe_backfill()

note: Inside a Linearizer context manager keyed to room_id

lock: acquire_read_write_lock()

task: PURGE_PAGINATION_LOCK_NAME

lock_key: room_id

write: False

handlers/message.EventCreationHandler.create_and_send_nonmember_event()

lock: acquire_read_write_lock()

task: NEW_EVENT_DURING_PURGE_LOCK_NAME

lock_key: room_id

write: False

handlers/message.EventCreationHandler._send_dummy_events_to_fill_extremities()

lock: acquire_read_write_lock()

task: NEW_EVENT_DURING_PURGE_LOCK_NAME

lock_key: room_id

write: False

handlers/pagination.PaginationHandler.purge_history()

lock: acquire_read_write_lock()

task: PURGE_PAGINATION_LOCK_NAME

lock_key: room_id

write: True

handlers/pagination.PaginationHandler.purge_room()

lock: acquire_multi_read_write_lock()

task:

PURGE_PAGINATION_LOCK_NAME

NEW_EVENT_DURING_PURGE_LOCK_NAME

lock_key: room_id (for both tasks)

write: True

handlers/room_member.RoomMemberHandler.update_membership()

note: This is inside two Linearizers, one for a given room and another for limiting appservice member changes to no more than 10 simultaneous

lock: acquire_read_write_lock()

task: NEW_EVENT_DURING_PURGE_LOCK_NAME

lock_key: room_id

write: False

rest/client/room_upgrade_rest_servlet.RoomUpgradeRestServlet.on_POST()

lock: acquire_read_write_lock()

task: NEW_EVENT_DURING_PURGE_LOCK_NAME

lock_key: room_id

write: False

storage/controllers/persist_events.EventsPersistenceStorageController._process_event_persist_queue_task()

lock: acquire_read_write_lock()

task: NEW_EVENT_DURING_PURGE_LOCK_NAME

lock_key: room_id

write: False

Of those, these are more likely to be high frequency attempts

federation/federation_server.FederationServer._process_incoming_pdus_in_room_inner()

handlers/e2e_keys.E2eKeysHandler._upload_one_time_keys_for_user()

handlers/message.EventCreationHandler.create_and_send_nonmember_event()

handlers/message.EventCreationHandler._send_dummy_events_to_fill_extremities()

handlers/room_member.RoomMemberHandler.update_membership()

storage/controllers/persist_events.EventsPersistenceStorageController._process_event_persist_queue_task()

These are possibly likely to be between medium and high frequency attempts

handlers/federation.FederationHandler.maybe_backfill()

rest/client/room_upgrade_rest_servlet.RoomUpgradeRestServlet.on_POST()

These are possibly low volume attempts

handlers/pagination.PaginationHandler.purge_history()

handlers/pagination.PaginationHandler.purge_room()

It is probable that allowing a customizable retry attempt interval to be passed into the WorkerLock on creation would be helpful.

…ch attempt and cap at 15 minutes

jason-famedly marked this pull request as ready for review January 2, 2026 13:18

jason-famedly requested a review from a team as a code owner January 2, 2026 13:18

jason-famedly force-pushed the jason/worker-lock-tweaks branch from 0d476d8 to 05e50a6 Compare January 2, 2026 18:48

jason-famedly changed the title ~~fix: Adjust timing and jitter on both styles of WorkerLocks to allow faster resolution of the lock~~ fix: Adjust timing on WorkerLocks to allow faster resolution of the lock Jan 5, 2026

jason-famedly force-pushed the famedly-release/v1.143 branch from fa96612 to 26426ca Compare January 6, 2026 12:28

jason-famedly marked this pull request as draft January 7, 2026 11:19

Base automatically changed from famedly-release/v1.143 to master January 12, 2026 14:39

MadLittleMods reviewed Jan 12, 2026

View reviewed changes

jason-famedly force-pushed the jason/worker-lock-tweaks branch from 05e50a6 to f58b379 Compare January 15, 2026 19:26

jason-famedly changed the base branch from master to famedly-release/v1.145 January 15, 2026 19:32

fix: Adjust worker lock retry intervals to double retry iterations ea…

b399876

…ch attempt and cap at 15 minutes

jason-famedly force-pushed the jason/worker-lock-tweaks branch from f58b379 to b399876 Compare January 15, 2026 19:49

Base automatically changed from famedly-release/v1.145 to master January 19, 2026 09:00

jason-famedly mentioned this pull request Jan 20, 2026

fix: Cap WorkerLock retry intervals to 15 minutes element-hq/synapse#19394

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Adjust timing on WorkerLocks to allow faster resolution of the lock #221

fix: Adjust timing on WorkerLocks to allow faster resolution of the lock #221

Uh oh!

jason-famedly commented Jan 2, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 2, 2026 •

edited

Loading

Uh oh!

MadLittleMods Jan 12, 2026

Uh oh!

jason-famedly Jan 13, 2026

Uh oh!

jason-famedly Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if self._retry_interval > 10 * ONE_MINUTE_SECONDS: # >7 iterations
	logger.warning(
	"Lock timeout is getting excessive: %ss. There may be a deadlock.",
	self._retry_interval,
	)

fix: Adjust timing on WorkerLocks to allow faster resolution of the lock #221

Are you sure you want to change the base?

fix: Adjust timing on WorkerLocks to allow faster resolution of the lock #221

Uh oh!

Conversation

jason-famedly commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MadLittleMods Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

jason-famedly Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

jason-famedly Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jason-famedly commented Jan 2, 2026 •

edited

Loading

codecov bot commented Jan 2, 2026 •

edited

Loading