Skip to content

Conversation

@jason-famedly
Copy link
Contributor

@jason-famedly jason-famedly commented Jan 2, 2026

Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.

Right now, the retry interval in seconds looks like [0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)] after which logging should start about excessive times

With this change, retry intervals in seconds should look more like:

[
0.2, 
0.4, 
0.8, 
1.6, 
3.2, 
6.4, 
12.8, 
25.6, 
51.2, 
102.4,  # 1.7 minutes
204.8,  # 3.41 minutes
409.6,  # 6.83 minutes
819.2,  # 13.65 minutes  < logging about excessive times will start here, 13th iteration
900,  # 15 minutes
]

Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea

@jason-famedly jason-famedly marked this pull request as ready for review January 2, 2026 13:18
@jason-famedly jason-famedly requested a review from a team as a code owner January 2, 2026 13:18
@codecov
Copy link

codecov bot commented Jan 2, 2026

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.14%. Comparing base (b52bebf) to head (b399876).

Files with missing lines Patch % Lines
synapse/handlers/worker_lock.py 25.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@                   Coverage Diff                   @@
##           famedly-release/v1.145     #221   +/-   ##
=======================================================
  Coverage                   80.14%   80.14%           
=======================================================
  Files                         498      498           
  Lines                       71133    71133           
  Branches                    10683    10683           
=======================================================
+ Hits                        57007    57009    +2     
+ Misses                      10886    10885    -1     
+ Partials                     3240     3239    -1     
Files with missing lines Coverage Δ
synapse/handlers/worker_lock.py 87.75% <25.00%> (ø)

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b52bebf...b399876. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jason-famedly jason-famedly force-pushed the jason/worker-lock-tweaks branch from 0d476d8 to 05e50a6 Compare January 2, 2026 18:48
@jason-famedly jason-famedly changed the title fix: Adjust timing and jitter on both styles of WorkerLocks to allow faster resolution of the lock fix: Adjust timing on WorkerLocks to allow faster resolution of the lock Jan 5, 2026
@jason-famedly jason-famedly force-pushed the famedly-release/v1.143 branch from fa96612 to 26426ca Compare January 6, 2026 12:28
@jason-famedly jason-famedly marked this pull request as draft January 7, 2026 11:19
Base automatically changed from famedly-release/v1.143 to master January 12, 2026 14:39
def _get_next_retry_interval(self) -> float:
next = self._retry_interval
self._retry_interval = max(5, next * 2)
self._retry_interval = min(5.0, next * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, retry intervals in seconds should look more like [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 5.0, 5.0, 5.0, (stays at 5.0)]

Any interest in upstreaming this change?

It was mentioned as part of the fix for element-hq/synapse#19315 to prevent getting to the point that the number grows so large that we see ValueError: Exceeds the limit (4300 digits) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit.

The only concern with a low value of 5.0 seconds is that it increases the amount of database activity from all of the workers checking the lock more often. It's unclear if the increased database activity is even a problem in practice. At the very-least, we could introduce some high maximum like 15 minutes. 15 minutes is higher than the 10 minute sanity check we have so we could still notice this failure mode:

if self._retry_interval > 10 * ONE_MINUTE_SECONDS: # >7 iterations
logger.warning(
"Lock timeout is getting excessive: %ss. There may be a deadlock.",
self._retry_interval,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! I think it would be great to upstream. I have compiled some information on declarations of WorkerLocks, as I agree that a ceiling of 5 seconds is very much to low in general. I will leave the information here in another comment, but we are still investigating if it is possible to use this repo to submit to upstream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEW_EVENT_DURING_PURGE_LOCK_NAME = "new_event_during_purge_lock"
PURGE_PAGINATION_LOCK_NAME = "purge_pagination_lock"
ONE_TIME_KEY_UPLOAD = "one_time_key_upload_lock"

  • federation/federation_server.FederationServer._process_incoming_pdus_in_room_inner()

    • note: a LockStore.Lock(_INBOUND_EVENT_HANDLING_LOCK_NAME, room_id) is taken before this is called(from _handle_received_pdu()). I believe this is a background task
    • lock: acquire_read_write_lock()
    • task: NEW_EVENT_DURING_PURGE_LOCK_NAME
    • lock_key: room_id
    • write: False
  • handlers/e2e_keys.E2eKeysHandler._upload_one_time_keys_for_user()

    • lock: acquire_lock()
    • task: ONE_TIME_KEY_UPLOAD
    • lock_key: f"{user_id}_{device_id}"
  • handlers/federation.FederationHandler.maybe_backfill()

    • note: Inside a Linearizer context manager keyed to room_id
    • lock: acquire_read_write_lock()
    • task: PURGE_PAGINATION_LOCK_NAME
    • lock_key: room_id
    • write: False
  • handlers/message.EventCreationHandler.create_and_send_nonmember_event()

    • lock: acquire_read_write_lock()
    • task: NEW_EVENT_DURING_PURGE_LOCK_NAME
    • lock_key: room_id
    • write: False
  • handlers/message.EventCreationHandler._send_dummy_events_to_fill_extremities()

    • lock: acquire_read_write_lock()
    • task: NEW_EVENT_DURING_PURGE_LOCK_NAME
    • lock_key: room_id
    • write: False
  • handlers/pagination.PaginationHandler.purge_history()

    • lock: acquire_read_write_lock()
    • task: PURGE_PAGINATION_LOCK_NAME
    • lock_key: room_id
    • write: True
  • handlers/pagination.PaginationHandler.purge_room()

    • lock: acquire_multi_read_write_lock()
    • task:
      • PURGE_PAGINATION_LOCK_NAME
      • NEW_EVENT_DURING_PURGE_LOCK_NAME
    • lock_key: room_id (for both tasks)
    • write: True
  • handlers/room_member.RoomMemberHandler.update_membership()

    • note: This is inside two Linearizers, one for a given room and another for limiting appservice member changes to no more than 10 simultaneous
    • lock: acquire_read_write_lock()
    • task: NEW_EVENT_DURING_PURGE_LOCK_NAME
    • lock_key: room_id
    • write: False
  • rest/client/room_upgrade_rest_servlet.RoomUpgradeRestServlet.on_POST()

    • lock: acquire_read_write_lock()
    • task: NEW_EVENT_DURING_PURGE_LOCK_NAME
    • lock_key: room_id
    • write: False
  • storage/controllers/persist_events.EventsPersistenceStorageController._process_event_persist_queue_task()

    • lock: acquire_read_write_lock()
    • task: NEW_EVENT_DURING_PURGE_LOCK_NAME
    • lock_key: room_id
    • write: False

Of those, these are more likely to be high frequency attempts

  • federation/federation_server.FederationServer._process_incoming_pdus_in_room_inner()
  • handlers/e2e_keys.E2eKeysHandler._upload_one_time_keys_for_user()
  • handlers/message.EventCreationHandler.create_and_send_nonmember_event()
  • handlers/message.EventCreationHandler._send_dummy_events_to_fill_extremities()
  • handlers/room_member.RoomMemberHandler.update_membership()
  • storage/controllers/persist_events.EventsPersistenceStorageController._process_event_persist_queue_task()

These are possibly likely to be between medium and high frequency attempts

  • handlers/federation.FederationHandler.maybe_backfill()
  • rest/client/room_upgrade_rest_servlet.RoomUpgradeRestServlet.on_POST()

These are possibly low volume attempts

  • handlers/pagination.PaginationHandler.purge_history()
  • handlers/pagination.PaginationHandler.purge_room()

It is probable that allowing a customizable retry attempt interval to be passed into the WorkerLock on creation would be helpful.

@jason-famedly jason-famedly force-pushed the jason/worker-lock-tweaks branch from 05e50a6 to f58b379 Compare January 15, 2026 19:26
@jason-famedly jason-famedly changed the base branch from master to famedly-release/v1.145 January 15, 2026 19:32
@jason-famedly jason-famedly force-pushed the jason/worker-lock-tweaks branch from f58b379 to b399876 Compare January 15, 2026 19:49
Base automatically changed from famedly-release/v1.145 to master January 19, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants