-
Notifications
You must be signed in to change notification settings - Fork 4
fix: Adjust timing on WorkerLocks to allow faster resolution of the lock #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## famedly-release/v1.145 #221 +/- ##
=======================================================
Coverage 80.14% 80.14%
=======================================================
Files 498 498
Lines 71133 71133
Branches 10683 10683
=======================================================
+ Hits 57007 57009 +2
+ Misses 10886 10885 -1
+ Partials 3240 3239 -1
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
0d476d8 to
05e50a6
Compare
fa96612 to
26426ca
Compare
synapse/handlers/worker_lock.py
Outdated
| def _get_next_retry_interval(self) -> float: | ||
| next = self._retry_interval | ||
| self._retry_interval = max(5, next * 2) | ||
| self._retry_interval = min(5.0, next * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, retry intervals in seconds should look more like [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 5.0, 5.0, 5.0, (stays at 5.0)]
Any interest in upstreaming this change?
It was mentioned as part of the fix for element-hq/synapse#19315 to prevent getting to the point that the number grows so large that we see ValueError: Exceeds the limit (4300 digits) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit.
The only concern with a low value of 5.0 seconds is that it increases the amount of database activity from all of the workers checking the lock more often. It's unclear if the increased database activity is even a problem in practice. At the very-least, we could introduce some high maximum like 15 minutes. 15 minutes is higher than the 10 minute sanity check we have so we could still notice this failure mode:
synapse/synapse/handlers/worker_lock.py
Lines 279 to 283 in b958daf
| if self._retry_interval > 10 * ONE_MINUTE_SECONDS: # >7 iterations | |
| logger.warning( | |
| "Lock timeout is getting excessive: %ss. There may be a deadlock.", | |
| self._retry_interval, | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! I think it would be great to upstream. I have compiled some information on declarations of WorkerLocks, as I agree that a ceiling of 5 seconds is very much to low in general. I will leave the information here in another comment, but we are still investigating if it is possible to use this repo to submit to upstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NEW_EVENT_DURING_PURGE_LOCK_NAME = "new_event_during_purge_lock"
PURGE_PAGINATION_LOCK_NAME = "purge_pagination_lock"
ONE_TIME_KEY_UPLOAD = "one_time_key_upload_lock"
-
federation/federation_server.FederationServer._process_incoming_pdus_in_room_inner()- note: a
LockStore.Lock(_INBOUND_EVENT_HANDLING_LOCK_NAME, room_id)is taken before this is called(from_handle_received_pdu()). I believe this is a background task - lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- note: a
-
handlers/e2e_keys.E2eKeysHandler._upload_one_time_keys_for_user()- lock:
acquire_lock() - task:
ONE_TIME_KEY_UPLOAD - lock_key:
f"{user_id}_{device_id}"
- lock:
-
handlers/federation.FederationHandler.maybe_backfill()- note: Inside a
Linearizercontext manager keyed to room_id - lock:
acquire_read_write_lock() - task:
PURGE_PAGINATION_LOCK_NAME - lock_key: room_id
- write: False
- note: Inside a
-
handlers/message.EventCreationHandler.create_and_send_nonmember_event()- lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- lock:
-
handlers/message.EventCreationHandler._send_dummy_events_to_fill_extremities()- lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- lock:
-
handlers/pagination.PaginationHandler.purge_history()- lock:
acquire_read_write_lock() - task:
PURGE_PAGINATION_LOCK_NAME - lock_key: room_id
- write: True
- lock:
-
handlers/pagination.PaginationHandler.purge_room()- lock:
acquire_multi_read_write_lock() - task:
PURGE_PAGINATION_LOCK_NAMENEW_EVENT_DURING_PURGE_LOCK_NAME
- lock_key: room_id (for both tasks)
- write: True
- lock:
-
handlers/room_member.RoomMemberHandler.update_membership()- note: This is inside two
Linearizers, one for a given room and another for limiting appservice member changes to no more than 10 simultaneous - lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- note: This is inside two
-
rest/client/room_upgrade_rest_servlet.RoomUpgradeRestServlet.on_POST()- lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- lock:
-
storage/controllers/persist_events.EventsPersistenceStorageController._process_event_persist_queue_task()- lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- lock:
Of those, these are more likely to be high frequency attempts
federation/federation_server.FederationServer._process_incoming_pdus_in_room_inner()handlers/e2e_keys.E2eKeysHandler._upload_one_time_keys_for_user()handlers/message.EventCreationHandler.create_and_send_nonmember_event()handlers/message.EventCreationHandler._send_dummy_events_to_fill_extremities()handlers/room_member.RoomMemberHandler.update_membership()storage/controllers/persist_events.EventsPersistenceStorageController._process_event_persist_queue_task()
These are possibly likely to be between medium and high frequency attempts
handlers/federation.FederationHandler.maybe_backfill()rest/client/room_upgrade_rest_servlet.RoomUpgradeRestServlet.on_POST()
These are possibly low volume attempts
handlers/pagination.PaginationHandler.purge_history()handlers/pagination.PaginationHandler.purge_room()
It is probable that allowing a customizable retry attempt interval to be passed into the WorkerLock on creation would be helpful.
05e50a6 to
f58b379
Compare
…ch attempt and cap at 15 minutes
f58b379 to
b399876
Compare
Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.
Right now, the retry interval in seconds looks like [0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)] after which logging should start about excessive times
With this change, retry intervals in seconds should look more like:
Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea