Update WorkerLock tests to better stress the WORKER_LOCK_MAX_RETRY_INTERVAL#19772
Conversation
|
|
||
| # How long before an acquired lock times out. | ||
| _LOCK_TIMEOUT_MS = 2 * 60 * 1000 | ||
| _LOCK_TIMEOUT = Duration(minutes=2) |
There was a problem hiding this comment.
Just a refactor to use Duration for _LOCK_TIMEOUT (no behavioral change)
| This matters most when locks go stale as normally, when the lock holder releases, we | ||
| signal to other locks (with the same name/key) that they should try reacquiring the lock | ||
| immediately. But stale locks are never released and instead forcefully reaped behind the | ||
| scenes. |
There was a problem hiding this comment.
The original reasoning here came from #19755
It was based on my flawed understanding on how the lock release notifications worked. It turns out we also notify_lock_released(...) over replication when other workers tell us about it.
|
|
||
| # Release the first lock (`lock1`). The second lock(`lock2`) should be | ||
| # automatically acquired by the `pump()` inside `get_success()` | ||
| self.get_success(lock1.__aexit__(None, None, None)) |
There was a problem hiding this comment.
# [...] The second lock(`lock2`) should be # automatically acquired by the `pump()` inside `get_success()`
Basically, the behavior described by this comment circumvents the retry timeout interval logic we're trying to stress. And the previous tests actually pass without any of the fixes from #19394 because of this happy-path flow.
To explain further: When a lock is released, we immediately try to re-acquire the lock again
Notifier.notify_lock_released(...) -> calls any callbacks registered from Notifier.add_lock_released_callback(...) -> which we do in WorkerLocksHandler and will call release_lock() which resolves the deferred and wakes up the timeout_deferred(...) and loops around the while-loop again which tries to re-acquire the lock.
Instead, we want to avoid the lock released notification stuff and stress the retry interval which helps in situations where the lock holder goes stale, is reaped, and the other locks want to try to acquire the lock.
I've tested to make sure these new tests fail with a version of Synapse before #19394
git checkout v1.152.0- Paste the latest
tests/handlers/test_worker_lock.pyinto the codebase - Shim a couple values that don't exist in that Synapse version:
WORKER_LOCK_MAX_RETRY_INTERVAL = Duration(seconds=5) _LOCK_TIMEOUT = Duration(minutes=2) poetry install --extras allSYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.handlers.test_worker_lock- Notice the tests fail as expected:
tests.handlers.test_worker_lock WorkerLockTestCase test_lock_contention ... [OK] test_timeouts_for_lock_locally ... [FAIL] test_wait_for_lock_locally ... [OK] WorkerLockWorkersTestCase test_timeouts_for_lock_worker ... [FAIL] test_wait_for_lock_worker ... [OK]
|
Thanks for the review @erikjohnston 🐎 |
Update
WorkerLocktests to better stress theWORKER_LOCK_MAX_RETRY_INTERVAL. There is no behavioral change, only a change to the tests. See #19772 (comment) for an explanation of why the tests needed changing (and diff comments).Follow-up to #19394. The test discussion originally happened in #19394 (comment)
This is spawning from thinking about the problem again.
Dev notes
Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.