Skip to content

Update WorkerLock tests to better stress the WORKER_LOCK_MAX_RETRY_INTERVAL#19772

Merged
MadLittleMods merged 5 commits into
developfrom
madlittlemods/better-worker-lock-tests
May 12, 2026
Merged

Update WorkerLock tests to better stress the WORKER_LOCK_MAX_RETRY_INTERVAL#19772
MadLittleMods merged 5 commits into
developfrom
madlittlemods/better-worker-lock-tests

Conversation

@MadLittleMods
Copy link
Copy Markdown
Contributor

@MadLittleMods MadLittleMods commented May 11, 2026

Update WorkerLock tests to better stress the WORKER_LOCK_MAX_RETRY_INTERVAL. There is no behavioral change, only a change to the tests. See #19772 (comment) for an explanation of why the tests needed changing (and diff comments).

Follow-up to #19394. The test discussion originally happened in #19394 (comment)

This is spawning from thinking about the problem again.

Dev notes

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.handlers.test_worker_lock
SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.handlers.test_worker_lock.WorkerLockWorkersTestCase.test_timeouts_for_lock_worker

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)


# How long before an acquired lock times out.
_LOCK_TIMEOUT_MS = 2 * 60 * 1000
_LOCK_TIMEOUT = Duration(minutes=2)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a refactor to use Duration for _LOCK_TIMEOUT (no behavioral change)

Comment on lines +64 to +67
This matters most when locks go stale as normally, when the lock holder releases, we
signal to other locks (with the same name/key) that they should try reacquiring the lock
immediately. But stale locks are never released and instead forcefully reaped behind the
scenes.
Copy link
Copy Markdown
Contributor Author

@MadLittleMods MadLittleMods May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original reasoning here came from #19755

It was based on my flawed understanding on how the lock release notifications worked. It turns out we also notify_lock_released(...) over replication when other workers tell us about it.


# Release the first lock (`lock1`). The second lock(`lock2`) should be
# automatically acquired by the `pump()` inside `get_success()`
self.get_success(lock1.__aexit__(None, None, None))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# [...] The second lock(`lock2`) should be
# automatically acquired by the `pump()` inside `get_success()`

Basically, the behavior described by this comment circumvents the retry timeout interval logic we're trying to stress. And the previous tests actually pass without any of the fixes from #19394 because of this happy-path flow.

To explain further: When a lock is released, we immediately try to re-acquire the lock again

Notifier.notify_lock_released(...) -> calls any callbacks registered from Notifier.add_lock_released_callback(...) -> which we do in WorkerLocksHandler and will call release_lock() which resolves the deferred and wakes up the timeout_deferred(...) and loops around the while-loop again which tries to re-acquire the lock.

Instead, we want to avoid the lock released notification stuff and stress the retry interval which helps in situations where the lock holder goes stale, is reaped, and the other locks want to try to acquire the lock.

I've tested to make sure these new tests fail with a version of Synapse before #19394

  1. git checkout v1.152.0
  2. Paste the latest tests/handlers/test_worker_lock.py into the codebase
  3. Shim a couple values that don't exist in that Synapse version:
    WORKER_LOCK_MAX_RETRY_INTERVAL = Duration(seconds=5)
    _LOCK_TIMEOUT = Duration(minutes=2)
    
  4. poetry install --extras all
  5. SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.handlers.test_worker_lock
  6. Notice the tests fail as expected:
    tests.handlers.test_worker_lock
      WorkerLockTestCase
        test_lock_contention ...                                               [OK]
        test_timeouts_for_lock_locally ...                                   [FAIL]
        test_wait_for_lock_locally ...                                         [OK]
      WorkerLockWorkersTestCase
        test_timeouts_for_lock_worker ...                                    [FAIL]
        test_wait_for_lock_worker ...                                          [OK]
    

@MadLittleMods MadLittleMods marked this pull request as ready for review May 11, 2026 21:10
@MadLittleMods MadLittleMods requested a review from a team as a code owner May 11, 2026 21:10
@MadLittleMods MadLittleMods merged commit b8bd351 into develop May 12, 2026
46 checks passed
@MadLittleMods MadLittleMods deleted the madlittlemods/better-worker-lock-tests branch May 12, 2026 15:10
@MadLittleMods
Copy link
Copy Markdown
Contributor Author

Thanks for the review @erikjohnston 🐎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants