[BP-1.15][tests] Mitigate likelihood to run into test stability issues caused by CURATOR-645 #20730
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a 1.15 backport of PR #20709 (no conflicts detected during cherry-pick)
CURATOR-645 covers a bug in the LeaderLatch implementation that causes a race condition if a child node, participating in the leader election, is removed too fast. This results in a different code branch being executed which triggers a reset of the LeaderLatch instead of re-collecting the children to determine the next leader.
The issue occurs because LeaderLatch#checkLeadership is not executed transactionally, i.e. retrieving the children and setting up the watcher for the predecessor is not done atomically. This leads to the race condition where a children (the previous leader's node) is removed before setting up the watcher which results in an invalid handling of the situation using reset.
Adding some sleep here (simulating the leader actually doing something) will reduce the risk of falling into the race condition because it will give the concurrently running LeaderLatch instances more time to set up the watchers properly.
This is only meant as a temporary solution until CURATOR-645 is resolved and the curator dependency on the Flink side is upgraded.