[FLINK-30765][runtime] Aligns the LeaderElectionService.stop() contract #21742

XComp · 2023-01-20T14:25:08Z

PR order:

=== THIS PR === FLINK-30765
[FLINK-31768][runtime] Removes contender description #22379
[FLINK-31838][runtime] Moves thread handling for leader event listener calls from DefaultMultipleComponentLeaderElectionService into DefaultLeaderElectionService #22422
[FLINK-31773][runtime] Separate DefaultLeaderElectionService.start(LeaderContender) into two separate methods for starting the driver and registering a contender #22380
[FLINK-31776][runtime] Introduces LeaderElection interface #22384

What is the purpose of the change

This PR is about hardening the LeaderElectionService.stop() contract.

The current implementations of LeaderElectionService do not implement the stop() call consistently. Some (e.g. StandaloneLeaderElectionService call revoke on the LeaderContender) whereas others don't (e.g. DefaultLeaderElectionService). The MultipleComponentLeaderElectionService does call revoke on the LeaderContender instances, though.

This PR makes the contract of the LeaderElectionService.stop call more consistent, i.e. aligning it more with what's happening if the leadership is revoked by the HA backend. The contract change reduces the assumptions that are made by the LeaderElectionService implementations (e.g. that the LeaderContender owns the LeaderElectionService and is, therefore, responsible for its lifecycle) which losens the coupling between the two components.

This should reduce noise when going ahead with refactoring the interfaces (FLIP-285).

Brief change log

Updated the JavaDoc in LeaderElectionService.stop() to specify the contract
Add revoke call to LeaderElectionService.stop() implementations for the case where the individual instance still have the leadership acquired
Additionally, a hotfix commit was added to remove obsolete log code (if statements)

Verifying this change

The LeaderContender.revokeLeadership() call was also added to TestingLeaderElectionService to make each test rely on this contract.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? JavaDocs

flinkbot · 2023-01-20T14:35:21Z

CI report:

030f156 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

...time/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java

zentol · 2023-04-12T08:43:57Z

...time/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java

-                // Clear the old leader information on the external storage
-                leaderElectionDriver.writeLeaderInformation(LeaderInformation.empty());


Why is it fine to remove this? The test states that this is done because leadership was already lost.
Does that imply that from the perspective of a TM there is always a leader (after leadership was acquired at least once), because this information is never cleared but only over-written?

onRevokeLeadership() is called after the leadership is lost. All LeaderElectionDriver implementations do a check before writing data to the backend which prevents any writes if the leadership is lost:

ZooKeeperLeaderElectionDriver:198 (but we even ignore empty leader information in ZooKeeperLeaderElectionDriver:190). Same code is available for the ZooKeeperMultipleComponentLeaderElectionDriver.java:154

KubernetesLeaderElectionDriver:139

KubernetesMultipleComponentLeaderElectionDriver:169

One could argue that the leadership could be regained while the operation isn't omitted, yet. Then, we would miss writing the empty leader information. But in that case, leadership will be renegotiated anyway which will trigger a write operation with the updated leader information soon after.

…ice. Signed-off-by: Matthias Pohl <matthias.pohl@aiven.io>

…stop DefaultLeaderElectionService.stop will now behave in the same way as if the leadership loss was triggered by the HA backend (iff the LeaderElectionService instance has the leadership acquired prior to the stop() call).

XComp · 2023-04-12T13:02:40Z

Thanks for the review: I squashed the commits and rebased the branch (which I realized I shouldn't do to keep the diff intact in the PR -.-).

flinkbot added the component=Runtime/Coordination label Jan 20, 2023

XComp force-pushed the FLINK-30765 branch from 9afce3d to a9bdb1b Compare January 20, 2023 16:49

XComp commented Jan 20, 2023

View reviewed changes

...time/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java Outdated Show resolved Hide resolved

XComp requested a review from zentol January 20, 2023 16:52

XComp force-pushed the FLINK-30765 branch from a9bdb1b to b867808 Compare February 14, 2023 14:16

XComp force-pushed the FLINK-30765 branch from 8a33943 to cb80f0d Compare March 27, 2023 13:55

XComp force-pushed the FLINK-30765 branch 3 times, most recently from 6266fdf to 1768d4f Compare April 11, 2023 12:17

XComp mentioned this pull request Apr 11, 2023

[FLINK-31773][runtime] Separate DefaultLeaderElectionService.start(LeaderContender) into two separate methods for starting the driver and registering a contender #22380

Merged

zentol self-assigned this Apr 11, 2023

XComp mentioned this pull request Apr 12, 2023

[FLINK-31768][runtime] Removes contender description #22379

Merged

zentol reviewed Apr 12, 2023

View reviewed changes

zentol approved these changes Apr 12, 2023

View reviewed changes

XComp added 2 commits April 12, 2023 14:58

[hotfix][runtime] Refactors logging code in DefaultLeaderElectionServ…

9e3682a

…ice. Signed-off-by: Matthias Pohl <matthias.pohl@aiven.io>

XComp force-pushed the FLINK-30765 branch from 148ce1a to 030f156 Compare April 12, 2023 12:59

XComp mentioned this pull request Apr 12, 2023

[FLINK-31776][runtime] Introduces LeaderElection interface #22384

Merged

XComp merged commit 3e4f653 into apache:master Apr 12, 2023

XComp mentioned this pull request May 5, 2023

[FLINK-32013][runtime] Moves the implicit ownership of the lifecycle management from LeaderContender to the HighAvailabilityServices #22526

Closed

This was referenced May 17, 2023

[FLINK-31781][runtime] Introduces the contender ID to the LeaderElectionService interface #22601

Merged

[FLINK-31786][runtime] Removes unused classes #22623

Merged

XComp mentioned this pull request Jul 3, 2023

[FLINK-32381][runtime] Replaces FatalErrorHandler by Listener.onError #22935

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-30765][runtime] Aligns the LeaderElectionService.stop() contract #21742

[FLINK-30765][runtime] Aligns the LeaderElectionService.stop() contract #21742

XComp commented Jan 20, 2023 •

edited

flinkbot commented Jan 20, 2023 •

edited

zentol Apr 12, 2023 •

edited

XComp Apr 12, 2023 •

edited

XComp commented Apr 12, 2023

		// Clear the old leader information on the external storage
		leaderElectionDriver.writeLeaderInformation(LeaderInformation.empty());

[FLINK-30765][runtime] Aligns the LeaderElectionService.stop() contract #21742

[FLINK-30765][runtime] Aligns the LeaderElectionService.stop() contract #21742

Conversation

XComp commented Jan 20, 2023 • edited

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Jan 20, 2023 • edited

CI report:

zentol Apr 12, 2023 • edited

Choose a reason for hiding this comment

XComp Apr 12, 2023 • edited

Choose a reason for hiding this comment

XComp commented Apr 12, 2023

XComp commented Jan 20, 2023 •

edited

flinkbot commented Jan 20, 2023 •

edited

zentol Apr 12, 2023 •

edited

XComp Apr 12, 2023 •

edited