Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-30765][runtime] Aligns the LeaderElectionService.stop() contract #21742

Merged
merged 2 commits into from
Apr 12, 2023

Conversation

XComp
Copy link
Contributor

@XComp XComp commented Jan 20, 2023

PR order:

What is the purpose of the change

This PR is about hardening the LeaderElectionService.stop() contract.

The current implementations of LeaderElectionService do not implement the stop() call consistently. Some (e.g. StandaloneLeaderElectionService call revoke on the LeaderContender) whereas others don't (e.g. DefaultLeaderElectionService). The MultipleComponentLeaderElectionService does call revoke on the LeaderContender instances, though.

This PR makes the contract of the LeaderElectionService.stop call more consistent, i.e. aligning it more with what's happening if the leadership is revoked by the HA backend. The contract change reduces the assumptions that are made by the LeaderElectionService implementations (e.g. that the LeaderContender owns the LeaderElectionService and is, therefore, responsible for its lifecycle) which losens the coupling between the two components.

This should reduce noise when going ahead with refactoring the interfaces (FLIP-285).

Brief change log

  • Updated the JavaDoc in LeaderElectionService.stop() to specify the contract
  • Add revoke call to LeaderElectionService.stop() implementations for the case where the individual instance still have the leadership acquired
  • Additionally, a hotfix commit was added to remove obsolete log code (if statements)

Verifying this change

The LeaderContender.revokeLeadership() call was also added to TestingLeaderElectionService to make each test rely on this contract.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? JavaDocs

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 20, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Comment on lines -220 to -242
// Clear the old leader information on the external storage
leaderElectionDriver.writeLeaderInformation(LeaderInformation.empty());
Copy link
Contributor

@zentol zentol Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it fine to remove this? The test states that this is done because leadership was already lost.
Does that imply that from the perspective of a TM there is always a leader (after leadership was acquired at least once), because this information is never cleared but only over-written?

Copy link
Contributor Author

@XComp XComp Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onRevokeLeadership() is called after the leadership is lost. All LeaderElectionDriver implementations do a check before writing data to the backend which prevents any writes if the leadership is lost:

One could argue that the leadership could be regained while the operation isn't omitted, yet. Then, we would miss writing the empty leader information. But in that case, leadership will be renegotiated anyway which will trigger a write operation with the updated leader information soon after.

…ice.

Signed-off-by: Matthias Pohl <matthias.pohl@aiven.io>
…stop

DefaultLeaderElectionService.stop will now behave in the same way as if the leadership loss was triggered by the HA backend (iff the LeaderElectionService instance has the leadership acquired prior to the stop() call).
@XComp
Copy link
Contributor Author

XComp commented Apr 12, 2023

Thanks for the review: I squashed the commits and rebased the branch (which I realized I shouldn't do to keep the diff intact in the PR -.-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants