Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-16171 Fix hybrid mode controller race #15238

Closed

Conversation

mumrah
Copy link
Contributor

@mumrah mumrah commented Jan 19, 2024

This patch causes the active KRaftMigrationDriver to reload the /migration ZK state after electing itself as the leader in ZK. This closes a race condition where the previous active controller could make an update to /migration after the new leader was elected. The update race was not actually a problem regarding the data since both controllers would be syncing the same state from KRaft to ZK, but the change to the znode causes the new controller to fail on the zk version check on /migration.

This patch also fixes a as-yet-unseen bug where the active controllers failing to elect itself via claimControllerLeadership would not retry.

@mumrah mumrah added the kraft label Jan 20, 2024
@mumrah mumrah marked this pull request as ready for review January 20, 2024 16:06
@mumrah mumrah requested a review from cmccabe January 20, 2024 16:06
@@ -139,14 +139,18 @@ public ZkMigrationLeadershipState setMigrationRecoveryState(ZkMigrationLeadershi

@Override
public ZkMigrationLeadershipState claimControllerLeadership(ZkMigrationLeadershipState state) {
this.state = state;
return state;
if (state.zkControllerEpochZkVersion() == ZkMigrationLeadershipState.EMPTY.zkControllerEpochZkVersion()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just compare against -2 (or at least have a constant for this like ZkMigrationLeadershipState.UNKNOWN_ZK_VERSION?)

@@ -635,12 +635,24 @@ class BecomeZkControllerEvent extends MigrationEvent {
public void run() throws Exception {
if (checkDriverState(MigrationDriverState.BECOME_CONTROLLER, this)) {
applyMigrationOperation("Claiming ZK controller leadership", zkMigrationClient::claimControllerLeadership);
if (migrationLeadershipState.zkControllerEpochZkVersion() == -1) {
if (migrationLeadershipState.zkControllerEpochZkVersion() == -2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a constant for this like ZkMigrationLeadershipState.UNKNOWN_ZK_VERSION?

Copy link
Contributor

@cmccabe cmccabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @mumrah . Left two minor comments.

cmccabe pushed a commit that referenced this pull request Jan 29, 2024
This patch causes the active KRaftMigrationDriver to reload the /migration ZK state after electing
itself as the leader in ZK. This closes a race condition where the previous active controller could
make an update to /migration after the new leader was elected. The update race was not actually a
problem regarding the data since both controllers would be syncing the same state from KRaft to ZK,
but the change to the znode causes the new controller to fail on the zk version check on
/migration.

This patch also fixes a as-yet-unseen bug where the active controllers failing to elect itself via
claimControllerLeadership would not retry.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
@cmccabe
Copy link
Contributor

cmccabe commented Jan 29, 2024

committed, thanks!

@cmccabe cmccabe closed this Jan 29, 2024
cmccabe pushed a commit that referenced this pull request Jan 29, 2024
This patch causes the active KRaftMigrationDriver to reload the /migration ZK state after electing
itself as the leader in ZK. This closes a race condition where the previous active controller could
make an update to /migration after the new leader was elected. The update race was not actually a
problem regarding the data since both controllers would be syncing the same state from KRaft to ZK,
but the change to the znode causes the new controller to fail on the zk version check on
/migration.

This patch also fixes a as-yet-unseen bug where the active controllers failing to elect itself via
claimControllerLeadership would not retry.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024
This patch causes the active KRaftMigrationDriver to reload the /migration ZK state after electing
itself as the leader in ZK. This closes a race condition where the previous active controller could
make an update to /migration after the new leader was elected. The update race was not actually a
problem regarding the data since both controllers would be syncing the same state from KRaft to ZK,
but the change to the znode causes the new controller to fail on the zk version check on
/migration.

This patch also fixes a as-yet-unseen bug where the active controllers failing to elect itself via
claimControllerLeadership would not retry.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
cmccabe pushed a commit that referenced this pull request Mar 12, 2024
This patch causes the active KRaftMigrationDriver to reload the /migration ZK state after electing
itself as the leader in ZK. This closes a race condition where the previous active controller could
make an update to /migration after the new leader was elected. The update race was not actually a
problem regarding the data since both controllers would be syncing the same state from KRaft to ZK,
but the change to the znode causes the new controller to fail on the zk version check on
/migration.

This patch also fixes a as-yet-unseen bug where the active controllers failing to elect itself via
claimControllerLeadership would not retry.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
clolov pushed a commit to clolov/kafka that referenced this pull request Apr 5, 2024
This patch causes the active KRaftMigrationDriver to reload the /migration ZK state after electing
itself as the leader in ZK. This closes a race condition where the previous active controller could
make an update to /migration after the new leader was elected. The update race was not actually a
problem regarding the data since both controllers would be syncing the same state from KRaft to ZK,
but the change to the znode causes the new controller to fail on the zk version check on
/migration.

This patch also fixes a as-yet-unseen bug where the active controllers failing to elect itself via
claimControllerLeadership would not retry.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
Phuc-Hong-Tran pushed a commit to Phuc-Hong-Tran/kafka that referenced this pull request Jun 6, 2024
This patch causes the active KRaftMigrationDriver to reload the /migration ZK state after electing
itself as the leader in ZK. This closes a race condition where the previous active controller could
make an update to /migration after the new leader was elected. The update race was not actually a
problem regarding the data since both controllers would be syncing the same state from KRaft to ZK,
but the change to the znode causes the new controller to fail on the zk version check on
/migration.

This patch also fixes a as-yet-unseen bug where the active controllers failing to elect itself via
claimControllerLeadership would not retry.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants