Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-13966 Set curClaimEpoch after we enqueue bootstrap write #12269

Merged
merged 7 commits into from Jun 23, 2022

Conversation

mumrah
Copy link
Contributor

@mumrah mumrah commented Jun 8, 2022

This patch does two things to tighten up races during bootstrap. One is to simply move the volatile write until after the "bootstrapMetadata" event has been enqueued. The other thing is to prepend "bootstrapMetadata" to the queue.

From the JIRA, this test was flaky when we see the "registerBroker" event precede the "boostrapMetadata" event

handleLeaderChange() start
appendWriteEvent(registerBroker)
appendWriteEvent(bootstrapMetadata)
handleLeaderChange() finish
registerBroker() -> writes broker registration to log
bootstrapMetadata() -> writes bootstrap metadata to log

The test harness in QuorumControllerTest is polling the controllers to see when their current leader epoch matches the expected leader epoch

        QuorumController activeController() throws InterruptedException {
        AtomicReference<QuorumController> value = new AtomicReference<>(null);
        TestUtils.retryOnExceptionWithTimeout(20000, 3, () -> {
            LeaderAndEpoch leader = logEnv.leaderAndEpoch();
            for (QuorumController controller : controllers) {
                if (OptionalInt.of(controller.nodeId()).equals(leader.leaderId()) &&
              --->  controller.curClaimEpoch() == leader.epoch()) { <---
                    value.set(controller);
                    break;
                }
            }

            if (value.get() == null) {
                throw new RuntimeException(String.format("Expected to see %s as leader", leader));
            }
        });

        return value.get();
    }

Prior to this patch, we were setting the curClaimEpoch volatile before scheduling the "bootstrapMetadata" event. This creates a race where the test code can see the controller as active before the bootstrap metadata has been enqueued. By moving the volatile write after the "bootstrapMetadata" event enqueueing, we can eliminate this race.

In production, there is nothing stopping clients from making requests to a controller that is in the process of starting up. Once the socket server begins processing requests, we will begin seeing events on the QuorumController queue. The check for active controller is done when events are processed, so we could have events enqueued at any time. By prepending the "bootstrapMetadata" event, we can ensure any events that see the controller as active will come after the bootstrap metadata has been processed.

If we handle a "registerBroker" event before "bootstrapMetadata" is enqueued, then we are sure to see curClaimEpoch = -1. If a "registerBroker" is waiting in the queue while we finish bootstrapping, then we are sure to enqueue "bootstrapMetadata" before it. This means "registerBroker" (and any other event) is guaranteed to see the controller as inactive or that it is active and the bootstrap metadata has been written.

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the fix looks good. Is there any way to get a more reliable test case than QuorumControllerTest.testUnregisterBroker?

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@mumrah mumrah merged commit c6c9da0 into apache:trunk Jun 23, 2022
mjsax pushed a commit to confluentinc/kafka that referenced this pull request Jun 30, 2022
)

Also fixes flaky QuorumControllerTest#testInvalidBootstrapMetadata

Reviewers: Jason Gustafson <jason@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants