KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment #7805

hachikuji · 2019-12-09T18:15:23Z

This is a cherry-pick of 5d0cb14. The main differences are 1) leader epoch validation is unconditionally disabled, and 2) the test case has been refactored due to the absence of the reassignment admin APIs.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…uring reassignment (apache#7795) KIP-320 improved fetch semantics by adding leader epoch validation. This relies on reliable propagation of leader epoch information from the controller. Unfortunately, we have encountered a bug during partition reassignment in which the leader epoch in the controller context does not get properly updated. This causes UpdateMetadata requests to be sent with stale epoch information which results in the metadata caches on the brokers falling out of sync. This bug has existed for a long time, but it is only a problem due to the new epoch validation done by the client. Because the client includes the stale leader epoch in its requests, the leader rejects them, yet the stale metadata cache on the brokers prevents the consumer from getting the latest epoch. Hence the consumer cannot make progress while a reassignment is ongoing. Although it is straightforward to fix this problem in the controller for the new releases (which this patch does), it is not so easy to fix older brokers which means new clients could still encounter brokers with this bug. To address this problem, this patch also modifies the client to treat the leader epoch returned from the Metadata response as "unreliable" if it comes from an older version of the protocol. The client in this case will discard the returned epoch and it won't be included in any requests. Also, note that the correct epoch is still forwarded to replicas correctly in the LeaderAndIsr request, so this bug does not affect replication. Reviewers: Jun Rao <junrao@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>

ijuma

LGTM

hachikuji · 2019-12-09T22:56:01Z

retest this please

hachikuji · 2019-12-10T02:14:13Z

The failures are all known flakes in 2.3 (mainly testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl). I ran them locally and they pass. I will go ahead and merge.

ijuma approved these changes Dec 9, 2019

View reviewed changes

hachikuji merged commit baf7766 into apache:2.3 Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment #7805

KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment #7805

hachikuji commented Dec 9, 2019 •

edited

Loading

ijuma left a comment

hachikuji commented Dec 9, 2019

hachikuji commented Dec 10, 2019 •

edited

Loading

KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment #7805

KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment #7805

Conversation

hachikuji commented Dec 9, 2019 • edited Loading

Committer Checklist (excluded from commit message)

ijuma left a comment

Choose a reason for hiding this comment

hachikuji commented Dec 9, 2019

hachikuji commented Dec 10, 2019 • edited Loading

hachikuji commented Dec 9, 2019 •

edited

Loading

hachikuji commented Dec 10, 2019 •

edited

Loading