KAFKA-13407: Always start controller when broker wins election #11476

edoardocomar · 2021-11-08T16:59:34Z

Add a call to onControllerFailover into code path where elect is
called, and the broker discovers it has already been elected. We
found that by restarting the ZK leader we could occasionally trigger
this code path, and prior to this change it would not start a
controller failover. This left our Kafka cluster in a state where the
/controller znode existed, and named the broker that had "won" the
controller election, but in terms of runtime state: all the brokers
had resigned from being the controller. Without a running controller,
restarting brokers would typically cause partitions to become
under-replicated as the restarted brokers never received the
UpdateMetadata or LeaderAndISR requests required to correctly lead /
follow any of their replicas.

Also add some info level logging and more descriptive log messages for
the log lines that were helpful in tracking the controller failover.

proposed fix for https://issues.apache.org/jira/browse/KAFKA-13407

Co-authored-by: Tina Selenge tina.selenge@gmail.com
Co-authored-by: Adrian Preston prestona@uk.ibm.com
Co-authored-by: Edoardo Comar ecomar@euk.ibm.com.com

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

mimaison · 2021-11-15T13:11:03Z

Is this also an issue on trunk? If so let's fix trunk first then we can backport it to older branches.

edoardocomar · 2021-11-15T13:46:31Z

Hi @mimaison !
so we managed to reproduce using 3.0 in our cloud clusters - not locally on a laptop with 3ZK and 3 Kafkas. So latencies may play a part ?

As for doing the fix in trunk, we attempted to reproduce it using the ducktape on docker Kafka system tests
but we had no luck
https://github.com/edoardocomar/kafka/tree/controller_system_test

any suggestions are welcome.

Add a call to `onControllerFailover` into code path where `elect` is called, and the broker discovers it has already been elected. We found that by restarting the ZK leader we could occasionally trigger this code path, and prior to this change it would not start a controller failover. This left our Kafka cluster in a state where the `/controller` znode existed, and named the broker that had "won" the controller election, but in terms of runtime state: all the brokers had resigned from being the controller. Without a running controller, restarting brokers would typically cause partitions to become under-replicated as the restarted brokers never received the UpdateMetadata or LeaderAndISR requests required to correctly lead / follow any of their replicas. Also add some info level logging and more descriptive log messages for the log lines that were helpful in tracking the controller failover. Co-authored-by: Tina Selenge <tina.selenge@gmail.com> Co-authored-by: Adrian Preston <prestona@uk.ibm.com> Co-authored-by: Edoardo Comar <ecomar@euk.ibm.com.com>

dineshudayakumar · 2022-01-19T17:17:28Z

I was seeing this issue in 3.0 in our on-prem 3 nodes K8s cluster
Applied this patch and after that I wasn't able to reproduce it.
Is there a reason why this is not merged to a release yet?

Thank You!!

junrao · 2022-02-24T22:12:06Z

@edoardocomar : Thanks for the PR. As I mentioned in the jira, I believe this issue is fixed by https://issues.apache.org/jira/browse/KAFKA-13461

mimaison · 2022-05-03T21:03:31Z

This has been fixed in #11563, closing

edoardocomar force-pushed the KAFKA-13407 branch from 96e6680 to 3fb0870 Compare November 22, 2021 11:03

edoardocomar force-pushed the KAFKA-13407 branch from 3fb0870 to 3d7406d Compare November 22, 2021 11:18

ijuma requested review from cmccabe and junrao February 5, 2022 05:37

mimaison closed this May 3, 2022

edoardocomar deleted the KAFKA-13407 branch January 13, 2023 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-13407: Always start controller when broker wins election #11476

KAFKA-13407: Always start controller when broker wins election #11476

edoardocomar commented Nov 8, 2021 •

edited

mimaison commented Nov 15, 2021

edoardocomar commented Nov 15, 2021

dineshudayakumar commented Jan 19, 2022 •

edited

junrao commented Feb 24, 2022

mimaison commented May 3, 2022

KAFKA-13407: Always start controller when broker wins election #11476

KAFKA-13407: Always start controller when broker wins election #11476

Conversation

edoardocomar commented Nov 8, 2021 • edited

Committer Checklist (excluded from commit message)

mimaison commented Nov 15, 2021

edoardocomar commented Nov 15, 2021

dineshudayakumar commented Jan 19, 2022 • edited

junrao commented Feb 24, 2022

mimaison commented May 3, 2022

edoardocomar commented Nov 8, 2021 •

edited

dineshudayakumar commented Jan 19, 2022 •

edited