Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-13407: Always start controller when broker wins election #11476

Closed
wants to merge 1 commit into from

Conversation

edoardocomar
Copy link
Contributor

@edoardocomar edoardocomar commented Nov 8, 2021

Add a call to onControllerFailover into code path where elect is
called, and the broker discovers it has already been elected. We
found that by restarting the ZK leader we could occasionally trigger
this code path, and prior to this change it would not start a
controller failover. This left our Kafka cluster in a state where the
/controller znode existed, and named the broker that had "won" the
controller election, but in terms of runtime state: all the brokers
had resigned from being the controller. Without a running controller,
restarting brokers would typically cause partitions to become
under-replicated as the restarted brokers never received the
UpdateMetadata or LeaderAndISR requests required to correctly lead /
follow any of their replicas.

Also add some info level logging and more descriptive log messages for
the log lines that were helpful in tracking the controller failover.

proposed fix for https://issues.apache.org/jira/browse/KAFKA-13407

Co-authored-by: Tina Selenge tina.selenge@gmail.com
Co-authored-by: Adrian Preston prestona@uk.ibm.com
Co-authored-by: Edoardo Comar ecomar@euk.ibm.com.com

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@mimaison
Copy link
Member

Is this also an issue on trunk? If so let's fix trunk first then we can backport it to older branches.

@edoardocomar
Copy link
Contributor Author

Hi @mimaison !
so we managed to reproduce using 3.0 in our cloud clusters - not locally on a laptop with 3ZK and 3 Kafkas. So latencies may play a part ?

As for doing the fix in trunk, we attempted to reproduce it using the ducktape on docker Kafka system tests
but we had no luck
https://github.com/edoardocomar/kafka/tree/controller_system_test

any suggestions are welcome.

Add a call to `onControllerFailover` into code path where `elect` is
called, and the broker discovers it has already been elected. We
found that by restarting the ZK leader we could occasionally trigger
this code path, and prior to this change it would not start a
controller failover. This left our Kafka cluster in a state where the
`/controller` znode existed, and named the broker that had "won" the
controller election, but in terms of runtime state: all the brokers
had resigned from being the controller. Without a running controller,
restarting brokers would typically cause partitions to become
under-replicated as the restarted brokers never received the
UpdateMetadata or LeaderAndISR requests required to correctly lead /
follow any of their replicas.

Also add some info level logging and more descriptive log messages for
the log lines that were helpful in tracking the controller failover.

Co-authored-by: Tina Selenge <tina.selenge@gmail.com>
Co-authored-by: Adrian Preston <prestona@uk.ibm.com>
Co-authored-by: Edoardo Comar <ecomar@euk.ibm.com.com>
@dineshudayakumar
Copy link

dineshudayakumar commented Jan 19, 2022

I was seeing this issue in 3.0 in our on-prem 3 nodes K8s cluster
Applied this patch and after that I wasn't able to reproduce it.
Is there a reason why this is not merged to a release yet?

Thank You!!

@ijuma ijuma requested review from cmccabe and junrao February 5, 2022 05:37
@junrao
Copy link
Contributor

junrao commented Feb 24, 2022

@edoardocomar : Thanks for the PR. As I mentioned in the jira, I believe this issue is fixed by https://issues.apache.org/jira/browse/KAFKA-13461

@mimaison
Copy link
Member

mimaison commented May 3, 2022

This has been fixed in #11563, closing

@mimaison mimaison closed this May 3, 2022
@edoardocomar edoardocomar deleted the KAFKA-13407 branch January 13, 2023 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants