Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader controller loses all the callback handlers after leadership switch #394

Closed
jiajunwang opened this issue Aug 8, 2019 · 2 comments
Closed

Comments

@jiajunwang
Copy link
Contributor

A problem was found that the leader controller may lose all the callback handlers after leadership switch.
To reproduce the issue, the cluster must be using leader election mode (DistributedLeaderElection). Then frequent leadership switch caused by ZK session expiring may trigger the problem.

The symptom is that, although the leader controller exists, it won't process any ZK notification. So the cluster will not be managed.

@i3wangyi
Copy link
Contributor

i3wangyi commented Aug 9, 2019

Do I understand it right? I just want to understand the whole picture more clearly since there's a bunch of race-condition issues happened recently.

  1. The "frequent" leadership switch is caused by ZK session expiration.
  2. What caused the "frequent" ZK session expiration?
  3. Why the new controller didn't successfully register the callback handlers?
  4. Is it because the 1st new controller hasn't finished the registration and it lost leadership already then the 2nd new controller came to place?

@jiajunwang
Copy link
Contributor Author

jiajunwang commented Aug 9, 2019

Do I understand it right? I just want to understand the whole picture more clearly since there's a bunch of race-condition issues happened recently.

  1. The "frequent" leadership switch is caused by ZK session expiration.

Yes.

  1. What caused the "frequent" ZK session expiration?

This is not confirmed. Maybe GC, maybe network issue. Or combined. The result is obvious though.

  1. Why the new controller didn't successfully register the callback handlers?

Please refer to the fix. The previous design cannot handle more than one leader node change event in a graceful way.
This use case is just not considered.

  1. Is it because the 1st new controller hasn't finished the registration and it lost leadership already then the 2nd new controller came to place?

The 1st controller does lose the leadership, but that does not cause the problem. The issue was in one controller always. If this one has a leftover controller change event unprocessed, it will for sure fall into this bad situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants