Leader controller loses all the callback handlers after leadership switch #394

jiajunwang · 2019-08-08T21:48:13Z

A problem was found that the leader controller may lose all the callback handlers after leadership switch.
To reproduce the issue, the cluster must be using leader election mode (DistributedLeaderElection). Then frequent leadership switch caused by ZK session expiring may trigger the problem.

The symptom is that, although the leader controller exists, it won't process any ZK notification. So the cluster will not be managed.

i3wangyi · 2019-08-09T22:23:16Z

Do I understand it right? I just want to understand the whole picture more clearly since there's a bunch of race-condition issues happened recently.

The "frequent" leadership switch is caused by ZK session expiration.
What caused the "frequent" ZK session expiration?
Why the new controller didn't successfully register the callback handlers?
Is it because the 1st new controller hasn't finished the registration and it lost leadership already then the 2nd new controller came to place?

jiajunwang · 2019-08-09T22:43:51Z

Do I understand it right? I just want to understand the whole picture more clearly since there's a bunch of race-condition issues happened recently.

The "frequent" leadership switch is caused by ZK session expiration.

Yes.

What caused the "frequent" ZK session expiration?

This is not confirmed. Maybe GC, maybe network issue. Or combined. The result is obvious though.

Why the new controller didn't successfully register the callback handlers?

Please refer to the fix. The previous design cannot handle more than one leader node change event in a graceful way.
This use case is just not considered.

Is it because the 1st new controller hasn't finished the registration and it lost leadership already then the 2nd new controller came to place?

The 1st controller does lose the leadership, but that does not cause the problem. The issue was in one controller always. If this one has a leftover controller change event unprocessed, it will for sure fall into this bad situation.

jiajunwang mentioned this issue Aug 8, 2019

Fix the CallbackHandler registration logic in DistributedLeaderElection #395

Merged

7 tasks

jiajunwang closed this as completed Aug 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader controller loses all the callback handlers after leadership switch #394

Leader controller loses all the callback handlers after leadership switch #394

jiajunwang commented Aug 8, 2019

i3wangyi commented Aug 9, 2019

jiajunwang commented Aug 9, 2019 •

edited

Loading

Leader controller loses all the callback handlers after leadership switch #394

Leader controller loses all the callback handlers after leadership switch #394

Comments

jiajunwang commented Aug 8, 2019

i3wangyi commented Aug 9, 2019

jiajunwang commented Aug 9, 2019 • edited Loading

jiajunwang commented Aug 9, 2019 •

edited

Loading