-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible issues with IST donors going back to Synced too soon #106
Comments
Looks like a leftover from the recent intelligent node selection refactoring. Node tries to join two times. |
I think this issue is introduced here c2b489f#diff-3500af29aec4b4fd6a26e6e9194792c0R374 About (2), it's safe. A node could serve IST for several nodes, but could only serve SST for one node. And that's why we add I think the whole process is safe. It's a optimization. Our intention is to offer more donors by instructing node goes to SYNCED state ASAP if it does IST. But seems it breaks some contracts. @ayurchen I think maybe it's better to delete this line because perhaps it's just a premature optimization. What do you think ? |
About (1), a possible scenario:
|
Maybe related #101. |
Yan, To be frank I never got to the bottom of those lines and comments there. And now it still seems unclear. But the thing is that in the case of IST the node should join ASAP - i.e. right after it does empty SST. And should not try to join afterwards. Check line replicator_smm.cpp:1182. I think the whole
|
Yes, my point is, in normal process, |
…ter when it exits. it is introduced because sometimes sst_sent may not be called.
…IST or SST for only one node, which is not we expect
Fix merged to 3.x, closing. |
General Rule: normally, There are some possible error cases.
About case 1, if SST fails, then About case 2, if IST fails when connecting to receiver. Sender connects to receiver at About case 3, if IST fails when sending trxs. Joiner will detect error at So to fix case 3 but don't break general rule, we need to call |
Yan, I think you're wrong here. Precisely to allow the donor to serve several ISTs at once it should not be in DONOR state when IST is served. So it should join right after SST is done. That is:
So this is all there is. The issue of the JOINER hanging in case IST fails (at whatever stage) should be reported and fixed separately, as it affects only joiners and only in case of misconfiguration. |
Right. And that's current implementation
JOINER won't hang, but DONOR will. If donor fails to send trxs(for unknown reason) when only doing IST(so SST request is empty), it will be in DONOR state forever and can not serve state transfer any more. The commit b83db28 is to fix that problem. (call |
I have an MTR test case for this bug. Please let me know when I can push it. |
Yan,
But
calls |
Alex, I have written the general rule and error handling of state transfer in this comment #106 (comment).
if we call |
Alex, sorry, I must misunderstand you. It does call |
Yes, so my concern is this method:
why is it calling |
ok. seems I'm wrong. The reason why calls We are gonna do IST and SST is not empty, so we expect But seems I'm wrong. If we are gonna do IST and SST is not empty, |
Yan, I think you're still looking at it from the wrong perspective. The And so far cases are 4
In the first two cases, In the last two cases JOIN must be sent from IST code should not be involved in this in any way. On 2014-10-09 04:37, yan.zhang wrote:
|
OK. I was too focusing on details. But you are right. Thanks for explanation. |
Refs codership/galera-features#101 - make sure service thread is flushed
I noticed an IST donor only stayed in the Donor state for a very brief period, but the IST took much longer in this log:
My concerns are:
The text was updated successfully, but these errors were encountered: