Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #10080 - OSD crashes during msgr reconnection #3029

Closed
wants to merge 1 commit into from
Closed

Fix #10080 - OSD crashes during msgr reconnection #3029

wants to merge 1 commit into from

Conversation

guangyy
Copy link
Contributor

@guangyy guangyy commented Nov 28, 2014

Event flow analysis:
Let us say OSD A and B are peers with each other, at certain point, B is marked down by monitor (and thus kicked off osd map), but the daemon is still there, A would mark the connection with B down and clean everything. When B is in again, it will try to connect with A and negotiate the message in/out sequence. At A side, it will receive B's in_seq and drop messages whose sequence is equal to or less than B's in_seq. In this case, A's out_seq might be inconsistent to B's in_seq.

Solution:
At A side, after receiving B's in_seq, if after locally clean-up, if its out_seq is still less than B's in_seq, bump up its out_seq to make two sides consistent with each other.

Fixes: #10080

Signed-off-by: Guang Yang yguang@yahoo-inc.com

  Let us say OSD A and B are peers with each other, at certain point, B is marked down by monitor (and thus kicked off osd map), but the daemon is still there, A would mark the connection with B down and clean everything. When B is in again, it will try to connect with A and negotiate the message in/out sequence. At A side, it will receive B's in_seq and drop messages whose sequence is equal to or less than B's in_seq. In this case, A's out_seq might be inconsistent to B's in_seq.

Solution:
 At A side, after receiving B's in_seq, if after locally clean-up, if its out_seq is still less than B's in_seq, bump up its out_seq to make two sides consistent with each other.

Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
@ghost ghost added bug-fix core labels Nov 28, 2014
@yuyuyu101
Copy link
Member

It seemed reasonable for the situation. But the solution maybe better that A reply a reset tag to let B clean up its state and issue connect msg again by checking connect_seq or in_seq?

@guangyy
Copy link
Contributor Author

guangyy commented Nov 30, 2014

It seems to depend on the underlying assumption in such case:

  1. B does not need to reset its in/out queue, so that the two sides only to hand-shake the in/out sequeue.
  2. B need to reset its in/out queue, so that we clear everything at both sides and start from scratch.

My understanding is that our case belongs to 1 (please correct me if my I am wrong here). One step forward might to reset B's in_seq to 0, but it is a little bit harder (from implementation's perspective).

@gregsfortytwo
Copy link
Member

There's something else wrong here. I'll discuss it more on the tracker, but for the brief reason: if we lost some session state, then this is a flimsy band-aid and I think that no matter the underlying cause it will be missing things. :(

@guangyy guangyy deleted the wip-10080 branch February 27, 2015 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants