Fix #10080 - OSD crashes during msgr reconnection #3029
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Event flow analysis:
Let us say OSD A and B are peers with each other, at certain point, B is marked down by monitor (and thus kicked off osd map), but the daemon is still there, A would mark the connection with B down and clean everything. When B is in again, it will try to connect with A and negotiate the message in/out sequence. At A side, it will receive B's in_seq and drop messages whose sequence is equal to or less than B's in_seq. In this case, A's out_seq might be inconsistent to B's in_seq.
Solution:
At A side, after receiving B's in_seq, if after locally clean-up, if its out_seq is still less than B's in_seq, bump up its out_seq to make two sides consistent with each other.
Fixes: #10080
Signed-off-by: Guang Yang yguang@yahoo-inc.com