Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.0.x] Added missing destination check before processing message batch #515

Merged
merged 1 commit into from Oct 16, 2020

Conversation

SgtSilvio
Copy link
Contributor

We found a bug with message batching on the receiver side.
The bug leads to the new node silently ignoring messages from other nodes so it can not completely join the cluster.

Steps to reproduce:

  1. Form a cluster of 2 nodes
    node 1 has physical address ip1 and logical address A
    node 2 has physical address ip2 and logical address B
  2. Shutdown B
  3. A still sends unicast messages to B (for example responses that B did not wait for).
    These messages are not acknowledged and so UNICAST3 keeps them in its send table.
    The messages are also retransmitted in an interval.
  4. Start C on the same physical address ip2 as B while messages are still sent/retransmitted from A to B.
  5. The messages to B will arrive at C.
    TP.handleSingleMessage filters out single messages to the old destination B before calling MaxOneThreadPerSender.process via unicastDestMismatch.
    TP.handleMessageBatch or TP.processBatch does not filter out message batches to the old destination B before calling MaxOneThreadPerSender.process.
  6. If a message batch to the old destination B (without OOB and INTERNAL flags) is received, it starts a new batch in MaxOneThreadPerSender.MessageTable.Entry by calling MessageTable.get.
    The destination of the batch is set to the old destination B.
  7. If a message to the new destination C arrives quickly afterwards it is bundled into the same message batch inside MaxOneThreadPerSender.MessageTable.Entry.
    A message batch must never bundle messages to different destinations, so here is the problem.
  8. The whole batch is then ignored in SubmitToThreadPool.BatchHandler.run via the unicastDestMismatch check because the destination of the batch is set to B.
    This includes ignoring the message to the new destination C.
  9. Retransmissions retransmit the old message to B and the new message to C which then are again batched together and ignored.

Result: Messages arrive at the transport level but are never passed up the stack, so the new node is stuck although heartbeats work.

The fix we provide is very easy: the unicastDestMismatch check is performed for messages batches the same way it is performed for single messages before starting processing the message.

@SgtSilvio SgtSilvio changed the title Added missing destination check before processing message batch [4.0.x] Added missing destination check before processing message batch Oct 16, 2020
@belaban belaban merged commit 469cb0b into belaban:4.0.22 Oct 16, 2020
@belaban
Copy link
Owner

belaban commented Oct 16, 2020

Thanks for the detailed bug report and PR! Applied

@SgtSilvio
Copy link
Contributor Author

Thank you for the fast response and merging the PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants