-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: avoid LOST due to EHOSTUNREACH messages during shutdown #5928
Conversation
I ran a quick test of current master and this branch (rebased on current master) to see if there was an impact on the time from
This was on a single node. Times are in seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Cool thanks! MWP away! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5928 +/- ##
==========================================
- Coverage 83.31% 83.31% -0.01%
==========================================
Files 514 514
Lines 82959 83013 +54
==========================================
+ Hits 69120 69161 +41
- Misses 13839 13852 +13
|
Problem: comment in state_machine.c contains a spelling error. Fix it.
Problem: it is rather confusing to trace how the broker transitions out of GOODBYE state on all ranks, since the "goodbye" event is posted only in shutdown.c, which doesn't do much on followers. Define an action_goodbye() callback, following the pattern used in other states, and have it post the "goodbye" event for followers. This will enable other things to be done in the action callback too.
Problem: the log message "<hostname> (rank <N>) transitioning to LOST due to EHOSTUNREACH error on send" is not user friendly. Change it to <hostname> (rank <N>) has disconnected unexpectedly. Marking it LOST.
Problem: changes in the overlay shutdown handshake may affect the ability for follower ranks to get final log messages into the rank 0 log buffer. Change the test that ensures the broker transitions through expected states to use log-stderr-mode=local instead of the default "leader" mode, so the log messages are captured on stderr from all ranks not just via rank 0.
Problem: the shutdown handshake is insufficient to prevent the parent from accessing a child peer's socket after it has disconnected and logging errors about unexpected disconnection. Currently a child node immediately exits GOODBYE state, and tears down the overlay network, which sends a control status offline message and disconnects the socket. The parent processes the control status message asynchronously (whenever it emerges from the message queue). In the mean time, the parent could send messages (for example routine event broadcasts) to the child's socket. If the send fails with EHOSTUNREACH before the peer is marked offline, the unexpected disconnection is logged at LOG_ERR. Add a overlay.goodbye RPC which the child sends to its parent when it enters GOODBYE state. The child remains in GOODBYE state until a response is received (or a timeout), then it proceeds to EXIT state and the disconnects as before. While waiting for the response, further messages to the parent, like heartbeats or log messages, are suppressed. This is to avoid triggering a disconnect control message from the parent if the follower has been marked OFFLINE before the message is processed, as it surely will be given message order guarantees. The disconnect control messsage causes immediate disconnection - not logged on the parent, but LOG_CRIT on the child. Fixes flux-framework#5881
Problem: as noted in #5881, we observe many "transitioning to LOST due to EHOSTUNREACH" log messages during shutdown of a large system with a flat TBON.
This is due to a deficiency in the shutdown handshake. Once a tbon child completes rc3, it sends an offline status control message to the parent and immediately disconnects. The parent processes the status message asynchronously and may attempt to send messages to the child (for example routine broadcasts) after the child has closed the connection, which triggers this log message.
Improve the shutdown handshake by adding an RPC that the child must wait for before disconnecting.
In addition, make the message a little more user friendy so if it does appear it is more clear to users what is going on.