New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: improve handling of overlay network during shutdown #5883
Conversation
Problem: parent_cb() defines 'type', then redefines it in one of its blocks. Rename the variable in the block to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -777,9 +777,12 @@ static int overlay_sendmsg_child (struct overlay *ov, const flux_msg_t *msg) | |||
&& (child = child_lookup_online (ov, uuid))) { | |||
flux_log (ov->h, | |||
LOG_ERR, | |||
"%s (rank %d) transitioning to LOST due to %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message: till ➡️ still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed that and another typo. Doing some manual testing now.
Problem: if the parent broker reboots after hello was sent but before a response is received, the child broker is stuck unable to connect. As noted in flux-framework#5881, we see a continuous stream of log messages on the client: DROP upstream control topic - : message received before hello handshake completed The messages may get started because 1. child sends hello 2. parent reboots and misses the hello 3. child starts sending periodic heartbeat control message to the parent after 5s regardless of the hello handshake status (this is by design - see below) 4. parent expects to get a hello request before any other message, so it sends a disconnect control message 5. child logs the above message (which goes to the parent) but doesn't disconnect 6. parent recieves log message, goto 4 Note that although the broker's downstream 0MQ ROUTER socket exposes TCP disconnects as you might expect for a regular socket, the upstream 0MQ DEALER socket does not. This is why the child does not detect the parent restart after sending hello - it has seamlessly reconnected to the new parent but the hello request that was likely safely delivered to the old parent before it died and is gone. The design to work around this passive reconnect behavior is to start the periodic heartbeat control messages early and force something to happen. The parent does the right thing by sending the disconnect in response to the heartbeat. Unfortunatley, the child ignores the disconnect and a fast but unentertaining game of pingpong ensues. To break the cycle, handle the control disconnect message in the child as originally intended. That will cause the broker to be restarted by systemd and introductions can be restarted from scratch. This first noted in flux-framework#5881.
Problem: if an instance is slow to shut down, nodes could restart and rejoin the dying instance. Set a flag when the broker enters cleanup state that causes any new clients that send a hello message to get an immediate control disconnect.
Problem: when nodes of a system instance are forcibly disconnected, they can reconnect fairly quickly because we allow systemd to restart them in 5s. While Flux is shutting down, hello requests are now immediately sent a control disconnect message, but if the fanout is large and the shutdown is slow, the whack-a-mole overhead may be non-negligible. Raise the systemd unit file RestartSec value from 5s to 30s.
Problem: many "transitioning to LOST due to EHOSTUNREACH error on send" messages were logged during shutdown of a large instance. This is still not well understood but we can perhaps get a little more information for next time. Add the previous state and the time the peer has spent in that state (in whole seconds). Hopefully will help with flux-framework#5881 if it occurs again.
I didn't find anything wrong with this in testing but I did discover one omission that could be corrected at a later date. Only the rank 0 broker prohibits peers from reconnecting during cleanup. The other ranks are still in run state and are perfectly happy to let peers return to service, only to ask them to shutdown once cleanup completes. It's not wrong, but it seems like wasted effort that we could fix. Unfortunately it's a little more than I think I can get done today. Let's merge this and get on with it. Setting MWP. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5883 +/- ##
==========================================
- Coverage 83.31% 83.30% -0.01%
==========================================
Files 514 514
Lines 82767 82774 +7
==========================================
- Hits 68957 68955 -2
- Misses 13810 13819 +9
|
This should resolve some problems related to #5881, where we had a flux shutdown get stuck logging many "transitioning to LOST" messages, and some clients stuck in a loop through the next day waiting for a hello response that would never come.
The hello deadlock should be resolved here. Instead of logging an error message and doing nothing when a disconnect control message is received during hello, we disconnect and start over.
The leader broker problems may be related to the fact that there is no mechanism to prevent clients that have rebooted from saying hello again while the instance is in shutdown state. Add a mechanism to send disconnect control messages to any client saying hello during shutdown. Also add the previous state and the time in that state to the LOST messages to get a little more info on what is happening if we see it again.
Finally, the systemd min restart time is raised from 5s to 30s to avoid storms of reconnects, since sending those disconnects could end up being a non-trivial overhead for a broker with many peers.