broker: improve handling of overlay network during shutdown #5883

garlick · 2024-04-12T21:55:27Z

This should resolve some problems related to #5881, where we had a flux shutdown get stuck logging many "transitioning to LOST" messages, and some clients stuck in a loop through the next day waiting for a hello response that would never come.

The hello deadlock should be resolved here. Instead of logging an error message and doing nothing when a disconnect control message is received during hello, we disconnect and start over.

The leader broker problems may be related to the fact that there is no mechanism to prevent clients that have rebooted from saying hello again while the instance is in shutdown state. Add a mechanism to send disconnect control messages to any client saying hello during shutdown. Also add the previous state and the time in that state to the LOST messages to get a little more info on what is happening if we see it again.

Finally, the systemd min restart time is raised from 5s to 30s to avoid storms of reconnects, since sending those disconnects could end up being a non-trivial overhead for a broker with many peers.

Problem: parent_cb() defines 'type', then redefines it in one of its blocks. Rename the variable in the block to avoid this.

grondo

LGTM!

grondo · 2024-04-12T22:02:45Z

src/broker/overlay.c

@@ -777,9 +777,12 @@ static int overlay_sendmsg_child (struct overlay *ov, const flux_msg_t *msg)
            && (child = child_lookup_online (ov, uuid))) {
            flux_log (ov->h,
                      LOG_ERR,
-                      "%s (rank %d) transitioning to LOST due to %s",


Commit message: till ➡️ still?

Fixed that and another typo. Doing some manual testing now.

Problem: if the parent broker reboots after hello was sent but before a response is received, the child broker is stuck unable to connect. As noted in flux-framework#5881, we see a continuous stream of log messages on the client: DROP upstream control topic - : message received before hello handshake completed The messages may get started because 1. child sends hello 2. parent reboots and misses the hello 3. child starts sending periodic heartbeat control message to the parent after 5s regardless of the hello handshake status (this is by design - see below) 4. parent expects to get a hello request before any other message, so it sends a disconnect control message 5. child logs the above message (which goes to the parent) but doesn't disconnect 6. parent recieves log message, goto 4 Note that although the broker's downstream 0MQ ROUTER socket exposes TCP disconnects as you might expect for a regular socket, the upstream 0MQ DEALER socket does not. This is why the child does not detect the parent restart after sending hello - it has seamlessly reconnected to the new parent but the hello request that was likely safely delivered to the old parent before it died and is gone. The design to work around this passive reconnect behavior is to start the periodic heartbeat control messages early and force something to happen. The parent does the right thing by sending the disconnect in response to the heartbeat. Unfortunatley, the child ignores the disconnect and a fast but unentertaining game of pingpong ensues. To break the cycle, handle the control disconnect message in the child as originally intended. That will cause the broker to be restarted by systemd and introductions can be restarted from scratch. This first noted in flux-framework#5881.

Problem: if an instance is slow to shut down, nodes could restart and rejoin the dying instance. Set a flag when the broker enters cleanup state that causes any new clients that send a hello message to get an immediate control disconnect.

Problem: when nodes of a system instance are forcibly disconnected, they can reconnect fairly quickly because we allow systemd to restart them in 5s. While Flux is shutting down, hello requests are now immediately sent a control disconnect message, but if the fanout is large and the shutdown is slow, the whack-a-mole overhead may be non-negligible. Raise the systemd unit file RestartSec value from 5s to 30s.

Problem: many "transitioning to LOST due to EHOSTUNREACH error on send" messages were logged during shutdown of a large instance. This is still not well understood but we can perhaps get a little more information for next time. Add the previous state and the time the peer has spent in that state (in whole seconds). Hopefully will help with flux-framework#5881 if it occurs again.

garlick · 2024-04-13T00:15:14Z

I didn't find anything wrong with this in testing but I did discover one omission that could be corrected at a later date. Only the rank 0 broker prohibits peers from reconnecting during cleanup. The other ranks are still in run state and are perfectly happy to let peers return to service, only to ask them to shutdown once cleanup completes. It's not wrong, but it seems like wasted effort that we could fix. Unfortunately it's a little more than I think I can get done today.

Let's merge this and get on with it. Setting MWP.

codecov · 2024-04-13T00:16:14Z

Codecov Report

Merging #5883 (61bb419) into master (4bd465d) will decrease coverage by 0.01%.
The diff coverage is 82.35%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5883      +/-   ##
==========================================
- Coverage   83.31%   83.30%   -0.01%     
==========================================
  Files         514      514              
  Lines       82767    82774       +7     
==========================================
- Hits        68957    68955       -2     
- Misses      13810    13819       +9

Files	Coverage Δ
src/broker/state_machine.c	`80.36% <100.00%> (-0.52%)`	⬇️
src/broker/overlay.c	`83.68% <80.00%> (-0.11%)`	⬇️

... and 21 files with indirect coverage changes

broker: fix shadowed variable

3c4c840

Problem: parent_cb() defines 'type', then redefines it in one of its blocks. Rename the variable in the block to avoid this.

grondo approved these changes Apr 12, 2024

View reviewed changes

garlick added 4 commits April 12, 2024 15:37

broker: prevent new clients during shutdown

9233873

Problem: if an instance is slow to shut down, nodes could restart and rejoin the dying instance. Set a flag when the broker enters cleanup state that causes any new clients that send a hello message to get an immediate control disconnect.

garlick force-pushed the issue#5881 branch from c5cb387 to 61bb419 Compare April 12, 2024 22:39

garlick added the merge-when-passing label Apr 13, 2024

mergify bot merged commit 5186f08 into flux-framework:master Apr 13, 2024
34 of 35 checks passed

garlick deleted the issue#5881 branch April 13, 2024 00:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker: improve handling of overlay network during shutdown #5883

broker: improve handling of overlay network during shutdown #5883

garlick commented Apr 12, 2024

grondo left a comment

grondo Apr 12, 2024

garlick Apr 12, 2024

garlick commented Apr 13, 2024

codecov bot commented Apr 13, 2024

broker: improve handling of overlay network during shutdown #5883

broker: improve handling of overlay network during shutdown #5883

Conversation

garlick commented Apr 12, 2024

grondo left a comment

Choose a reason for hiding this comment

grondo Apr 12, 2024

Choose a reason for hiding this comment

garlick Apr 12, 2024

Choose a reason for hiding this comment

garlick commented Apr 13, 2024

codecov bot commented Apr 13, 2024

Codecov Report