broker: improve handling of overlay network during shutdown #5883

Problem: parent_cb() defines 'type', then redefines it in one of its blocks. Rename the variable in the block to avoid this.

Problem: if the parent broker reboots after hello was sent but before a response is received, the child broker is stuck unable to connect. As noted in flux-framework#5881, we see a continuous stream of log messages on the client: DROP upstream control topic - : message received before hello handshake completed The messages may get started because 1. child sends hello 2. parent reboots and misses the hello 3. child starts sending periodic heartbeat control message to the parent after 5s regardless of the hello handshake status (this is by design - see below) 4. parent expects to get a hello request before any other message, so it sends a disconnect control message 5. child logs the above message (which goes to the parent) but doesn't disconnect 6. parent recieves log message, goto 4 Note that although the broker's downstream 0MQ ROUTER socket exposes TCP disconnects as you might expect for a regular socket, the upstream 0MQ DEALER socket does not. This is why the child does not detect the parent restart after sending hello - it has seamlessly reconnected to the new parent but the hello request that was likely safely delivered to the old parent before it died and is gone. The design to work around this passive reconnect behavior is to start the periodic heartbeat control messages early and force something to happen. The parent does the right thing by sending the disconnect in response to the heartbeat. Unfortunatley, the child ignores the disconnect and a fast but unentertaining game of pingpong ensues. To break the cycle, handle the control disconnect message in the child as originally intended. That will cause the broker to be restarted by systemd and introductions can be restarted from scratch. This first noted in flux-framework#5881.

Problem: if an instance is slow to shut down, nodes could restart and rejoin the dying instance. Set a flag when the broker enters cleanup state that causes any new clients that send a hello message to get an immediate control disconnect.

Problem: when nodes of a system instance are forcibly disconnected, they can reconnect fairly quickly because we allow systemd to restart them in 5s. While Flux is shutting down, hello requests are now immediately sent a control disconnect message, but if the fanout is large and the shutdown is slow, the whack-a-mole overhead may be non-negligible. Raise the systemd unit file RestartSec value from 5s to 30s.

Problem: many "transitioning to LOST due to EHOSTUNREACH error on send" messages were logged during shutdown of a large instance. This is still not well understood but we can perhaps get a little more information for next time. Add the previous state and the time the peer has spent in that state (in whole seconds). Hopefully will help with flux-framework#5881 if it occurs again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker: improve handling of overlay network during shutdown #5883

broker: improve handling of overlay network during shutdown #5883

Commits on Apr 12, 2024