Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: improve handling of overlay network during shutdown #5883

Merged
merged 5 commits into from Apr 13, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 12, 2024

This should resolve some problems related to #5881, where we had a flux shutdown get stuck logging many "transitioning to LOST" messages, and some clients stuck in a loop through the next day waiting for a hello response that would never come.

The hello deadlock should be resolved here. Instead of logging an error message and doing nothing when a disconnect control message is received during hello, we disconnect and start over.

The leader broker problems may be related to the fact that there is no mechanism to prevent clients that have rebooted from saying hello again while the instance is in shutdown state. Add a mechanism to send disconnect control messages to any client saying hello during shutdown. Also add the previous state and the time in that state to the LOST messages to get a little more info on what is happening if we see it again.

Finally, the systemd min restart time is raised from 5s to 30s to avoid storms of reconnects, since sending those disconnects could end up being a non-trivial overhead for a broker with many peers.

Problem: parent_cb() defines 'type', then redefines it in
one of its blocks.

Rename the variable in the block to avoid this.
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -777,9 +777,12 @@ static int overlay_sendmsg_child (struct overlay *ov, const flux_msg_t *msg)
&& (child = child_lookup_online (ov, uuid))) {
flux_log (ov->h,
LOG_ERR,
"%s (rank %d) transitioning to LOST due to %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message: till ➡️ still?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed that and another typo. Doing some manual testing now.

Problem: if the parent broker reboots after hello was sent
but before a response is received, the child broker is stuck
unable to connect.

As noted in flux-framework#5881, we see a continuous stream of log messages on
the client:

   DROP upstream control topic - :
     message received before hello handshake completed

The messages may get started because
1. child sends hello
2. parent reboots and misses the hello
3. child starts sending periodic heartbeat control message to
   the parent after 5s regardless of the hello handshake status
   (this is by design - see below)
4. parent expects to get a hello request before any other message,
   so it sends a disconnect control message
5. child logs the above message (which goes to the parent)
   but doesn't disconnect
6. parent recieves log message, goto 4

Note that although the broker's downstream 0MQ ROUTER socket exposes
TCP disconnects as you might expect for a regular socket, the upstream
0MQ DEALER socket does not.  This is why the child does not detect the
parent restart after sending hello - it has seamlessly reconnected to
the new parent but the hello request that was likely safely delivered to
the old parent before it died and is gone.

The design to work around this passive reconnect behavior is to start
the periodic heartbeat control messages early and force something to happen.
The parent does the right thing by sending the disconnect in response to
the heartbeat.  Unfortunatley, the child ignores the disconnect and a
fast but unentertaining game of pingpong ensues.

To break the cycle, handle the control disconnect message in the child
as originally intended.  That will cause the broker to be restarted by
systemd and introductions can be restarted from scratch.

This first noted in flux-framework#5881.
Problem: if an instance is slow to shut down, nodes could restart
and rejoin the dying instance.

Set a flag when the broker enters cleanup state that causes any
new clients that send a hello message to get an immediate control
disconnect.
Problem: when nodes of a system instance are forcibly disconnected,
they can reconnect fairly quickly because we allow systemd to
restart them in 5s.

While Flux is shutting down, hello requests are now immediately sent
a control disconnect message, but if the fanout is large and the
shutdown is slow, the whack-a-mole overhead may be non-negligible.

Raise the systemd unit file RestartSec value from 5s to 30s.
Problem: many "transitioning to LOST due to EHOSTUNREACH error on send"
messages were logged during shutdown of a large instance.

This is still not well understood but we can perhaps get a little more
information for next time.

Add the previous state and the time the peer has spent in that
state (in whole seconds).

Hopefully will help with flux-framework#5881 if it occurs again.
@garlick
Copy link
Member Author

garlick commented Apr 13, 2024

I didn't find anything wrong with this in testing but I did discover one omission that could be corrected at a later date. Only the rank 0 broker prohibits peers from reconnecting during cleanup. The other ranks are still in run state and are perfectly happy to let peers return to service, only to ask them to shutdown once cleanup completes. It's not wrong, but it seems like wasted effort that we could fix. Unfortunately it's a little more than I think I can get done today.

Let's merge this and get on with it. Setting MWP.

Copy link

codecov bot commented Apr 13, 2024

Codecov Report

Merging #5883 (61bb419) into master (4bd465d) will decrease coverage by 0.01%.
The diff coverage is 82.35%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5883      +/-   ##
==========================================
- Coverage   83.31%   83.30%   -0.01%     
==========================================
  Files         514      514              
  Lines       82767    82774       +7     
==========================================
- Hits        68957    68955       -2     
- Misses      13810    13819       +9     
Files Coverage Δ
src/broker/state_machine.c 80.36% <100.00%> (-0.52%) ⬇️
src/broker/overlay.c 83.68% <80.00%> (-0.11%) ⬇️

... and 21 files with indirect coverage changes

@mergify mergify bot merged commit 5186f08 into flux-framework:master Apr 13, 2024
34 of 35 checks passed
@garlick garlick deleted the issue#5881 branch April 13, 2024 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants