Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: avoid LOST due to EHOSTUNREACH messages during shutdown #5928

Merged
merged 5 commits into from
May 3, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented May 2, 2024

Problem: as noted in #5881, we observe many "transitioning to LOST due to EHOSTUNREACH" log messages during shutdown of a large system with a flat TBON.

This is due to a deficiency in the shutdown handshake. Once a tbon child completes rc3, it sends an offline status control message to the parent and immediately disconnects. The parent processes the status message asynchronously and may attempt to send messages to the child (for example routine broadcasts) after the child has closed the connection, which triggers this log message.

Improve the shutdown handshake by adding an RPC that the child must wait for before disconnecting.

In addition, make the message a little more user friendy so if it does appear it is more clear to users what is going on.

@garlick
Copy link
Member Author

garlick commented May 2, 2024

I ran a quick test of current master and this branch (rebased on current master) to see if there was an impact on the time from shutdown->finalize on rank 0. This is with kary:32. A little bit slower but not much.

nodes   master  issue#5881
2       0.23    0.23
8       0.28    0.29
32      0.99    1.01
64      2.25    2.29
128     4.38    4.47
256     8.69    8.84
512     17.53   17.78

This was on a single node. Times are in seconds.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@garlick
Copy link
Member Author

garlick commented May 3, 2024

Cool thanks! MWP away!

Copy link

codecov bot commented May 3, 2024

Codecov Report

Attention: Patch coverage is 84.48276% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 83.31%. Comparing base (1c299b9) to head (41e19da).
Report is 23 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5928      +/-   ##
==========================================
- Coverage   83.31%   83.31%   -0.01%     
==========================================
  Files         514      514              
  Lines       82959    83013      +54     
==========================================
+ Hits        69120    69161      +41     
- Misses      13839    13852      +13     
Files Coverage Δ
src/broker/state_machine.c 81.42% <100.00%> (-0.04%) ⬇️
src/broker/shutdown.c 81.45% <66.66%> (+0.15%) ⬆️
src/broker/overlay.c 83.51% <80.00%> (-0.18%) ⬇️

... and 11 files with indirect coverage changes

Problem: comment in state_machine.c contains a spelling error.

Fix it.
Problem: it is rather confusing to trace how the broker transitions
out of GOODBYE state on all ranks, since the "goodbye" event is
posted only in shutdown.c, which doesn't do much on followers.

Define an action_goodbye() callback, following the pattern used in
other states, and have it post the "goodbye" event for followers.

This will enable other things to be done in the action callback too.
Problem: the log message "<hostname> (rank <N>) transitioning to
LOST due to EHOSTUNREACH error on send" is not user friendly.

Change it to
  <hostname> (rank <N>) has disconnected unexpectedly.  Marking it LOST.
Problem: changes in the overlay shutdown handshake may affect the
ability for follower ranks to get final log messages into the rank
0 log buffer.

Change the test that ensures the broker transitions through expected
states to use log-stderr-mode=local instead of the default "leader"
mode, so the log messages are captured on stderr from all ranks not
just via rank 0.
Problem: the shutdown handshake is insufficient to prevent the
parent from accessing a child peer's socket after it has disconnected
and logging errors about unexpected disconnection.

Currently a child node immediately exits GOODBYE state, and tears
down the overlay network, which sends a control status offline message
and disconnects the socket.  The parent processes the control status
message asynchronously (whenever it emerges from the message queue).
In the mean time, the parent could send messages (for example routine
event broadcasts) to the child's socket.  If the send fails with
EHOSTUNREACH before the peer is marked offline, the unexpected
disconnection is logged at LOG_ERR.

Add a overlay.goodbye RPC which the child sends to its parent when it
enters GOODBYE state.  The child remains in GOODBYE state until a
response is received (or a timeout), then it proceeds to EXIT state and
the disconnects as before.

While waiting for the response, further messages to the parent,
like heartbeats or log messages, are suppressed.  This is to avoid
triggering a disconnect control message from the parent if the
follower has been marked OFFLINE before the message is processed,
as it surely will be given message order guarantees.  The disconnect
control messsage causes immediate disconnection - not logged on the
parent, but LOG_CRIT on the child.

Fixes flux-framework#5881
@mergify mergify bot merged commit e38a0cd into flux-framework:master May 3, 2024
33 checks passed
@garlick garlick deleted the issue#5881 branch May 3, 2024 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants