broker: avoid LOST due to EHOSTUNREACH messages during shutdown #5928

garlick · 2024-05-02T15:00:51Z

Problem: as noted in #5881, we observe many "transitioning to LOST due to EHOSTUNREACH" log messages during shutdown of a large system with a flat TBON.

This is due to a deficiency in the shutdown handshake. Once a tbon child completes rc3, it sends an offline status control message to the parent and immediately disconnects. The parent processes the status message asynchronously and may attempt to send messages to the child (for example routine broadcasts) after the child has closed the connection, which triggers this log message.

Improve the shutdown handshake by adding an RPC that the child must wait for before disconnecting.

In addition, make the message a little more user friendy so if it does appear it is more clear to users what is going on.

garlick · 2024-05-02T22:57:03Z

I ran a quick test of current master and this branch (rebased on current master) to see if there was an impact on the time from shutdown->finalize on rank 0. This is with kary:32. A little bit slower but not much.

nodes   master  issue#5881
2       0.23    0.23
8       0.28    0.29
32      0.99    1.01
64      2.25    2.29
128     4.38    4.47
256     8.69    8.84
512     17.53   17.78

This was on a single node. Times are in seconds.

grondo

LGTM!

garlick · 2024-05-03T01:07:01Z

Cool thanks! MWP away!

codecov · 2024-05-03T01:07:22Z

Codecov Report

Attention: Patch coverage is 84.48276% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 83.31%. Comparing base (1c299b9) to head (41e19da).
Report is 23 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5928      +/-   ##
==========================================
- Coverage   83.31%   83.31%   -0.01%     
==========================================
  Files         514      514              
  Lines       82959    83013      +54     
==========================================
+ Hits        69120    69161      +41     
- Misses      13839    13852      +13

Files	Coverage Δ
src/broker/state_machine.c	`81.42% <100.00%> (-0.04%)`	⬇️
src/broker/shutdown.c	`81.45% <66.66%> (+0.15%)`	⬆️
src/broker/overlay.c	`83.51% <80.00%> (-0.18%)`	⬇️

... and 11 files with indirect coverage changes

Problem: comment in state_machine.c contains a spelling error. Fix it.

Problem: it is rather confusing to trace how the broker transitions out of GOODBYE state on all ranks, since the "goodbye" event is posted only in shutdown.c, which doesn't do much on followers. Define an action_goodbye() callback, following the pattern used in other states, and have it post the "goodbye" event for followers. This will enable other things to be done in the action callback too.

Problem: the log message "<hostname> (rank <N>) transitioning to LOST due to EHOSTUNREACH error on send" is not user friendly. Change it to <hostname> (rank <N>) has disconnected unexpectedly. Marking it LOST.

Problem: changes in the overlay shutdown handshake may affect the ability for follower ranks to get final log messages into the rank 0 log buffer. Change the test that ensures the broker transitions through expected states to use log-stderr-mode=local instead of the default "leader" mode, so the log messages are captured on stderr from all ranks not just via rank 0.

Problem: the shutdown handshake is insufficient to prevent the parent from accessing a child peer's socket after it has disconnected and logging errors about unexpected disconnection. Currently a child node immediately exits GOODBYE state, and tears down the overlay network, which sends a control status offline message and disconnects the socket. The parent processes the control status message asynchronously (whenever it emerges from the message queue). In the mean time, the parent could send messages (for example routine event broadcasts) to the child's socket. If the send fails with EHOSTUNREACH before the peer is marked offline, the unexpected disconnection is logged at LOG_ERR. Add a overlay.goodbye RPC which the child sends to its parent when it enters GOODBYE state. The child remains in GOODBYE state until a response is received (or a timeout), then it proceeds to EXIT state and the disconnects as before. While waiting for the response, further messages to the parent, like heartbeats or log messages, are suppressed. This is to avoid triggering a disconnect control message from the parent if the follower has been marked OFFLINE before the message is processed, as it surely will be given message order guarantees. The disconnect control messsage causes immediate disconnection - not logged on the parent, but LOG_CRIT on the child. Fixes flux-framework#5881

grondo approved these changes May 2, 2024

View reviewed changes

garlick added the merge-when-passing label May 3, 2024

garlick added 5 commits May 3, 2024 01:08

broker: fix typo in comment

68dba3b

Problem: comment in state_machine.c contains a spelling error. Fix it.

broker: make LOST log message more understandable

f920737

Problem: the log message "<hostname> (rank <N>) transitioning to LOST due to EHOSTUNREACH error on send" is not user friendly. Change it to <hostname> (rank <N>) has disconnected unexpectedly. Marking it LOST.

garlick force-pushed the issue#5881 branch from 41e19da to 81b43b5 Compare May 3, 2024 01:08

mergify bot merged commit e38a0cd into flux-framework:master May 3, 2024
33 checks passed

garlick deleted the issue#5881 branch May 3, 2024 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker: avoid LOST due to EHOSTUNREACH messages during shutdown #5928

broker: avoid LOST due to EHOSTUNREACH messages during shutdown #5928

garlick commented May 2, 2024

garlick commented May 2, 2024 •

edited

grondo left a comment

garlick commented May 3, 2024

codecov bot commented May 3, 2024

broker: avoid LOST due to EHOSTUNREACH messages during shutdown #5928

broker: avoid LOST due to EHOSTUNREACH messages during shutdown #5928

Conversation

garlick commented May 2, 2024

garlick commented May 2, 2024 • edited

grondo left a comment

Choose a reason for hiding this comment

garlick commented May 3, 2024

codecov bot commented May 3, 2024

Codecov Report

garlick commented May 2, 2024 •

edited