Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: improve handling of overlay network during shutdown #5883

Merged
merged 5 commits into from
Apr 13, 2024

Commits on Apr 12, 2024

  1. broker: fix shadowed variable

    Problem: parent_cb() defines 'type', then redefines it in
    one of its blocks.
    
    Rename the variable in the block to avoid this.
    garlick committed Apr 12, 2024
    Configuration menu
    Copy the full SHA
    3c4c840 View commit details
    Browse the repository at this point in the history
  2. broker: honor control disconnect during hello

    Problem: if the parent broker reboots after hello was sent
    but before a response is received, the child broker is stuck
    unable to connect.
    
    As noted in flux-framework#5881, we see a continuous stream of log messages on
    the client:
    
       DROP upstream control topic - :
         message received before hello handshake completed
    
    The messages may get started because
    1. child sends hello
    2. parent reboots and misses the hello
    3. child starts sending periodic heartbeat control message to
       the parent after 5s regardless of the hello handshake status
       (this is by design - see below)
    4. parent expects to get a hello request before any other message,
       so it sends a disconnect control message
    5. child logs the above message (which goes to the parent)
       but doesn't disconnect
    6. parent recieves log message, goto 4
    
    Note that although the broker's downstream 0MQ ROUTER socket exposes
    TCP disconnects as you might expect for a regular socket, the upstream
    0MQ DEALER socket does not.  This is why the child does not detect the
    parent restart after sending hello - it has seamlessly reconnected to
    the new parent but the hello request that was likely safely delivered to
    the old parent before it died and is gone.
    
    The design to work around this passive reconnect behavior is to start
    the periodic heartbeat control messages early and force something to happen.
    The parent does the right thing by sending the disconnect in response to
    the heartbeat.  Unfortunatley, the child ignores the disconnect and a
    fast but unentertaining game of pingpong ensues.
    
    To break the cycle, handle the control disconnect message in the child
    as originally intended.  That will cause the broker to be restarted by
    systemd and introductions can be restarted from scratch.
    
    This first noted in flux-framework#5881.
    garlick committed Apr 12, 2024
    Configuration menu
    Copy the full SHA
    24b2c82 View commit details
    Browse the repository at this point in the history
  3. broker: prevent new clients during shutdown

    Problem: if an instance is slow to shut down, nodes could restart
    and rejoin the dying instance.
    
    Set a flag when the broker enters cleanup state that causes any
    new clients that send a hello message to get an immediate control
    disconnect.
    garlick committed Apr 12, 2024
    Configuration menu
    Copy the full SHA
    9233873 View commit details
    Browse the repository at this point in the history
  4. systemd: change RestartSec=5s to 30s

    Problem: when nodes of a system instance are forcibly disconnected,
    they can reconnect fairly quickly because we allow systemd to
    restart them in 5s.
    
    While Flux is shutting down, hello requests are now immediately sent
    a control disconnect message, but if the fanout is large and the
    shutdown is slow, the whack-a-mole overhead may be non-negligible.
    
    Raise the systemd unit file RestartSec value from 5s to 30s.
    garlick committed Apr 12, 2024
    Configuration menu
    Copy the full SHA
    d4fa3dc View commit details
    Browse the repository at this point in the history
  5. broker: improve LOST error message

    Problem: many "transitioning to LOST due to EHOSTUNREACH error on send"
    messages were logged during shutdown of a large instance.
    
    This is still not well understood but we can perhaps get a little more
    information for next time.
    
    Add the previous state and the time the peer has spent in that
    state (in whole seconds).
    
    Hopefully will help with flux-framework#5881 if it occurs again.
    garlick committed Apr 12, 2024
    Configuration menu
    Copy the full SHA
    61bb419 View commit details
    Browse the repository at this point in the history