Add keep-alive watchdog (ping/pong) in networking for resilience #3530

devendran-m · 2022-12-23T11:25:47Z

To improve reliability and catch potentially stale connections early (as opposed to when we actually want to send data to them), add a periodic ping/pong message being sent to peers.
When failing to respond to a ping in a timely manner, terminate the connection (but do not block the peer).

3540: Network watchdog r=marc-casperlabs a=marc-casperlabs Closes #3530. This PR adds a network watchdog in the form of a ping/pong functionality: * Nodes will periodically send a `Ping` down every outgoing connection. * Any node receiving a `Ping` will respond with a `Pong` . * These pings/pongs contain nonces to prevent false positives on retries or allowing for spamming pongs (after a certain amount of invalid pongs, the peer is banned). * If a ping times out, it is retried a few times. * Once a certain amount of ping timeouts is hit, the connection is terminated (but the peer is *not* banned). The core motivation for adding this to 1.5 is to prevent unlikely but possible connection stalls due to deadlock while interdependent nodes fetch backpressured tries from each other. As a side benefit, really slow connections or stalled are also terminated. Test coverage for the functionality is extensive for the actual logic (see `health.rs`), which attempts to cover every possible edge case. Proper integration in layers up is also tested, but a certain amount of testing remains manual, as there is currently not a good way to easily write a tests that puts the nodes into the deadlocked state. As a nice side benefit, the node can now be queries for round-trip times to other nodes through `net-info` on the diagnostics port. Security aspects: * Ping floods are prevented through rate limiting. * Pong floods are prevented through rate limiting; also nodes ban peers that send too many unasked pings. * A 2:1 cost ratio of ping:pong prevents blowing up a peers memory through pings. Co-authored-by: Marc Brinkmann <marc@casperlabs.io>

devendran-m added the release blocker PR to be merged before releasing label Dec 23, 2022

devendran-m assigned marc-casperlabs Dec 23, 2022

marc-casperlabs mentioned this issue Jan 3, 2023

Network watchdog #3540

Merged

casperlabs-bors-ng bot closed this as completed in 67c9c9b Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add keep-alive watchdog (ping/pong) in networking for resilience #3530

Add keep-alive watchdog (ping/pong) in networking for resilience #3530

devendran-m commented Dec 23, 2022

Add keep-alive watchdog (ping/pong) in networking for resilience #3530

Add keep-alive watchdog (ping/pong) in networking for resilience #3530

Comments

devendran-m commented Dec 23, 2022