Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keep-alive watchdog (ping/pong) in networking for resilience #3530

Closed
devendran-m opened this issue Dec 23, 2022 · 0 comments · Fixed by #3540
Closed

Add keep-alive watchdog (ping/pong) in networking for resilience #3530

devendran-m opened this issue Dec 23, 2022 · 0 comments · Fixed by #3540
Assignees
Labels
release blocker PR to be merged before releasing

Comments

@devendran-m
Copy link
Contributor

To improve reliability and catch potentially stale connections early (as opposed to when we actually want to send data to them), add a periodic ping/pong message being sent to peers.
When failing to respond to a ping in a timely manner, terminate the connection (but do not block the peer).

@devendran-m devendran-m added the release blocker PR to be merged before releasing label Dec 23, 2022
casperlabs-bors-ng bot added a commit that referenced this issue Jan 4, 2023
3540: Network watchdog r=marc-casperlabs a=marc-casperlabs

Closes #3530.

This PR adds a network watchdog in the form of a ping/pong functionality:

* Nodes will periodically send a `Ping`  down every outgoing connection.
* Any node receiving a `Ping`  will respond with a `Pong` .
* These pings/pongs contain nonces to prevent false positives on retries or allowing for spamming pongs (after a certain amount of invalid pongs, the peer is banned).
* If a ping times out, it is retried a few times.
* Once a certain amount of ping timeouts is hit, the connection is terminated (but the peer is *not* banned).

The core motivation for adding this to 1.5 is to prevent unlikely but possible connection stalls due to deadlock while interdependent nodes fetch backpressured tries from each other. As a side benefit, really slow connections or stalled are also terminated.

Test coverage for the functionality is extensive for the actual logic (see `health.rs`), which attempts to cover every possible edge case. Proper integration in layers up is also tested, but a certain amount of testing remains manual, as there is currently not a good way to easily write a tests that puts the nodes into the deadlocked state.

As a nice side benefit, the node can now be queries for round-trip times to other nodes through `net-info` on the diagnostics port.

Security aspects:

* Ping floods are prevented through rate limiting.
* Pong floods are prevented through rate limiting; also nodes ban peers that send too many unasked pings.
* A 2:1 cost ratio of ping:pong prevents blowing up a peers memory through pings.

Co-authored-by: Marc Brinkmann <marc@casperlabs.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release blocker PR to be merged before releasing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants