[Network] Add support for latency aware peer dialing #11814

JoshLind · 2024-01-29T16:44:49Z

Description

This PR updates the networking stack to add support for latency aware peer dialing. Instead of selecting peers to dial randomly, we now select peers based on peer ping latencies (i.e., the lower the peer latency, the higher the probability of dialing the peer). Experiments show this reduces end-to-end transaction latencies by several 100ms (for high-latency PFNs).

The PR offers several commits:

Make a tiny improvement to an existing error log.
Add latency aware peer dialing to the network stack.
Add some simple tests to verify peer dialing and selection logic.
Adopt f64 for peer ping latencies (to improve metrics).

A couple notes about the PR:

To identify peer ping latencies before dialing, we simply time how long it takes to establish a TCP connection to each peer. This seems to provide a reasonable proxy for real ping latencies.
The feature is gated by a config flag and can be disabled (if required). If the config flag is disabled, the existing dialing logic is used.
Latency aware peer dialing is only done for peers on the public network (as it makes little sense for validator and VFN networks, given these are all-to-all connections).
We also keep the existing dialing prioritization, i.e., if a peer has already been dialed (and we've failed to establish a connection with them), we deprioritize that peer from the selection process. This prevents us from attempting to dial the seem unresponsive peers continuously.

Test Plan

New and existing test infrastructure (as well as manual verification).

trunk-io · 2024-01-29T16:44:54Z

⏱️ 21h 11m total CI duration on this PR

Job	Cumulative Duration	Recent Runs
rust-unit-coverage	4h 30m	🟩
rust-smoke-coverage	3h 2m	🟥
rust-unit-tests	2h 44m	⬜ 🟩 🟩 🟩 🟩 (+1 more)
rust-smoke-tests	2h 4m	⬜ 🟩 🟩 🟩 🟩
windows-build	1h 42m	🟩 🟩 🟩 🟩 🟩 (+1 more)
execution-performance / single-node-performance	1h 32m	🟩 🟩 🟩 🟩 🟩
rust-images / rust-all	1h 16m	⬜ 🟩 🟩 🟩 🟩
rust-lints	46m	🟩 🟩 🟩 🟩 🟩 (+1 more)
forge-e2e-test / forge	45m	🟩 🟩 🟩
forge-compat-test / forge	39m	🟩 🟩 🟩
cli-e2e-tests / run-cli-tests	32m	🟩 🟩 🟩 🟩
run-tests-main-branch	27m	🟩 🟩 🟩 🟩 🟩 (+1 more)
check	24m	🟩 🟩 🟩 🟩 🟩 (+1 more)
general-lints	15m	🟩 🟩 🟩 🟩 🟩 (+1 more)
check-dynamic-deps	14m	🟩 🟩 🟩 🟩 🟩 (+1 more)
indexer-grpc-e2e-tests / test-indexer-grpc-docker-compose	7m	🟩 🟩 🟩 🟩
node-api-compatibility-tests / node-api-compatibility-tests	4m	🟩 🟩 🟩 🟩
semgrep/ci	3m	🟩 🟩 🟩 🟩 🟩 (+1 more)
file_change_determinator	1m	🟩 🟩 🟩 🟩 🟩 (+1 more)
file_change_determinator	1m	🟩 🟩 🟩 🟩 🟩 (+1 more)
execution-performance / file_change_determinator	55s	🟩 🟩 🟩 🟩 🟩
file_change_determinator	52s	🟩 🟩 🟩 🟩 🟩
permission-check	24s	🟩 🟩 🟩 🟩 🟩 (+1 more)
permission-check	24s	🟩 🟩 🟩 🟩 🟩 (+1 more)
permission-check	21s	🟩 🟩 🟩 🟩 🟩
permission-check	17s	🟩 🟩 🟩 🟩 🟩 (+1 more)
permission-check	15s	🟩 🟩 🟩 🟩 🟩 (+1 more)
determine-docker-build-metadata	10s	🟩 🟩 🟩 🟩 🟩

🚨 2 jobs on the last run were significantly faster/slower than expected

Job	Duration	vs 7d avg	Delta
run-tests-main-branch	6m	4m
rust-images / rust-all	17m	12m

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

brianolson · 2024-01-30T15:11:24Z

network/framework/src/connectivity_manager/mod.rs

+            .into_iter()
+            .filter(|(_, peer)| peer.ping_latency_ms.is_none())
+            .collect::<Vec<_>>();
+        let num_peers_to_ping = peers_to_ping.len();


This should often be zero, if so it would be nice to return here ?

SGTM -- added it 😄

brianolson · 2024-01-30T15:12:30Z

network/framework/src/connectivity_manager/mod.rs

+        // Identify the eligible peers that don't already have latency information
+        let peers_to_ping = eligible_peers
+            .into_iter()
+            .filter(|(_, peer)| peer.ping_latency_ms.is_none())


next round: give ping_lattency a TTL after which we should ping again?

Yeah, if we want to make this more intelligent, we can definitely do that, e.g., give each ping measurement a TTL and refresh it periodically. We could then combine that with some type of support for disconnects, such that if you discover better peers to dial, you could "swap" a couple of your current connections with the more optimal peers (assuming you're already at your max outbound connection limit).

But, I figured that'd be best to leave to a future PR/effort (if we see the need). At least now, a simple node reboot should refresh the peer ping latencies and connections.

brianolson · 2024-01-30T15:18:27Z

network/framework/src/connectivity_manager/mod.rs

+            self.enable_latency_aware_dialing,
+        ) {
+            // Ping the eligible peers (so that we can fetch missing ping latency information)
+            self.ping_eligible_peers(eligible_peers.clone()).await;


I'm a little sad that no connections will be tried until ping is done; this will mostly slow down a newly started node connecting by a few seconds? It looks like in the steady state things already have a ping time so there's no waiting for pings.

Yeah, the pinging does unfortunately block the first dial (and any subsequent dials if there are peers we failed to ping last round). But, thankfully, the delay isn't super noticeable. For example, this dial loop is only invoked every 5 or 10 seconds anyway (depending on network/node). So, when you add the additional ping time on (a couple seconds), it's not very noticeable.

If we do notice it becoming a problem, we could try to make things async (or heavily reduce the timeout), but it should hopefully be reasonable enough to start with 😄

bchocho

LGTM!

github-actions · 2024-02-01T18:37:43Z

✅ Forge suite `compat` success on `aptos-node-v1.8.3` ==> `15632560704cf2ef140e013941e6d65dc591cde8`

Compatibility test results for aptos-node-v1.8.3 ==> 15632560704cf2ef140e013941e6d65dc591cde8 (PR)
1. Check liveness of validators at old version: aptos-node-v1.8.3
compatibility::simple-validator-upgrade::liveness-check : committed: 3588 txn/s, latency: 6947 ms, (p50: 6900 ms, p90: 9600 ms, p99: 16200 ms), latency samples: 172240
2. Upgrading first Validator to new version: 15632560704cf2ef140e013941e6d65dc591cde8
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1504 txn/s, latency: 18630 ms, (p50: 18400 ms, p90: 32200 ms, p99: 33700 ms), latency samples: 93300
3. Upgrading rest of first batch to new version: 15632560704cf2ef140e013941e6d65dc591cde8
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1780 txn/s, latency: 16028 ms, (p50: 19200 ms, p90: 22200 ms, p99: 22600 ms), latency samples: 92600
4. upgrading second batch to new version: 15632560704cf2ef140e013941e6d65dc591cde8
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 2606 txn/s, latency: 11781 ms, (p50: 9900 ms, p90: 24700 ms, p99: 29300 ms), latency samples: 101640
5. check swarm health
Compatibility test for aptos-node-v1.8.3 ==> 15632560704cf2ef140e013941e6d65dc591cde8 passed
Test Ok

github-actions · 2024-02-01T18:40:47Z

✅ Forge suite `realistic_env_max_load` success on `15632560704cf2ef140e013941e6d65dc591cde8`

two traffics test: inner traffic : committed: 7247 txn/s, latency: 5264 ms, (p50: 4800 ms, p90: 7000 ms, p99: 11100 ms), latency samples: 3145240
two traffics test : committed: 100 txn/s, latency: 2187 ms, (p50: 2100 ms, p90: 2500 ms, p99: 4300 ms), latency samples: 1820
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.211, avg: 0.196", "QsPosToProposal: max: 0.146, avg: 0.141", "ConsensusProposalToOrdered: max: 0.563, avg: 0.539", "ConsensusOrderedToCommit: max: 0.490, avg: 0.459", "ConsensusProposalToCommit: max: 1.035, avg: 0.998"]
Max round gap was 1 [limit 4] at version 1258948. Max no progress secs was 5.10237 [limit 15] at version 1258948.
Test Ok

[Network] Improve noise error log

794c0a6

JoshLind requested review from bchocho, msmouse, grao1991 and zekun000 January 29, 2024 16:44

JoshLind requested review from gregnazario and brianolson as code owners January 29, 2024 16:44

brianolson reviewed Jan 30, 2024

View reviewed changes

bchocho approved these changes Jan 30, 2024

View reviewed changes

JoshLind added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Feb 1, 2024

JoshLind added 2 commits February 1, 2024 09:02

[Network] Add peer ping latencies to dialing.

e26e990

[Network] Add simple tests for latency aware dialing.

bab1632

JoshLind force-pushed the ping_peers_net_2 branch from 8dbe271 to bab1632 Compare February 1, 2024 14:02

This comment has been minimized.

Sign in to view

brianolson approved these changes Feb 1, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

JoshLind force-pushed the ping_peers_net_2 branch from 688e53a to 52f9b21 Compare February 1, 2024 17:10

This comment has been minimized.

Sign in to view

[Network] Adopt f64 for ping latencies.

1563256

JoshLind force-pushed the ping_peers_net_2 branch from 52f9b21 to 1563256 Compare February 1, 2024 18:04

This comment has been minimized.

Sign in to view

JoshLind merged commit a86669f into main Feb 1, 2024
50 checks passed

JoshLind deleted the ping_peers_net_2 branch February 1, 2024 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Network] Add support for latency aware peer dialing #11814

[Network] Add support for latency aware peer dialing #11814

JoshLind commented Jan 29, 2024 •

edited

trunk-io bot commented Jan 29, 2024 •

edited

brianolson Jan 30, 2024

JoshLind Feb 1, 2024

brianolson Jan 30, 2024

JoshLind Feb 1, 2024

brianolson Jan 30, 2024

JoshLind Feb 1, 2024

bchocho left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Feb 1, 2024

github-actions bot commented Feb 1, 2024

[Network] Add support for latency aware peer dialing #11814

[Network] Add support for latency aware peer dialing #11814

Conversation

JoshLind commented Jan 29, 2024 • edited

Description

Test Plan

trunk-io bot commented Jan 29, 2024 • edited

brianolson Jan 30, 2024

Choose a reason for hiding this comment

JoshLind Feb 1, 2024

Choose a reason for hiding this comment

brianolson Jan 30, 2024

Choose a reason for hiding this comment

JoshLind Feb 1, 2024

Choose a reason for hiding this comment

brianolson Jan 30, 2024

Choose a reason for hiding this comment

JoshLind Feb 1, 2024

Choose a reason for hiding this comment

bchocho left a comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Feb 1, 2024

✅ Forge suite compat success on aptos-node-v1.8.3 ==> 15632560704cf2ef140e013941e6d65dc591cde8

github-actions bot commented Feb 1, 2024

✅ Forge suite realistic_env_max_load success on 15632560704cf2ef140e013941e6d65dc591cde8

JoshLind commented Jan 29, 2024 •

edited

trunk-io bot commented Jan 29, 2024 •

edited

✅ Forge suite `compat` success on `aptos-node-v1.8.3` ==> `15632560704cf2ef140e013941e6d65dc591cde8`

✅ Forge suite `realistic_env_max_load` success on `15632560704cf2ef140e013941e6d65dc591cde8`