Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Network] Add support for latency aware peer dialing #11814

Merged
merged 4 commits into from
Feb 1, 2024
Merged

Conversation

JoshLind
Copy link
Contributor

@JoshLind JoshLind commented Jan 29, 2024

Description

This PR updates the networking stack to add support for latency aware peer dialing. Instead of selecting peers to dial randomly, we now select peers based on peer ping latencies (i.e., the lower the peer latency, the higher the probability of dialing the peer). Experiments show this reduces end-to-end transaction latencies by several 100ms (for high-latency PFNs).

The PR offers several commits:

  1. Make a tiny improvement to an existing error log.
  2. Add latency aware peer dialing to the network stack.
  3. Add some simple tests to verify peer dialing and selection logic.
  4. Adopt f64 for peer ping latencies (to improve metrics).

A couple notes about the PR:

  1. To identify peer ping latencies before dialing, we simply time how long it takes to establish a TCP connection to each peer. This seems to provide a reasonable proxy for real ping latencies.
  2. The feature is gated by a config flag and can be disabled (if required). If the config flag is disabled, the existing dialing logic is used.
  3. Latency aware peer dialing is only done for peers on the public network (as it makes little sense for validator and VFN networks, given these are all-to-all connections).
  4. We also keep the existing dialing prioritization, i.e., if a peer has already been dialed (and we've failed to establish a connection with them), we deprioritize that peer from the selection process. This prevents us from attempting to dial the seem unresponsive peers continuously.

Test Plan

New and existing test infrastructure (as well as manual verification).

Copy link

trunk-io bot commented Jan 29, 2024

⏱️ 21h 11m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-unit-coverage 4h 30m 🟩
rust-smoke-coverage 3h 2m 🟥
rust-unit-tests 2h 44m 🟩🟩🟩🟩 (+1 more)
rust-smoke-tests 2h 4m 🟩🟩🟩🟩
windows-build 1h 42m 🟩🟩🟩🟩🟩 (+1 more)
execution-performance / single-node-performance 1h 32m 🟩🟩🟩🟩🟩
rust-images / rust-all 1h 16m 🟩🟩🟩🟩
rust-lints 46m 🟩🟩🟩🟩🟩 (+1 more)
forge-e2e-test / forge 45m 🟩🟩🟩
forge-compat-test / forge 39m 🟩🟩🟩
cli-e2e-tests / run-cli-tests 32m 🟩🟩🟩🟩
run-tests-main-branch 27m 🟩🟩🟩🟩🟩 (+1 more)
check 24m 🟩🟩🟩🟩🟩 (+1 more)
general-lints 15m 🟩🟩🟩🟩🟩 (+1 more)
check-dynamic-deps 14m 🟩🟩🟩🟩🟩 (+1 more)
indexer-grpc-e2e-tests / test-indexer-grpc-docker-compose 7m 🟩🟩🟩🟩
node-api-compatibility-tests / node-api-compatibility-tests 4m 🟩🟩🟩🟩
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+1 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+1 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+1 more)
execution-performance / file_change_determinator 55s 🟩🟩🟩🟩🟩
file_change_determinator 52s 🟩🟩🟩🟩🟩
permission-check 24s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 24s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 21s 🟩🟩🟩🟩🟩
permission-check 17s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 15s 🟩🟩🟩🟩🟩 (+1 more)
determine-docker-build-metadata 10s 🟩🟩🟩🟩🟩

🚨 2 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
run-tests-main-branch 6m 4m +51%
rust-images / rust-all 17m 12m +37%

settingsfeedbackdocs ⋅ learn more about trunk.io

.into_iter()
.filter(|(_, peer)| peer.ping_latency_ms.is_none())
.collect::<Vec<_>>();
let num_peers_to_ping = peers_to_ping.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should often be zero, if so it would be nice to return here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM -- added it 😄

// Identify the eligible peers that don't already have latency information
let peers_to_ping = eligible_peers
.into_iter()
.filter(|(_, peer)| peer.ping_latency_ms.is_none())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next round: give ping_lattency a TTL after which we should ping again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we want to make this more intelligent, we can definitely do that, e.g., give each ping measurement a TTL and refresh it periodically. We could then combine that with some type of support for disconnects, such that if you discover better peers to dial, you could "swap" a couple of your current connections with the more optimal peers (assuming you're already at your max outbound connection limit).

But, I figured that'd be best to leave to a future PR/effort (if we see the need). At least now, a simple node reboot should refresh the peer ping latencies and connections.

self.enable_latency_aware_dialing,
) {
// Ping the eligible peers (so that we can fetch missing ping latency information)
self.ping_eligible_peers(eligible_peers.clone()).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little sad that no connections will be tried until ping is done; this will mostly slow down a newly started node connecting by a few seconds? It looks like in the steady state things already have a ping time so there's no waiting for pings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the pinging does unfortunately block the first dial (and any subsequent dials if there are peers we failed to ping last round). But, thankfully, the delay isn't super noticeable. For example, this dial loop is only invoked every 5 or 10 seconds anyway (depending on network/node). So, when you add the additional ping time on (a couple seconds), it's not very noticeable.

If we do notice it becoming a problem, we could try to make things async (or heavily reduce the timeout), but it should hopefully be reasonable enough to start with 😄

Copy link
Contributor

@bchocho bchocho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@JoshLind JoshLind added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Feb 1, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

github-actions bot commented Feb 1, 2024

✅ Forge suite compat success on aptos-node-v1.8.3 ==> 15632560704cf2ef140e013941e6d65dc591cde8

Compatibility test results for aptos-node-v1.8.3 ==> 15632560704cf2ef140e013941e6d65dc591cde8 (PR)
1. Check liveness of validators at old version: aptos-node-v1.8.3
compatibility::simple-validator-upgrade::liveness-check : committed: 3588 txn/s, latency: 6947 ms, (p50: 6900 ms, p90: 9600 ms, p99: 16200 ms), latency samples: 172240
2. Upgrading first Validator to new version: 15632560704cf2ef140e013941e6d65dc591cde8
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1504 txn/s, latency: 18630 ms, (p50: 18400 ms, p90: 32200 ms, p99: 33700 ms), latency samples: 93300
3. Upgrading rest of first batch to new version: 15632560704cf2ef140e013941e6d65dc591cde8
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1780 txn/s, latency: 16028 ms, (p50: 19200 ms, p90: 22200 ms, p99: 22600 ms), latency samples: 92600
4. upgrading second batch to new version: 15632560704cf2ef140e013941e6d65dc591cde8
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 2606 txn/s, latency: 11781 ms, (p50: 9900 ms, p90: 24700 ms, p99: 29300 ms), latency samples: 101640
5. check swarm health
Compatibility test for aptos-node-v1.8.3 ==> 15632560704cf2ef140e013941e6d65dc591cde8 passed
Test Ok

Copy link
Contributor

github-actions bot commented Feb 1, 2024

✅ Forge suite realistic_env_max_load success on 15632560704cf2ef140e013941e6d65dc591cde8

two traffics test: inner traffic : committed: 7247 txn/s, latency: 5264 ms, (p50: 4800 ms, p90: 7000 ms, p99: 11100 ms), latency samples: 3145240
two traffics test : committed: 100 txn/s, latency: 2187 ms, (p50: 2100 ms, p90: 2500 ms, p99: 4300 ms), latency samples: 1820
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.211, avg: 0.196", "QsPosToProposal: max: 0.146, avg: 0.141", "ConsensusProposalToOrdered: max: 0.563, avg: 0.539", "ConsensusOrderedToCommit: max: 0.490, avg: 0.459", "ConsensusProposalToCommit: max: 1.035, avg: 0.998"]
Max round gap was 1 [limit 4] at version 1258948. Max no progress secs was 5.10237 [limit 15] at version 1258948.
Test Ok

@JoshLind JoshLind merged commit a86669f into main Feb 1, 2024
50 checks passed
@JoshLind JoshLind deleted the ping_peers_net_2 branch February 1, 2024 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants