[Mempool] Improve peer priority selection. #12537

JoshLind · 2024-03-14T21:27:43Z

Note: most of this PR is new unit tests.

Description

This PR improves the peer selection logic for mempool (i.e. deciding which peers to forward transactions to):

Currently, the logic just selects peers based on network types and roles, but this is somewhat inefficient (especially when nodes define seeds that override more peformant peers).
To avoid this, the PR updates the selection logic to prioritize peers based on: (i) network type; (ii) distance from the validators (e.g., to avoid misconfigured and/or disconnected VFNs); and (iii) peer ping latencies (i.e., to favour closer peers).
To avoid excessively reprioritizing peers (which can be detrimental to mempool under load), we only update the peer priorities when: (i) our peers change; (ii) we're still waiting for the peer monitoring service to populate the peer ping latencies; or (iii) if neither (i) or (ii) is true, we update every ~10 minutes (configurable). This avoids overly reprioritizing peers at steady state.

Testing Plan

New and existing test infrastructure. I also ran several PFN-only tests to ensure that the average broadcast latencies are reduced.

trunk-io · 2024-03-14T21:27:46Z

⏱️ 39h 28m total CI duration on this PR

Job	Cumulative Duration	Recent Runs
rust-unit-tests	7h 53m	🟩 🟩 🟩 🟩 🟩 (+14 more)
windows-build	5h 51m	🟩 🟩 🟩 🟩 🟩 (+17 more)
forge-e2e-test / forge	5h 42m	🟩 🟩 🟩 🟩 🟩 (+14 more)
rust-images / rust-all	5h 2m	🟩 🟩 🟩 🟩 🟩 (+13 more)
rust-unit-coverage	4h 21m	🟩
rust-smoke-coverage	3h 50m	🟩
rust-lints	2h 25m	🟩 🟩 🟩 🟩 🟩 (+14 more)
check	1h 20m	🟩 🟩 🟩 🟩 ⬜ (+18 more)
run-tests-main-branch	1h 8m	🟥 🟥 🟥 🟥 🟥 (+13 more)
general-lints	37m	🟩 🟩 🟩 🟩 🟩 (+14 more)
check-dynamic-deps	35m	🟩 🟩 🟩 🟩 🟩 (+16 more)
determine-test-metadata	19m	🟩 🟩 🟩 🟩 🟩 (+10 more)
semgrep/ci	9m	🟩 🟩 🟩 🟩 🟩 (+16 more)
file_change_determinator	3m	🟩 🟩 🟩 🟩 🟩 (+13 more)
file_change_determinator	3m	🟩 🟩 🟩 🟩 🟩 (+14 more)
file_change_determinator	3m	🟩 🟩 🟩 🟩 🟩 (+16 more)
permission-check	1m	🟩 🟩 🟩 🟩 🟩 (+16 more)
permission-check	1m	🟩 🟩 🟩 🟩 🟩 (+16 more)
permission-check	1m	🟩 🟩 🟩 🟩 🟩 (+16 more)
determine-docker-build-metadata	57s	🟩 🟩 🟩 🟩 🟩 (+13 more)
permission-check	51s	🟩 🟩 🟩 🟩 🟩 (+16 more)
permission-check	50s	🟩 🟩 🟩 🟩 🟩 (+15 more)
upload-to-codecov	12s	🟩

🚨 1 job on the last run was significantly faster/slower than expected

Job	Duration	vs 7d avg	Delta
rust-images / rust-all	19m	15m

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

codecov · 2024-03-15T01:51:43Z

Codecov Report

Attention: Patch coverage is 79.58115% with 78 lines in your changes are missing coverage. Please review.

Project coverage is 69.9%. Comparing base (b169adf) to head (31014fe).
Report is 1 commits behind head on main.

❗ Current head 31014fe differs from pull request most recent head f47df24. Consider uploading reports for the commit f47df24 to get more accurate results

Files	Patch %	Lines
mempool/src/shared_mempool/network.rs	79.5%	78 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #12537       +/-   ##
===========================================
+ Coverage    64.1%    69.9%     +5.7%     
===========================================
  Files         819     2284     +1465     
  Lines      182919   431953   +249034     
===========================================
+ Hits       117397   302142   +184745     
- Misses      65522   129811    +64289

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sitalkedia · 2024-03-19T21:52:39Z

mempool/src/shared_mempool/priority.rs

+    pub fn ready_for_update(&self, peers_changed: bool) -> bool {
+        // If our peers have changed, or we haven't observed ping latencies
+        // for all peers yet, we should update the prioritized peers again.
+        if peers_changed || !self.observed_all_ping_latencies {


Can you explain the reasoning behind updating this if we have not observed ping latency for all peers?

Aah, it's because ping latency information is only populated after a new peer connects (e.g., 30 seconds), but as soon as the peer connects, mempool gets notified and tries to prioritize the peer (without any latency information). We basically just keep updating our priorities (every few seconds) until all the ping latency information is populated for the current set of peers. This is enough to provide an optimal/best effort selection. After this point, we just move to the more stable schedule (e.g., every 10 mins).

I see, makes sense. It might be better to provide this context with comment in the code.

SGTM! Will add a comment 😄

sitalkedia · 2024-03-19T21:56:59Z

mempool/src/shared_mempool/priority.rs

+            .iter()
+            .sorted_by(|peer_a, peer_b| {
+                let ordering = &self.peer_comparator.compare(peer_a, peer_b);
+                ordering.reverse() // Prioritize higher values (i.e., sorted by descending order)


@JoshLind - May be I am missing something but this seems to be returning the list in ascending order of priority - because the peer_comparator is returning the highest priority first?

It's because sorted_by sorts in ascending order, but we want high priority peers to be at the beginning of the list (i.e., descending order, as mempool selects peers from the front of the list). So, the easiest way to do that is reverse the ordering here 😄

(We can also do this by flipping peer_a and peer_b, if that's clearer 🤔)

Thanks - this makes sense to me.

sitalkedia

LGTM

bchocho

Can we gate the priority sorting based on ping times?

One case I am concerned about is when there are a lot of outstanding items in mempool, getting a new peer in the priority peers means all the outstanding items have to be broadcasted to the new peer (in FIFO order), incurring significant latency for incoming transactions as they wait in a queue to be broadcasted. Not sure how problematic this could be in practice though.

There's no foolproof solution I can think about to resolve this. Here's some ideas--

Constrain the number of peers to be re-prioritized in each 10 min interval to half of the peers (1 in our case)?
For a new peer, constrain the number of "old" txns we forward to the new peer.

bchocho · 2024-03-21T21:21:59Z

mempool/src/shared_mempool/priority.rs

+        // Update the prioritized peers
+        let mut prioritized_peers = self.prioritized_peers.write();
+        if new_prioritized_peers != *prioritized_peers {
+            info!(


Can we add an explicit metric for how many prioritized peers were changed? This can help us debug in case there are issues with changing frequently.

JoshLind · 2024-03-25T22:27:50Z

Can we gate the priority sorting based on ping times?

As discussed offline, I've updated the PR to do this. 😄 The "intelligent" sorting is enabled by default, but it can be disabled in the node config. If it is disabled, it means mempool will retain the same behaviour as before (i.e., only reprioritize on peer changes, and use a simpler prioritization algorithm).

Note: I have modified the original behaviour slightly: instead of using the "peer role" in the simple prioritization algorithm, we no longer use it, and instead sort by network and then peer hash. This is because the peer roles are imprecise, so it makes sense to fallback to random selection (instead of prioritizing seeds), when the feature is disabled.

Let me know what you think!

bchocho

Thanks! One comment on the metrics, but approving to unblock

bchocho · 2024-03-26T03:39:48Z

mempool/src/counters.rs

@@ -429,6 +430,20 @@ pub fn shared_mempool_pending_broadcasts(peer: &PeerNetworkId) -> IntGauge {
    ])
 }

+/// Counter tracking the number of peers that changed priority in shared mempool
+static SHARED_MEMPOOL_PRIORITY_CHANGE_COUNT: Lazy<IntGaugeVec> = Lazy::new(|| {


Looks like IntGauge would be sufficient, instead of IntGaugeVec?

But actually I'm wondering if just a Histogram might work better? E.g., to count how many changes there were for the past 30 mins.

Aah, thanks @bchocho -- I missed IntGauge 🤦 Will change to that. Note: I'm not sure Histogram makes sense because we update this metric really infrequently (e.g., every 10 minutes). It's probably easiest to just graph the raw gauge value instead of try to work with averages, rate changes, etc. But, we can always update this if we need more 😄

github-actions · 2024-03-26T18:42:38Z

✅ Forge suite `realistic_env_max_load` success on `f47df2487f02d4758fefb1d1a43f938d5af2d81d`

two traffics test: inner traffic : committed: 8604 txn/s, latency: 4564 ms, (p50: 4400 ms, p90: 5300 ms, p99: 9900 ms), latency samples: 3708680
two traffics test : committed: 100 txn/s, latency: 1909 ms, (p50: 1800 ms, p90: 2100 ms, p99: 6200 ms), latency samples: 1780
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.211, avg: 0.202", "QsPosToProposal: max: 0.224, avg: 0.206", "ConsensusProposalToOrdered: max: 0.434, avg: 0.420", "ConsensusOrderedToCommit: max: 0.366, avg: 0.353", "ConsensusProposalToCommit: max: 0.787, avg: 0.774"]
Max round gap was 1 [limit 4] at version 1813521. Max no progress secs was 4.733936 [limit 15] at version 1813521.
Test Ok

github-actions · 2024-03-26T18:44:02Z