p2p: adaptive connections services flags #28170

furszy · 2023-07-27T15:34:45Z

Derived from #28120 discussion.

By relocating the peer desirable services flags into the peer manager, we
allow the connections acceptance process to handle post-IBD potential
stalling scenarios.

The peer manager will be able to dynamically adjust the services flags
based on the node's proximity to the tip (back and forth). Allowing the node
to recover from the following post-IBD scenario:
Suppose the node has successfully synced the chain, but later experienced
dropped connections and remained inactive for a duration longer than the limited
peers threshold (the timeframe within which limited peers can provide blocks). In
such cases, upon reconnecting to the network, the node might only establish
connections with limited peers, filling up all available outbound slots. Resulting
in an inability to synchronize the chain (because limited peers will not provide
blocks older than the NODE_NETWORK_LIMITED_MIN_BLOCKS threshold).

DrahtBot · 2023-07-27T15:34:48Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	vasild, naumenkogs, mzumsande, andrewtoth, achow101

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#29231 (logging: Update to new logging API by ajtowns)
#28016 (p2p: gives seednode priority over dnsseed if both are provided by sr-gi)
#27052 (test: rpc: add last block announcement time to getpeerinfo result by LarryRuane)
#15218 (validation: Flush state after initial sync by andrewtoth)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

dergoegge

I don't think the scenario you describe is likely to happen since we consider ourselves to be in IBD if our tip is older than 24h (which corresponds to ~144 blocks).

Suppose the node has successfully synced the chain, but later experienced
dropped connections and remained inactive for a duration longer than the limited
peers threshold (the timeframe within which limited peers can provide blocks). In
such cases, upon reconnecting to the network, the node might only establish
connections with limited peers, filling up all available outbound slots.

So this isn't true, upon restart the node would be behind by >24h and open connections to non-pruning peers (unless the block rate significantly increased beyond 1block/10min).

Getting rid of the g_initial_block_download_completed global would still be worthwhile but the other changes seem unnecessary to me.

dergoegge · 2023-08-07T14:50:21Z

src/net.h

+    /**
+     * Callback to determine whether the given set of service flags are sufficient
+     * for a peer to be "relevant".
+     */
+    virtual bool HasDesirableServiceFlags(ServiceFlags services) = 0;


This interface is not the right place for a callback like this because it is not a "net event".

Would be better to have a method on connman, e.g. CConnman::SetDesirableServiceFlags that is then called by peerman.

This interface is not the right place for a callback like this because it is not a "net event".

Hmm ok, that would avoid locking cs_main in the open connections thread. Which is nice.

Have thought more about this and I'm not so convinced about it anymore.

The connection manager class lives on a lower level, and manages only a fraction of the node's information. And this field is strictly linked to the node's overall status. Right now, HasDesirableServiceFlags() requires to know the chain sync status.
So, by placing it inside CConMan, we add another level of indirection. Because, most of the time, the peer manager class will call to the CConMan setter and subsequently call the getter to utilize it. Using the CConMan class mostly as a field container.

Also, it would complicate any expansion of the peer services selectivity logic because different processes would require to update the flag stored in CConMan. For instance, the node could be seeking to exclusively connect to peers that provide block filters (up to a certain point, then pick other preferences on a case basis), or could want to minimize the number of limited connections based on the chain download state, or if a chain stale state was detected, or could be needing to disallow certain peer types for a certain time etc.

Considering this, the responsibility for managing this field seems to fit better inside the peer manager class.

Happy to hear your thoughts about this. Will keep thinking about it, we can still improve this to not require cs_main lock even when it's placed inside the peers manager class.

Because, most of the time, the peer manager class will call to the CConMan setter and subsequently call the getter to utilize it.

I wasn't saying to make connman the container for this data, I only meant to add a setter to inform connman of the things it needs to perform its job (making connections to desirable nodes). The main container for the flags would still be peerman.

Because, most of the time, the peer manager class will call to the CConMan setter and subsequently call the getter to utilize it.

I wasn't saying to make connman the container for this data, I only meant to add a setter to inform connman of the things it needs to perform its job (making connections to desirable nodes). The main container for the flags would still be peerman.

Ok. What I dislike about that approach is that we lose a lot of flexibility.

The PeerManager should be able to decide granularly whether to connect to a specific address before ConnMann opens the connection and initiates handshaking etc.
The node's contextual information and logic required to take the decision of opening a socket to a particular address resides mostly in PeerManager (right now, we only have the stale chain condition, but there could be more). And, by adding another field inside ConnMan, we would be duplicating data (same desirable_services field inside PeerManager and ConnMan), which would need to be synchronized and could potentially lead to race conditions.

I mean, this misalignment could result in ConnMan performing actions that are no longer in line with the latest PeerManager state. For instance, it might create connections to multiple peers with certain service flags, when PeerManager only intended to connect to one peer supporting such service flags. This situation would eventually lead to PeerManager closing the newly created connections when it sends the version message, as these connections are no longer desirable. Which is a waste of resources.

Thus why I think that the callback mechanism suits well here. So PeerManager is able to evaluate each connection request, and avoid executing the connection + handshake when we are going to disconnect the peer anyway.

Circling back to this, I still don't think the interface choices in this PR are right.

Mostly it's just that the NetEventsInterface is not the place for a getter like this. It is meant for networking events (hence the name) such as opening/closing connections and processing/sending messages. So Approach NACK on misusing this interface.

The node's contextual information and logic required to take the decision of opening a socket to a particular address resides mostly in PeerManager (right now, we only have the stale chain condition, but there could be more)

This is not true? ThreadOpenConnections is a method on CConnman and handles the opening of network connections.

I would suggest that you take the same approach as we have already taken with CConnman::{SetTryNewOutboundPeer, StartExtraBlockRelayPeers}, i.e. peerman telling connman what to do. That approach avoids increasing the coupling between the two components.

The node's contextual information and logic required to take the decision of opening a socket to a particular address resides mostly in PeerManager (right now, we only have the stale chain condition, but there could be more)

This is not true? ThreadOpenConnections is a method on CConnman and handles the opening of network connections.

Yes, that's the layering violation that causes the coupling between PeerManager and ConMann that we both dislike and shouldn't be there (at least not entirely). The connections manager alone does not have enough information to select the best candidate based on the node's context. CConMan cannot access the chain state, the sync progress, in-flight block requests, nor any services predilection, nor any other user customization to prefer certain peers over others—such as ones supporting a number of compact block filters or v2 encrypted connections, etc. This is the reason why we are forced to set the bool flags you mentioned (which can easily get out of sync, causing a waste of resources and races). Under this structure, the only thing we do is continue expanding the number of fields inside CConMan that are susceptible to the same issues (see below).

I would suggest that you take the same approach as we have already taken with CConnman::{SetTryNewOutboundPeer, StartExtraBlockRelayPeers}, i.e. peerman telling connman what to do. That approach avoids increasing the coupling between the two components.

I still stand by what I wrote previously at #28170 (comment).
Essentially, in my view, the approach is resource-wasteful and could lead to race conditions. And, following what I wrote above, it has a poor structure that increases the number of unrelated fields inside CConMan over time.

Overall, I'm in favor of decoupling these two components, but I'm not in favor of continuing to expand the current structure. The connections' desirable services flag can (and will) change continually over time, depending on the node's state, connection heuristics, and possible user custom configs. It is a much more dynamic field than the ones you mention.

Mostly it's just that the NetEventsInterface is not the place for a getter like this. It is meant for networking events (hence the name) such as opening/closing connections and processing/sending messages.

Two points about this:

Firstly, it is not just a getter. The node needs to make the decision to connect to a certain address based on several factors that depend on the node's context. Caching a flag doesn't help here. I mentioned some possible use cases above. I don't think continuing to expand CConMan fields with data from upper layers is a good structure.

Secondly, I believe that with this statement, we are somewhat in agreement that the logic of "to which address, supporting which services, should we connect?" shouldn't be part of CConMan. However, as long as it remains there, the process of deciding to connect (or not) to a certain address is a net event. Like you, I would prefer if it weren't there, and we could work on it. But I don't think this PR is introducing something incorrect right now.

CConMan cannot access the chain state, the sync progress, in-flight block requests

I don't see how the connection opening logic needs access to any of those things.

number of compact block filters

Bitcoin Core doesn't download compact block filters.

v2 encrypted connections, etc.

Connman is doing this right now using our service flags and the service flag of the potential new peer. It has all the data it needs.

Essentially, in my view, the approach is resource-wasteful and could lead to race conditions.

Your approach has the exact same "flaw" w.r.t. connecting to desired service flags, e.g.:

connman calls HasDesireableServiceFlags for peer A which returns true

connman proceeds with opening the connection to peer A

some event occurs (e.g. messages from other peers are processed) that would make HasDesireableServiceFlags return false for peer A

connection to peer A is established

Peer A is now connected even-though HasDesireableServiceFlags would return false. These races can't really be avoided, so it's best to stick with the interface design we've already got imo.

CConMan cannot access the chain state, the sync progress, in-flight block requests

I don't see how the connection opening logic needs access to any of those things.

I believe that we are not in sync regarding the distinction between (1) the logic for deciding whether to connect to a specific address and (2) the thread/process responsible for opening such a connection. The former requires access to that contextual information; its usage is behind the extra_block_relay and try_another_outbound_peer bool flags, and also behind the existing HasAllDesirableServiceFlags() call.

What I intended to say on that sentence was that a node could have different connections strategies in the future, even though it doesn't have them now. These strategies would need access to the overall node state. Thus why I'm saying that the current approach of adding fields inside CConnMan with each new connection decision logic change make the code harder to work with, rather than simplifying it.
Still, adding a callback as I do here isn't the saint grail neither. It is slightly better for the reasons I state below, but in the future, with the introduction of more connection strategies, I think we should consider moving part of the ThreadOpenConnections logic to PeerManager.

number of compact block filters

Bitcoin Core doesn't download compact block filters.

It was just an example of a modification we could add to the connections decision-making process. We can replace it with a more general "reserve X outbound slots for peers supporting Y services" which could also depend on certain contextual factors. This is something either the user or we might want in the future. And like this one, there could be many other modifications that could be added.
But, in case it helps, I do have an experimental branch implementing compact block filters download and the entire BIP157 client mode.

v2 encrypted connections, etc.

Connman is doing this right now using our service flags and the service flag of the potential new peer. It has all the data it needs.

You are talking about the way we connect to potential v2 peers and fallback to v1 based on their response right?. My point was about preferring certain services over others, for a limited number of outbound connection slots, before opening the socket. This prioritization could change depending on the node's context; e.g. taking into consideration the chain progress and/or if the user has configured the node differently.

Essentially, in my view, the approach is resource-wasteful and could lead to race conditions.

Your approach has the exact same "flaw" w.r.t. connecting to desired service flags, e.g.:

connman calls HasDesireableServiceFlags for peer A which returns true

connman proceeds with opening the connection to peer A

some event occurs (e.g. messages from other peers are processed) that would make HasDesireableServiceFlags return false for peer A

connection to peer A is established

Peer A is now connected even-though HasDesireableServiceFlags would return false. These races can't really be avoided, so it's best to stick with the interface design we've already got imo.

The difference between your proposed design and mine is essentially polling to set the value vs calculating the value on-demand.

In your design, the code needs to check data at regular intervals to update the flag in another component. If this process is executed too rapidly, it can harm performance, and if it is done too slowly, it wastes resources.
In the one I'm proposing, because the value is calculated only when is needed, it shortens the race window.
Are we in agreement on that?

Still, following to what I wrote in my previous comment, I would like to re-think ThreadOpenConnections because otherwise, we will keep adding fields to CConMan that are unrelated to the component itself.

furszy

I don't think the scenario you describe is likely to happen since we consider ourselves to be in IBD if our tip is older than 24h (which corresponds to ~144 blocks).

Thats not entirely true, misses an important part. We consider ourselves in IBD if our tip is older than 24hs (or the custom max tip age) and we haven't completed IBD yet. Once IBD is completed, the node remains out of IBD for the entire software lifecycle.

Suppose the node has successfully synced the chain, but later experienced
dropped connections and remained inactive for a duration longer than the limited
peers threshold (the timeframe within which limited peers can provide blocks). In
such cases, upon reconnecting to the network, the node might only establish
connections with limited peers, filling up all available outbound slots.

So this isn't true, upon restart the node would be behind by >24h and open connections to non-pruning peers (unless the block rate significantly increased beyond 1block/10min).

This conclusion misses what I wrote above. The IBD flag is set to true but never reverted for the entire software lifecycle.

The scenario that I described in the PR description is a real scenario. Reproducible by running #28120 test case. Which exercises the proposed behavior.

Basically, same as the IBD flag, g_initial_block_download_completed is set to true but never reverted. Which ends up with the node always connecting to limited peers after the first IBD completion. Not accounting for prolonged stalling situations, where the node is beyond the limited peers threshold due an internet outage.

To test the behavior, you could drop the bug fix commit from that branch. Will see the post-IBD super stalled node connecting to a limited peer even when such peer is not helpful, as it will not provide any of the historical blocks required to sync-up the node.

mzumsande

Concept ACK
This seems to be similar to the suggestion by gmaxwell in #10387 (comment)
It also is a good thing to have protocol.h not depend on dynamic net_processing state.

src/net_processing.cpp

src/protocol.h

src/net_processing.h

furszy · 2023-09-20T13:59:30Z

per convo, cc @vasild

furszy

Ok, fine. Then, just a comments nit, the recently added comments "Per BIP159, safety buffer" and "BIP159 connections safety window" better be removed? Because that is not "A safety buffer of 144 blocks to handle chain reorganizations"?

Sure @vasild. Updated. Thanks! Small diff.

vasild

ACK 27f260a

Wrt the interface addition discussed in #28170 (comment), I think it is not perfect but I do not see it as a blocker for this PR. Would be nice to see the discussion resolved in one way or another.

naumenkogs

ACK 27f260a

src/net_processing.h

src/net_processing.cpp

naumenkogs · 2024-01-29T09:07:56Z

src/test/peerman_tests.cpp

+    // Lastly, verify the stale tip checks can disallow limited peers connections after not receiving blocks for a prolonged period.
+    SetMockTime(GetTime<std::chrono::seconds>() + std::chrono::seconds{consensus.nPowTargetSpacing * NODE_NETWORK_LIMITED_ALLOW_CONN_BLOCKS + 1});
+    BOOST_CHECK(peerman->GetDesirableServiceFlags(peer_flags) == ServiceFlags(NODE_NETWORK | NODE_WITNESS));
+}


aff7d92

would be nice to test reacting to reorgs.

Sure. As it would be really nice to progress on this PR, will leave it follow-up. Thanks.

mzumsande

Light Code Review ACK 27f260a

I don't have a strong opinion on the interface issue (I read the discussion and can see both points of view), but I do think that having a global variable g_initial_block_download_completed that lives in protocol.h and is updated dynamically from net_processing is an ugly layer violation too, so removing that is an important improvement from the architectural point of view.

mzumsande · 2024-01-30T20:13:57Z

src/test/peerman_tests.cpp

+
+    // Check we start connecting to full nodes
+    ServiceFlags peer_flags{NODE_WITNESS | NODE_NETWORK_LIMITED};
+    BOOST_CHECK(peerman->GetDesirableServiceFlags(peer_flags) == ServiceFlags(NODE_NETWORK | NODE_WITNESS));


could also have a check for HasAllDesirableServiceFlags each time GetDesirableServiceFlags is checked.

Sure, noted. As it is a nit and would love to move forward, will leave it for a small follow-up. Thanks!

andrewtoth

ACK 27f260a

andrewtoth · 2024-01-31T01:13:58Z

src/test/peerman_tests.cpp

+    mineBlock(m_node, /*block_time=*/GetTime<std::chrono::seconds>());
+    BOOST_CHECK(peerman->GetDesirableServiceFlags(peer_flags) == ServiceFlags(NODE_NETWORK_LIMITED | NODE_WITNESS));
+
+    // Lastly, verify the stale tip checks can disallow limited peers connections after not receiving blocks for a prolonged period.


As noted in #28170 (comment), stale tip checks don't factor into the current approach. This check seems redundant now with the one on lines 54-56.

As noted in #28170 (comment), stale tip checks don't factor into the current approach.

Sure. As it is a nit, will update the comment in the follow-up PR. Thanks!.

This check seems redundant now with the one on lines 54-56.

I think it's worth checking whether the services flags can still change after receiving blocks from the network. The lines 54-56 are checking updates prior receiving any block.

achow101 · 2024-01-31T16:10:02Z

ACK 27f260a

…depth) This adds coverage for the logic introduced in PR bitcoin#28170 ("p2p: adaptive connections services flags").

DrahtBot added the P2P label Jul 27, 2023

furszy changed the title ~~net: adaptive connections services flags~~ p2p: adaptive connections services flags Jul 27, 2023

DrahtBot added the CI failed label Jul 27, 2023

DrahtBot removed the CI failed label Jul 28, 2023

DrahtBot mentioned this pull request Jul 30, 2023

validation: sync chainstate to disk after syncing to tip #15218

Open

This was referenced Aug 1, 2023

p2p: make block download logic aware of limited peers threshold #28120

Merged

net: introduce block tracker to retry to download blocks after failure #27837

Draft

test: Extend stale tip test #18470

Closed

DrahtBot mentioned this pull request Aug 5, 2023

Use shared_ptr for CNode inside CConnman #28222

Draft

dergoegge reviewed Aug 7, 2023

View reviewed changes

furszy commented Aug 7, 2023

View reviewed changes

This was referenced Aug 15, 2023

Package Relay 1/3: Introduce TxDownloadManager and improve orphan-handling #28031

Draft

test: rpc: add last block announcement time to getpeerinfo result #27052

Open

Handle CJDNS from LookupSubNet() #27071

Merged

mzumsande reviewed Aug 16, 2023

View reviewed changes

src/net_processing.cpp Outdated Show resolved Hide resolved

src/net_processing.cpp Outdated Show resolved Hide resolved

src/net_processing.cpp Outdated Show resolved Hide resolved

naumenkogs reviewed Aug 17, 2023

View reviewed changes

src/protocol.h Show resolved Hide resolved

naumenkogs reviewed Aug 17, 2023

View reviewed changes

src/net_processing.h Outdated Show resolved Hide resolved

furszy force-pushed the 2023_net_adaptable_desirable_flags branch from 2325ff0 to 26a9bb4 Compare August 17, 2023 22:18

DrahtBot added the CI failed label Aug 18, 2023

furszy force-pushed the 2023_net_adaptable_desirable_flags branch 2 times, most recently from e00b965 to 8af5f25 Compare August 18, 2023 17:51

DrahtBot removed the CI failed label Aug 18, 2023

furszy mentioned this pull request Oct 16, 2023

Allow getblockfrompeer to use any peer #27652

Open

furszy commented Jan 23, 2024

View reviewed changes

vasild approved these changes Jan 24, 2024

View reviewed changes

DrahtBot requested review from mzumsande and andrewtoth January 24, 2024 09:41

naumenkogs approved these changes Jan 29, 2024

View reviewed changes

DrahtBot requested review from andrewtoth and removed request for andrewtoth January 29, 2024 09:08

mzumsande reviewed Jan 30, 2024

View reviewed changes

DrahtBot requested review from andrewtoth and removed request for andrewtoth January 30, 2024 21:58

andrewtoth approved these changes Jan 31, 2024

View reviewed changes

achow101 merged commit 0b76874 into bitcoin:master Jan 31, 2024
16 checks passed

furszy deleted the 2023_net_adaptable_desirable_flags branch January 31, 2024 20:02

TheCharlatan mentioned this pull request Jan 31, 2024

kernel: Remove dependency on CScheduler #28960

Merged

hebasto mentioned this pull request Feb 9, 2024

cmake: Regular rebasing of the cmake-staging branch hebasto/bitcoin#97

Closed

theStack added a commit to theStack/bitcoin that referenced this pull request Feb 28, 2024

test: p2p: check limited peers desirability (depending on best block …

2f5b69e

…depth) This adds coverage for the logic introduced in PR bitcoin#28170 ("p2p: adaptive connections services flags").

theStack added a commit to theStack/bitcoin that referenced this pull request Mar 5, 2024

test: p2p: check limited peers desirability (depending on best block …

49fbd57

…depth) This adds coverage for the logic introduced in PR bitcoin#28170 ("p2p: adaptive connections services flags").

theStack added a commit to theStack/bitcoin that referenced this pull request Mar 11, 2024

test: p2p: check limited peers desirability (depending on best block …

2f23987

…depth) This adds coverage for the logic introduced in PR bitcoin#28170 ("p2p: adaptive connections services flags").

mzumsande mentioned this pull request Apr 24, 2024

net: Make AddrFetch connections to fixed seeds #26114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p: adaptive connections services flags #28170

p2p: adaptive connections services flags #28170

furszy commented Jul 27, 2023 •

edited

DrahtBot commented Jul 27, 2023 •

edited

dergoegge left a comment

dergoegge Aug 7, 2023

furszy Aug 7, 2023

furszy Aug 8, 2023 •

edited

dergoegge Aug 14, 2023

furszy Aug 17, 2023 •

edited

dergoegge Jan 18, 2024

furszy Jan 19, 2024 •

edited

dergoegge Jan 22, 2024

furszy Jan 23, 2024

furszy left a comment •

edited

mzumsande left a comment

furszy commented Sep 20, 2023

furszy left a comment

vasild left a comment

naumenkogs left a comment

naumenkogs Jan 29, 2024

furszy Jan 29, 2024

mzumsande left a comment

mzumsande Jan 30, 2024

furszy Jan 30, 2024

andrewtoth left a comment

andrewtoth Jan 31, 2024

furszy Jan 31, 2024 •

edited

achow101 commented Jan 31, 2024

p2p: adaptive connections services flags #28170

p2p: adaptive connections services flags #28170

Conversation

furszy commented Jul 27, 2023 • edited

DrahtBot commented Jul 27, 2023 • edited

Code Coverage

Reviews

Conflicts

dergoegge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

furszy Aug 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

furszy Aug 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

furszy Jan 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

furszy left a comment • edited

Choose a reason for hiding this comment

mzumsande left a comment

Choose a reason for hiding this comment

furszy commented Sep 20, 2023

furszy left a comment

Choose a reason for hiding this comment

vasild left a comment

Choose a reason for hiding this comment

naumenkogs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzumsande left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewtoth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

furszy Jan 31, 2024 • edited

Choose a reason for hiding this comment

achow101 commented Jan 31, 2024

furszy commented Jul 27, 2023 •

edited

DrahtBot commented Jul 27, 2023 •

edited

furszy Aug 8, 2023 •

edited

furszy Aug 17, 2023 •

edited

furszy Jan 19, 2024 •

edited

furszy left a comment •

edited

furszy Jan 31, 2024 •

edited