Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: adaptive connections services flags #28170

Merged
merged 6 commits into from Jan 31, 2024

Conversation

furszy
Copy link
Member

@furszy furszy commented Jul 27, 2023

Derived from #28120 discussion.

By relocating the peer desirable services flags into the peer manager, we
allow the connections acceptance process to handle post-IBD potential
stalling scenarios.

The peer manager will be able to dynamically adjust the services flags
based on the node's proximity to the tip (back and forth). Allowing the node
to recover from the following post-IBD scenario:
Suppose the node has successfully synced the chain, but later experienced
dropped connections and remained inactive for a duration longer than the limited
peers threshold (the timeframe within which limited peers can provide blocks). In
such cases, upon reconnecting to the network, the node might only establish
connections with limited peers, filling up all available outbound slots. Resulting
in an inability to synchronize the chain (because limited peers will not provide
blocks older than the NODE_NETWORK_LIMITED_MIN_BLOCKS threshold).

@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 27, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK vasild, naumenkogs, mzumsande, andrewtoth, achow101

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #29231 (logging: Update to new logging API by ajtowns)
  • #28016 (p2p: gives seednode priority over dnsseed if both are provided by sr-gi)
  • #27052 (test: rpc: add last block announcement time to getpeerinfo result by LarryRuane)
  • #15218 (validation: Flush state after initial sync by andrewtoth)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Member

@dergoegge dergoegge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the scenario you describe is likely to happen since we consider ourselves to be in IBD if our tip is older than 24h (which corresponds to ~144 blocks).

Suppose the node has successfully synced the chain, but later experienced
dropped connections and remained inactive for a duration longer than the limited
peers threshold (the timeframe within which limited peers can provide blocks). In
such cases, upon reconnecting to the network, the node might only establish
connections with limited peers, filling up all available outbound slots.

So this isn't true, upon restart the node would be behind by >24h and open connections to non-pruning peers (unless the block rate significantly increased beyond 1block/10min).


Getting rid of the g_initial_block_download_completed global would still be worthwhile but the other changes seem unnecessary to me.

src/net.h Outdated
Comment on lines 669 to 1019
/**
* Callback to determine whether the given set of service flags are sufficient
* for a peer to be "relevant".
*/
virtual bool HasDesirableServiceFlags(ServiceFlags services) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface is not the right place for a callback like this because it is not a "net event".

Would be better to have a method on connman, e.g. CConnman::SetDesirableServiceFlags that is then called by peerman.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface is not the right place for a callback like this because it is not a "net event".

Hmm ok, that would avoid locking cs_main in the open connections thread. Which is nice.

Copy link
Member Author

@furszy furszy Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have thought more about this and I'm not so convinced about it anymore.

The connection manager class lives on a lower level, and manages only a fraction of the node's information. And this field is strictly linked to the node's overall status. Right now, HasDesirableServiceFlags() requires to know the chain sync status.
So, by placing it inside CConMan, we add another level of indirection. Because, most of the time, the peer manager class will call to the CConMan setter and subsequently call the getter to utilize it. Using the CConMan class mostly as a field container.

Also, it would complicate any expansion of the peer services selectivity logic because different processes would require to update the flag stored in CConMan. For instance, the node could be seeking to exclusively connect to peers that provide block filters (up to a certain point, then pick other preferences on a case basis), or could want to minimize the number of limited connections based on the chain download state, or if a chain stale state was detected, or could be needing to disallow certain peer types for a certain time etc.

Considering this, the responsibility for managing this field seems to fit better inside the peer manager class.

Happy to hear your thoughts about this. Will keep thinking about it, we can still improve this to not require cs_main lock even when it's placed inside the peers manager class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because, most of the time, the peer manager class will call to the CConMan setter and subsequently call the getter to utilize it.

I wasn't saying to make connman the container for this data, I only meant to add a setter to inform connman of the things it needs to perform its job (making connections to desirable nodes). The main container for the flags would still be peerman.

Copy link
Member Author

@furszy furszy Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because, most of the time, the peer manager class will call to the CConMan setter and subsequently call the getter to utilize it.

I wasn't saying to make connman the container for this data, I only meant to add a setter to inform connman of the things it needs to perform its job (making connections to desirable nodes). The main container for the flags would still be peerman.

Ok. What I dislike about that approach is that we lose a lot of flexibility.

The PeerManager should be able to decide granularly whether to connect to a specific address before ConnMann opens the connection and initiates handshaking etc.
The node's contextual information and logic required to take the decision of opening a socket to a particular address resides mostly in PeerManager (right now, we only have the stale chain condition, but there could be more). And, by adding another field inside ConnMan, we would be duplicating data (same desirable_services field inside PeerManager and ConnMan), which would need to be synchronized and could potentially lead to race conditions.

I mean, this misalignment could result in ConnMan performing actions that are no longer in line with the latest PeerManager state. For instance, it might create connections to multiple peers with certain service flags, when PeerManager only intended to connect to one peer supporting such service flags. This situation would eventually lead to PeerManager closing the newly created connections when it sends the version message, as these connections are no longer desirable. Which is a waste of resources.

Thus why I think that the callback mechanism suits well here. So PeerManager is able to evaluate each connection request, and avoid executing the connection + handshake when we are going to disconnect the peer anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circling back to this, I still don't think the interface choices in this PR are right.

Mostly it's just that the NetEventsInterface is not the place for a getter like this. It is meant for networking events (hence the name) such as opening/closing connections and processing/sending messages. So Approach NACK on misusing this interface.

The node's contextual information and logic required to take the decision of opening a socket to a particular address resides mostly in PeerManager (right now, we only have the stale chain condition, but there could be more)

This is not true? ThreadOpenConnections is a method on CConnman and handles the opening of network connections.

I would suggest that you take the same approach as we have already taken with CConnman::{SetTryNewOutboundPeer, StartExtraBlockRelayPeers}, i.e. peerman telling connman what to do. That approach avoids increasing the coupling between the two components.

Copy link
Member Author

@furszy furszy Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node's contextual information and logic required to take the decision of opening a socket to a particular address resides mostly in PeerManager (right now, we only have the stale chain condition, but there could be more)

This is not true? ThreadOpenConnections is a method on CConnman and handles the opening of network connections.

Yes, that's the layering violation that causes the coupling between PeerManager and ConMann that we both dislike and shouldn't be there (at least not entirely). The connections manager alone does not have enough information to select the best candidate based on the node's context. CConMan cannot access the chain state, the sync progress, in-flight block requests, nor any services predilection, nor any other user customization to prefer certain peers over others—such as ones supporting a number of compact block filters or v2 encrypted connections, etc. This is the reason why we are forced to set the bool flags you mentioned (which can easily get out of sync, causing a waste of resources and races). Under this structure, the only thing we do is continue expanding the number of fields inside CConMan that are susceptible to the same issues (see below).

I would suggest that you take the same approach as we have already taken with CConnman::{SetTryNewOutboundPeer, StartExtraBlockRelayPeers}, i.e. peerman telling connman what to do. That approach avoids increasing the coupling between the two components.

I still stand by what I wrote previously at #28170 (comment).
Essentially, in my view, the approach is resource-wasteful and could lead to race conditions. And, following what I wrote above, it has a poor structure that increases the number of unrelated fields inside CConMan over time.

Overall, I'm in favor of decoupling these two components, but I'm not in favor of continuing to expand the current structure. The connections' desirable services flag can (and will) change continually over time, depending on the node's state, connection heuristics, and possible user custom configs. It is a much more dynamic field than the ones you mention.

Mostly it's just that the NetEventsInterface is not the place for a getter like this. It is meant for networking events (hence the name) such as opening/closing connections and processing/sending messages.

Two points about this:

Firstly, it is not just a getter. The node needs to make the decision to connect to a certain address based on several factors that depend on the node's context. Caching a flag doesn't help here. I mentioned some possible use cases above. I don't think continuing to expand CConMan fields with data from upper layers is a good structure.

Secondly, I believe that with this statement, we are somewhat in agreement that the logic of "to which address, supporting which services, should we connect?" shouldn't be part of CConMan. However, as long as it remains there, the process of deciding to connect (or not) to a certain address is a net event. Like you, I would prefer if it weren't there, and we could work on it. But I don't think this PR is introducing something incorrect right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CConMan cannot access the chain state, the sync progress, in-flight block requests

I don't see how the connection opening logic needs access to any of those things.

number of compact block filters

Bitcoin Core doesn't download compact block filters.

v2 encrypted connections, etc.

Connman is doing this right now using our service flags and the service flag of the potential new peer. It has all the data it needs.

Essentially, in my view, the approach is resource-wasteful and could lead to race conditions.

Your approach has the exact same "flaw" w.r.t. connecting to desired service flags, e.g.:

  • connman calls HasDesireableServiceFlags for peer A which returns true
  • connman proceeds with opening the connection to peer A
  • some event occurs (e.g. messages from other peers are processed) that would make HasDesireableServiceFlags return false for peer A
  • connection to peer A is established

Peer A is now connected even-though HasDesireableServiceFlags would return false. These races can't really be avoided, so it's best to stick with the interface design we've already got imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CConMan cannot access the chain state, the sync progress, in-flight block requests

I don't see how the connection opening logic needs access to any of those things.

I believe that we are not in sync regarding the distinction between (1) the logic for deciding whether to connect to a specific address and (2) the thread/process responsible for opening such a connection. The former requires access to that contextual information; its usage is behind the extra_block_relay and try_another_outbound_peer bool flags, and also behind the existing HasAllDesirableServiceFlags() call.

What I intended to say on that sentence was that a node could have different connections strategies in the future, even though it doesn't have them now. These strategies would need access to the overall node state. Thus why I'm saying that the current approach of adding fields inside CConnMan with each new connection decision logic change make the code harder to work with, rather than simplifying it.
Still, adding a callback as I do here isn't the saint grail neither. It is slightly better for the reasons I state below, but in the future, with the introduction of more connection strategies, I think we should consider moving part of the ThreadOpenConnections logic to PeerManager.

number of compact block filters

Bitcoin Core doesn't download compact block filters.

It was just an example of a modification we could add to the connections decision-making process. We can replace it with a more general "reserve X outbound slots for peers supporting Y services" which could also depend on certain contextual factors. This is something either the user or we might want in the future. And like this one, there could be many other modifications that could be added.
But, in case it helps, I do have an experimental branch implementing compact block filters download and the entire BIP157 client mode.

v2 encrypted connections, etc.

Connman is doing this right now using our service flags and the service flag of the potential new peer. It has all the data it needs.

You are talking about the way we connect to potential v2 peers and fallback to v1 based on their response right?. My point was about preferring certain services over others, for a limited number of outbound connection slots, before opening the socket. This prioritization could change depending on the node's context; e.g. taking into consideration the chain progress and/or if the user has configured the node differently.

Essentially, in my view, the approach is resource-wasteful and could lead to race conditions.

Your approach has the exact same "flaw" w.r.t. connecting to desired service flags, e.g.:

  • connman calls HasDesireableServiceFlags for peer A which returns true
  • connman proceeds with opening the connection to peer A
  • some event occurs (e.g. messages from other peers are processed) that would make HasDesireableServiceFlags return false for peer A
  • connection to peer A is established

Peer A is now connected even-though HasDesireableServiceFlags would return false. These races can't really be avoided, so it's best to stick with the interface design we've already got imo.

The difference between your proposed design and mine is essentially polling to set the value vs calculating the value on-demand.

In your design, the code needs to check data at regular intervals to update the flag in another component. If this process is executed too rapidly, it can harm performance, and if it is done too slowly, it wastes resources.
In the one I'm proposing, because the value is calculated only when is needed, it shortens the race window.
Are we in agreement on that?

Still, following to what I wrote in my previous comment, I would like to re-think ThreadOpenConnections because otherwise, we will keep adding fields to CConMan that are unrelated to the component itself.

Copy link
Member Author

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the scenario you describe is likely to happen since we consider ourselves to be in IBD if our tip is older than 24h (which corresponds to ~144 blocks).

Thats not entirely true, misses an important part. We consider ourselves in IBD if our tip is older than 24hs (or the custom max tip age) and we haven't completed IBD yet. Once IBD is completed, the node remains out of IBD for the entire software lifecycle.

Suppose the node has successfully synced the chain, but later experienced
dropped connections and remained inactive for a duration longer than the limited
peers threshold (the timeframe within which limited peers can provide blocks). In
such cases, upon reconnecting to the network, the node might only establish
connections with limited peers, filling up all available outbound slots.

So this isn't true, upon restart the node would be behind by >24h and open connections to non-pruning peers (unless the block rate significantly increased beyond 1block/10min).

This conclusion misses what I wrote above. The IBD flag is set to true but never reverted for the entire software lifecycle.

The scenario that I described in the PR description is a real scenario. Reproducible by running #28120 test case. Which exercises the proposed behavior.

Basically, same as the IBD flag, g_initial_block_download_completed is set to true but never reverted. Which ends up with the node always connecting to limited peers after the first IBD completion. Not accounting for prolonged stalling situations, where the node is beyond the limited peers threshold due an internet outage.

To test the behavior, you could drop the bug fix commit from that branch. Will see the post-IBD super stalled node connecting to a limited peer even when such peer is not helpful, as it will not provide any of the historical blocks required to sync-up the node.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK
This seems to be similar to the suggestion by gmaxwell in #10387 (comment)
It also is a good thing to have protocol.h not depend on dynamic net_processing state.

src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.h Outdated Show resolved Hide resolved
@furszy furszy force-pushed the 2023_net_adaptable_desirable_flags branch from 2325ff0 to 26a9bb4 Compare August 17, 2023 22:18
@furszy furszy force-pushed the 2023_net_adaptable_desirable_flags branch 2 times, most recently from e00b965 to 8af5f25 Compare August 18, 2023 17:51
@furszy
Copy link
Member Author

furszy commented Sep 20, 2023

per convo, cc @vasild

Copy link
Member Author

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fine. Then, just a comments nit, the recently added comments "Per BIP159, safety buffer" and "BIP159 connections safety window" better be removed? Because that is not "A safety buffer of 144 blocks to handle chain reorganizations"?

Sure @vasild. Updated. Thanks! Small diff.

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 27f260a

Wrt the interface addition discussed in #28170 (comment), I think it is not perfect but I do not see it as a blocker for this PR. Would be nice to see the discussion resolved in one way or another.

Copy link
Member

@naumenkogs naumenkogs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 27f260a

src/net_processing.h Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
// Lastly, verify the stale tip checks can disallow limited peers connections after not receiving blocks for a prolonged period.
SetMockTime(GetTime<std::chrono::seconds>() + std::chrono::seconds{consensus.nPowTargetSpacing * NODE_NETWORK_LIMITED_ALLOW_CONN_BLOCKS + 1});
BOOST_CHECK(peerman->GetDesirableServiceFlags(peer_flags) == ServiceFlags(NODE_NETWORK | NODE_WITNESS));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aff7d92

would be nice to test reacting to reorgs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. As it would be really nice to progress on this PR, will leave it follow-up. Thanks.

@DrahtBot DrahtBot requested review from andrewtoth and removed request for andrewtoth January 29, 2024 09:08
Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Light Code Review ACK 27f260a

I don't have a strong opinion on the interface issue (I read the discussion and can see both points of view), but I do think that having a global variable g_initial_block_download_completed that lives in protocol.h and is updated dynamically from net_processing is an ugly layer violation too, so removing that is an important improvement from the architectural point of view.


// Check we start connecting to full nodes
ServiceFlags peer_flags{NODE_WITNESS | NODE_NETWORK_LIMITED};
BOOST_CHECK(peerman->GetDesirableServiceFlags(peer_flags) == ServiceFlags(NODE_NETWORK | NODE_WITNESS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also have a check for HasAllDesirableServiceFlags each time GetDesirableServiceFlags is checked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, noted. As it is a nit and would love to move forward, will leave it for a small follow-up. Thanks!

@DrahtBot DrahtBot requested review from andrewtoth and removed request for andrewtoth January 30, 2024 21:58
Copy link
Contributor

@andrewtoth andrewtoth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 27f260a

mineBlock(m_node, /*block_time=*/GetTime<std::chrono::seconds>());
BOOST_CHECK(peerman->GetDesirableServiceFlags(peer_flags) == ServiceFlags(NODE_NETWORK_LIMITED | NODE_WITNESS));

// Lastly, verify the stale tip checks can disallow limited peers connections after not receiving blocks for a prolonged period.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in #28170 (comment), stale tip checks don't factor into the current approach. This check seems redundant now with the one on lines 54-56.

Copy link
Member Author

@furszy furszy Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in #28170 (comment), stale tip checks don't factor into the current approach.

Sure. As it is a nit, will update the comment in the follow-up PR. Thanks!.

This check seems redundant now with the one on lines 54-56.

I think it's worth checking whether the services flags can still change after receiving blocks from the network. The lines 54-56 are checking updates prior receiving any block.

@achow101
Copy link
Member

ACK 27f260a

@achow101 achow101 merged commit 0b76874 into bitcoin:master Jan 31, 2024
16 checks passed
@furszy furszy deleted the 2023_net_adaptable_desirable_flags branch January 31, 2024 20:02
theStack added a commit to theStack/bitcoin that referenced this pull request Feb 28, 2024
…depth)

This adds coverage for the logic introduced in PR bitcoin#28170
("p2p: adaptive connections services flags").
theStack added a commit to theStack/bitcoin that referenced this pull request Mar 5, 2024
…depth)

This adds coverage for the logic introduced in PR bitcoin#28170
("p2p: adaptive connections services flags").
theStack added a commit to theStack/bitcoin that referenced this pull request Mar 11, 2024
…depth)

This adds coverage for the logic introduced in PR bitcoin#28170
("p2p: adaptive connections services flags").
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants