Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: peer connection bug fixes #28248

Closed
wants to merge 15 commits into from

Conversation

jonatack
Copy link
Contributor

@jonatack jonatack commented Aug 9, 2023

This pull fixes several peer connection bugs in our p2p code, along with the logging that uncovered them:

  1. Fix detection of inbound peer connections in GetAddedNodeInfo.

  2. Fix addnode CJDNS peers not detected in GetAddedNodeInfo, causing ThreadOpenAddedConnections to continually retry to connect to them and RPC getaddednodeinfo incorrectly showing them as not connected.

  3. Fix ThreadOpenConnections not detecting inbound CJDNS connections and making automatic outbound connections to them.

  4. Fix detection of already connected peers in AlreadyConnectedToAddress().

  5. Fix detection of already connected peers when making outbound connections in ConnectNode.

  6. Do not accept inbound connections in CreateNodeFromAcceptedSocket from I2P peers we're already connected to, as building I2P tunnels is expensive.

  7. Fix making automatic outbound connections in ThreadOpenConnections to addnode peers, in order not to allocate our limited outbound slots to them and to ensure addnode connections benefit from their intended protections. Our addnode logic usually connects the addnode peers before the automatic outbound logic does, but not always, as a connection race can occur (see the commit message for further details and mainnet examples). When an addnode peer is connected as an automatic outbound peer and is the only connection we have to a network, it can be protected by our new outbound network-specific eviction logic and persist in the "wrong role". Fix these issues by checking if the selected address is an addnode peer in our automatic outbound connection logic.

Update the p2p logging with the improvements that allowed seeing/understanding/debugging the current behavior. Please see the commit messages for details.

Simplify MaybePickPreferredNetwork to return std::optional, make it a const class method, and add Clang thread-safety analysis annotation and related assertions.

@DrahtBot
Copy link
Contributor

DrahtBot commented Aug 9, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK brunoerg
Stale ACK vasild

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #28155 (net: improves addnode / m_added_nodes logic by sr-gi)
  • #27114 (p2p: Allow whitelisting outgoing connections by brunoerg)
  • #27052 (test: rpc: add last block announcement time to getpeerinfo result by LarryRuane)
  • #26859 (fuzz: extend ConsumeNetAddr() to return I2P and CJDNS addresses by vasild)
  • #25832 (tracing: network connection tracepoints by 0xB10C)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot DrahtBot added the P2P label Aug 9, 2023
@jonatack jonatack marked this pull request as ready for review August 10, 2023 00:56
@dergoegge
Copy link
Member

The title "p2p: outbound network diversity improvements" makes it sound like you are improving on the logic for making outbound connections but really you are refactoring and adding additional logging. Would you mind making the title more accurate?

Please see the commit messages for details.

The PR description is added to the merge commit, so in my opinion it makes sense to have at least a summary of what the PR does in the description.

From our docs:

The body of the pull request should contain sufficient description of what the patch does, and even more importantly, why, with 
justification and reasoning. You should include references to any discussions (for example, other issues or mailing list
discussions).

@jonatack jonatack changed the title p2p: outbound network diversity improvements p2p: outbound network diversity refactoring and logging improvements Aug 10, 2023
@jonatack
Copy link
Contributor Author

@dergoegge updated

@ajtowns
Copy link
Contributor

ajtowns commented Aug 11, 2023

This mostly seems like pointless refactoring (changing a switch to an || of two functions that themselves have switches). Doesn't seem worth the review effort to me.

@jonatack jonatack changed the title p2p: outbound network diversity refactoring and logging improvements p2p: outbound logic, logging and refactoring improvements Aug 12, 2023
@jonatack jonatack changed the title p2p: outbound logic, logging and refactoring improvements p2p: bugfixes, logic and logging improvements Aug 13, 2023
@jonatack jonatack force-pushed the 2023-08-network-diversity branch 2 times, most recently from 428169f to 875d7b0 Compare August 14, 2023 00:03
@jonatack
Copy link
Contributor Author

Updated to propose fixes for the issues observed with the improved logging.

src/net.cpp Outdated Show resolved Hide resolved
@jonatack
Copy link
Contributor Author

Updated for @vasild's review feedback to change m_added_nodes from a vector to an unordered set (3f4a1f0), which simplifies and speeds up the AddNode() and RemoveAddedNode() methods along with the commit that follows it, p2p: do not make automatic outbound connections to addnode peers (048e1af).

src/net.cpp Outdated Show resolved Hide resolved
src/net.cpp Outdated Show resolved Hide resolved
src/net.cpp Outdated Show resolved Hide resolved
src/net.cpp Outdated Show resolved Hide resolved
@DrahtBot DrahtBot mentioned this pull request Aug 24, 2023
@jonatack jonatack force-pushed the 2023-08-network-diversity branch 2 times, most recently from abc2507 to 7019c23 Compare September 6, 2023 06:20
@maflcko
Copy link
Member

maflcko commented Nov 1, 2023

Maybe mark as draft while CI is red?

@jonatack
Copy link
Contributor Author

jonatack commented Nov 1, 2023

Done. Note that only the latest push is red. I can push a version that is green, but have been rebasing on #28155 that I reckon will be merged soon, and adding missing test coverage. Concept ACKs on fixing the bugs would be encouraging 😃

@jonatack jonatack marked this pull request as draft November 1, 2023 20:07
@jonatack
Copy link
Contributor Author

jonatack commented Nov 1, 2023

Note also, that item 7 in the PR description involves a regression in v26.x.

@fanquake
Copy link
Member

fanquake commented Nov 1, 2023

Note also, that item 7 in the PR description involves a regression in v26.x.

What is the regression? Which change caused it?

@mzumsande
Copy link
Contributor

mzumsande commented Nov 1, 2023

Concept ACKs on fixing the bugs would be encouraging 😃

I think that some of the commits are straightforward (e.g. the missing MaybeFlipIPv6toCJDNS calls or the logging improvements), whereas others are fixing some non-ideal behavior in some cases, but have the risk of introducing new non-ideal behavior in other cases because the fix comes with downsides of its own.

For example, if it isn't possible to distinguish situations in which two connections involving the same IP are controlled by the same entity or different entities, any solution would be imperfect, either allowing duplicate connections or preventing legitimate behavior (that is possibly widely used in local setups). So it's unclear to me if the status quo on these is actually a bug or an intended best-effort solution / the result of a compromise.

Maybe some of the straightforward bugfixes could be split out into another PR?

@jonatack
Copy link
Contributor Author

jonatack commented Nov 1, 2023

Reported in https://github.com/bitcoin/bitcoin/pull/27213/files#r1291926369 and described in 21f8426, there is an interaction between the new network-specific automatic outbound logic and the addnode list: the former doesn't account for the latter.

Notable changes
===============

P2P and network changes
-----------------------

- Nodes with multiple reachable networks will actively try to have at least one
  outbound connection to each network. This improves individual resistance to
  eclipse attacks and network level resistance to partition attacks. Users no
  longer need to perform active measures to ensure being connected to multiple
  enabled networks. (#27213)

Node operators running over multiple networks may be likely to use -addnode to ensure connection to each of them (I do). They would be the ones most likely to be affected by the change with their addnode peers being allotted to rare outbound slots instead. It makes sense to exclude addnode peers from the automatic outbound logic, either in general or for the new network-specific one.

@jonatack
Copy link
Contributor Author

jonatack commented Nov 1, 2023

Maybe some of the straightforward bugfixes could be split out into another PR?

Yes -- looking at doing that, or dropping here for now any that aren't relatively clear.

src/net.cpp Show resolved Hide resolved
@DrahtBot
Copy link
Contributor

DrahtBot commented Nov 8, 2023

🐙 This pull request conflicts with the target branch and needs rebase.

@mzumsande
Copy link
Contributor

mzumsande commented Nov 15, 2023

Note also, that item 7 in the PR description involves a regression in v26.x.

I don't think that this can be called a regression:

  • There is nothing new about the root problem (making automatic connections to addnode peers in rare cases) in v26.
  • It can happen that such a peer could now be protected as the only one of its network, but to determine if it's a regression this needs to be compared with the status before.
  • Before we did protection-by-network, we would only ever get into the situation with an extra-outbound peer if we had a stale tip (30 minutes). Unless we are eclipsed or something weird is going on, this happens naturally a few times per day on average.
  • If we are in that situation, it is not very probable that we'd evict the outbound peer that we'd rather have as an addnode peer. First of all, we'd usually just evict the new extra-peer (unless it provides a new block, which should happen very rarely, especially when we have other high-bandwidth OB peers at the same time). But even in the rare cases, we often would not evict the specific peer to solve the addnode problem, that peer might be one of the 4 protected peers, or we might just pick another one.

I'd guess you could probably have waited for months to actually evict a mis-placed peer that way. So I'd say that the stale-tip eviction was never a functioning mechanism to resolve these kind of addnode problems (and never meant to be), and as a result, #27213 was not a regression - in practice, it might arguably be even an improvement instead of a regression, because in those cases where the peer in question is not the only one from its network, we'll get into the situation where we have an extra peer to evict more frequently than before.

@jonatack
Copy link
Contributor Author

@mzumsande The improved logging allows distinguishing when this is due to the new protection logic, and with occasional exceptions, it is.

@mzumsande
Copy link
Contributor

@mzumsande The improved logging allows distinguishing when this is due to the new protection logic, and with occasional exceptions, it is.

That's not my point. My point (see above) is that in earlier versions, the extra-outbound eviction (which would only happen in the stale-tip case then) was not a working or reliable mechanism to evict these peers, so there is no regression. Or did I miss something in my above explanation / do you have logs that pre-26 the stale-tip outbound eviction could somehow resolve this problem reliably?

@jonatack
Copy link
Contributor Author

Empirically, the regression is that we're actively adding addnode peers as outbound ones, and keeping them there. Since a couple of years I run a node with all three of our privacy networks and addnode entries for each, along with clearnet, and I didn't see my addnode peers being connected to as non-manual outbounds. This is new behavior for my node.

Aside, I'm adding unit tests for each change right now, and these are finding additional issues.

@mzumsande
Copy link
Contributor

mzumsande commented Nov 15, 2023

Empirically, the regression is that we're actively adding addnode peers as outbound ones, and keeping them there.

Oh, I thought you were talking about a race at startup - since the extra outbound should only kick in after 5 minutes, I think that should be enough to make connections to the addnode peers.

I think what might be happening later is that an addnode peer disconnects us, and if it was the only one from its network then there could be a race between picking another network-related outbound peer (which could be the peer that just disconnected us if addrman would select it) and reconnecting to the peer that evicted us with addnode thread. Could that be what you are seeing?

@jonatack
Copy link
Contributor Author

@mzumsande I think so, yes, along with the other races mentioned in the commit message 21f8426. I'll add the case you describe to it (thanks!)

@jonatack
Copy link
Contributor Author

jonatack commented Nov 15, 2023

I'd be interested to hear anyone's thoughts on this, from that commit message:

Finally, there does not seem to be a reason to make block-relay or short-lived feeler connections to addnode peers, as the addnode logic will ensure we connect to them if they are up, within the addnode connection limit.

This may not be optimal if one were to add many addnode entries, i.e. more than 8 that are online. I don't know if people do that. Maybe it would be prudent to allow feelers to addnodes.

@mzumsande
Copy link
Contributor

Maybe it would be prudent to allow feelers to addnodes.

I think it's fine to skip feelers for addnode peers. When we would skip these addresses for automatic connection, I don't see what making a feeler connection to them would achieve, considering that the point of feelers is to have better options for future automatic connections.

@DrahtBot
Copy link
Contributor

⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?

  • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
  • Is it no longer relevant? ➡️ Please close.
  • Did the author lose interest or time to work on this? ➡️ Please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

@jonatack
Copy link
Contributor Author

jonatack commented Apr 8, 2024

Will continue work on these bugfixes. Closing to be able to re-open it.

@jonatack jonatack closed this Apr 8, 2024
fanquake added a commit that referenced this pull request May 16, 2024
d0b0474 test: add GetAddedNodeInfo() CJDNS regression unit test (Jon Atack)
684da97 p2p, bugfix: detect addnode cjdns peers in GetAddedNodeInfo() (Jon Atack)

Pull request description:

  Addnode peers connected to us via the cjdns network are currently not detected by `CConnman::GetAddedNodeInfo()`, i.e. `fConnected` is always false. This causes the following issues:

  - RPC `getaddednodeinfo` incorrectly shows them as not connected

  - `CConnman::ThreadOpenAddedConnections()` continually retries to connect them

  Fix the issue and add a unit regression test. Extracted from #28248. Suggest running the test with:

  `./src/test/test_bitcoin -t net_peer_connection_tests -l test_suite`

ACKs for top commit:
  mzumsande:
    utACK d0b0474
  brunoerg:
    crACK d0b0474
  pinheadmz:
    ACK d0b0474

Tree-SHA512: a4d81425f79558f5792585611f3fe8ab999b82144daeed5c3ec619861c69add934c2b2afdad24c8488a0ade94f5ce8112f5555d60a1ce913d4f5a1cf5dbba55a
@bitcoin bitcoin deleted a comment from Waseem650 May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants