New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p: make block download logic aware of limited peers threshold #28120
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsNo conflicts as of last run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe getting disconnected could be beneficial in this rare situation? If our tip is that much behind but we're not in IBD, we can't help them and they can't help us, so our priority should be to get the missing blocks asap, and getting disconnected by useless peers seems like it's not such a bad thing towards that goal.
What would be bad is if all our current peers were limited, and we wouldn't try to request the needed blocks from anyone, but also wouldn't try to exchange them for better peers, so that we would be stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe getting disconnected could be beneficial in this rare situation? If our tip is that much behind but we're not in IBD, we can't help them and they can't help us, so our priority should be to get the missing blocks asap, and getting disconnected by useless peers seems like it's not such a bad thing towards that goal.
Agree but, while the outcome might be the desired one, the approach to achieve it isn't the best. It seems unwise to have the node sending a message through the wire, knowing it will fail, just to force the other peer to disconnect.
The node should be smarter than that and not rely on external factors to take decisions.
Moreover, in cases where the other peer chooses to ignore the message without disconnecting (maybe due to using different software than bitcoin-core), the node will end up waiting for a response that will never arrive, wasting time and resources, which could have been easily avoided.
And also, while the node has more connection slots available, I don't see why should disconnect a peer that is behaving properly.
Instead, I think that we should try to keep processes separated. The eviction process should be in charge of disconnecting peers that aren't/will not provide any meaningful data. And the block downloading logic should only be responsible of deciding which blocks request to each peer and which ones not.
I agree with this part. I think the solution should be expanded to give a chance in such cases, and it should be merged within the same PR. |
Agree with @mzumsande and @naumenkogs. Maybe not the same PR if it's substantially different, but the new eviction logic should be merged before this fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be bad is if all our current peers were limited, and we wouldn't try to request the needed blocks from anyone, but also wouldn't try to exchange them for better peers, so that we would be stuck.
Right now, this will hardly happen. The reason is CheckForStaleTipAndEvictPeers
. The logic is as follows:
When the node doesn't update the tip for a period longer than 30 minutes (TipMayBeStale()
), it triggers the extra outbound connection process. Which instructs the net layer to bypass the maximum outbound connection restriction and create another full relay outbound connection to sync up the missing blocks.
Then, considering the worst-case scenario: when the extra outbound connection is also a limited peer. In that case, 45 seconds after connecting to the extra peer (scheduler task timeout), CheckForStaleTipAndEvictPeers
will call EvictExtraOutboundPeers
, which will evict the peer that least recently announced a block header.
Then, in case of no chain movement; the process starts again, adding an extra outbound connection, and so on, until it finds a peer that can provide the missing blocks.
Note:
While this sooner or later corrects the stalling situation, it's not ideal to continue establishing connections to limited peers, on the extra outbound connection process, when the peer is stalled. So, to address this issue, have created #28170. Which basically disallows connections to limited peers when the node is behind enough to know that limited peers will not provide any of the required historical blocks (although they will still provide valid headers).
8e91439
to
c1612ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebased. Conflicts solved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK c1612ea
9eba981
to
534996a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated per feedback, thanks for the review!
Fixed the two blocks buffer extension (now is reduction). Small diff.
534996a
to
adfc9c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated per feedback, thanks for the review @vasild!
The node will now request blocks within the limited peers threshold when the block download window contemplates them and not only when the previous blocks arrived or are in-flight. Added test coverage exercising this behavior and verifying that the node can get synced even when the blocks arrived out of order.
adfc9c7
to
930e531
Compare
@@ -1451,6 +1451,7 @@ void PeerManagerImpl::FindNextBlocks(std::vector<const CBlockIndex*>& vBlocks, c | |||
{ | |||
std::vector<const CBlockIndex*> vToFetch; | |||
int nMaxHeight = std::min<int>(state->pindexBestKnownBlock->nHeight, nWindowEnd + 1); | |||
bool is_limited_peer = IsLimitedPeer(peer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const bool is_limited_peer{IsLimitedPeer};
@@ -1475,6 +1476,11 @@ void PeerManagerImpl::FindNextBlocks(std::vector<const CBlockIndex*>& vBlocks, c | |||
return; | |||
} | |||
|
|||
// Don't request blocks that go further than what limited peers can provide | |||
if (is_limited_peer && (state->pindexBestKnownBlock->nHeight - pindex->nHeight >= static_cast<int>(NODE_NETWORK_LIMITED_MIN_BLOCKS) - 2 /* two blocks buffer for possible races */)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what @naumenkogs is referring to here is that if this condition is true then pindex
will hit this condition on the first pindex
in the loop and every pindex
up until this condition is no longer true. This can be more efficient by just bumping pindex
to the first pindex
that is past the network limited blocks threshold.
So, we can either change the loop from range based to regular for-loop and advance the index, or move this check out of this loop and into the loop on line 1464 and break early if the condition is met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review ACK c5b5843
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK c5b5843
Built and ran all functional and unit tests on x86 Ubuntu. Confirmed test fails on master, passes with the patch in the branch. Reviewed code and existing discussion. Regarding the decision to keep the pruning peer connected at all, I agree with the author that the affected code is only responsible for requesting available blocks from a peer and it is reasonable to take the limited flag into account. I like the clean up in the first commit as well, it's easier to read the loop with continue
.
For sanity I added checks in the test to confirm blocks from the pruned node were "found" by the full node but had 'confirmations': -1,
which makes sense because there is a gap between the full node's chain tip and the blocks available from the pruning peer.
Show Signature
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
ACK c5b5843d8f10d96f76ee6b95f2b1b1b4ce755f75
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEE5hdzzW4BBA4vG9eM5+KYS2KJyToFAmXs7UsACgkQ5+KYS2KJ
yTpnUg//b9HSUh/s1VJRWSoX23pcmFJtmeA4ndd5dGM+Q2RbyBZenI4ZwVlkunxy
l5B+U/gpD7cL/6iZsV6Eph/ouX9Om8NKQcxMn0TXzvOBYcByCZB++r5oNwfeKohu
aUJWRce9MYDj4GtdbW7p5DwcsKaX6nC8oOnxfv9ivP473F7BDjnYZpX7F0nKsrx7
b5vKMWCY7fXxmHpe0y0FnWVAJ10hvIQFD6I0EMO1mQpq14U3NUcG+Axfm6kCrJRE
EbeAKtxz/tITNNiN5AZg2NdgnmOR1MaOatkOuCxWgMer+N2EOWZgeW+/42Iy0Isz
hOnv9BQi4MwewY/I5ruM7Yx2oghUEpIkIhBX8/JPeA9iD2pnVIcXi6Xkaawd31C1
Xl7OMpX1mOVtPjNPvmYxpbFjze6BhGpaa3h7UJFSBzqhFm0gpe70f+hza4HbXUQj
MihsnKASJPmCDLFY/kGddad/E+yJ5QNxSHhLqEugkWPGPnIa4CHyd91uU5qQwwOv
H1sN9hAlNYguDZs79jPY32yIXgpufQyUWQA1+emQpDQun68jHjTxTdgUFEbvlIDU
Nl+fxsY9Fqj4D3HpdnnwIUcX1W68itMhPPBQUDu3R1abdUHMc/ntE4gwr6BP3LcL
AFv+d4aflQrEG9RntgmkyUSnYw6VIj9INlQMS6ql4r70Kh/6Bhk=
=C8tL
-----END PGP SIGNATURE-----
pinheadmz's public key is on keybase
ACK c5b5843 |
Even when the node believes it has IBD completed, need to avoid
requesting historical blocks from network-limited peers.
Otherwise, the limited peer will disconnect right away.
The simplest scenario could be a node that gets synced, drops
connections, and stays inactive for a while. Then, once it re-connects
(IBD stays completed), the node tries to fetch all the missing blocks
from any peer, getting disconnected by the limited ones.
Note:
Can verify the behavior by cherry-picking the test commit alone on
master. It will fail there.