-
Notifications
You must be signed in to change notification settings - Fork 37.2k
Periodically make block-relay connections and sync headers #19858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodically make block-relay connections and sync headers #19858
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Concept ACK |
Concept ACK The most interesting question seems to be the eviction criteria after this new block-relay peer gave us a new block, and we want to evict someone. I think the "evict the youngest" approach is reasonable: it would be very hard for an attacker to control our block-relay-only connections by just serving blocks faster when we connect to them periodically. They'd have to also maintain very long-lived connections to evict honest peers rather than their own Sybils. We still have a couple places with |
ac71c8b
to
b92887f
Compare
Now that #19724 has been merged, this is ready for review. |
I actually didn't mean to necessarily include that In this PR that change is unnecessary (though an improvement), but I guess I don't want to do a wholesale review of all the |
b92887f
to
be956cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards Concept ACK but have you considered impact with #17428 ? I fear it may reduce its usefulness.
m_connman.ForEachNode([&](CNode* pnode) { | ||
LockAssertion lock(::cs_main); | ||
if (!pnode->IsBlockOnlyConn() || pnode->fDisconnect) return; | ||
if (pnode->GetId() > youngest_peer.first) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making an assumption on GetNewNodeId()
being a monotonic counter function of connections order. It may silently break id we modify ids selection to something else (like random nonces). Can we use nTimeConnected
instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also prefer to not having this assumption, and it seems to be easy to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If m_connman.ForEachNode
is not ordered, this algorithm may mistakenly select second_youngest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can change this, but keep in mind this logic is used a little further down already, in the existing outbound full-relay peer eviction algorithm.
EDIT: Actually, I think while this has some logical appeal it makes the code read strictly worse -- CNode::nTimeConnected is tracked in seconds, so it's perfectly plausible that you might have two nodes that tie, and you'd presumably break the tie using nodeid anyway! I'm just going to add a comment that we're using nodeid to make the determination of which peer is younger (ie higher NodeId).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As alternative have you considered caching them ? We know both when we open such connections and when we drop them. It would avoid the risk of logic bug and iterating every other connections types not concerned by such eviction.
I think it's more a future work direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean caching at the time we open the connection? I think that is problematic, because in order to keep that extra state up to date in the event that peers get disconnected and we make new connections after that, you have to a lot of additional error checking and introduced added complexity. Doing all the checks in one place, at the point in time that we care about getting to a new/correct state when we're over our maximum, seems simpler to me to reason about.
// | ||
// Then disconnect the peer, if we haven't learned anything new. | ||
// | ||
// The idea is to make eclipse attacks very difficult to pull off, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be more conservative in allegation of this new eclipse counter-measure effectiveness.
I believe we should have a hierarchy of eclipse attacks classified with regards to resources they require from the attacker to successfully perform them. And thus serves as a ground to evaluate a counter-measure with regards to a scenario. The fact that a stronger attack A can easily bypass counter-measure for attack B doesn't invalidate worthiness of counter-measure B.
For this new periodic-chain-sync counter-measure, I do agree it improves against eviction logic takeover or partial addrman poisoning. However I guess it doesn't score well against total addrman poisoning or infrastructure-based eclipse.
As a suggestion, maybe we can add a prefix to any mention of eclipse attacks hinting scenario considered like addrman-poisoning
or eviction-gaming
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-read my comment, and I think it's pretty accurate. If other reviewers think that the language here is somehow too strong and implies this logic is doing something it isn't, I'll reconsider.
Note, by the way, that the behavior introduced here is beneficial to not just the node doing it, but to the whole network, as a node already connected to the honest network that is periodically connecting to new peers to sync tips with others is helping bridge the entire network.
src/net_processing.cpp
Outdated
@@ -4064,6 +4108,11 @@ void PeerLogicValidation::CheckForStaleTipAndEvictPeers(const Consensus::Params | |||
} | |||
m_stale_tip_check_time = time_in_seconds + STALE_CHECK_INTERVAL; | |||
} | |||
|
|||
if (!m_initial_sync_finished && CanDirectFetch(consensusParams)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this doesn't protect against initial-network-connection eclipse attack like DNS cache poisoning. Maybe after some timer based on an optimistic headers-chain sync delay and observing that our tip isn't near to local clock trigger this logic anyway ?
That said, if you're effectively eclipsed since the beginning and don't have any good peers available in your addrman I don't think it would change anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the existing stale-tip check let us get new outbound peers in the case that our tip isn't updating at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming someone feeding you slowly the most-PoW valid chain thus never triggering the stale-tip check ? I think a broader definition of eclipse attack should include slow chain feeding as it's open the door for offchain exploitation.
That said, I think eclipse attacks during the bootstrap view of your network view are a special-case and we can address them latter with smarter heuristics based on this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do require that our initial headers-sync peer provide us with a headers chain that looks reasonable within a bounded amount of time (on the order of 20 minutes if I remember correctly -- the time scales with the expected honest chain length and very conservative notions of how long it takes to download headers). However if we're connecting blocks slowly, we can't distinguish between our own node being too slow to validate the entire blockchain (say due to local computing/memory/disk/network resources) or our peers colluding to collectively slow us down.
This seems like something that needs human intervention to determine that initial sync is in fact going too slowly.
src/net.h
Outdated
@@ -51,6 +51,8 @@ static const bool DEFAULT_WHITELISTFORCERELAY = false; | |||
static const int TIMEOUT_INTERVAL = 20 * 60; | |||
/** Run the feeler connection loop once every 2 minutes or 120 seconds. **/ | |||
static const int FEELER_INTERVAL = 120; | |||
/** Run the chain-sync connection loop once every 5 minutes. **/ | |||
static const int CHAIN_SYNC_INTERVAL = 300; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this mockable from the beginning? (std::chrono)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we punt until someone also changes the feeler logic to be the same? Right now the logic for both is very similar, which I think helps readability. (Also, I find std::chrono
to be harder to work with than the tools I know, so I'm afraid I'll introduce an error if I try to make the change myself.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by "feeler logic to be the same", but I'm making feeler timings mockable as part of #19869, you're very welcome to review :)
My opinion is not very strong here, we can update it later, I just thought it's a good opportunity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If #19869 is merged first, then I'll update the code here as well when I rebase.
@@ -94,6 +94,9 @@ class PeerLogicValidation final : public CValidationInterface, public NetEventsI | |||
private: | |||
int64_t m_stale_tip_check_time; //!< Next time to check for stale tip | |||
|
|||
/** Whether we've completed initial sync yet, for determining when to turn | |||
* on extra block-relay peers. */ | |||
bool m_initial_sync_finished{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this ever be set back to false? For example, if we were offline for a week and we know we're catching up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you go offline for a week by shutting down bitcoind there is no issue; if you close your laptop or disconnect from the network though then yes you're right that we'll use these occasional peers to help us catch up, which is not the intent. However, we don't have a good way to distinguish that situation in our code right now... Arguably stale-tip checking shouldn't fire either in those circumstances but we don't try to do anything to limit that?
I'm inclined to leave this, and if we somehow improve our software to detect circumstances like that, then we can update this logic accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you close your laptop or disconnect from the network though then yes you're right that we'll use these occasional peers to help us catch up, which is not the intent.
If this is the intent (to stop making these short-lived connections if we've fallen behind the tip), then I think we can achieve that fairly easily by removing this caching variable, making CanDirectFetch()
a public method on PeerManager
and calling that function whenever we need to test if we should make an additional block-relay-only connection:
diff --git a/src/net.cpp b/src/net.cpp
index 48977aeadf..1de1bda9a8 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -1957,7 +1957,7 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
conn_type = ConnectionType::BLOCK_RELAY;
} else if (GetTryNewOutboundPeer()) {
// OUTBOUND_FULL_RELAY
- } else if (nTime > nNextExtraBlockRelay && m_start_extra_block_relay_peers) {
+ } else if (nTime > nNextExtraBlockRelay && m_msgproc->CanDirectFetch()) {
// Periodically connect to a peer (using regular outbound selection
// methodology from addrman) and stay connected long enough to sync
// headers, but not much else.
diff --git a/src/net.h b/src/net.h
index 58a5b36918..c836161f83 100644
--- a/src/net.h
+++ b/src/net.h
@@ -635,6 +635,7 @@ public:
virtual bool SendMessages(CNode* pnode) = 0;
virtual void InitializeNode(CNode* pnode) = 0;
virtual void FinalizeNode(NodeId id, bool& update_connection_time) = 0;
+ virtual bool CanDirectFetch() const = 0;
protected:
/**
diff --git a/src/net_processing.cpp b/src/net_processing.cpp
index ad40d67a97..ef47d00e73 100644
--- a/src/net_processing.cpp
+++ b/src/net_processing.cpp
@@ -883,6 +883,11 @@ void RequestTx(CNodeState* state, const GenTxid& gtxid, std::chrono::microsecond
} // namespace
+bool PeerManager::CanDirectFetch() const
+{
+ return ::CanDirectFetch(m_chainparams.GetConsensus());
+}
+
// This function is used for testing the stale tip eviction logic, see
// denialofservice_tests.cpp
void UpdateLastBlockAnnounceTime(NodeId node, int64_t time_in_seconds)
@@ -1956,7 +1961,7 @@ void PeerManager::ProcessHeadersMessage(CNode& pfrom, const std::vector<CBlockHe
m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETHEADERS, ::ChainActive().GetLocator(pindexLast), uint256()));
}
- bool fCanDirectFetch = CanDirectFetch(m_chainparams.GetConsensus());
+ bool fCanDirectFetch = CanDirectFetch();
// If this set of headers is valid and ends in a block with at least as
// much work as our tip, download as much as possible.
if (fCanDirectFetch && pindexLast->IsValid(BLOCK_VALID_TREE) && ::ChainActive().Tip()->nChainWork <= pindexLast->nChainWork) {
@@ -3261,7 +3266,7 @@ void PeerManager::ProcessMessage(CNode& pfrom, const std::string& msg_type, CDat
}
// If we're not close to tip yet, give up and let parallel block fetch work its magic
- if (!fAlreadyInFlight && !CanDirectFetch(m_chainparams.GetConsensus()))
+ if (!fAlreadyInFlight && !CanDirectFetch())
return;
if (IsWitnessEnabled(pindex->pprev, m_chainparams.GetConsensus()) && !nodestate->fSupportsDesiredCmpctVersion) {
@@ -4073,7 +4078,7 @@ void PeerManager::CheckForStaleTipAndEvictPeers()
m_stale_tip_check_time = time_in_seconds + STALE_CHECK_INTERVAL;
}
- if (!m_initial_sync_finished && CanDirectFetch(m_chainparams.GetConsensus())) {
+ if (!m_initial_sync_finished && CanDirectFetch()) {
m_connman.StartExtraBlockRelayPeers();
m_initial_sync_finished = true;
}
diff --git a/src/net_processing.h b/src/net_processing.h
index 6e3e032831..88bf7ff2a6 100644
--- a/src/net_processing.h
+++ b/src/net_processing.h
@@ -93,6 +93,11 @@ public:
*/
void Misbehaving(const NodeId pnode, const int howmuch, const std::string& message);
+ /**
+ * Return whether our tip block's time is close enough to current time that we can directly fetch.
+ */
+ bool CanDirectFetch() const;
+
private:
/**
* Potentially mark a node discouraged based on the contents of a BlockValidationState object
That approach may be preferable for a couple of reasons:
- Placing the logic that checks/sets the condition under which we'll make additional block-relay-only peers in the same place that it makes those connections makes it much easier to reason about those state transitions. Currently
m_start_extra_block_relay_peers
is set based on a loop in net_processing and then read on a timer in net. - Caching the state internally makes it more difficult to test all the code paths. If the start_extra_block_relay_peers condition is set by a call into PeerManager, then that interface can be mocked and code paths can be hit more easily in unit/fuzz testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would mean that we stop using block-relay-only peer rotation when our tip is stale, which might be when we want these connections happening the most? It becomes arguable whether we should just rely on stale-tip-checking + full outbound peer rotation at that point, but I think we would want to carefully reason about our protections in that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested ACK b3a515c over several weeks, though this change and behavior could benefit from test coverage and other follow-ups (refactoring, etc.) described in the review feedback. I did not verify the behavior of m_start_extra_block_relay_peers
only being enabled after initial chain sync. Since my last review, one unneeded cs_main
lock was removed.
A few empirical observations from running this code over time on a tor v3 node with mostly only 18 outbounds (~half very good manual ones) and 0, sometimes 1, inbounds:
- The timing happens as expected
- GetExtraFullOutboundCount() and GetExtraBlockRelayCount() mostly return 0 (97%+ of the time); when not, they return 1
- Most of the time (97%+), the newly added block-relay-only peer is disconnected within ~50 seconds (and the log says "(last block received at time 0)")
- Actual peer rotation is rare
- Without the patch to induce a 10% chance of initializing nLastBlockTime to current time, I no longer saw any keeping/evicting events
My main wishlist would be that the code be designed from the start to be testable and have regression coverage. I'm not as confident in reviewing or changing code without coverage.
Logging used for my testing: jonatack@8986db4 (a variant of Suhas' patch above). |
Code Review ACK b3a515c, only change since last time is dropping a useless
AFAIU this PR, I'm not worried about the network load introduced by this PR, whatever the metric we're picking (connection slots/application bandwidth/IP/TCP bandwidth). Let's assume a really worst-case scenario, where the victim node is always fallen-behind from a better-work chain by ~5 headers and has to download them every 5 min (EXTRA_BLOCK_RELAY_ONLY_PEER_INTERVAL). 80 * 5 * 12 * 24 * 30 = 3_456_000, this node will consume 3.45 MB by month from the network ? If I get the math right, I think that's fairly acceptable. That said, I would be glad if we start to develop and sustain frameworks to evaluate question like network load which rightfully spawn up in this kind of work. Beyond agreeing on security model efficiency, having sound quantitative model would ease reaching Concept ACK/NACK.
I know I'm a minority opinion, but I still feel we should have a sound discussion before dissociating further rationale from code by writing more documentation in the wiki instead of in-tree. Contra:
I'm a big fan of the code documentation approach which has been done for #19988, and I hope to stick more to this kind of code documentation standard in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post merge ACK b3a515c
Testing this, I'm seeing some cases where the extra block-relay peer is being evicted for sending a tx, eg:
2020-12-11T10:12:05Z Added connection peer=38
2020-12-11T10:12:06Z receive version message: /Satoshi:0.20.1/: version 70015, blocks=660884, us=xxx:8333, peer=38
2020-12-11T10:12:06Z New outbound peer connected: version: 70015, blocks=660884, peer=38 (block-relay)
2020-12-11T10:12:17Z received: inv (1261 bytes) peer=38
2020-12-11T10:12:17Z got inv: tx 565fed8bc9ff5fa333a8130ec399f23d24d4bcdd435778b1fd2a67278a980ee6 have peer=38
2020-12-11T10:12:17Z transaction (565fed8bc9ff5fa333a8130ec399f23d24d4bcdd435778b1fd2a67278a980ee6) inv sent in violation of protocol, disconnecting peer=38
2020-12-11T10:12:17Z disconnecting peer=38
2020-12-11T10:12:17Z Cleared nodestate for peer=38
which might be worth looking into further just in case those peers are actually running core like they say and we have a bug.
@@ -2557,7 +2557,7 @@ void PeerManager::ProcessMessage(CNode& pfrom, const std::string& msg_type, CDat | |||
LogPrintf("New outbound peer connected: version: %d, blocks=%d, peer=%d%s (%s)\n", | |||
pfrom.nVersion.load(), pfrom.nStartingHeight, | |||
pfrom.GetId(), (fLogIPs ? strprintf(", peeraddr=%s", pfrom.addr.ToString()) : ""), | |||
pfrom.m_tx_relay == nullptr ? "block-relay" : "full-relay"); | |||
pfrom.IsBlockOnlyConn() ? "block-relay" : "full-relay"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this suggestion (changing m_tx_relay == nullptr
to IsBlockOnlyConn()
when logging a new outbound peer) already done? I think github is confusing me...
Anyway, changing that entire x ? "y" : "z"
to just be pfrom.ConnectionTypeAsString()
seems like it might be better.
@@ -1820,18 +1820,32 @@ void CConnman::SetTryNewOutboundPeer(bool flag) | |||
// Also exclude peers that haven't finished initial connection handshake yet | |||
// (so that we don't decide we're over our desired connection limit, and then | |||
// evict some peer that has finished the handshake) | |||
int CConnman::GetExtraOutboundCount() | |||
int CConnman::GetExtraFullOutboundCount() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this does get refactored further, making it bool HasExtraBlockRelay()
and bool HasExtraFullOutbound
and returning count > m_max_x
instead of max(0, count - m_max_x)
would be an idea too.
|
||
// Check whether we have too many OUTBOUND_FULL_RELAY peers | ||
if (m_connman.GetExtraFullOutboundCount() > 0) { | ||
// If we have more OUTBOUND_FULL_RELAY peers than we target, disconnect one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems better to use the names from ConnectionTypeAsString
in comments than the shouty enums (ie outbound-full-relay
) here).
Looking into this a bit further: I'm only seeing this coming from nodes advertising themselves as 0.20.1, and the timing seems like it's consistent with a 5s poisson delay since the connection is established. Not all 0.20.1 nodes are failing this way. I've just added some additional logging, and it looks like the txs they're advertising aren't particularly strange, but all the violating nodes seem to be running on cloud hosting (digital ocean, amazon, google cloud). So seems plausible that they're just buggy and lying about their version details? One thing that might be worth considering: our sybil mitigation only works for concurrent connections -- our 10 regular outbounds all have to be in different netgroups because they're simultaneously connected, but 10 sequential extra block-relay-only could all end up to the same netgroup. Could fix this by keeping track of the last 10 extra connections we've tried, and trying to choose the next one from a different netgroup. |
I'm surprised that your log isn't showing the "send version message" line. You obviously have NET logging enabled since you're seeing the "receive version message" line. These are outbound connections, so I'd expect to see "send version message" (in Even if we did have that logging in Perhaps peer message capture would help here (#19509) |
Oh, with logips enabled, I see those nodes also don't appear to be telling me my IP address, instead reporting their own IP in both Edited to add: I think about 12% of the extra block-relay-only connections my peer is opening get disconnected for this reason Example:
Note: we are signalling no tx relay in the version message we send
Note: they are claiming to be 0.20.1, and are telling us our ip is the same as their ip.
Everthing looks fine so far. Except 8s later we get sent some txids, despite saying we don't want them. The txids themselves don't seem particularly suspicious.
|
@ajtowns I think this is an interesting idea -- seems like it would be a strict improvement in security (in a mathematical sense, ie I can't imagine how our security could be any worse off with that approach); but I'm not sure if the additional complexity is worth the potential gain? Not intrinsically opposed, but maybe this isn't low-hanging fruit either. |
0c41c10 doc: Remove shouty enums in net_processing comments (Suhas Daftuar) Pull request description: This uses the `CNode::ConnectionTypeAsString()` strings in place of the all-caps enums in a couple of comments in `net_processing`, as suggested by ajtowns in #19858 (comment). ACKs for top commit: practicalswift: ACK 0c41c10 jnewbery: ACK 0c41c10 laanwj: ACK 0c41c10 Tree-SHA512: c8ab905e151ebb144c3f878277dc59d77591e4b39632658407b69b80b80d65825d5a391b01e2aea6af2fdf174c143dfe7d2f3eba84a020a58d7926458fdcd0a5
6d1e85f Clean up logging of outbound connection type (Suhas Daftuar) Pull request description: We have a function that converts `ConnectionType` enums to strings, so use it. Suggested by ajtowns in #19858 (comment) ACKs for top commit: amitiuttarwar: ACK 6d1e85f naumenkogs: ACK 6d1e85f Tree-SHA512: f5084d8b5257380696d9fde86a8873e190cd4553feb07fa49df39bbd9510bf5832d190a3bca1571c48370d16a17c7a34900857b21b27bec0777bfa710211d7bb
6d1e85f Clean up logging of outbound connection type (Suhas Daftuar) Pull request description: We have a function that converts `ConnectionType` enums to strings, so use it. Suggested by ajtowns in bitcoin#19858 (comment) ACKs for top commit: amitiuttarwar: ACK 6d1e85f naumenkogs: ACK 6d1e85f Tree-SHA512: f5084d8b5257380696d9fde86a8873e190cd4553feb07fa49df39bbd9510bf5832d190a3bca1571c48370d16a17c7a34900857b21b27bec0777bfa710211d7bb
IIUC, since this merge, an extra outbound-full-relay connection is made upon a stale tip(3 block intervals), and an extra outbound-block-relay connection is now made every 5 minutes. Is there still value in the former? |
Yes I think there is. The stale tip logic more aggressively seeks out a new peer to connect to (staying in that state until a new connection is actually made) while this logic fires once on a selection from addrman and gives up even if no connection is made. Moreover the stale tip logic is an eviction algorithm for full relay peers, while this logic is only for block relay peers. I think having rotation logic for both makes sense, though there is probably room to improve the interaction between these two behaviors in all the various cases we can think of. |
@sdaftuar Thanks for explaining. Could this be a way to improve the interaction between the two behaviors?:
Perhaps the simpler starting point is just to implement (2)? Are there any downsides/known attacks if we do that? |
4740fe8 test: Add test for block relay only eviction (Martin Zumsande) Pull request description: Adds a unit test for block-relay-only eviction logic added in #19858, which was not covered by any tests before. The added test is very similar to the existing `stale_tip_peer_management` unit test, which tests the analogous logic for regular outbound peers. ACKs for top commit: glozow: reACK 4740fe8 rajarshimaitra: tACK 4740fe8 shaavan: ACK 4740fe8. Great work @ mzumsande! LarryRuane: ACK 4740fe8 Tree-SHA512: 5985afd7d8f7ae311903dbbf6b7d526e16309c83c88ae6dd6551960c0b186156310a6be0cf6b684f82ac1378d0fc5aa3717f0139e078471013fceb6aebe81bf6
4740fe8 test: Add test for block relay only eviction (Martin Zumsande) Pull request description: Adds a unit test for block-relay-only eviction logic added in bitcoin#19858, which was not covered by any tests before. The added test is very similar to the existing `stale_tip_peer_management` unit test, which tests the analogous logic for regular outbound peers. ACKs for top commit: glozow: reACK 4740fe8 rajarshimaitra: tACK bitcoin@4740fe8 shaavan: ACK 4740fe8. Great work @ mzumsande! LarryRuane: ACK 4740fe8 Tree-SHA512: 5985afd7d8f7ae311903dbbf6b7d526e16309c83c88ae6dd6551960c0b186156310a6be0cf6b684f82ac1378d0fc5aa3717f0139e078471013fceb6aebe81bf6
Summary: ``` To make eclipse attacks more difficult, regularly initiate outbound connections and stay connected long enough to sync headers and potentially learn of new blocks. If we learn a new block, rotate out an existing block-relay peer in favor of the new peer. This augments the existing outbound peer rotation that exists -- currently we make new full-relay connections when our tip is stale, which we disconnect after waiting a small time to see if we learn a new block. As block-relay connections use minimal bandwidth, we can make these connections regularly and not just when our tip is stale. Like feeler connections, these connections are not aggressive; whenever our timer fires (once every 5 minutes on average), we'll try to initiate a new block-relay connection as described, but if we fail to connect we just wait for our timer to fire again before repeating with a new peer. ``` Backport of [[bitcoin/bitcoin#19858 | core#19858]]. Ref T1696. Test Plan: ninja all check-all Run IBD. Reviewers: #bitcoin_abc, PiRK Reviewed By: #bitcoin_abc, PiRK Maniphest Tasks: T1696 Differential Revision: https://reviews.bitcoinabc.org/D10907
To make eclipse attacks more difficult, regularly initiate outbound connections
and stay connected long enough to sync headers and potentially learn of new
blocks. If we learn a new block, rotate out an existing block-relay peer in
favor of the new peer.
This augments the existing outbound peer rotation that exists -- currently we
make new full-relay connections when our tip is stale, which we disconnect
after waiting a small time to see if we learn a new block. As block-relay
connections use minimal bandwidth, we can make these connections regularly and
not just when our tip is stale.
Like feeler connections, these connections are not aggressive; whenever our
timer fires (once every 5 minutes on average), we'll try to initiate a new
block-relay connection as described, but if we fail to connect we just wait for
our timer to fire again before repeating with a new peer.