Skip to content

Periodically make block-relay connections and sync headers #19858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 11, 2020

Conversation

sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Sep 1, 2020

To make eclipse attacks more difficult, regularly initiate outbound connections
and stay connected long enough to sync headers and potentially learn of new
blocks. If we learn a new block, rotate out an existing block-relay peer in
favor of the new peer.

This augments the existing outbound peer rotation that exists -- currently we
make new full-relay connections when our tip is stale, which we disconnect
after waiting a small time to see if we learn a new block. As block-relay
connections use minimal bandwidth, we can make these connections regularly and
not just when our tip is stale.

Like feeler connections, these connections are not aggressive; whenever our
timer fires (once every 5 minutes on average), we'll try to initiate a new
block-relay connection as described, but if we fail to connect we just wait for
our timer to fire again before repeating with a new peer.

@sdaftuar
Copy link
Member Author

sdaftuar commented Sep 1, 2020

This implements something similar to what is discussed in #16859.

Marking this as draft for now, as it builds on top of #19724.

@DrahtBot DrahtBot added the P2P label Sep 1, 2020
@DrahtBot
Copy link
Contributor

DrahtBot commented Sep 1, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@practicalswift
Copy link
Contributor

Concept ACK

@naumenkogs
Copy link
Member

Concept ACK

The most interesting question seems to be the eviction criteria after this new block-relay peer gave us a new block, and we want to evict someone.

I think the "evict the youngest" approach is reasonable: it would be very hard for an attacker to control our block-relay-only connections by just serving blocks faster when we connect to them periodically. They'd have to also maintain very long-lived connections to evict honest peers rather than their own Sybils.


We still have a couple places with m_tx_relay == nullptr, and in some of them we mention block-relay-only in a related comment. Should we converge at least those to IsBlockOnlyConn() check?

@sdaftuar sdaftuar force-pushed the 2020-08-blocks-only-rotation branch 3 times, most recently from ac71c8b to b92887f Compare September 3, 2020 14:43
@sdaftuar sdaftuar marked this pull request as ready for review September 3, 2020 14:46
@sdaftuar
Copy link
Member Author

sdaftuar commented Sep 3, 2020

Now that #19724 has been merged, this is ready for review.

@sdaftuar
Copy link
Member Author

sdaftuar commented Sep 3, 2020

We still have a couple places with m_tx_relay == nullptr, and in some of them we mention block-relay-only in a related comment. Should we converge at least those to IsBlockOnlyConn() check?

I actually didn't mean to necessarily include that m_tx_relay == nullptr in the LogPrintf() that I did; the clarification of changing to using the IsBlockOnlyConn() check was leftover from a previous version of this work where I was going to introduce a new connection type altogether, and the m_tx_relay == nullptr check would have been incorrect for identifying the peer type.

In this PR that change is unnecessary (though an improvement), but I guess I don't want to do a wholesale review of all the m_tx_relay==nullptr checks still in the code -- many of them are places where we check for nullptr before dereferencing, which I think would make sense to leave as-is. Places in the code where we use that check to determine how to do something logical with that peer are different, and should be switched at some point. If there are places like that which make this new code harder to understand, please flag and I can include fixes here, but I think since I'm not adding new peer-types I hope that wouldn't be necessary.

@sdaftuar sdaftuar force-pushed the 2020-08-blocks-only-rotation branch from b92887f to be956cd Compare September 3, 2020 16:45
Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards Concept ACK but have you considered impact with #17428 ? I fear it may reduce its usefulness.

m_connman.ForEachNode([&](CNode* pnode) {
LockAssertion lock(::cs_main);
if (!pnode->IsBlockOnlyConn() || pnode->fDisconnect) return;
if (pnode->GetId() > youngest_peer.first) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making an assumption on GetNewNodeId() being a monotonic counter function of connections order. It may silently break id we modify ids selection to something else (like random nonces). Can we use nTimeConnected instead ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also prefer to not having this assumption, and it seems to be easy to avoid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If m_connman.ForEachNode is not ordered, this algorithm may mistakenly select second_youngest.

Copy link
Member Author

@sdaftuar sdaftuar Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can change this, but keep in mind this logic is used a little further down already, in the existing outbound full-relay peer eviction algorithm.

EDIT: Actually, I think while this has some logical appeal it makes the code read strictly worse -- CNode::nTimeConnected is tracked in seconds, so it's perfectly plausible that you might have two nodes that tie, and you'd presumably break the tie using nodeid anyway! I'm just going to add a comment that we're using nodeid to make the determination of which peer is younger (ie higher NodeId).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As alternative have you considered caching them ? We know both when we open such connections and when we drop them. It would avoid the risk of logic bug and iterating every other connections types not concerned by such eviction.

I think it's more a future work direction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean caching at the time we open the connection? I think that is problematic, because in order to keep that extra state up to date in the event that peers get disconnected and we make new connections after that, you have to a lot of additional error checking and introduced added complexity. Doing all the checks in one place, at the point in time that we care about getting to a new/correct state when we're over our maximum, seems simpler to me to reason about.

//
// Then disconnect the peer, if we haven't learned anything new.
//
// The idea is to make eclipse attacks very difficult to pull off,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more conservative in allegation of this new eclipse counter-measure effectiveness.

I believe we should have a hierarchy of eclipse attacks classified with regards to resources they require from the attacker to successfully perform them. And thus serves as a ground to evaluate a counter-measure with regards to a scenario. The fact that a stronger attack A can easily bypass counter-measure for attack B doesn't invalidate worthiness of counter-measure B.

For this new periodic-chain-sync counter-measure, I do agree it improves against eviction logic takeover or partial addrman poisoning. However I guess it doesn't score well against total addrman poisoning or infrastructure-based eclipse.

As a suggestion, maybe we can add a prefix to any mention of eclipse attacks hinting scenario considered like addrman-poisoning or eviction-gaming ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-read my comment, and I think it's pretty accurate. If other reviewers think that the language here is somehow too strong and implies this logic is doing something it isn't, I'll reconsider.

Note, by the way, that the behavior introduced here is beneficial to not just the node doing it, but to the whole network, as a node already connected to the honest network that is periodically connecting to new peers to sync tips with others is helping bridge the entire network.

@@ -4064,6 +4108,11 @@ void PeerLogicValidation::CheckForStaleTipAndEvictPeers(const Consensus::Params
}
m_stale_tip_check_time = time_in_seconds + STALE_CHECK_INTERVAL;
}

if (!m_initial_sync_finished && CanDirectFetch(consensusParams)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this doesn't protect against initial-network-connection eclipse attack like DNS cache poisoning. Maybe after some timer based on an optimistic headers-chain sync delay and observing that our tip isn't near to local clock trigger this logic anyway ?

That said, if you're effectively eclipsed since the beginning and don't have any good peers available in your addrman I don't think it would change anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the existing stale-tip check let us get new outbound peers in the case that our tip isn't updating at all?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming someone feeding you slowly the most-PoW valid chain thus never triggering the stale-tip check ? I think a broader definition of eclipse attack should include slow chain feeding as it's open the door for offchain exploitation.

That said, I think eclipse attacks during the bootstrap view of your network view are a special-case and we can address them latter with smarter heuristics based on this work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do require that our initial headers-sync peer provide us with a headers chain that looks reasonable within a bounded amount of time (on the order of 20 minutes if I remember correctly -- the time scales with the expected honest chain length and very conservative notions of how long it takes to download headers). However if we're connecting blocks slowly, we can't distinguish between our own node being too slow to validate the entire blockchain (say due to local computing/memory/disk/network resources) or our peers colluding to collectively slow us down.

This seems like something that needs human intervention to determine that initial sync is in fact going too slowly.

src/net.h Outdated
@@ -51,6 +51,8 @@ static const bool DEFAULT_WHITELISTFORCERELAY = false;
static const int TIMEOUT_INTERVAL = 20 * 60;
/** Run the feeler connection loop once every 2 minutes or 120 seconds. **/
static const int FEELER_INTERVAL = 120;
/** Run the chain-sync connection loop once every 5 minutes. **/
static const int CHAIN_SYNC_INTERVAL = 300;
Copy link
Member

@naumenkogs naumenkogs Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this mockable from the beginning? (std::chrono)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we punt until someone also changes the feeler logic to be the same? Right now the logic for both is very similar, which I think helps readability. (Also, I find std::chrono to be harder to work with than the tools I know, so I'm afraid I'll introduce an error if I try to make the change myself.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by "feeler logic to be the same", but I'm making feeler timings mockable as part of #19869, you're very welcome to review :)

My opinion is not very strong here, we can update it later, I just thought it's a good opportunity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #19869 is merged first, then I'll update the code here as well when I rebase.

@@ -94,6 +94,9 @@ class PeerLogicValidation final : public CValidationInterface, public NetEventsI
private:
int64_t m_stale_tip_check_time; //!< Next time to check for stale tip

/** Whether we've completed initial sync yet, for determining when to turn
* on extra block-relay peers. */
bool m_initial_sync_finished{false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this ever be set back to false? For example, if we were offline for a week and we know we're catching up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you go offline for a week by shutting down bitcoind there is no issue; if you close your laptop or disconnect from the network though then yes you're right that we'll use these occasional peers to help us catch up, which is not the intent. However, we don't have a good way to distinguish that situation in our code right now... Arguably stale-tip checking shouldn't fire either in those circumstances but we don't try to do anything to limit that?

I'm inclined to leave this, and if we somehow improve our software to detect circumstances like that, then we can update this logic accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you close your laptop or disconnect from the network though then yes you're right that we'll use these occasional peers to help us catch up, which is not the intent.

If this is the intent (to stop making these short-lived connections if we've fallen behind the tip), then I think we can achieve that fairly easily by removing this caching variable, making CanDirectFetch() a public method on PeerManager and calling that function whenever we need to test if we should make an additional block-relay-only connection:

diff --git a/src/net.cpp b/src/net.cpp
index 48977aeadf..1de1bda9a8 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -1957,7 +1957,7 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
             conn_type = ConnectionType::BLOCK_RELAY;
         } else if (GetTryNewOutboundPeer()) {
             // OUTBOUND_FULL_RELAY
-        } else if (nTime > nNextExtraBlockRelay && m_start_extra_block_relay_peers) {
+        } else if (nTime > nNextExtraBlockRelay && m_msgproc->CanDirectFetch()) {
             // Periodically connect to a peer (using regular outbound selection
             // methodology from addrman) and stay connected long enough to sync
             // headers, but not much else.
diff --git a/src/net.h b/src/net.h
index 58a5b36918..c836161f83 100644
--- a/src/net.h
+++ b/src/net.h
@@ -635,6 +635,7 @@ public:
     virtual bool SendMessages(CNode* pnode) = 0;
     virtual void InitializeNode(CNode* pnode) = 0;
     virtual void FinalizeNode(NodeId id, bool& update_connection_time) = 0;
+    virtual bool CanDirectFetch() const = 0;
 
 protected:
     /**
diff --git a/src/net_processing.cpp b/src/net_processing.cpp
index ad40d67a97..ef47d00e73 100644
--- a/src/net_processing.cpp
+++ b/src/net_processing.cpp
@@ -883,6 +883,11 @@ void RequestTx(CNodeState* state, const GenTxid& gtxid, std::chrono::microsecond
 
 } // namespace
 
+bool PeerManager::CanDirectFetch() const
+{
+    return ::CanDirectFetch(m_chainparams.GetConsensus());
+}
+
 // This function is used for testing the stale tip eviction logic, see
 // denialofservice_tests.cpp
 void UpdateLastBlockAnnounceTime(NodeId node, int64_t time_in_seconds)
@@ -1956,7 +1961,7 @@ void PeerManager::ProcessHeadersMessage(CNode& pfrom, const std::vector<CBlockHe
             m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETHEADERS, ::ChainActive().GetLocator(pindexLast), uint256()));
         }
 
-        bool fCanDirectFetch = CanDirectFetch(m_chainparams.GetConsensus());
+        bool fCanDirectFetch = CanDirectFetch();
         // If this set of headers is valid and ends in a block with at least as
         // much work as our tip, download as much as possible.
         if (fCanDirectFetch && pindexLast->IsValid(BLOCK_VALID_TREE) && ::ChainActive().Tip()->nChainWork <= pindexLast->nChainWork) {
@@ -3261,7 +3266,7 @@ void PeerManager::ProcessMessage(CNode& pfrom, const std::string& msg_type, CDat
         }
 
         // If we're not close to tip yet, give up and let parallel block fetch work its magic
-        if (!fAlreadyInFlight && !CanDirectFetch(m_chainparams.GetConsensus()))
+        if (!fAlreadyInFlight && !CanDirectFetch())
             return;
 
         if (IsWitnessEnabled(pindex->pprev, m_chainparams.GetConsensus()) && !nodestate->fSupportsDesiredCmpctVersion) {
@@ -4073,7 +4078,7 @@ void PeerManager::CheckForStaleTipAndEvictPeers()
         m_stale_tip_check_time = time_in_seconds + STALE_CHECK_INTERVAL;
     }
 
-    if (!m_initial_sync_finished && CanDirectFetch(m_chainparams.GetConsensus())) {
+    if (!m_initial_sync_finished && CanDirectFetch()) {
         m_connman.StartExtraBlockRelayPeers();
         m_initial_sync_finished = true;
     }
diff --git a/src/net_processing.h b/src/net_processing.h
index 6e3e032831..88bf7ff2a6 100644
--- a/src/net_processing.h
+++ b/src/net_processing.h
@@ -93,6 +93,11 @@ public:
      */
     void Misbehaving(const NodeId pnode, const int howmuch, const std::string& message);
 
+    /**
+     * Return whether our tip block's time is close enough to current time that we can directly fetch.
+     */
+    bool CanDirectFetch() const;
+
 private:
     /**
      * Potentially mark a node discouraged based on the contents of a BlockValidationState object

That approach may be preferable for a couple of reasons:

  • Placing the logic that checks/sets the condition under which we'll make additional block-relay-only peers in the same place that it makes those connections makes it much easier to reason about those state transitions. Currently m_start_extra_block_relay_peers is set based on a loop in net_processing and then read on a timer in net.
  • Caching the state internally makes it more difficult to test all the code paths. If the start_extra_block_relay_peers condition is set by a call into PeerManager, then that interface can be mocked and code paths can be hit more easily in unit/fuzz testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would mean that we stop using block-relay-only peer rotation when our tip is stale, which might be when we want these connections happening the most? It becomes arguable whether we should just rely on stale-tip-checking + full outbound peer rotation at that point, but I think we would want to carefully reason about our protections in that scenario.

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested ACK b3a515c over several weeks, though this change and behavior could benefit from test coverage and other follow-ups (refactoring, etc.) described in the review feedback. I did not verify the behavior of m_start_extra_block_relay_peers only being enabled after initial chain sync. Since my last review, one unneeded cs_main lock was removed.

A few empirical observations from running this code over time on a tor v3 node with mostly only 18 outbounds (~half very good manual ones) and 0, sometimes 1, inbounds:

  • The timing happens as expected
  • GetExtraFullOutboundCount() and GetExtraBlockRelayCount() mostly return 0 (97%+ of the time); when not, they return 1
  • Most of the time (97%+), the newly added block-relay-only peer is disconnected within ~50 seconds (and the log says "(last block received at time 0)")
  • Actual peer rotation is rare
  • Without the patch to induce a 10% chance of initializing nLastBlockTime to current time, I no longer saw any keeping/evicting events

My main wishlist would be that the code be designed from the start to be testable and have regression coverage. I'm not as confident in reviewing or changing code without coverage.

@jonatack
Copy link
Member

jonatack commented Dec 10, 2020

Logging used for my testing: jonatack@8986db4 (a variant of Suhas' patch above).

@ariard
Copy link

ariard commented Dec 10, 2020

Code Review ACK b3a515c, only change since last time is dropping a useless cs_main taking. I manually tested a previous version of the PR, and not substantial change has been introduced since then which would alter behavior IMO.


The trade-off here is that the more frequently we make connections (and therefore add edges to the network graph of nodes), the more security we have against partitioning attacks, but also the more network load we sustain. I don't think we have a great framework for evaluating the load, so I'd just eyeball it and observe after we deploy it, and we can always tune it if we think it's excessive or insufficient.

AFAIU this PR, I'm not worried about the network load introduced by this PR, whatever the metric we're picking (connection slots/application bandwidth/IP/TCP bandwidth). Let's assume a really worst-case scenario, where the victim node is always fallen-behind from a better-work chain by ~5 headers and has to download them every 5 min (EXTRA_BLOCK_RELAY_ONLY_PEER_INTERVAL). 80 * 5 * 12 * 24 * 30 = 3_456_000, this node will consume 3.45 MB by month from the network ? If I get the math right, I think that's fairly acceptable.

That said, I would be glad if we start to develop and sustain frameworks to evaluate question like network load which rightfully spawn up in this kind of work. Beyond agreeing on security model efficiency, having sound quantitative model would ease reaching Concept ACK/NACK.

I thought I did document this innet.h, but I think it would make sense to put a writeup of how all this works on our wiki, which I'd be happy to do after this is merged.

I know I'm a minority opinion, but I still feel we should have a sound discussion before dissociating further rationale from code by writing more documentation in the wiki instead of in-tree.

Contra:

  • splitting code rationale from code paths increase the risk of documentation being airy, hard to link back and easy outdated
  • rationale should be submitted to the same review process, wiki is free to edit for project members,, IMO code rationale is a much important than code correctness, because ultimately rationale is leading how code
    should be
  • increase project reliance on centralized GH infrastructure of duplicated git copies

I'm a big fan of the code documentation approach which has been done for #19988, and I hope to stick more to this kind of code documentation standard in the future.

@maflcko maflcko merged commit 6a48063 into bitcoin:master Dec 11, 2020
Copy link
Contributor

@ajtowns ajtowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post merge ACK b3a515c

Testing this, I'm seeing some cases where the extra block-relay peer is being evicted for sending a tx, eg:

2020-12-11T10:12:05Z Added connection peer=38
2020-12-11T10:12:06Z receive version message: /Satoshi:0.20.1/: version 70015, blocks=660884, us=xxx:8333, peer=38
2020-12-11T10:12:06Z New outbound peer connected: version: 70015, blocks=660884, peer=38 (block-relay)
2020-12-11T10:12:17Z received: inv (1261 bytes) peer=38
2020-12-11T10:12:17Z got inv: tx 565fed8bc9ff5fa333a8130ec399f23d24d4bcdd435778b1fd2a67278a980ee6  have peer=38
2020-12-11T10:12:17Z transaction (565fed8bc9ff5fa333a8130ec399f23d24d4bcdd435778b1fd2a67278a980ee6) inv sent in violation of protocol, disconnecting peer=38
2020-12-11T10:12:17Z disconnecting peer=38
2020-12-11T10:12:17Z Cleared nodestate for peer=38

which might be worth looking into further just in case those peers are actually running core like they say and we have a bug.

@@ -2557,7 +2557,7 @@ void PeerManager::ProcessMessage(CNode& pfrom, const std::string& msg_type, CDat
LogPrintf("New outbound peer connected: version: %d, blocks=%d, peer=%d%s (%s)\n",
pfrom.nVersion.load(), pfrom.nStartingHeight,
pfrom.GetId(), (fLogIPs ? strprintf(", peeraddr=%s", pfrom.addr.ToString()) : ""),
pfrom.m_tx_relay == nullptr ? "block-relay" : "full-relay");
pfrom.IsBlockOnlyConn() ? "block-relay" : "full-relay");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this suggestion (changing m_tx_relay == nullptr to IsBlockOnlyConn() when logging a new outbound peer) already done? I think github is confusing me...

Anyway, changing that entire x ? "y" : "z" to just be pfrom.ConnectionTypeAsString() seems like it might be better.

@@ -1820,18 +1820,32 @@ void CConnman::SetTryNewOutboundPeer(bool flag)
// Also exclude peers that haven't finished initial connection handshake yet
// (so that we don't decide we're over our desired connection limit, and then
// evict some peer that has finished the handshake)
int CConnman::GetExtraOutboundCount()
int CConnman::GetExtraFullOutboundCount()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this does get refactored further, making it bool HasExtraBlockRelay() and bool HasExtraFullOutbound and returning count > m_max_x instead of max(0, count - m_max_x) would be an idea too.


// Check whether we have too many OUTBOUND_FULL_RELAY peers
if (m_connman.GetExtraFullOutboundCount() > 0) {
// If we have more OUTBOUND_FULL_RELAY peers than we target, disconnect one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems better to use the names from ConnectionTypeAsString in comments than the shouty enums (ie outbound-full-relay) here).

sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Dec 11, 2020
@ajtowns
Copy link
Contributor

ajtowns commented Dec 16, 2020

Testing this, I'm seeing some cases where the extra block-relay peer is being evicted for sending a tx, [...]

Looking into this a bit further: I'm only seeing this coming from nodes advertising themselves as 0.20.1, and the timing seems like it's consistent with a 5s poisson delay since the connection is established. Not all 0.20.1 nodes are failing this way. I've just added some additional logging, and it looks like the txs they're advertising aren't particularly strange, but all the violating nodes seem to be running on cloud hosting (digital ocean, amazon, google cloud). So seems plausible that they're just buggy and lying about their version details?

One thing that might be worth considering: our sybil mitigation only works for concurrent connections -- our 10 regular outbounds all have to be in different netgroups because they're simultaneously connected, but 10 sequential extra block-relay-only could all end up to the same netgroup. Could fix this by keeping track of the last 10 extra connections we've tried, and trying to choose the next one from a different netgroup.

@jnewbery
Copy link
Contributor

Testing this, I'm seeing some cases where the extra block-relay peer is being evicted for sending a tx ...

I'm surprised that your log isn't showing the "send version message" line. You obviously have NET logging enabled since you're seeing the "receive version message" line. These are outbound connections, so I'd expect to see "send version message" (in InitializeNode()) between "Added connection peer" (in ConnectNode()) and "receive version message" (in ProcessMessage()). Very odd.

Even if we did have that logging in PushNodeVersion(), it doesn't include what the fRelay was set to. That'd be evidence that the peer is misbehaving and we don't have a bug.

Perhaps peer message capture would help here (#19509)

@ajtowns
Copy link
Contributor

ajtowns commented Dec 16, 2020

Oh, with logips enabled, I see those nodes also don't appear to be telling me my IP address, instead reporting their own IP in both us= and peeraddr=. Might just be a proxy that's munging the version message both ways, and losing the relay bit as well as address info.

Edited to add:

I think about 12% of the extra block-relay-only connections my peer is opening get disconnected for this reason

Example:

2020-12-16T13:12:01Z Added connection to [their_ip]:8333 peer=157
2020-12-16T13:12:01Z sending version (103 bytes) peer=157
2020-12-16T13:12:01Z ABCD, sent version message peer=157 block-relay-only m_ignore_incoming_txs==0, pnode.m_tx_relay == nullptr, relaytxes=0

Note: we are signalling no tx relay in the version message we send

2020-12-16T13:12:01Z send version message: version 70016, blocks=661619, us=[::]:0, them=[their_ip]:8333, peer=157
2020-12-16T13:12:01Z received: version (102 bytes) peer=157
2020-12-16T13:12:01Z sending verack (0 bytes) peer=157
2020-12-16T13:12:01Z receive version message: /Satoshi:0.20.1/: version 70015, blocks=661619, us=[their_ip]:8333, peer=157, peeraddr=[their_ip]:8333

Note: they are claiming to be 0.20.1, and are telling us our ip is the same as their ip.

2020-12-16T13:12:01Z received: verack (0 bytes) peer=157
2020-12-16T13:12:01Z New outbound peer connected: version: 70015, blocks=661619, peer=157, peeraddr=[their_ip]:8333 (block-relay)
2020-12-16T13:12:01Z sending sendheaders (0 bytes) peer=157
2020-12-16T13:12:01Z sending sendcmpct (9 bytes) peer=157
2020-12-16T13:12:01Z sending sendcmpct (9 bytes) peer=157
2020-12-16T13:12:01Z sending ping (8 bytes) peer=157
2020-12-16T13:12:01Z initial getheaders (661618) to peer=157 (startheight:661619)
2020-12-16T13:12:01Z sending getheaders (1029 bytes) peer=157
2020-12-16T13:12:02Z received: sendheaders (0 bytes) peer=157
2020-12-16T13:12:02Z received: sendcmpct (9 bytes) peer=157
2020-12-16T13:12:02Z received: sendcmpct (9 bytes) peer=157
2020-12-16T13:12:02Z received: ping (8 bytes) peer=157
2020-12-16T13:12:02Z sending pong (8 bytes) peer=157
2020-12-16T13:12:02Z received: getheaders (1029 bytes) peer=157
2020-12-16T13:12:02Z getheaders 661619 to end from peer=157
2020-12-16T13:12:02Z sending headers (82 bytes) peer=157
2020-12-16T13:12:02Z received: feefilter (8 bytes) peer=157
2020-12-16T13:12:02Z received: feefilter of 0.00001000 BTC/kvB from peer=157
2020-12-16T13:12:02Z received: pong (8 bytes) peer=157
2020-12-16T13:12:02Z received: headers (82 bytes) peer=157

Everthing looks fine so far. Except 8s later we get sent some txids, despite saying we don't want them. The txids themselves don't seem particularly suspicious.

2020-12-16T13:12:10Z received: inv (1261 bytes) peer=157
2020-12-16T13:12:10Z got inv: tx 75bd60faafd08519ef2e792df0545c9392df41bf291d1134e7a5db9710ced23b  have peer=157
2020-12-16T13:12:10Z transaction (75bd60faafd08519ef2e792df0545c9392df41bf291d1134e7a5db9710ced23b) inv sent in violation of protocol, disconnecting peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 75bd60faafd08519ef2e792df0545c9392df41bf291d1134e7a5db9710ced23b have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 7b449d756f6dda691f992987e17bf3f7dc34e0ea0ce65d43622401bc8f60dd88 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx bcb93b03d029cb3ec2d778b7d52dd3ec6dd281c5a3a0a5db9fb2acda71638c44 new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 5b5a2b221caa73ed6b703add97c72f826ac70148b973cf375e7a53db0a1fb379 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 81921bce2a40d6a461996feec5a47b1a4121d0da247d837e6e4dea899230b659 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 7a34e32d7077960769e5c0cbacad4a12dda84418204a2652c491a570058fcb0a have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx af263f51587ff049cc64222d6660c63abb13458c92902d639882e2d3ab7287d6 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 3d4b1ecae99adc8c8077ae091d88ca269c71c76c18de7a871b3218d6ba08356d have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 390071b455d1ed37f6c16863e08fcc97a53331a1f0ecd5d8f397658c8524afac have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx d21e808b5d6dd2ed4e6e462b8beef3a4dcc3e791324bd7c8f7cc4c9f29dfba5d have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 9a7d584b6cd86fb404f9cd4f64278f00111db009ccbcf1c7838cc32990149e4e new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 79c47859b388197e2f54e84b8e4127c774cd411a1d7e212a9260f69616bed798 new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 43ee2d62b43292aa5f93456afeea622201f8617041d41b3a5270b8d069074787 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx f6ec29d439fe01ff4253086aa31ac82e03304fe1b8bb335b1afcf9be24140787 new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx d81d8167089003070e65d909c51365f5906b30b0b9d41a842b0e87bdaa4b3f95 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx f6e90b5f427eba175ef3b464556b2deb2a6b6239f8ea45f82c6c3f99702ab740 new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx e3f89991c70433914ca25c371a3ede00f3909bc72fc4c6debabe996f459fe095 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 4af670a50a53b3949528a375cd958818c436efafbf8cf2d9476ba251e0624d03 new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 99c9200154e040337c4d853897930c6fb6ba5e052f2a9b801aa9a85d08cd4563 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx a2706abb0f8470c0e84aa07801fef1a56b1d2184b316bd13f1a18716a27da43e new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 5d41a07d349067fadc2ca165c32c2a1f369756006b512590fee88db0d3b05538 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 0e25bc1d72e1d6585e8e0b5a6e30e228136d8f25e8af15600c5b23d16d99dce0 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 9c7d8fa67e81da350ae834bbefb9dc2955a18067512129fee672cc28c1118629 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 9d4bc391dc0f4ee61e39f0ed65ef3f611a461a2bd54b216aa0263265fa50cca4 new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 9d9a74778fbaa7605f3386142f3ac5169a333ca5a846863394ff3c9e9ae69778 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx bf464249cf02bc4caac19c04d6a0cb76ea1628531e6d52b0d8dc9e3d2f1a5477 new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 457df7291de77b3b5ca0871ad87621c819efab6fb1471842e88fb7959ab1e2ca new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 3f644cb1a138d6dd1bfd461470172b7dab8e161f5b6b056cc477eda781414aee have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 9e6dd4d37cc7a56fd188ffe99cc1dc523901e18f8b6ec02e8b99c5ed6f3f608a new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx e64861304f5f7d829c82d559c0c1c7c57c5cb5e51bdfd96f10fcf04abb858408 have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 13bc931cd0a09e546b9d1b16c66964b33cb872e5eec65d21c2344dec1b317baa have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 038ee30078d52b1af945b16578ff4c302040248be273809aa285f7667a8f7edc have peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 36a363550c4c8a5fbb38b69cfde0a938faa06fa300836d769cb085da963174a3 new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx b09beb484f21adcec10665aa526132331f117e748997a41d9330370c8ce580f5 new peer=157
2020-12-16T13:12:10Z   ABCD, inv tx 12b3adb1fc1ccfbeec6d0700b1fce0b8e6a2dbe85e227745dca4f388b8b2d926 have peer=157
2020-12-16T13:12:10Z disconnecting peer=157
2020-12-16T13:12:10Z Cleared nodestate for peer=157

@sdaftuar
Copy link
Member Author

One thing that might be worth considering: our sybil mitigation only works for concurrent connections -- our 10 regular outbounds all have to be in different netgroups because they're simultaneously connected, but 10 sequential extra block-relay-only could all end up to the same netgroup. Could fix this by keeping track of the last 10 extra connections we've tried, and trying to choose the next one from a different netgroup.

@ajtowns I think this is an interesting idea -- seems like it would be a strict improvement in security (in a mathematical sense, ie I can't imagine how our security could be any worse off with that approach); but I'm not sure if the additional complexity is worth the potential gain? Not intrinsically opposed, but maybe this isn't low-hanging fruit either.

laanwj added a commit that referenced this pull request Dec 17, 2020
0c41c10 doc: Remove shouty enums in net_processing comments (Suhas Daftuar)

Pull request description:

  This uses the `CNode::ConnectionTypeAsString()` strings in place of the all-caps enums in a couple of comments in `net_processing`, as suggested by ajtowns in #19858 (comment).

ACKs for top commit:
  practicalswift:
    ACK 0c41c10
  jnewbery:
    ACK 0c41c10
  laanwj:
    ACK 0c41c10

Tree-SHA512: c8ab905e151ebb144c3f878277dc59d77591e4b39632658407b69b80b80d65825d5a391b01e2aea6af2fdf174c143dfe7d2f3eba84a020a58d7926458fdcd0a5
maflcko pushed a commit that referenced this pull request Dec 18, 2020
6d1e85f Clean up logging of outbound connection type (Suhas Daftuar)

Pull request description:

  We have a function that converts `ConnectionType` enums to strings, so use it.

  Suggested by ajtowns in #19858 (comment)

ACKs for top commit:
  amitiuttarwar:
    ACK 6d1e85f
  naumenkogs:
    ACK 6d1e85f

Tree-SHA512: f5084d8b5257380696d9fde86a8873e190cd4553feb07fa49df39bbd9510bf5832d190a3bca1571c48370d16a17c7a34900857b21b27bec0777bfa710211d7bb
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Dec 18, 2020
6d1e85f Clean up logging of outbound connection type (Suhas Daftuar)

Pull request description:

  We have a function that converts `ConnectionType` enums to strings, so use it.

  Suggested by ajtowns in bitcoin#19858 (comment)

ACKs for top commit:
  amitiuttarwar:
    ACK 6d1e85f
  naumenkogs:
    ACK 6d1e85f

Tree-SHA512: f5084d8b5257380696d9fde86a8873e190cd4553feb07fa49df39bbd9510bf5832d190a3bca1571c48370d16a17c7a34900857b21b27bec0777bfa710211d7bb
@dhruv
Copy link
Contributor

dhruv commented Jan 1, 2021

IIUC, since this merge, an extra outbound-full-relay connection is made upon a stale tip(3 block intervals), and an extra outbound-block-relay connection is now made every 5 minutes. Is there still value in the former?

@sdaftuar
Copy link
Member Author

sdaftuar commented Jan 1, 2021

IIUC, since this merge, an extra outbound-full-relay connection is made upon a stale tip(3 block intervals), and an extra outbound-block-relay connection is now made every 5 minutes. Is there still value in the former?

Yes I think there is. The stale tip logic more aggressively seeks out a new peer to connect to (staying in that state until a new connection is actually made) while this logic fires once on a selection from addrman and gives up even if no connection is made. Moreover the stale tip logic is an eviction algorithm for full relay peers, while this logic is only for block relay peers. I think having rotation logic for both makes sense, though there is probably room to improve the interaction between these two behaviors in all the various cases we can think of.

@dhruv
Copy link
Contributor

dhruv commented Jan 2, 2021

@sdaftuar Thanks for explaining. Could this be a way to improve the interaction between the two behaviors?:

  1. Upon stale tip, seek extra outbound-block-relay more aggressively rather than a 5 minute interval (stay in that state until the tip is no longer stale).
  2. Novel block discovery by the extra outbound-block-relay triggers outbound-full-relay eviction.
  3. Eliminate extra outbound-full-relay connections.

Perhaps the simpler starting point is just to implement (2)? Are there any downsides/known attacks if we do that?

maflcko pushed a commit that referenced this pull request Dec 6, 2021
4740fe8 test: Add test for block relay only eviction (Martin Zumsande)

Pull request description:

  Adds a unit test for block-relay-only eviction logic added in #19858, which was not covered by any tests before. The added test is very similar to the existing `stale_tip_peer_management` unit test, which tests the analogous logic for regular outbound peers.

ACKs for top commit:
  glozow:
    reACK 4740fe8
  rajarshimaitra:
    tACK 4740fe8
  shaavan:
    ACK 4740fe8. Great work @ mzumsande!
  LarryRuane:
    ACK 4740fe8

Tree-SHA512: 5985afd7d8f7ae311903dbbf6b7d526e16309c83c88ae6dd6551960c0b186156310a6be0cf6b684f82ac1378d0fc5aa3717f0139e078471013fceb6aebe81bf6
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Dec 7, 2021
4740fe8 test: Add test for block relay only eviction (Martin Zumsande)

Pull request description:

  Adds a unit test for block-relay-only eviction logic added in bitcoin#19858, which was not covered by any tests before. The added test is very similar to the existing `stale_tip_peer_management` unit test, which tests the analogous logic for regular outbound peers.

ACKs for top commit:
  glozow:
    reACK 4740fe8
  rajarshimaitra:
    tACK bitcoin@4740fe8
  shaavan:
    ACK 4740fe8. Great work @ mzumsande!
  LarryRuane:
    ACK 4740fe8

Tree-SHA512: 5985afd7d8f7ae311903dbbf6b7d526e16309c83c88ae6dd6551960c0b186156310a6be0cf6b684f82ac1378d0fc5aa3717f0139e078471013fceb6aebe81bf6
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Jan 26, 2022
Summary:
```
To make eclipse attacks more difficult, regularly initiate outbound connections
and stay connected long enough to sync headers and potentially learn of new
blocks. If we learn a new block, rotate out an existing block-relay peer in
favor of the new peer.

This augments the existing outbound peer rotation that exists -- currently we
make new full-relay connections when our tip is stale, which we disconnect
after waiting a small time to see if we learn a new block. As block-relay
connections use minimal bandwidth, we can make these connections regularly and
not just when our tip is stale.

Like feeler connections, these connections are not aggressive; whenever our
timer fires (once every 5 minutes on average), we'll try to initiate a new
block-relay connection as described, but if we fail to connect we just wait for
our timer to fire again before repeating with a new peer.
```

Backport of [[bitcoin/bitcoin#19858 | core#19858]].

Ref T1696.

Test Plan:
  ninja all check-all
Run IBD.

Reviewers: #bitcoin_abc, PiRK

Reviewed By: #bitcoin_abc, PiRK

Maniphest Tasks: T1696

Differential Revision: https://reviews.bitcoinabc.org/D10907
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.