New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net_processing: make any misbehavior trigger immediate discouragement #29575
base: master
Are you sure you want to change the base?
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concept ACK
going to run on a listening node with logging to test
Concept ACK |
1 similar comment
Concept ACK |
I have been monitoring peer misbehavior with the misbehaving tracepoint from #25832. In the past weeks, I've seen the following misbehavior across 10 nodes:
I've looked at a few of these I'll keep an eye on further misbehavior. |
Perhaps we can tolerate this if the block is less than 10 hours in the future and disconnect if it's more? It seems cheap enough to process slightly-too-far-in-the-future headers, and they are enormously expensive to produce. (for other reviewers, look for The peer might also be telling us about a reorg longer than
So we should not disconnect the first time around. The Sjors@fad8c5b illustrates how we could limit this to just once ( |
Sjors@00a7dcc is one approach of dealing with the time difference issue. If any of the headers are Some of the comments in that commit may be spurious; I can't think of a useful attack where a peer (e.g. Bob) sends us headers, three hours in the future, with fake PoW. All Bob would achieve is that we ask I also don't know if if it's even worth checking if a header is in the far future, but at least it makes for more useful log entries? (it's nice to know the difference between a peer with a broken clock and a peer that falsely announces a large reorg) |
I discussed this with @sdaftuar, @mzumsande, @sr-gi in person, and decided to instead just drop the Misbehavior score for unconnecting headers. The rationale is that this misbehavior was originally intended to prevent bandwidth wastage due to very broken (but likely non-malicious) nodes actually observed on the network that respond to GETHEADERS with a response unrelated to the request, triggering a request cycle. This has however largely been addressed by #25454, which limits the number of (not responded to) GETHEADERS requests in flight to one, with a 2 minute timeout. With that, the specific misbehavior is largely irrelevant (for inbound peers, it is now mostly harmless; for outbound peers, the eviction logic will eventually kick them out if they're not keeping up with headers at all). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at each of the limits in aee3b67 to confirm they're plenty-old enough to just disconnect anyone that breaks them.
Will look at the newly pushed commits later.
Reviewed dee6ca2 I'll run the log commit for a while. I'm still a bit on the fence about a903bba.
Some context here for other reviewers, when you look at
IIUC outbound peer eviction is handled in Also... It might still be useful, even if not needed for security, to know if a peer sent us disconnecting headers. Although the Tracking if the peer did this before, e.g. with something like |
@Sjors It's worth taking into account that despite its history as originally being a "DoS score", the discouragement logic isn't particularly good at preventing intentional abuse of resources; scoring systems are inherently easy to bypass: just do enough to keep the score acceptable, and be as wasteful otherwise (or even better, if many IP addresses are available, use several). The only real solution, in my opinion, for actual DoS protection is keeping track of how many resources (bandwidth, CPU, I/O, ...) we spend on behalf of every peer, and whenever those resources become a bottleneck for us, throttle the worst peers. So the purpose for the discouragement logic as I see it is avoiding wasting resources on (unintentionally) broken peers. These are somewhat distinct from useless peers (which is what the outbound eviction logic is aimed to prevent). There is an infinite amount of potential broken behavior one can imagine, and clearly we cannot have code for all of it. Another reason for the misbehavior detection is to help protocol implementers get a good idea for expectations around parts of the protocol that aren't formally specified (because it triggers disconnections, and possibly logging). And arguably, unconnecting headers isn't a protocol violation, at least in the context of BIP130 announcements, so disconnecting for something that's very rare but not impossible to trigger in what we'd consider a perfectly compliant protocol implementation would be confusing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK, a few doc nits and one question below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK. I left some comments inline, plus some general ones that may be addressable with the change of functionality:
-
Looks like
banmah.h
also needs updating, given there is no such thing as how much a peer misbehaves anymore:(line 37) // 2. Discouragement. If a peer misbehaves enough (see Misbehaving() in // net_processing.cpp)
-
p2p_filter.py
andp2p_invalid_messages.py
are asserting logs looking forMisbehaving
. Shouldn't it be better now to just check that the peer was disconnected in those cases? e.g:self.log.info('Check that too large filter is rejected') with self.nodes[0].assert_debug_log(['Misbehaving']): filter_peer.send_and_ping(msg_filterload(data=b'\xbb'*(MAX_BLOOM_FILTER_SIZE+1)))
@@ -1828,8 +1828,7 @@ bool PeerManagerImpl::MaybePunishNodeForBlock(NodeId nodeid, const BlockValidati | |||
return true; | |||
// Conflicting (but not necessarily invalid) data or different policy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment accurate? I don't see how it applies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, this seems to be a leftover comment that hasn't applied in years. This code used to handle both block and transaction validation failures, but transactions were moved elsewhere. It seems unrelated to this PR, so I won't touch it here.
@@ -1848,7 +1835,7 @@ bool PeerManagerImpl::MaybePunishNodeForTx(NodeId nodeid, const TxValidationStat | |||
break; | |||
// The node is providing invalid data: | |||
case TxValidationResult::TX_CONSENSUS: | |||
if (peer) Misbehaving(*peer, 100, ""); | |||
if (peer) Misbehaving(*peer, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we don't even signal this as a consensus discrepancy? Is it logged somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the network processing layer, which acts in response to validation failures (to the extent that it needs to punish peers for it). The actual validation failure should be logged through the validation code itself, see ConnectTip()
there, which logs failures, and invokes the BlockChecked
signal that is handled by net_processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The message passed to Misbehaving
ranges in how explicit it is though (from the actual reason to the category, without really specifying what, e.g. "invalid compact block"). That's why I was expecting this to at least signal the reason is due to a consensus discrepancy
Since bitcoin#25454 we keep track of the last GETHEADERS request that was sent and wasn't responded to. So far, every incoming HEADERS message is treated as a response to whatever GETHEADERS was last sent, regardless of its contents. This commit makes this tracking more accurate, by only treating HEADERS messages which (1) are empty, (2) connect to our existing block header tree, or (3) are a continuation of a low-work headers sync as responses that clear the "outstanding GETHEADERS" state (m_last_getheaders_timestamp). That means that HEADERS messages which do not satisfy any of the above criteria will be ignored, not triggering a GETHEADERS, and potentially (for now, but see later commit) increase misbehavior score.
This misbehavior was originally intended to prevent bandwidth wastage due to actually observed very broken (but likely non-malicious) nodes that respond to GETHEADERS with a response unrelated to the request, triggering a request cycle. This has however largely been addressed by the previous commit, which causes non-connecting HEADERS that are received while a GETHEADERS has not been responded to, to be ignored, as long as they do not time out (2 minutes). With that, the specific misbehavior is largely irrelevant (for inbound peers, it is now harmless; for outbound peers, the eviction logic will eventually kick them out if they're not keeping up with headers at all).
With the Misbehavior score gone for non-connecting headers (see previous commit), there is no need to only treat headers messages with up to 8 headers as potential BIP130 announcements. BIP130 does not specify such a limit; it was purely a heuristic.
This removes the need to actually track misbehavior score (see further commit), because any Misbehaving node will immediately hit the discouragement threshold.
This is now all unused.
dee6ca2
to
e7ccaa0
Compare
Addressed most of the feedback I've gotten. As pointed out by @mzumsande in #29575 (comment), the protection I imagined existed against buggy clients causing a @sr-gi I have not addressed the |
I ran a slightly outdated version of this PR with the log commit e451338 for the last two weeks on a decently connected node. Only a single
This invalid header was sent 9 seconds after receiving a new valid block. I'll continue to run with the latest version... ACK e7ccaa0 |
Concept and code review ACK. I haven't done any testing (nor did I carefully review the tests), just read through the code and thought about the overall logical flow, which makes sense to me. |
It's been two weeks and I don't see a single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review ACK e7ccaa0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of doc nits and a question below
1. Announce a header that doesn't connect. | ||
Expect: getheaders message | ||
2. Send headers chain. | ||
Expect: getdata for the missing blocks, tip update. | ||
b. Then send 9 more headers that don't connect. | ||
b. Then send 100 more headers that don't connect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in: 78515e1
nit: isn't this 99? (for i in range(1, NUM_HEADERS)
)
# Send an empty header as failed response to the receive getheaders, to | ||
# make marked responded to. New headers are not treated as announcements | ||
# otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in: ce238dd
nit: I found this wording a bit hard to follow, especially the "to make marked responded to" bit.
IIUC, this is sending an empty block as a response to the outstanding getheaders request from the previous loop iteration, otherwise sending blocks[i]
would be treated as the reply instead. Is that it? If so, I'd suggest something along the lines of:
# Send an empty header as failed response to the receive getheaders, to | |
# make marked responded to. New headers are not treated as announcements | |
# otherwise. | |
# Send an empty header as a failed response to the received getheaders | |
# (from the previous iteration). Otherwise, the new headers will be | |
# treated as a response instead of as an announcement. |
@@ -2628,6 +2628,8 @@ bool PeerManagerImpl::IsContinuationOfLowWorkHeadersSync(Peer& peer, CNode& pfro | |||
{ | |||
if (peer.m_headers_sync) { | |||
auto result = peer.m_headers_sync->ProcessNextHeaders(headers, headers.size() == MAX_HEADERS_RESULTS); | |||
// If it is a valid continuation, we should treat the existing getheaders request as responded to. | |||
if (result.success) peer.m_last_getheaders_timestamp = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in: ce238dd
I think the comment 6 lines below is inconsistent now:
// It should be impossible for the getheaders request to fail,
// because we should have cleared the last getheaders timestamp
// when processing the headers that triggered this call.
[...]
IIUC, result.request_more
cannot be true if result.success
is false (based on my understanding of HeadersSyncState::ProcessNextHeaders), therefore we will clear m_last_getheaders_timestamp
right before entering the request_more
scope.
I think this should also mean that there shouldn't be a way of bypassing this check (as the second part of the comment used to suggest), so the logging can be simplified
🐙 This pull request conflicts with the target branch and needs rebase. |
So far, discouragement of peers triggers when their misbehavior score exceeds 100 points. Most types of misbehavior increment the score by 100, triggering immediate discouragement, but some types do not. This PR makes all increments equal to either 100 (meaning any misbehavior will immediately cause disconnection and discouragement) or 0 (making the behavior effectively unconditionally allowed), and then removes the logic for score accumulation.
This simplifies the code a bit, but also makes protocol expectations clearer: if a peer misbehaves, they get disconnected. There is no good reason why certain types of protocol violations should be permitted 4 times (howmuch=20) or 9 times (howmuch=10), while many others are never allowed. Furthermore, the distinction between these looks arbitrary.
To assess whether this does not unduly affect existing other software on the network, I've been running this with an additional patch (see https://github.com/sipa/bitcoin/commits/202403_nomisbehave_log) that adds logging whenever behavior is detected whose misbehavior increment is changed to 100 by this PR. Over the past day I have not seen any instances on my well-connected node (~150 incoming connections), but will continue to monitor.
The specific types of misbehavior that are changed to 100 are:
block
which does not connect to our header tree (which necessarily must have been unsollicited). [used to be score 10]headers
with a non-continuous headers sequence. [used to be score 20]addr
oraddrv2
message [used to be score 20]inv
message [used to be score 20]headers
message [used to be score 20]The specific types of misbehavior that are changed to 0 are:
headers
message (which thus does not get treated as a BIP130 announcement) that does not connect to our block tree. [used to be score 10]I believe that none of these behaviors are unavoidable, except for the one marked (*) which can in theory happen still due to interaction between BIP130 and variations in system clocks (the max 2 hour in the future rule). This one has been removed entirely. In order to remove the impact of the bug it was designed to deal with, without relying on misbehavior, a separate improvement is included that makes
getheaders
-tracking more accurate.In another unrelated improvement, this also gets rid of the 8 header limit heuristic to determine whether an incoming non-connecting
headers
is a potential BIP130 announcement, as this rule is no longer needed to prevent spurious Misbehavior. Instead, any non-connectingheaders
is now treated as a potential announcement.