Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net_processing: make any misbehavior trigger immediate discouragement #29575

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

sipa
Copy link
Member

@sipa sipa commented Mar 6, 2024

So far, discouragement of peers triggers when their misbehavior score exceeds 100 points. Most types of misbehavior increment the score by 100, triggering immediate discouragement, but some types do not. This PR makes all increments equal to either 100 (meaning any misbehavior will immediately cause disconnection and discouragement) or 0 (making the behavior effectively unconditionally allowed), and then removes the logic for score accumulation.

This simplifies the code a bit, but also makes protocol expectations clearer: if a peer misbehaves, they get disconnected. There is no good reason why certain types of protocol violations should be permitted 4 times (howmuch=20) or 9 times (howmuch=10), while many others are never allowed. Furthermore, the distinction between these looks arbitrary.

To assess whether this does not unduly affect existing other software on the network, I've been running this with an additional patch (see https://github.com/sipa/bitcoin/commits/202403_nomisbehave_log) that adds logging whenever behavior is detected whose misbehavior increment is changed to 100 by this PR. Over the past day I have not seen any instances on my well-connected node (~150 incoming connections), but will continue to monitor.

The specific types of misbehavior that are changed to 100 are:

  • Sending us a block which does not connect to our header tree (which necessarily must have been unsollicited). [used to be score 10]
  • Sending us a headers with a non-continuous headers sequence. [used to be score 20]
  • Sending us more than 1000 addresses in a single addr or addrv2 message [used to be score 20]
  • Sending us more than 50000 invs in a single inv message [used to be score 20]
  • Sending us more than 2000 headers in a single headers message [used to be score 20]

The specific types of misbehavior that are changed to 0 are:

  • Sending us 10 (*) separate BIP130 headers announcements that do not connect to our block tree [used to be score 20]
  • Sending us more than 8 headers in a single headers message (which thus does not get treated as a BIP130 announcement) that does not connect to our block tree. [used to be score 10]

I believe that none of these behaviors are unavoidable, except for the one marked (*) which can in theory happen still due to interaction between BIP130 and variations in system clocks (the max 2 hour in the future rule). This one has been removed entirely. In order to remove the impact of the bug it was designed to deal with, without relying on misbehavior, a separate improvement is included that makes getheaders-tracking more accurate.

In another unrelated improvement, this also gets rid of the 8 header limit heuristic to determine whether an incoming non-connecting headers is a potential BIP130 announcement, as this rule is no longer needed to prevent spurious Misbehavior. Instead, any non-connecting headers is now treated as a potential announcement.

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 6, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK Sjors, mzumsande
Concept ACK instagibbs, 1440000bytes, brunoerg, sr-gi, sdaftuar, willcl-ark

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #29736 (test: Extends wait_for_getheaders so a specific block hash can be checked by sr-gi)
  • #25832 (tracing: network connection tracepoints by 0xB10C)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@willcl-ark willcl-ark added the P2P label Mar 6, 2024
Copy link
Member

@instagibbs instagibbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concept ACK

going to run on a listening node with logging to test

test/functional/p2p_sendheaders.py Outdated Show resolved Hide resolved
@1440000bytes
Copy link

Concept ACK

1 similar comment
@brunoerg
Copy link
Contributor

brunoerg commented Mar 6, 2024

Concept ACK

@0xB10C
Copy link
Contributor

0xB10C commented Mar 7, 2024

I have been monitoring peer misbehavior with the misbehaving tracepoint from #25832. In the past weeks, I've seen the following misbehavior across 10 nodes:

  • header with invalid proof of work [100]: very frequent
  • mutated block [100]: these are frequent. I was running p2p: Don't process mutated blocks #29412 where we disconnect for each mutated block (compared to only disconnecting for mutated blocks we hadn't seen before)
  • invalid header received [10]: these are infrequent ("Sending us more than 8 headers in a single headers message")
  • headers message size = X [20]: only seen one occurrence ("Sending us more than 2000 headers in a single headers message that does not connect to our block tree")

I've looked at a few of these invalid header received misbehaviors and they all came from the same IP address with a (fake?) v26.0 Bitcoin Core user agent and always claiming their start height to be 153681 (in the version msg). Seems ok to disconnect these?

I'll keep an eye on further misbehavior.

@Sjors
Copy link
Member

Sjors commented Mar 7, 2024

due to interaction between BIP130 and variations in system clocks (the max 2 hour in the future rule)

Perhaps we can tolerate this if the block is less than 10 hours in the future and disconnect if it's more? It seems cheap enough to process slightly-too-far-in-the-future headers, and they are enormously expensive to produce.

(for other reviewers, look for MAX_NUM_UNCONNECTING_HEADERS_MSGS to see where this is handled)

The peer might also be telling us about a reorg longer than MAX_BLOCKS_TO_ANNOUNCE = 8. In UpdatedBlockTip we:

limit announcements in case of a huge reorganization.
Rely on the peer's synchronization mechanism in that case.

So we should not disconnect the first time around.

The HandleFewUnconnectingHeaders documentation should probably mention this.

Sjors@fad8c5b illustrates how we could limit this to just once (MAX_NUM_UNCONNECTING_HEADERS_MSGS = 2), since > 8 block reorgs should be rare. If we were to immedidately disconnect (MAX_NUM_UNCONNECTING_HEADERS_MSGS = 1 then step 5 of p2p_sendheaders.py fails). This doesn't account for the time difference issue. And in any case it can be done in a followup. And of course we wouldn't need an integer, but could instead call it m_chicken_little.

@Sjors
Copy link
Member

Sjors commented Mar 7, 2024

Sjors@00a7dcc is one approach of dealing with the time difference issue. If any of the headers are > MAX_FUTURE_BLOCK_TIME * 10 in the future we disconnect with time-too-new. We could perhaps increase the number to allow for even more incorrect clocks, but I'm not sure how useful that is.

Some of the comments in that commit may be spurious; I can't think of a useful attack where a peer (e.g. Bob) sends us headers, three hours in the future, with fake PoW. All Bob would achieve is that we ask another peer (Caroll) Bob for the missing headers, which Caroll either won't have, or she does we disconnect her he either sends, for which we disconnect. Or he doesn't send and tries again, at which point we disconnect him.

I also don't know if if it's even worth checking if a header is in the far future, but at least it makes for more useful log entries? (it's nice to know the difference between a peer with a broken clock and a peer that falsely announces a large reorg)

@sipa
Copy link
Member Author

sipa commented Mar 7, 2024

I discussed this with @sdaftuar, @mzumsande, @sr-gi in person, and decided to instead just drop the Misbehavior score for unconnecting headers. The rationale is that this misbehavior was originally intended to prevent bandwidth wastage due to very broken (but likely non-malicious) nodes actually observed on the network that respond to GETHEADERS with a response unrelated to the request, triggering a request cycle. This has however largely been addressed by #25454, which limits the number of (not responded to) GETHEADERS requests in flight to one, with a 2 minute timeout. With that, the specific misbehavior is largely irrelevant (for inbound peers, it is now mostly harmless; for outbound peers, the eviction logic will eventually kick them out if they're not keeping up with headers at all).

Copy link
Member

@Sjors Sjors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at each of the limits in aee3b67 to confirm they're plenty-old enough to just disconnect anyone that breaks them.

Will look at the newly pushed commits later.

src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
@Sjors
Copy link
Member

Sjors commented Mar 7, 2024

Reviewed dee6ca2

I'll run the log commit for a while.

I'm still a bit on the fence about a903bba.

With that, the specific misbehavior is largely irrelevant (for inbound peers, it is now mostly harmless; ...

Some context here for other reviewers, when you look at SelectNodeToEvict one of the criteria that prevents eviction is if they sent us novel blocks. This means that peers that keep sending us useless headers, will probably not enjoy this protection and eventually get replaced. Unless they also give use useful data, in which case they're ... useful - and we can just humor their headers shenaningans.

... for outbound peers, the eviction logic will eventually kick them out if they're not keeping up with headers at all).

IIUC outbound peer eviction is handled in PeerManagerImpl::ConsiderEviction, which loses its patience after CHAIN_SYNC_TIMEOUT + HEADERS_RESPONSE_TIME. That seems a bit late though, if all our outbound peers are slow? Even if that's rare, it seems better to have as many "fit" outbound peers as possible.

Also... It might still be useful, even if not needed for security, to know if a peer sent us disconnecting headers. Although the NET log still provides a log message received header %s: missing prev block %s, sending getheaders (%d) to end (peer=%d), we don't know the outcome.

Tracking if the peer did this before, e.g. with something like m_chicken_little, would make this easier. But not sure if it's worth the complexity. It seems all this would do is reveal that someone tried an attack that we believe to be useless.

@sipa
Copy link
Member Author

sipa commented Mar 7, 2024

@Sjors It's worth taking into account that despite its history as originally being a "DoS score", the discouragement logic isn't particularly good at preventing intentional abuse of resources; scoring systems are inherently easy to bypass: just do enough to keep the score acceptable, and be as wasteful otherwise (or even better, if many IP addresses are available, use several). The only real solution, in my opinion, for actual DoS protection is keeping track of how many resources (bandwidth, CPU, I/O, ...) we spend on behalf of every peer, and whenever those resources become a bottleneck for us, throttle the worst peers.

So the purpose for the discouragement logic as I see it is avoiding wasting resources on (unintentionally) broken peers. These are somewhat distinct from useless peers (which is what the outbound eviction logic is aimed to prevent). There is an infinite amount of potential broken behavior one can imagine, and clearly we cannot have code for all of it.

Another reason for the misbehavior detection is to help protocol implementers get a good idea for expectations around parts of the protocol that aren't formally specified (because it triggers disconnections, and possibly logging). And arguably, unconnecting headers isn't a protocol violation, at least in the context of BIP130 announcements, so disconnecting for something that's very rare but not impossible to trigger in what we'd consider a perfectly compliant protocol implementation would be confusing.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK, a few doc nits and one question below.

src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Show resolved Hide resolved
Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK. I left some comments inline, plus some general ones that may be addressable with the change of functionality:

  • Looks like banmah.h also needs updating, given there is no such thing as how much a peer misbehaves anymore:

    (line 37) // 2. Discouragement. If a peer misbehaves enough (see Misbehaving() in
    // net_processing.cpp)
    
  • p2p_filter.py and p2p_invalid_messages.py are asserting logs looking for Misbehaving. Shouldn't it be better now to just check that the peer was disconnected in those cases? e.g:

    self.log.info('Check that too large filter is rejected')
    with self.nodes[0].assert_debug_log(['Misbehaving']):
        filter_peer.send_and_ping(msg_filterload(data=b'\xbb'*(MAX_BLOOM_FILTER_SIZE+1)))

src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Show resolved Hide resolved
src/test/denialofservice_tests.cpp Outdated Show resolved Hide resolved
@@ -1828,8 +1828,7 @@ bool PeerManagerImpl::MaybePunishNodeForBlock(NodeId nodeid, const BlockValidati
return true;
// Conflicting (but not necessarily invalid) data or different policy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment accurate? I don't see how it applies

Copy link
Member Author

@sipa sipa Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, this seems to be a leftover comment that hasn't applied in years. This code used to handle both block and transaction validation failures, but transactions were moved elsewhere. It seems unrelated to this PR, so I won't touch it here.

@@ -1848,7 +1835,7 @@ bool PeerManagerImpl::MaybePunishNodeForTx(NodeId nodeid, const TxValidationStat
break;
// The node is providing invalid data:
case TxValidationResult::TX_CONSENSUS:
if (peer) Misbehaving(*peer, 100, "");
if (peer) Misbehaving(*peer, "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we don't even signal this as a consensus discrepancy? Is it logged somewhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the network processing layer, which acts in response to validation failures (to the extent that it needs to punish peers for it). The actual validation failure should be logged through the validation code itself, see ConnectTip() there, which logs failures, and invokes the BlockChecked signal that is handled by net_processing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The message passed to Misbehaving ranges in how explicit it is though (from the actual reason to the category, without really specifying what, e.g. "invalid compact block"). That's why I was expecting this to at least signal the reason is due to a consensus discrepancy

src/net_processing.cpp Show resolved Hide resolved
Since bitcoin#25454 we keep track of the last
GETHEADERS request that was sent and wasn't responded to. So far, every incoming
HEADERS message is treated as a response to whatever GETHEADERS was last sent,
regardless of its contents.

This commit makes this tracking more accurate, by only treating HEADERS messages
which (1) are empty, (2) connect to our existing block header tree, or (3) are a
continuation of a low-work headers sync as responses that clear the "outstanding
GETHEADERS" state (m_last_getheaders_timestamp).

That means that HEADERS messages which do not satisfy any of the above criteria
will be ignored, not triggering a GETHEADERS, and potentially (for now, but see
later commit) increase misbehavior score.
This misbehavior was originally intended to prevent bandwidth wastage due to
actually observed very broken (but likely non-malicious) nodes that respond
to GETHEADERS with a response unrelated to the request, triggering a request
cycle.

This has however largely been addressed by the previous commit, which causes
non-connecting HEADERS that are received while a GETHEADERS has not been
responded to, to be ignored, as long as they do not time out (2 minutes).
With that, the specific misbehavior is largely irrelevant (for inbound peers,
it is now harmless; for outbound peers, the eviction logic will eventually
kick them out if they're not keeping up with headers at all).
With the Misbehavior score gone for non-connecting headers (see previous
commit), there is no need to only treat headers messages with up to 8
headers as potential BIP130 announcements. BIP130 does not specify such
a limit; it was purely a heuristic.
This removes the need to actually track misbehavior score (see further commit), because any
Misbehaving node will immediately hit the discouragement threshold.
@sipa
Copy link
Member Author

sipa commented Mar 12, 2024

Addressed most of the feedback I've gotten.

As pointed out by @mzumsande in #29575 (comment), the protection I imagined existed against buggy clients causing a getheaders/headers cycle didn't actually exist. I've instead added that as a separate improvement, becoming the new first commit.

@sr-gi I have not addressed the p2p_filter.py and p2p_invalid_messages.py using misbehavior logging. These tests use whitelisting to prevent disconnection, and I think reworking that isn't necessarily an improvement.

@Sjors
Copy link
Member

Sjors commented Mar 22, 2024

I ran a slightly outdated version of this PR with the log commit e451338 for the last two weeks on a decently connected node. Only a single MISBEHAVE log entry:

168782-2024-03-07T15:58:10.608477Z Saw new header hash=00000000000000000002c88981de747ae237d80b74a5deb1a17943a76b296e3a height=833594 peer=1501 peeraddr=x.x.x.x:35630
168948-2024-03-07T15:58:10.947732Z UpdateTip: new best=00000000000000000002c88981de747ae237d80b74a5deb1a17943a76b296e3a height=833594 version=0x2319e000 log2_work=94.778914 tx=973821560 date='2024-03-07T15:57:26Z' progress=1.000000 cache=16.8MiB(112927txo)
169198:2024-03-07T15:58:19.055670Z MISBEHAVE: invalid header

This invalid header was sent 9 seconds after receiving a new valid block.

I'll continue to run with the latest version...


ACK e7ccaa0

@sdaftuar
Copy link
Member

sdaftuar commented Apr 4, 2024

Concept and code review ACK. I haven't done any testing (nor did I carefully review the tests), just read through the code and thought about the overall logical flow, which makes sense to me.

@Sjors
Copy link
Member

Sjors commented Apr 5, 2024

It's been two weeks and I don't see a single MISBEHAVE log entry.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review ACK e7ccaa0

@willcl-ark
Copy link
Member

Concept ACK.

IIUC this would not quite be enough to close down #9530 as it would not permit a header building on top of an invalid block with a valid header, as proposed there. Perhaps what is proposed here is superior though and ce238dd is enough to close that issue out if this goes in?

Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of doc nits and a question below

1. Announce a header that doesn't connect.
Expect: getheaders message
2. Send headers chain.
Expect: getdata for the missing blocks, tip update.
b. Then send 9 more headers that don't connect.
b. Then send 100 more headers that don't connect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in: 78515e1

nit: isn't this 99? (for i in range(1, NUM_HEADERS))

Comment on lines +560 to +562
# Send an empty header as failed response to the receive getheaders, to
# make marked responded to. New headers are not treated as announcements
# otherwise.
Copy link
Member

@sr-gi sr-gi Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in: ce238dd

nit: I found this wording a bit hard to follow, especially the "to make marked responded to" bit.

IIUC, this is sending an empty block as a response to the outstanding getheaders request from the previous loop iteration, otherwise sending blocks[i] would be treated as the reply instead. Is that it? If so, I'd suggest something along the lines of:

Suggested change
# Send an empty header as failed response to the receive getheaders, to
# make marked responded to. New headers are not treated as announcements
# otherwise.
# Send an empty header as a failed response to the received getheaders
# (from the previous iteration). Otherwise, the new headers will be
# treated as a response instead of as an announcement.

@@ -2628,6 +2628,8 @@ bool PeerManagerImpl::IsContinuationOfLowWorkHeadersSync(Peer& peer, CNode& pfro
{
if (peer.m_headers_sync) {
auto result = peer.m_headers_sync->ProcessNextHeaders(headers, headers.size() == MAX_HEADERS_RESULTS);
// If it is a valid continuation, we should treat the existing getheaders request as responded to.
if (result.success) peer.m_last_getheaders_timestamp = {};
Copy link
Member

@sr-gi sr-gi Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in: ce238dd

I think the comment 6 lines below is inconsistent now:

// It should be impossible for the getheaders request to fail,
// because we should have cleared the last getheaders timestamp
// when processing the headers that triggered this call.
[...]

IIUC, result.request_more cannot be true if result.success is false (based on my understanding of HeadersSyncState::ProcessNextHeaders), therefore we will clear m_last_getheaders_timestamp right before entering the request_more scope.

I think this should also mean that there shouldn't be a way of bypassing this check (as the second part of the comment used to suggest), so the logging can be simplified

@DrahtBot DrahtBot requested a review from sr-gi April 15, 2024 19:51
@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet