New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix transaction relay bugs introduced in #14897 and expire transactions from peer in-flight map #15834
Conversation
@@ -75,6 +75,8 @@ static constexpr int64_t INBOUND_PEER_TX_DELAY = 2 * 1000000; | |||
static constexpr int64_t GETDATA_TX_INTERVAL = 60 * 1000000; | |||
/** Maximum delay (in microseconds) for transaction requests to avoid biasing some peers over others. */ | |||
static constexpr int64_t MAX_GETDATA_RANDOM_DELAY = 2 * 1000000; | |||
/** How long to wait (in microseconds) before expiring a getdata request to a peer */ | |||
static constexpr int64_t TX_EXPIRY_INTERVAL = 10 * GETDATA_TX_INTERVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the use of 10 here imply that-- assuming all peers INV a transaction to us at roughly the same time-- that we get no more robustness to transaction suppression than if we only had ten peers? If so, perhaps this should be equal to the maximum number of peers times the interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, TX_EXPIRY_INTERVAL
only removes tx's that are already in the in_flight
list, which means we've already sent a GETDATA for them at least TX_EXPIRY_INTERVAL
microseconds ago. EDIT: (And doesn't remove them from other peers' queues, whether they've been sent a GETDATA already, or that's still pending)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I think there's another bug that causes problems here -- if you have two offers with nearby m_process_time
values, then you'll hit the else
clause of last_request_Time <= nNow - GETDATA_TX_INTERVAL
and call RequestTx(txid)
which will just return because it's already present in m_tx_announced
, and then you'll clear inv.hash
(ie txid
) from tx_process_time
, and never actually request it from that peer. So I think this also needs:
} else {
// This transaction is in flight from someone else; queue
// up processing to happen after the download times out
// (with a slight delay for inbound peers, to prefer
// requests to outbound peers).
+ state.m_tx_download.m_tx_announced.erase(inv.hash);
RequestTx(&state, txid, nNow);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that only works for the request side. If we get:
0s peer A: INV deadbeef...
1s peer B: INV deadbeef...
...
5s peer Z: INV deadbeef
and then we happen to query some unresponsive peers, we'll see:
10s: -> peer D: GETDATA deadbeef...
610s: [expired]
611s: -> peer Q: GETDATA deadbeef...
1211s: [expired]
At which point every peer will have decided "it's been >15m, I'll expire the deadbeef... tx from mapRelay, and then every node will respond with NOTFOUND for that tx...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the use of 10 here imply that-- assuming all peers INV a transaction to us at roughly the same time-- that we get no more robustness to transaction suppression than if we only had ten peers?
@gmaxwell I'm not sure I understand the question -- is the concern that an adversary could keep re-announcing a transaction to us, and every 10 minutes we'll retry that adversary instead of going to our other peers?
I think that does seem problematic, but since we prefer outbound peers over inbound ones, we should at least be able to query all our outbound peers before an adversary would be able to cause us to re-cycle back to them. I could, as you suggest, raise this to 125 minutes to eliminate this problem, but that seemed like an excessive time to me that we might (potentially) stop transaction relay. Of course in theory this code shouldn't really kick in except for misbehaving peers, so I don't have a strong intuition of where to set it.... Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajtowns Thanks for catching that bug around RequestTx()
. I'll be including a fix the next time I update this PR.
Oh, that only works for the request side. If we get:
0s peer A: INV deadbeef... 1s peer B: INV deadbeef... ... 5s peer Z: INV deadbeef
and then we happen to query some unresponsive peers, we'll see:
10s: -> peer D: GETDATA deadbeef... 610s: [expired] 611s: -> peer Q: GETDATA deadbeef... 1211s: [expired]
At which point every peer will have decided "it's been >15m, I'll expire the deadbeef... tx from mapRelay, and then every node will respond with NOTFOUND for that tx...
I think there's a misunderstanding here. The design here is supposed to be that every 1 minute (GETDATA_TX_INTERVAL), we will send a GETDATA for a txid that we want to a new peer, while every 10 minutes (TX_EXPIRY_INTERVAL) we'll clear out the in-flight request to a peer we've asked for a given txid, to make room for asking that peer for other txid's (since we cap the in-flight size to 100). Your example makes me think that you were looking at the 10-minute timer as applying to both new requests and expiring from the data structure, which should not be the case...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I got the GetTxRequestTime
check confused with checking whether it was inflight via another node at all, rather than just hadn't been asked for for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment on the interval was mostly that we'll expire all past the first ten to offer to us before we ever get to them. The discussion with AJ sorted out my confusion.
src/net_processing.cpp
Outdated
for (CInv &inv : vInv) { | ||
if (inv.type == MSG_TX || inv.type == MSG_WITNESS_TX) { | ||
state->m_tx_download.m_tx_announced.erase(inv.hash); | ||
state->m_tx_download.m_tx_in_flight.erase(inv.hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
if (m_tx_in_flight.erase(hash)) {
m_tx_announced.erase(hash);
}
Otherwise sending INV x1 x2 .. x100
, then looping INV x; NOTFOUND x
(without ever sending the corresponding TX
message) will grow m_tx_process_time
unboundedly, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, testing confirms we can get a memory leak this way -- test node gets from 60M to 1GB resident size after about 1M INV y1..y100; NOTFOUND y1..y100
pairs, after pre-filling in_flight and process with INV x1..x5000
, and suggested change keeps resident size stable at 60M.
INV y; TX ytx
also unconditionally clears m_tx_announced
, but that's okay because redoing INV y
won't re-add it to m_tx_announced
because of the AlreadyHave()
check. However, doing INV y1; TX y1; INV y2; TX y2; ..; INV yN; TX yN
can still cause m_tx_process_time to have N entries, though. That's a lot harder to exploit, especially with the timeout this patch adds that constrains the y1..yN to be within the 20minute period, but still maybe worth doing the same fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to call EraseTxRequest(hash);
inside the if
as well, so that we don't delay requesting from other peers when we've already received a NOTFOUND. Could leave that for #15505 though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we failing to dequeue txn for non-witness peers that send us witness txn that we drop? |
Concept ACK Could you please update the comment in bitcoin/src/net_processing.cpp Lines 1462 to 1469 in 7c4e69c
to say:
|
@gmaxwell Not that I can see. It looks to me like the only way a peer could "send" us a transaction that doesn't get dequeued is if it fails to deserialize successfully; once we've deserialized I don't think there's a code path that would not result in updating the data structures. |
So I don't forget: Instead of disconnecting we can make the ordering in the random fetching biased based on the size() of the INVs outstanding queue and the number of expired entries.... so hosts that INV dos us just end up de-preferred for fetching. |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsNo conflicts as of last run. |
Removing this from 0.18.0 because of #14897 |
8602d8b Revert "Change in transaction pull scheduling to prevent InvBlock-related attacks" (Suhas Daftuar) Pull request description: This is for 0.18, not master -- I propose we revert the getdata change for the 0.18 release, rather than try to continue patching it up. It seems like we've turned up several additional bugs that slipped through initial review (see #15776, #15834), and given the potential severe consequences of these bugs I think it'd make more sense for us to delay releasing this code until 0.19. Since the bugfix PRs are getting review, I think we can leave #14897 in master, but we can separately discuss if it should be reverted in master as well if anyone thinks that would be more appropriate. ACKs for commit 8602d8: Tree-SHA512: 0389cefc1bc74ac47f87866cf3a4dcfe202740a1baa3db074137e0aa5859672d74a50a34ccdb7cf43b3a3c99ce91e4ba1fb512d04d1d383d4cc184a8ada5543f
Previously there was an implicit bound based on the handling of m_tx_announced, but that approach is error-prone (particularly if we start automatically removing things from that set).
This prevents a bug where the in-flight queue for our peers will not be drained, resulting in not downloading any new transactions from our peers. Thanks to ajtowns for reporting this bug.
7c4e69c
to
76fe8d0
Compare
76fe8d0
to
25f7109
Compare
I've updated this PR and believe the bugs have now been fixed. I've also got a test, but with the delays involved it takes a very long time to run (ie the test has to sit around and wait for various timeouts). I'm guessing people don't want to include a test in the test suite (even an extended test) that would take tens of minutes to complete. Would reviewers prefer me to switch the uses of GetTimeMicros() in the logic to GetTime() so that we can mock it, and then include a functional test that uses mocktime? If such tests are viewed as too contrived then I'll skip it and leave this PR as is. |
Thanks for making progress here. Testing now. |
Addressed some nits. Old version is 15834.1 |
@jonatack You can take a look at this if you'd like: sdaftuar@db8fc5a Took a bit over 30 minutes the last time I ran it. |
utACK 308b767 |
Thanks @sdaftuar. Initial run passed in ~23 minutes.
(Edit: 2 more runs clocked in at 23 and 22 minutes. Am reviewing code and test.) |
// it from our data structures for this peer. | ||
auto in_flight_it = state->m_tx_download.m_tx_in_flight.find(inv.hash); | ||
if (in_flight_it == state->m_tx_download.m_tx_in_flight.end()) { | ||
// Skip any further work if this is a spurious NOTFOUND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is slightly confusing in this PR, but I think the code logic is correct. The idea is that if the tx isn't in_flight from you then there is no reason to try erasing it from m_tx_announced
, because if its in there you are still going to deal with it at some later point via m_tx_process_time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "spurious" is still fine. Why would a node announce you a tx and then follow up with a NOTFOUND without any further communication with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the questions around why I structured the loop this way (and why I made a comment about skipping "any further work") is motivated by the additional code I plan to add in #15505. See for instance: f5dbb49
I reviewed the logic and tried to think through various failure modes and did a light review of the code. |
Code review ACK 308b767 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CNodeState *state = State(pfrom->GetId()); | ||
std::vector<CInv> vInv; | ||
vRecv >> vInv; | ||
if (vInv.size() <= MAX_PEER_TX_IN_FLIGHT + MAX_BLOCKS_IN_TRANSIT_PER_PEER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was confused about why we're consulting MAX_BLOCKS_IN_TRANSIT_PER_PEER
for tx-based logic, but I guess this is just to avoid wasting time reading too-long DoSy INVs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in commit e32e084 Remove NOTFOUND transactions from in-flight data structures:
doc-nit: Maybe add a comment // We only send NOTFOUNDs for transactions, but for any peer, this message should never be larger than all in-flight inventory items
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that is the reason I had in mind -- a peer might reasonably send us a NOTFOUND for something we sent a GETDATA for, but not more than that.
continue; | ||
} | ||
state->m_tx_download.m_tx_in_flight.erase(in_flight_it); | ||
state->m_tx_download.m_tx_announced.erase(inv.hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you care to, I think the body of this loop can be written more simply as
if (inv.type == MSG_TX || inv.type == MSG_WITNESS_TX) {
// If we receive a NOTFOUND message for a txid we requested, erase
// it from our data structures for this peer.
if (state->m_tx_download.m_tx_in_flight.erase(inv.hash) > 0) {
state->m_tx_download.m_tx_announced.erase(inv.hash);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, but should keep the "spurious" comment ofc.
@@ -3987,14 +3998,14 @@ bool PeerLogicValidation::SendMessages(CNode* pto) | |||
// up processing to happen after the download times out | |||
// (with a slight delay for inbound peers, to prefer | |||
// requests to outbound peers). | |||
RequestTx(&state, txid, nNow); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RequestTx()
does the same as the two lines below, but performs .size()
checks on the tx maps and ensures the txid is in m_tx_announced
. Are we not just calling that because we don't care to do those checks as belt-and-suspenders, or do we specifically want to avoid populating m_tx_announced
? I guess those checks are safe to avoid given txid was already in tx_process_time
, but seems like it wouldn't hurt to do them anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RequestTx()
is inappropriate here and was a bug -- see @ajtowns' comment at #15834 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry I missed that. Thanks for the pointer.
@@ -699,7 +699,9 @@ void UpdateTxRequestTime(const uint256& txid, int64_t request_time) EXCLUSIVE_LO | |||
void RequestTx(CNodeState* state, const uint256& txid, int64_t nNow) EXCLUSIVE_LOCKS_REQUIRED(cs_main) | |||
{ | |||
CNodeState::TxDownloadState& peer_download_state = state->m_tx_download; | |||
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS || peer_download_state.m_tx_announced.count(txid)) { | |||
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS || | |||
peer_download_state.m_tx_process_time.size() >= MAX_PEER_TX_ANNOUNCEMENTS || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth noting that I think m_tx_process_time
is liable to outgrow m_tx_announced
temporarily (e.g. we ask peer for txid, get back NOTFOUND, m_tx_announced
entry immediately erased but m_tx_process_time
entry hangs around until next SendMessages()
) but I think this is negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That shouldn't matter, since something else must have gone horribly wrong when there are more than MAX_PEER_TX_ANNOUNCEMENTS
== 100k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left some style-nits or questions. Feel free to ignore them
ACK 308b767
Checked that each commit looks like an improvemnt.
Didn't test or compile.
Show signature and timestamp
Signature:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
ACK 308b76732f
Checked that each commit looks like an improvemnt.
Didn't test or compile.
-----BEGIN PGP SIGNATURE-----
iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
pUhnkQwAt1rsv6UDR4tMdCQbBQCD/OFCa2lvIDIv7JFpAkWRDVLo+bOKlsNFonnO
z1vQDf2quOAFcFJwgrBA1e+i1RVMuv142UuHrMQsL5XzlG8SC4+sAmLDNUypHOJU
VG9lwT+42yIpUeEDd0tROJ1XwfFdROIOZatDUFAms0hc4lQTyxRlKYsJAo3Q16yW
hH8VvGMlMqQm8Lnmwe7B3YTnTJbo7zMuFfi2bxyrRPds+cWJkKiIVdtzcmc4OFlv
MJRcGcS+iXhAgG1gHugqgSBrFX6frmzJK/DVH2xvgdj/Xa09YqxDUeXCYGdK+dp5
pJB517B533I+ctRkzK0VWXXCQeLXewtIPV5HlkAM3rUVVAA+JG+Z5ao9oQh1dehw
zcDYsXoKmbT+mPLGPhT3caoCwP5OQO5i2LTLrVMvR1vpp8lMwhwyPtqXQWFbMGj7
vniO+ovK1AU7IBt915ddDFxhSVnJ6gJHvNO8HLe467zONefaYIR3ByzNHDStXxOg
Du8Mmno8
=kBiZ
-----END PGP SIGNATURE-----
Timestamp of file with hash 5c821e3a0cb75ceb99ce84a2ae7942c0fce5d839a5cf299b21fed396a8c333bf -
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS || peer_download_state.m_tx_announced.count(txid)) { | ||
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS || | ||
peer_download_state.m_tx_process_time.size() >= MAX_PEER_TX_ANNOUNCEMENTS || | ||
peer_download_state.m_tx_announced.count(txid)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in commit 23163b7 Add an explicit memory bound to m_tx_process_time:
style-nit: Could remove the .count()
here and add the check below: if(!insert().second) return; // already have
. (This is how the code looked like in previous versions of Bitcoin Core)
@@ -699,7 +699,9 @@ void UpdateTxRequestTime(const uint256& txid, int64_t request_time) EXCLUSIVE_LO | |||
void RequestTx(CNodeState* state, const uint256& txid, int64_t nNow) EXCLUSIVE_LOCKS_REQUIRED(cs_main) | |||
{ | |||
CNodeState::TxDownloadState& peer_download_state = state->m_tx_download; | |||
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS || peer_download_state.m_tx_announced.count(txid)) { | |||
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS || | |||
peer_download_state.m_tx_process_time.size() >= MAX_PEER_TX_ANNOUNCEMENTS || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That shouldn't matter, since something else must have gone horribly wrong when there are more than MAX_PEER_TX_ANNOUNCEMENTS
== 100k.
CNodeState *state = State(pfrom->GetId()); | ||
std::vector<CInv> vInv; | ||
vRecv >> vInv; | ||
if (vInv.size() <= MAX_PEER_TX_IN_FLIGHT + MAX_BLOCKS_IN_TRANSIT_PER_PEER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in commit e32e084 Remove NOTFOUND transactions from in-flight data structures:
doc-nit: Maybe add a comment // We only send NOTFOUNDs for transactions, but for any peer, this message should never be larger than all in-flight inventory items
?
// it from our data structures for this peer. | ||
auto in_flight_it = state->m_tx_download.m_tx_in_flight.find(inv.hash); | ||
if (in_flight_it == state->m_tx_download.m_tx_in_flight.end()) { | ||
// Skip any further work if this is a spurious NOTFOUND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "spurious" is still fine. Why would a node announce you a tx and then follow up with a NOTFOUND without any further communication with you.
continue; | ||
} | ||
state->m_tx_download.m_tx_in_flight.erase(in_flight_it); | ||
state->m_tx_download.m_tx_announced.erase(inv.hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, but should keep the "spurious" comment ofc.
Thanks for the review @jamesob and @MarcoFalke. I'm going to leave the nits alone for now, and if I end up having to update this PR again I'll consider taking the nits at that point. To other reviewers, I think what this PR needs most is additional testing (eg on mainnet). |
Light ACK 308b767. Agree with additional testing. I've been looking at mainnet logs and the functional tests at sdaftuar@db8fc5a to see how they might be improved or run more quickly. Rough breakdown of current test run times in
After the Chaincode seminars these next couple weeks, will hopefully propose tests here or in a follow-up. |
One other thought for reviewers -- one thing I think we didn't consider very well when #14897 was merged is the effect this change has on the relay of dependent transactions. As a reminder, when a batch of transactions is to be announced to a peer, bitcoind will sort the batch so that parents appear before children in the INV message. Before #14897, transactions announced by a given peer would be requested in the same order as the announcement (but it was possible for transactions announced by multiple peers to be received in an arbitrary order, of course). When a parent arrives after a child, we must rely on the orphan map to reconstruct the chain, which is inefficient and error-prone (for anti-DoS reasons we limit the size of the orphan map, so too many orphans could lead to relay failures). Given that #14897 tries to randomize getdata requests, there is an increased potential for transactions to be fetched out of order, which could be problematic. My understanding of this issue is that we don't randomize the order of getdata's for transactions announced for the first time by outbound peers. Consequently, my guess is that this is not too big a deal. But note that (a) our inbound peers will generally announce to us faster than our outbounds (because of the way our software biases to sending with lower poisson delays to outbound peers over inbound ones), and (b) in the future, proposals like Erlay might mean we do much more flooding over outbound links than inbound ones. If we think this is in fact a problem, or will be in the future, then I think the easiest fix would be to not put any delay on transaction requests from an inbound peers when we're learning of a transaction for the first time. I would like to hold off on addressing this issue in this PR, which I think is a strict improvement over where we are now, but I'm mentioning it in case others are more concerned about the potential severity. |
Looks like this has about 6 ACKs. Unless there are objections, this will be merged within the next couple of weeks. I am running a node with extended logging for corner cases to see if anything obvious breaks. |
ACK 308b767 I've run the functional test @sdaftuar wrote (sdaftuar@db8fc5a) and verified that if fails before these changes and passes when cherry-picked on top of this branch. I compiled and ran this on mainnet for a few hours. My datadir was somewhat out of date, so this code was tested in both IBD and at tip. I watched Parsing debug.log shows robust tx transmission from a variety of peers during testing. Below, the message count and peer ID are shown: Tue 11 16:51 james/src/bitcoin 308b76732f* (308b76732f) btc
$ grep -E "got inv.*peer=" /data/bitcoin/debug.log | cut -d= -f2 | sort | uniq -c
26258 0
5473 1
16966 12
25632 2
23609 3
25232 6
24553 7
25578 8
25732 9
Tue 11 16:51 james/src/bitcoin 308b76732f* (308b76732f) btc
$ grep -E "sending getdata.*peer=" /data/bitcoin/debug.log | cut -d= -f2 | sort | uniq -c
2513 0
200 1
1010 12
2967 2
2124 3
2417 6
1761 7
2712 8
2661 9 Testing for a few hours doesn't rule out weird edge cases, but gives me pretty good confidence this change is fine. |
@MarcoFalke isn't there an argument for merging this sooner rather than later so that it gets wider (if not incidental) test usage before 0.19 is released? |
Did some tests:
The excerpt of the debug log shows that peer 7 relays us a tx, which is included in a block shortly after. Then peer 400 announces the tx to us, and we think it is
|
ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above) Show signature and timestampSignature:
Timestamp of file with hash |
…ire transactions from peer in-flight map 308b767 Fix bug around transaction requests (Suhas Daftuar) f635a3b Expire old entries from the in-flight tx map (Suhas Daftuar) e32e084 Remove NOTFOUND transactions from in-flight data structures (Suhas Daftuar) 23163b7 Add an explicit memory bound to m_tx_process_time (Suhas Daftuar) 218697b Improve NOTFOUND comment (Suhas Daftuar) Pull request description: #14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers. Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in #15776. This PR does a few things: - Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't - Eliminate the possibility of a memory attack on the CNodeState `m_tx_process_time` data structure by explicitly bounding its size - Remove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one - Fix a bug relating to the coordination of request times when multiple peers announce the same transaction The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions. ACKs for commit 308b76: ajtowns: utACK 308b767 morcos: light ACK 308b767 laanwj: Code review ACK 308b767 jonatack: Light ACK 308b767. jamesob: ACK 308b767 MarcoFalke: ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above) jamesob: Concept ACK 308b767 MarcoFalke: ACK 308b767 Tree-SHA512: 8865dca5294447859d95655e8699085643db60c22f0719e76e961651a1398251bc932494b68932e33f68d4f6084579ab3bed7d0e7dd4ac6c362590eaf9414eda
…#14897 and exp… …ire transactions from peer in-flight map 308b767 Fix bug around transaction requests (Suhas Daftuar) f635a3b Expire old entries from the in-flight tx map (Suhas Daftuar) e32e084 Remove NOTFOUND transactions from in-flight data structures (Suhas Daftuar) 23163b7 Add an explicit memory bound to m_tx_process_time (Suhas Daftuar) 218697b Improve NOTFOUND comment (Suhas Daftuar) Pull request description: bitcoin#14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers. Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in bitcoin#15776. This PR does a few things: - Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't - Eliminate the possibility of a memory attack on the CNodeState `m_tx_process_time` data structure by explicitly bounding its size - Remove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one - Fix a bug relating to the coordination of request times when multiple peers announce the same transaction The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions. ACKs for commit 308b76: ajtowns: utACK 308b767 morcos: light ACK 308b767 laanwj: Code review ACK 308b767 jonatack: Light ACK 308b767. jamesob: ACK 308b767 MarcoFalke: ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above) jamesob: Concept ACK 308b767 MarcoFalke: ACK 308b767 Tree-SHA512: 8865dca5294447859d95655e8699085643db60c22f0719e76e961651a1398251bc932494b68932e33f68d4f6084579ab3bed7d0e7dd4ac6c362590eaf9414eda
Summary: They all are backported at once to avoid leaving master in a buggy state. This is Core PR14897: bitcoin/bitcoin#14897 * Change in transaction pull scheduling to prevent InvBlock-related attacks Co-authored-by: Suhas Daftuar <sdaftuar@gmail.com> This is Core PR15834: bitcoin/bitcoin#15834 * Remove NOTFOUND transactions from in-flight data structures This prevents a bug where the in-flight queue for our peers will not be drained, resulting in not downloading any new transactions from our peers. Thanks to ajtowns for reporting this bug. * Add an explicit memory bound to m_tx_process_time Previously there was an implicit bound based on the handling of m_tx_announced, but that approach is error-prone (particularly if we start automatically removing things from that set). * Improve NOTFOUND comment * Expire old entries from the in-flight tx map If a peer hasn't responded to a getdata request, eventually time out the request and remove it from the in-flight data structures. This is to prevent any bugs in our handling of those in-flight data structures from filling up the in-flight map and preventing us from requesting more transactions (such as the NOTFOUND bug, fixed in a previous commit). Co-authored-by: Anthony Towns <aj@erisian.com.au> * Fix bug around transaction requests If a transaction is already in-flight when a peer announces a new tx to us, we schedule a time in the future to reconsider whether to download. At that future time, there was a bug that would prevent transactions from being rescheduled for potential download again (ie if the transaction was still in-flight at the time of reconsideration, such as from some other peer). Fix this. This is Core PR16196: bitcoin/bitcoin#16196 * doc: Add release notes for 14897 & 15834 Test Plan: make check ./test/functional/test_runner.py --extended Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Subscribers: Fabien Differential Revision: https://reviews.bitcoinabc.org/D4574
…#14897 and expire transactions from peer in-flight map 308b767 Fix bug around transaction requests (Suhas Daftuar) f635a3b Expire old entries from the in-flight tx map (Suhas Daftuar) e32e084 Remove NOTFOUND transactions from in-flight data structures (Suhas Daftuar) 23163b7 Add an explicit memory bound to m_tx_process_time (Suhas Daftuar) 218697b Improve NOTFOUND comment (Suhas Daftuar) Pull request description: bitcoin#14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers. Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in bitcoin#15776. This PR does a few things: - Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't - Eliminate the possibility of a memory attack on the CNodeState `m_tx_process_time` data structure by explicitly bounding its size - Remove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one - Fix a bug relating to the coordination of request times when multiple peers announce the same transaction The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions. ACKs for commit 308b76: ajtowns: utACK 308b767 morcos: light ACK 308b767 laanwj: Code review ACK 308b767 jonatack: Light ACK 308b767. jamesob: ACK 308b767 MarcoFalke: ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above) jamesob: Concept ACK bitcoin@308b767 MarcoFalke: ACK 308b767 Tree-SHA512: 8865dca5294447859d95655e8699085643db60c22f0719e76e961651a1398251bc932494b68932e33f68d4f6084579ab3bed7d0e7dd4ac6c362590eaf9414eda
Backport bitcoin#14897 and bitcoin#15834 and modify it to work with Dash messages
Summary: They all are backported at once to avoid leaving master in a buggy state. This is Core PR14897: bitcoin/bitcoin#14897 * Change in transaction pull scheduling to prevent InvBlock-related attacks Co-authored-by: Suhas Daftuar <sdaftuar@gmail.com> This is Core PR15834: bitcoin/bitcoin#15834 * Remove NOTFOUND transactions from in-flight data structures This prevents a bug where the in-flight queue for our peers will not be drained, resulting in not downloading any new transactions from our peers. Thanks to ajtowns for reporting this bug. * Add an explicit memory bound to m_tx_process_time Previously there was an implicit bound based on the handling of m_tx_announced, but that approach is error-prone (particularly if we start automatically removing things from that set). * Improve NOTFOUND comment * Expire old entries from the in-flight tx map If a peer hasn't responded to a getdata request, eventually time out the request and remove it from the in-flight data structures. This is to prevent any bugs in our handling of those in-flight data structures from filling up the in-flight map and preventing us from requesting more transactions (such as the NOTFOUND bug, fixed in a previous commit). Co-authored-by: Anthony Towns <aj@erisian.com.au> * Fix bug around transaction requests If a transaction is already in-flight when a peer announces a new tx to us, we schedule a time in the future to reconsider whether to download. At that future time, there was a bug that would prevent transactions from being rescheduled for potential download again (ie if the transaction was still in-flight at the time of reconsideration, such as from some other peer). Fix this. This is Core PR16196: bitcoin/bitcoin#16196 * doc: Add release notes for 14897 & 15834 Test Plan: make check ./test/functional/test_runner.py --extended Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Subscribers: Fabien Differential Revision: https://reviews.bitcoinabc.org/D4574
Summary: They all are backported at once to avoid leaving master in a buggy state. This is Core PR14897: bitcoin/bitcoin#14897 * Change in transaction pull scheduling to prevent InvBlock-related attacks Co-authored-by: Suhas Daftuar <sdaftuar@gmail.com> This is Core PR15834: bitcoin/bitcoin#15834 * Remove NOTFOUND transactions from in-flight data structures This prevents a bug where the in-flight queue for our peers will not be drained, resulting in not downloading any new transactions from our peers. Thanks to ajtowns for reporting this bug. * Add an explicit memory bound to m_tx_process_time Previously there was an implicit bound based on the handling of m_tx_announced, but that approach is error-prone (particularly if we start automatically removing things from that set). * Improve NOTFOUND comment * Expire old entries from the in-flight tx map If a peer hasn't responded to a getdata request, eventually time out the request and remove it from the in-flight data structures. This is to prevent any bugs in our handling of those in-flight data structures from filling up the in-flight map and preventing us from requesting more transactions (such as the NOTFOUND bug, fixed in a previous commit). Co-authored-by: Anthony Towns <aj@erisian.com.au> * Fix bug around transaction requests If a transaction is already in-flight when a peer announces a new tx to us, we schedule a time in the future to reconsider whether to download. At that future time, there was a bug that would prevent transactions from being rescheduled for potential download again (ie if the transaction was still in-flight at the time of reconsideration, such as from some other peer). Fix this. This is Core PR16196: bitcoin/bitcoin#16196 * doc: Add release notes for 14897 & 15834 Test Plan: make check ./test/functional/test_runner.py --extended Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Subscribers: Fabien Differential Revision: https://reviews.bitcoinabc.org/D4574
Summary: They all are backported at once to avoid leaving master in a buggy state. This is Core PR14897: bitcoin/bitcoin#14897 * Change in transaction pull scheduling to prevent InvBlock-related attacks Co-authored-by: Suhas Daftuar <sdaftuar@gmail.com> This is Core PR15834: bitcoin/bitcoin#15834 * Remove NOTFOUND transactions from in-flight data structures This prevents a bug where the in-flight queue for our peers will not be drained, resulting in not downloading any new transactions from our peers. Thanks to ajtowns for reporting this bug. * Add an explicit memory bound to m_tx_process_time Previously there was an implicit bound based on the handling of m_tx_announced, but that approach is error-prone (particularly if we start automatically removing things from that set). * Improve NOTFOUND comment * Expire old entries from the in-flight tx map If a peer hasn't responded to a getdata request, eventually time out the request and remove it from the in-flight data structures. This is to prevent any bugs in our handling of those in-flight data structures from filling up the in-flight map and preventing us from requesting more transactions (such as the NOTFOUND bug, fixed in a previous commit). Co-authored-by: Anthony Towns <aj@erisian.com.au> * Fix bug around transaction requests If a transaction is already in-flight when a peer announces a new tx to us, we schedule a time in the future to reconsider whether to download. At that future time, there was a bug that would prevent transactions from being rescheduled for potential download again (ie if the transaction was still in-flight at the time of reconsideration, such as from some other peer). Fix this. This is Core PR16196: bitcoin/bitcoin#16196 * doc: Add release notes for 14897 & 15834 Test Plan: make check ./test/functional/test_runner.py --extended Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Subscribers: Fabien Differential Revision: https://reviews.bitcoinabc.org/D4574
#14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers. Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in #15776.
This PR does a few things:
Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't
Eliminate the possibility of a memory attack on the CNodeState
m_tx_process_time
data structure by explicitly bounding its sizeRemove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one
Fix a bug relating to the coordination of request times when multiple peers announce the same transaction
The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions.