Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix transaction relay bugs introduced in #14897 and expire transactions from peer in-flight map #15834

Merged
merged 5 commits into from Jun 12, 2019

Conversation

sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Apr 17, 2019

#14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers. Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in #15776.

This PR does a few things:

  • Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't

  • Eliminate the possibility of a memory attack on the CNodeState m_tx_process_time data structure by explicitly bounding its size

  • Remove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one

  • Fix a bug relating to the coordination of request times when multiple peers announce the same transaction

The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions.

@fanquake fanquake assigned fanquake and unassigned fanquake Apr 17, 2019
@fanquake fanquake added the P2P label Apr 17, 2019
@fanquake fanquake added this to the 0.18.0 milestone Apr 17, 2019
@@ -75,6 +75,8 @@ static constexpr int64_t INBOUND_PEER_TX_DELAY = 2 * 1000000;
static constexpr int64_t GETDATA_TX_INTERVAL = 60 * 1000000;
/** Maximum delay (in microseconds) for transaction requests to avoid biasing some peers over others. */
static constexpr int64_t MAX_GETDATA_RANDOM_DELAY = 2 * 1000000;
/** How long to wait (in microseconds) before expiring a getdata request to a peer */
static constexpr int64_t TX_EXPIRY_INTERVAL = 10 * GETDATA_TX_INTERVAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the use of 10 here imply that-- assuming all peers INV a transaction to us at roughly the same time-- that we get no more robustness to transaction suppression than if we only had ten peers? If so, perhaps this should be equal to the maximum number of peers times the interval

Copy link
Contributor

@ajtowns ajtowns Apr 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, TX_EXPIRY_INTERVAL only removes tx's that are already in the in_flight list, which means we've already sent a GETDATA for them at least TX_EXPIRY_INTERVAL microseconds ago. EDIT: (And doesn't remove them from other peers' queues, whether they've been sent a GETDATA already, or that's still pending)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think there's another bug that causes problems here -- if you have two offers with nearby m_process_time values, then you'll hit the else clause of last_request_Time <= nNow - GETDATA_TX_INTERVAL and call RequestTx(txid) which will just return because it's already present in m_tx_announced, and then you'll clear inv.hash (ie txid) from tx_process_time, and never actually request it from that peer. So I think this also needs:

            } else {
                // This transaction is in flight from someone else; queue
                // up processing to happen after the download times out
                // (with a slight delay for inbound peers, to prefer
                // requests to outbound peers).
+               state.m_tx_download.m_tx_announced.erase(inv.hash);
                RequestTx(&state, txid, nNow);
            }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that only works for the request side. If we get:

 0s peer A: INV deadbeef...
 1s peer B: INV deadbeef...
 ...
 5s peer Z: INV deadbeef

and then we happen to query some unresponsive peers, we'll see:

10s: -> peer D: GETDATA deadbeef...
610s: [expired]
611s: -> peer Q: GETDATA deadbeef...
1211s: [expired]

At which point every peer will have decided "it's been >15m, I'll expire the deadbeef... tx from mapRelay, and then every node will respond with NOTFOUND for that tx...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the use of 10 here imply that-- assuming all peers INV a transaction to us at roughly the same time-- that we get no more robustness to transaction suppression than if we only had ten peers?

@gmaxwell I'm not sure I understand the question -- is the concern that an adversary could keep re-announcing a transaction to us, and every 10 minutes we'll retry that adversary instead of going to our other peers?

I think that does seem problematic, but since we prefer outbound peers over inbound ones, we should at least be able to query all our outbound peers before an adversary would be able to cause us to re-cycle back to them. I could, as you suggest, raise this to 125 minutes to eliminate this problem, but that seemed like an excessive time to me that we might (potentially) stop transaction relay. Of course in theory this code shouldn't really kick in except for misbehaving peers, so I don't have a strong intuition of where to set it.... Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajtowns Thanks for catching that bug around RequestTx(). I'll be including a fix the next time I update this PR.

Oh, that only works for the request side. If we get:

 0s peer A: INV deadbeef...
 1s peer B: INV deadbeef...
 ...
 5s peer Z: INV deadbeef

and then we happen to query some unresponsive peers, we'll see:

10s: -> peer D: GETDATA deadbeef...
610s: [expired]
611s: -> peer Q: GETDATA deadbeef...
1211s: [expired]

At which point every peer will have decided "it's been >15m, I'll expire the deadbeef... tx from mapRelay, and then every node will respond with NOTFOUND for that tx...

I think there's a misunderstanding here. The design here is supposed to be that every 1 minute (GETDATA_TX_INTERVAL), we will send a GETDATA for a txid that we want to a new peer, while every 10 minutes (TX_EXPIRY_INTERVAL) we'll clear out the in-flight request to a peer we've asked for a given txid, to make room for asking that peer for other txid's (since we cap the in-flight size to 100). Your example makes me think that you were looking at the 10-minute timer as applying to both new requests and expiring from the data structure, which should not be the case...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I got the GetTxRequestTime check confused with checking whether it was inflight via another node at all, rather than just hadn't been asked for for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment on the interval was mostly that we'll expire all past the first ten to offer to us before we ever get to them. The discussion with AJ sorted out my confusion.

for (CInv &inv : vInv) {
if (inv.type == MSG_TX || inv.type == MSG_WITNESS_TX) {
state->m_tx_download.m_tx_announced.erase(inv.hash);
state->m_tx_download.m_tx_in_flight.erase(inv.hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

if (m_tx_in_flight.erase(hash)) {
    m_tx_announced.erase(hash);
}

Otherwise sending INV x1 x2 .. x100, then looping INV x; NOTFOUND x (without ever sending the corresponding TX message) will grow m_tx_process_time unboundedly, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, testing confirms we can get a memory leak this way -- test node gets from 60M to 1GB resident size after about 1M INV y1..y100; NOTFOUND y1..y100 pairs, after pre-filling in_flight and process with INV x1..x5000, and suggested change keeps resident size stable at 60M.

INV y; TX ytx also unconditionally clears m_tx_announced, but that's okay because redoing INV y won't re-add it to m_tx_announced because of the AlreadyHave() check. However, doing INV y1; TX y1; INV y2; TX y2; ..; INV yN; TX yN can still cause m_tx_process_time to have N entries, though. That's a lot harder to exploit, especially with the timeout this patch adds that constrains the y1..yN to be within the 20minute period, but still maybe worth doing the same fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to call EraseTxRequest(hash); inside the if as well, so that we don't delay requesting from other peers when we've already received a NOTFOUND. Could leave that for #15505 though.

Copy link
Member Author

@sdaftuar sdaftuar Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in #15505 is checking for entries being in the in-flight map, so I'll take that suggestion here. But I think the best way to close the door on this would be to just bound the m_tx_process_time map explicitly, so I'll do that too.

I'll leave the EraseTxRequest() call for #15505.

src/net_processing.cpp Outdated Show resolved Hide resolved
@gmaxwell
Copy link
Contributor

Are we failing to dequeue txn for non-witness peers that send us witness txn that we drop?

@maflcko
Copy link
Member

maflcko commented Apr 17, 2019

Concept ACK

Could you please update the comment in

bitcoin/src/net_processing.cpp

Lines 1462 to 1469 in 7c4e69c

// Let the peer know that we didn't find what it asked for, so it doesn't
// have to wait around forever. Currently only SPV clients actually care
// about this message: it's needed when they are recursively walking the
// dependencies of relevant unconfirmed transactions. SPV clients want to
// do that because they want to know about (and store and rebroadcast and
// risk analyze) the dependencies of transactions relevant to them, without
// having to download the entire memory pool.
connman->PushMessage(pfrom, msgMaker.Make(NetMsgType::NOTFOUND, vNotFound));

to say:

@sdaftuar
Copy link
Member Author

@ajtowns Thanks for the review and good catch on those additional issues you spotted. I'll continue to work on this PR for master but I now think that we should revert this for 0.18 rather than try to make a fix and backport. Please see #15839.

@sdaftuar
Copy link
Member Author

Are we failing to dequeue txn for non-witness peers that send us witness txn that we drop?

@gmaxwell Not that I can see. It looks to me like the only way a peer could "send" us a transaction that doesn't get dequeued is if it fails to deserialize successfully; once we've deserialized I don't think there's a code path that would not result in updating the data structures.

@gmaxwell
Copy link
Contributor

So I don't forget: Instead of disconnecting we can make the ordering in the random fetching biased based on the size() of the INVs outstanding queue and the number of expired entries.... so hosts that INV dos us just end up de-preferred for fetching.

@DrahtBot
Copy link
Contributor

DrahtBot commented Apr 18, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

@laanwj
Copy link
Member

laanwj commented Apr 18, 2019

Removing this from 0.18.0 because of #14897

@laanwj laanwj removed this from the 0.18.0 milestone Apr 18, 2019
@laanwj laanwj removed this from Blockers in High-priority for review Apr 18, 2019
maflcko pushed a commit that referenced this pull request Apr 18, 2019
8602d8b Revert "Change in transaction pull scheduling to prevent InvBlock-related attacks" (Suhas Daftuar)

Pull request description:

  This is for 0.18, not master -- I propose we revert the getdata change for the 0.18 release, rather than try to continue patching it up.  It seems like we've turned up several additional bugs that slipped through initial review (see #15776, #15834), and given the potential severe consequences of these bugs I think it'd make more sense for us to delay releasing this code until 0.19.

  Since the bugfix PRs are getting review, I think we can leave #14897 in master, but we can separately discuss if it should be reverted in master as well if anyone thinks that would be more appropriate.

ACKs for commit 8602d8:

Tree-SHA512: 0389cefc1bc74ac47f87866cf3a4dcfe202740a1baa3db074137e0aa5859672d74a50a34ccdb7cf43b3a3c99ce91e4ba1fb512d04d1d383d4cc184a8ada5543f
Previously there was an implicit bound based on the handling of m_tx_announced,
but that approach is error-prone (particularly if we start automatically
removing things from that set).
This prevents a bug where the in-flight queue for our peers will not be
drained, resulting in not downloading any new transactions from our peers.

Thanks to ajtowns for reporting this bug.
@sdaftuar
Copy link
Member Author

sdaftuar commented May 3, 2019

I've updated this PR and believe the bugs have now been fixed. I've also got a test, but with the delays involved it takes a very long time to run (ie the test has to sit around and wait for various timeouts). I'm guessing people don't want to include a test in the test suite (even an extended test) that would take tens of minutes to complete.

Would reviewers prefer me to switch the uses of GetTimeMicros() in the logic to GetTime() so that we can mock it, and then include a functional test that uses mocktime? If such tests are viewed as too contrived then I'll skip it and leave this PR as is.

@gmaxwell
Copy link
Contributor

Thanks for making progress here. Testing now.

src/net_processing.cpp Outdated Show resolved Hide resolved
@sdaftuar
Copy link
Member Author

Addressed some nits. Old version is 15834.1

@sdaftuar
Copy link
Member Author

If the functional test is already written, I'd run it and certainly review it as an exercise to understand it.

@jonatack You can take a look at this if you'd like: sdaftuar@db8fc5a

Took a bit over 30 minutes the last time I ran it.

@ajtowns
Copy link
Contributor

ajtowns commented May 29, 2019

utACK 308b767

@jonatack
Copy link
Contributor

jonatack commented May 30, 2019

@jonatack You can take a look at this if you'd like: sdaftuar@db8fc5a

Took a bit over 30 minutes the last time I ran it.

Thanks @sdaftuar. Initial run passed in ~23 minutes.

((HEAD detached at sdaftuar/test-15834))$  test/functional/feature_tx_download.py
2019-05-30T12:38:57.455000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_gtfeuu95
2019-05-30T12:39:07.025000Z TestFramework (INFO): Nodes are setup with balances
2019-05-30T12:52:10.085000Z TestFramework (INFO): ---> Generated txid c9d4f3abccaef1315b216a9300efb64d2c91c37dc25fa451ce8af70dae7794b0
2019-05-30T12:52:11.170000Z TestFramework (INFO): ---> Mempools synced
2019-05-30T12:52:11.171000Z TestFramework (INFO): Testing transaction requests
2019-05-30T13:01:36.534000Z TestFramework (INFO): Stopping nodes
2019-05-30T13:01:36.847000Z TestFramework (INFO): Cleaning up /tmp/bitcoin_func_test_gtfeuu95 on exit
2019-05-30T13:01:36.848000Z TestFramework (INFO): Tests successful

(Edit: 2 more runs clocked in at 23 and 22 minutes. Am reviewing code and test.)

// it from our data structures for this peer.
auto in_flight_it = state->m_tx_download.m_tx_in_flight.find(inv.hash);
if (in_flight_it == state->m_tx_download.m_tx_in_flight.end()) {
// Skip any further work if this is a spurious NOTFOUND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is slightly confusing in this PR, but I think the code logic is correct. The idea is that if the tx isn't in_flight from you then there is no reason to try erasing it from m_tx_announced, because if its in there you are still going to deal with it at some later point via m_tx_process_time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re e32e084#r291253714:

I think "spurious" is still fine. Why would a node announce you a tx and then follow up with a NOTFOUND without any further communication with you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the questions around why I structured the loop this way (and why I made a comment about skipping "any further work") is motivated by the additional code I plan to add in #15505. See for instance: f5dbb49

@morcos
Copy link
Member

morcos commented Jun 6, 2019

I reviewed the logic and tried to think through various failure modes and did a light review of the code.
light ACK 308b767

@laanwj
Copy link
Member

laanwj commented Jun 10, 2019

Code review ACK 308b767

@sdaftuar sdaftuar changed the title Fix NOTFOUND bug and expire getdata requests for transactions Fix transaction relay bugs introduced in #14897 and expire transactions from peer in-flight map Jun 10, 2019
Copy link
Member

@jamesob jamesob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK 308b767

Mostly clarification comments. Going to ACK after I've had a chance to try to apply the tests @ajtowns wrote in #15776 to this PR.

CNodeState *state = State(pfrom->GetId());
std::vector<CInv> vInv;
vRecv >> vInv;
if (vInv.size() <= MAX_PEER_TX_IN_FLIGHT + MAX_BLOCKS_IN_TRANSIT_PER_PEER) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was confused about why we're consulting MAX_BLOCKS_IN_TRANSIT_PER_PEER for tx-based logic, but I guess this is just to avoid wasting time reading too-long DoSy INVs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in commit e32e084 Remove NOTFOUND transactions from in-flight data structures:

doc-nit: Maybe add a comment // We only send NOTFOUNDs for transactions, but for any peer, this message should never be larger than all in-flight inventory items?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that is the reason I had in mind -- a peer might reasonably send us a NOTFOUND for something we sent a GETDATA for, but not more than that.

continue;
}
state->m_tx_download.m_tx_in_flight.erase(in_flight_it);
state->m_tx_download.m_tx_announced.erase(inv.hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you care to, I think the body of this loop can be written more simply as

if (inv.type == MSG_TX || inv.type == MSG_WITNESS_TX) {
    // If we receive a NOTFOUND message for a txid we requested, erase
    // it from our data structures for this peer.
    if (state->m_tx_download.m_tx_in_flight.erase(inv.hash) > 0) {
        state->m_tx_download.m_tx_announced.erase(inv.hash);
   }
}
    

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re e32e084#r291901132:

Looks fine, but should keep the "spurious" comment ofc.

@@ -3987,14 +3998,14 @@ bool PeerLogicValidation::SendMessages(CNode* pto)
// up processing to happen after the download times out
// (with a slight delay for inbound peers, to prefer
// requests to outbound peers).
RequestTx(&state, txid, nNow);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestTx() does the same as the two lines below, but performs .size() checks on the tx maps and ensures the txid is in m_tx_announced. Are we not just calling that because we don't care to do those checks as belt-and-suspenders, or do we specifically want to avoid populating m_tx_announced? I guess those checks are safe to avoid given txid was already in tx_process_time, but seems like it wouldn't hurt to do them anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestTx() is inappropriate here and was a bug -- see @ajtowns' comment at #15834 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry I missed that. Thanks for the pointer.

@@ -699,7 +699,9 @@ void UpdateTxRequestTime(const uint256& txid, int64_t request_time) EXCLUSIVE_LO
void RequestTx(CNodeState* state, const uint256& txid, int64_t nNow) EXCLUSIVE_LOCKS_REQUIRED(cs_main)
{
CNodeState::TxDownloadState& peer_download_state = state->m_tx_download;
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS || peer_download_state.m_tx_announced.count(txid)) {
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS ||
peer_download_state.m_tx_process_time.size() >= MAX_PEER_TX_ANNOUNCEMENTS ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth noting that I think m_tx_process_time is liable to outgrow m_tx_announced temporarily (e.g. we ask peer for txid, get back NOTFOUND, m_tx_announced entry immediately erased but m_tx_process_time entry hangs around until next SendMessages()) but I think this is negligible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re 23163b7#r292250774:

That shouldn't matter, since something else must have gone horribly wrong when there are more than MAX_PEER_TX_ANNOUNCEMENTS == 100k.

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left some style-nits or questions. Feel free to ignore them

ACK 308b767

Checked that each commit looks like an improvemnt.
Didn't test or compile.

Show signature and timestamp

Signature:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

ACK 308b76732f

Checked that each commit looks like an improvemnt.
Didn't test or compile.
-----BEGIN PGP SIGNATURE-----

iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
pUhnkQwAt1rsv6UDR4tMdCQbBQCD/OFCa2lvIDIv7JFpAkWRDVLo+bOKlsNFonnO
z1vQDf2quOAFcFJwgrBA1e+i1RVMuv142UuHrMQsL5XzlG8SC4+sAmLDNUypHOJU
VG9lwT+42yIpUeEDd0tROJ1XwfFdROIOZatDUFAms0hc4lQTyxRlKYsJAo3Q16yW
hH8VvGMlMqQm8Lnmwe7B3YTnTJbo7zMuFfi2bxyrRPds+cWJkKiIVdtzcmc4OFlv
MJRcGcS+iXhAgG1gHugqgSBrFX6frmzJK/DVH2xvgdj/Xa09YqxDUeXCYGdK+dp5
pJB517B533I+ctRkzK0VWXXCQeLXewtIPV5HlkAM3rUVVAA+JG+Z5ao9oQh1dehw
zcDYsXoKmbT+mPLGPhT3caoCwP5OQO5i2LTLrVMvR1vpp8lMwhwyPtqXQWFbMGj7
vniO+ovK1AU7IBt915ddDFxhSVnJ6gJHvNO8HLe467zONefaYIR3ByzNHDStXxOg
Du8Mmno8
=kBiZ
-----END PGP SIGNATURE-----

Timestamp of file with hash 5c821e3a0cb75ceb99ce84a2ae7942c0fce5d839a5cf299b21fed396a8c333bf -

if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS || peer_download_state.m_tx_announced.count(txid)) {
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS ||
peer_download_state.m_tx_process_time.size() >= MAX_PEER_TX_ANNOUNCEMENTS ||
peer_download_state.m_tx_announced.count(txid)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in commit 23163b7 Add an explicit memory bound to m_tx_process_time:

style-nit: Could remove the .count() here and add the check below: if(!insert().second) return; // already have. (This is how the code looked like in previous versions of Bitcoin Core)

@@ -699,7 +699,9 @@ void UpdateTxRequestTime(const uint256& txid, int64_t request_time) EXCLUSIVE_LO
void RequestTx(CNodeState* state, const uint256& txid, int64_t nNow) EXCLUSIVE_LOCKS_REQUIRED(cs_main)
{
CNodeState::TxDownloadState& peer_download_state = state->m_tx_download;
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS || peer_download_state.m_tx_announced.count(txid)) {
if (peer_download_state.m_tx_announced.size() >= MAX_PEER_TX_ANNOUNCEMENTS ||
peer_download_state.m_tx_process_time.size() >= MAX_PEER_TX_ANNOUNCEMENTS ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re 23163b7#r292250774:

That shouldn't matter, since something else must have gone horribly wrong when there are more than MAX_PEER_TX_ANNOUNCEMENTS == 100k.

CNodeState *state = State(pfrom->GetId());
std::vector<CInv> vInv;
vRecv >> vInv;
if (vInv.size() <= MAX_PEER_TX_IN_FLIGHT + MAX_BLOCKS_IN_TRANSIT_PER_PEER) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in commit e32e084 Remove NOTFOUND transactions from in-flight data structures:

doc-nit: Maybe add a comment // We only send NOTFOUNDs for transactions, but for any peer, this message should never be larger than all in-flight inventory items?

// it from our data structures for this peer.
auto in_flight_it = state->m_tx_download.m_tx_in_flight.find(inv.hash);
if (in_flight_it == state->m_tx_download.m_tx_in_flight.end()) {
// Skip any further work if this is a spurious NOTFOUND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re e32e084#r291253714:

I think "spurious" is still fine. Why would a node announce you a tx and then follow up with a NOTFOUND without any further communication with you.

continue;
}
state->m_tx_download.m_tx_in_flight.erase(in_flight_it);
state->m_tx_download.m_tx_announced.erase(inv.hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re e32e084#r291901132:

Looks fine, but should keep the "spurious" comment ofc.

src/net_processing.cpp Show resolved Hide resolved
@sdaftuar
Copy link
Member Author

Thanks for the review @jamesob and @MarcoFalke. I'm going to leave the nits alone for now, and if I end up having to update this PR again I'll consider taking the nits at that point.

To other reviewers, I think what this PR needs most is additional testing (eg on mainnet).

@jonatack
Copy link
Contributor

Light ACK 308b767.

Agree with additional testing. I've been looking at mainnet logs and the functional tests at sdaftuar@db8fc5a to see how they might be improved or run more quickly.

Rough breakdown of current test run times in feature_tx_download.py:

  • test_in_flight_max in particular is a bottleneck at generally 13-15 (up to 25) minutes
  • test_inv_block: between 1 and 75 seconds
  • test_tx_requests: also a hotspot at 9-10 minutes

After the Chaincode seminars these next couple weeks, will hopefully propose tests here or in a follow-up.

@sdaftuar
Copy link
Member Author

One other thought for reviewers -- one thing I think we didn't consider very well when #14897 was merged is the effect this change has on the relay of dependent transactions.

As a reminder, when a batch of transactions is to be announced to a peer, bitcoind will sort the batch so that parents appear before children in the INV message. Before #14897, transactions announced by a given peer would be requested in the same order as the announcement (but it was possible for transactions announced by multiple peers to be received in an arbitrary order, of course).

When a parent arrives after a child, we must rely on the orphan map to reconstruct the chain, which is inefficient and error-prone (for anti-DoS reasons we limit the size of the orphan map, so too many orphans could lead to relay failures). Given that #14897 tries to randomize getdata requests, there is an increased potential for transactions to be fetched out of order, which could be problematic.

My understanding of this issue is that we don't randomize the order of getdata's for transactions announced for the first time by outbound peers. Consequently, my guess is that this is not too big a deal. But note that (a) our inbound peers will generally announce to us faster than our outbounds (because of the way our software biases to sending with lower poisson delays to outbound peers over inbound ones), and (b) in the future, proposals like Erlay might mean we do much more flooding over outbound links than inbound ones.

If we think this is in fact a problem, or will be in the future, then I think the easiest fix would be to not put any delay on transaction requests from an inbound peers when we're learning of a transaction for the first time. I would like to hold off on addressing this issue in this PR, which I think is a strict improvement over where we are now, but I'm mentioning it in case others are more concerned about the potential severity.

@maflcko
Copy link
Member

maflcko commented Jun 11, 2019

Looks like this has about 6 ACKs. Unless there are objections, this will be merged within the next couple of weeks.

I am running a node with extended logging for corner cases to see if anything obvious breaks.

@jamesob
Copy link
Member

jamesob commented Jun 11, 2019

ACK 308b767

I've run the functional test @sdaftuar wrote (sdaftuar@db8fc5a) and verified that if fails before these changes and passes when cherry-picked on top of this branch.

I compiled and ran this on mainnet for a few hours. My datadir was somewhat out of date, so this code was tested in both IBD and at tip. I watched getmempoolinfo to verify its tx count looked normal and varied within reason.

Parsing debug.log shows robust tx transmission from a variety of peers during testing. Below, the message count and peer ID are shown:

Tue 11 16:51 james/src/bitcoin 308b76732f* (308b76732f) btc
$ grep -E "got inv.*peer=" /data/bitcoin/debug.log | cut -d= -f2 | sort | uniq -c

  26258 0
   5473 1
  16966 12
  25632 2
  23609 3
  25232 6
  24553 7
  25578 8
  25732 9

Tue 11 16:51 james/src/bitcoin 308b76732f* (308b76732f) btc
$ grep -E "sending getdata.*peer=" /data/bitcoin/debug.log | cut -d= -f2 | sort | uniq -c

   2513 0
    200 1
   1010 12
   2967 2
   2124 3
   2417 6
   1761 7
   2712 8
   2661 9

Testing for a few hours doesn't rule out weird edge cases, but gives me pretty good confidence this change is fine.

@jamesob
Copy link
Member

jamesob commented Jun 11, 2019

@MarcoFalke isn't there an argument for merging this sooner rather than later so that it gets wider (if not incidental) test usage before 0.19 is released?

@maflcko
Copy link
Member

maflcko commented Jun 12, 2019

Did some tests:

  • ✔️ Race between a tx and the same tx in a block is correctly timed out:

The excerpt of the debug log shows that peer 7 relays us a tx, which is included in a block shortly after. Then peer 400 announces the tx to us, and we think it is new based on our heuristic tests. Peer 400 won't send us the tx anymore (for whatever reason), and it correctly times out on our side.

$ cat ~/.bitcoin/debug.log  | grep --extended-regexp '(98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d|00000000000000000007f6557ae653c28b86f6346f411981b0618ea423ed5ec0 height)'
2019-06-11T20:06:14Z got inv: tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d  new peer=7
2019-06-11T20:06:14Z Requesting witness-tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d peer=7
2019-06-11T20:06:14Z AcceptToMemoryPool: peer=7: accepted 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d (poolsz 20531 txn, 43706 kB)
2019-06-11T20:06:16Z got inv: tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d  have peer=2
2019-06-11T20:06:17Z got inv: tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d  have peer=306
2019-06-11T20:06:17Z got inv: tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d  have peer=321
2019-06-11T20:06:18Z got inv: tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d  have peer=57
2019-06-11T20:09:52Z UpdateTip: new best=00000000000000000007f6557ae653c28b86f6346f411981b0618ea423ed5ec0 height=580289 version=0x3fffe000 log2_work=90.730011 tx=423305186 date='2019-06-11T20:09:29Z' progress=1.000000 cache=20.7MiB(155741txo) warning='39 of last 100 blocks have unexpected version'
2019-06-11T20:10:13Z received getdata for: witness-tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d peer=0
2019-06-11T20:13:34Z got inv: tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d  new peer=400
2019-06-11T20:13:36Z Requesting witness-tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d peer=400
2019-06-11T20:29:04Z timeout of inflight tx 98a4959bf387f385b66e81e3ee7e8d6900cd31567cbc90048eb44597540a988d; id 400, ua /Satoshi:0.17.1/, version 70015, addr 111.37.183.111:9685, tx_relay 1, tx_sendmempool 0, fee_filter 1000, services 000000000000040d, time_connected 972, last_tx 0, last_block 0, last_mempool 0
  • ✔️ A transaction is requested after it was accepted to the mempool (and later mined into a block), due to broken orphan handling (e.g. " Improve AlreadyHave Improve AlreadyHave #7874 "). Those cases are then at least correctly handled via NOTFOUNDs.
$ cat ~/.bitcoin/debug.log  | grep --extended-regexp '(13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7|00000000000000000024c923fe8873778e437290e4958234922df101956fa4f4 height|e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6)'
2019-06-12T10:00:54Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  new peer=15
2019-06-12T10:00:54Z Requesting witness-tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7 peer=15
2019-06-12T10:00:54Z AcceptToMemoryPool: peer=15: accepted 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7 (poolsz 20535 txn, 41033 kB)
2019-06-12T10:00:55Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=2596
2019-06-12T10:00:55Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=3747
2019-06-12T10:00:55Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=57
2019-06-12T10:00:56Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=2963
2019-06-12T10:00:56Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=3616
2019-06-12T10:00:56Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=124
2019-06-12T10:00:56Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=3957
2019-06-12T10:00:56Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=1206
2019-06-12T10:00:57Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=4655
2019-06-12T10:00:57Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=4769
2019-06-12T10:00:57Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=4269
2019-06-12T10:00:57Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=4983
2019-06-12T10:00:58Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=4258
2019-06-12T10:00:58Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=4391
2019-06-12T10:00:59Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=4576
2019-06-12T10:00:59Z got inv: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7  have peer=1281
2019-06-12T10:01:01Z received getdata for: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7 peer=995
2019-06-12T10:01:01Z received getdata for: tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7 peer=4068
2019-06-12T10:16:04Z UpdateTip: new best=00000000000000000024c923fe8873778e437290e4958234922df101956fa4f4 height=580366 version=0x20800000 log2_work=90.731743 tx=423489088 date='2019-06-12T10:15:54Z' progress=1.000000 cache=78.5MiB(585276txo) warning='36 of last 100 blocks have unexpected version'
2019-06-12T10:35:10Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  new peer=5226
2019-06-12T10:35:11Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  new peer=3396
2019-06-12T10:35:11Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  new peer=3384
2019-06-12T10:35:11Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  new peer=4576
2019-06-12T10:35:11Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  new peer=1281
2019-06-12T10:35:12Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  new peer=19
2019-06-12T10:35:12Z Requesting witness-tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6 peer=19
2019-06-12T10:35:12Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  new peer=124
2019-06-12T10:35:12Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  new peer=7
2019-06-12T10:35:12Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  new peer=15
2019-06-12T10:35:12Z AcceptToMemoryPool: peer=19: accepted e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6 (poolsz 13361 txn, 29469 kB)
2019-06-12T10:35:12Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=3747
2019-06-12T10:35:13Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=4391
2019-06-12T10:35:13Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=4655
2019-06-12T10:35:13Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=3616
2019-06-12T10:35:13Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=4269
2019-06-12T10:35:14Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=5201
2019-06-12T10:35:14Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=2963
2019-06-12T10:35:14Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=1206
2019-06-12T10:35:15Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=57
2019-06-12T10:35:16Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=2596
2019-06-12T10:35:23Z got inv: tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6  have peer=4258
2019-06-12T10:36:12Z Requesting witness-tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6 peer=15
2019-06-12T10:36:12Z stored orphan tx e6dcc6451e35996ce4d823781196830d1731c01d9d3e0fe1683b6e38675c4bf6 (mapsz 39 outsz 80)
2019-06-12T10:36:12Z Requesting witness-tx 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7 peer=15
2019-06-12T10:36:12Z Process NOTFOUND for 13f15e0e6e0e0f19b3aa84a7c6d9715991ea5b8ccf8dad9002bee7c25b596aa7 id 15, ua /Satoshi:0.17.0.1/, version 70015, addr 34.222.37.152:8333, tx_relay 1, tx_sendmempool 0, fee_filter 1000, services 000000000000040d, time_connected 56403, last_tx 21, last_block 133, last_mempool 0

@maflcko
Copy link
Member

maflcko commented Jun 12, 2019

ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above)

Show signature and timestamp

Signature:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

ACK 308b76732f97020c86977e29c854e8e27262cf7c (Tested two of the three bugs this pull fixes, see comment above)
-----BEGIN PGP SIGNATURE-----

iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
pUgNywv9G066MMLt0dJFLrlz/RnlG/XSMMKGfvre21cWg7EvE4NURpf+DYVbcwm8
bNzSEvl3kLoZLCdW5SJjZkC/v/Shly9nKVpvDtS1APDZe1JbPjKrfEuCREPg/14S
IMRP1AkcJYpcs9CEsRLa3R4f+I+mYlhV9m0pAvaAXhl1HU2TwLL6RbJEpAKhMu5j
3TsU+1abFXEp8dRDOeHlIax6gqYiNBzILICrkMsgBEgyyufQ6EUlyJ4bjAzuuc2b
QHiig338kfnBjYJH/wKjn7KYERB908n370sXJd9j5clpGBQRvTVKLCU8xl81dPeV
gYvYieewzvap+Pcu5rxK9p0v+5xi+4qrzSpQTksfjYKgtYr7+lfoq3LLlMPB4A1k
oFbnXqGsKbYTybfO4THzX9uEekfASOGyKin2XLA2txSXfBfMUlzPo3S0EodjuIjj
zcZ18ST42SXfGagistp9lvVmtOirpz2pNhp8oARLUZAfRf+HaqE6pApcWuo3BO6g
6vz0Bzs+
=ddt0
-----END PGP SIGNATURE-----

Timestamp of file with hash 3ce687f87901b51b245d165509f0c12661c46cd7d1e956e5b93c607795664650 -

@maflcko maflcko merged commit 308b767 into bitcoin:master Jun 12, 2019
maflcko pushed a commit that referenced this pull request Jun 12, 2019
…ire transactions from peer in-flight map

308b767 Fix bug around transaction requests (Suhas Daftuar)
f635a3b Expire old entries from the in-flight tx map (Suhas Daftuar)
e32e084 Remove NOTFOUND transactions from in-flight data structures (Suhas Daftuar)
23163b7 Add an explicit memory bound to m_tx_process_time (Suhas Daftuar)
218697b Improve NOTFOUND comment (Suhas Daftuar)

Pull request description:

  #14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers.  Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in #15776.

  This PR does a few things:

  - Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't

  - Eliminate the possibility of a memory attack on the CNodeState `m_tx_process_time` data structure by explicitly bounding its size

  - Remove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one

  - Fix a bug relating to the coordination of request times when multiple peers announce the same transaction

  The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions.

ACKs for commit 308b76:
  ajtowns:
    utACK 308b767
  morcos:
    light ACK 308b767
  laanwj:
    Code review ACK 308b767
  jonatack:
    Light ACK 308b767.
  jamesob:
    ACK 308b767
  MarcoFalke:
    ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above)
  jamesob:
    Concept ACK 308b767
  MarcoFalke:
    ACK 308b767

Tree-SHA512: 8865dca5294447859d95655e8699085643db60c22f0719e76e961651a1398251bc932494b68932e33f68d4f6084579ab3bed7d0e7dd4ac6c362590eaf9414eda
sidhujag added a commit to syscoin/syscoin that referenced this pull request Jun 12, 2019
…#14897 and exp…  …ire transactions from peer in-flight map

308b767 Fix bug around transaction requests (Suhas Daftuar)
f635a3b Expire old entries from the in-flight tx map (Suhas Daftuar)
e32e084 Remove NOTFOUND transactions from in-flight data structures (Suhas Daftuar)
23163b7 Add an explicit memory bound to m_tx_process_time (Suhas Daftuar)
218697b Improve NOTFOUND comment (Suhas Daftuar)

Pull request description:

  bitcoin#14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers.  Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in bitcoin#15776.

  This PR does a few things:

  - Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't

  - Eliminate the possibility of a memory attack on the CNodeState `m_tx_process_time` data structure by explicitly bounding its size

  - Remove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one

  - Fix a bug relating to the coordination of request times when multiple peers announce the same transaction

  The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions.

ACKs for commit 308b76:
  ajtowns:
    utACK 308b767
  morcos:
    light ACK 308b767
  laanwj:
    Code review ACK 308b767
  jonatack:
    Light ACK 308b767.
  jamesob:
    ACK 308b767
  MarcoFalke:
    ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above)
  jamesob:
    Concept ACK 308b767
  MarcoFalke:
    ACK 308b767

Tree-SHA512: 8865dca5294447859d95655e8699085643db60c22f0719e76e961651a1398251bc932494b68932e33f68d4f6084579ab3bed7d0e7dd4ac6c362590eaf9414eda
maflcko pushed a commit that referenced this pull request Jun 16, 2019
fa55dd8 doc: Add release notes for 14897 & 15834 (MarcoFalke)

Pull request description:

  #14897 & #15834

ACKs for commit fa55dd:
  fanquake:
    ACK fa55dd8

Tree-SHA512: 301742191f3d0e9383c6fe455d18d1e153168728e75dd29b7d0a0246af1cf024cc8199e82a42d74b3e6f5b556831763e0170ed0cb7b3082c7e0c57b05a5776db
jasonbcox pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Dec 6, 2019
Summary:
They all are backported at once to avoid leaving master in a buggy state.

This is Core PR14897: bitcoin/bitcoin#14897

* Change in transaction pull scheduling to prevent InvBlock-related attacks

Co-authored-by: Suhas Daftuar <sdaftuar@gmail.com>

This is Core PR15834: bitcoin/bitcoin#15834

 * Remove NOTFOUND transactions from in-flight data structures

This prevents a bug where the in-flight queue for our peers will not be
drained, resulting in not downloading any new transactions from our peers.

Thanks to ajtowns for reporting this bug.

 * Add an explicit memory bound to m_tx_process_time

Previously there was an implicit bound based on the handling of m_tx_announced,
but that approach is error-prone (particularly if we start automatically
removing things from that set).

 * Improve NOTFOUND comment

 * Expire old entries from the in-flight tx map

If a peer hasn't responded to a getdata request, eventually time out the request
and remove it from the in-flight data structures.  This is to prevent any bugs in
our handling of those in-flight data structures from filling up the in-flight
map and preventing us from requesting more transactions (such as the NOTFOUND
bug, fixed in a previous commit).

Co-authored-by: Anthony Towns <aj@erisian.com.au>

 * Fix bug around transaction requests

If a transaction is already in-flight when a peer announces a new tx to us, we
schedule a time in the future to reconsider whether to download. At that future
time, there was a bug that would prevent transactions from being rescheduled
for potential download again (ie if the transaction was still in-flight at the
time of reconsideration, such as from some other peer). Fix this.

This is Core PR16196: bitcoin/bitcoin#16196

 * doc: Add release notes for 14897 & 15834

Test Plan:
  make check
  ./test/functional/test_runner.py --extended

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Subscribers: Fabien

Differential Revision: https://reviews.bitcoinabc.org/D4574
codablock pushed a commit to codablock/dash that referenced this pull request Apr 7, 2020
…#14897 and expire transactions from peer in-flight map

308b767 Fix bug around transaction requests (Suhas Daftuar)
f635a3b Expire old entries from the in-flight tx map (Suhas Daftuar)
e32e084 Remove NOTFOUND transactions from in-flight data structures (Suhas Daftuar)
23163b7 Add an explicit memory bound to m_tx_process_time (Suhas Daftuar)
218697b Improve NOTFOUND comment (Suhas Daftuar)

Pull request description:

  bitcoin#14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers.  Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in bitcoin#15776.

  This PR does a few things:

  - Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't

  - Eliminate the possibility of a memory attack on the CNodeState `m_tx_process_time` data structure by explicitly bounding its size

  - Remove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one

  - Fix a bug relating to the coordination of request times when multiple peers announce the same transaction

  The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions.

ACKs for commit 308b76:
  ajtowns:
    utACK 308b767
  morcos:
    light ACK 308b767
  laanwj:
    Code review ACK 308b767
  jonatack:
    Light ACK 308b767.
  jamesob:
    ACK 308b767
  MarcoFalke:
    ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above)
  jamesob:
    Concept ACK bitcoin@308b767
  MarcoFalke:
    ACK 308b767

Tree-SHA512: 8865dca5294447859d95655e8699085643db60c22f0719e76e961651a1398251bc932494b68932e33f68d4f6084579ab3bed7d0e7dd4ac6c362590eaf9414eda
codablock added a commit to dashpay/dash that referenced this pull request Apr 8, 2020
Backport bitcoin#14897 and bitcoin#15834 and modify it to work with Dash messages
jonspock pushed a commit to jonspock/devault that referenced this pull request Sep 29, 2020
Summary:
They all are backported at once to avoid leaving master in a buggy state.

This is Core PR14897: bitcoin/bitcoin#14897

* Change in transaction pull scheduling to prevent InvBlock-related attacks

Co-authored-by: Suhas Daftuar <sdaftuar@gmail.com>

This is Core PR15834: bitcoin/bitcoin#15834

 * Remove NOTFOUND transactions from in-flight data structures

This prevents a bug where the in-flight queue for our peers will not be
drained, resulting in not downloading any new transactions from our peers.

Thanks to ajtowns for reporting this bug.

 * Add an explicit memory bound to m_tx_process_time

Previously there was an implicit bound based on the handling of m_tx_announced,
but that approach is error-prone (particularly if we start automatically
removing things from that set).

 * Improve NOTFOUND comment

 * Expire old entries from the in-flight tx map

If a peer hasn't responded to a getdata request, eventually time out the request
and remove it from the in-flight data structures.  This is to prevent any bugs in
our handling of those in-flight data structures from filling up the in-flight
map and preventing us from requesting more transactions (such as the NOTFOUND
bug, fixed in a previous commit).

Co-authored-by: Anthony Towns <aj@erisian.com.au>

 * Fix bug around transaction requests

If a transaction is already in-flight when a peer announces a new tx to us, we
schedule a time in the future to reconsider whether to download. At that future
time, there was a bug that would prevent transactions from being rescheduled
for potential download again (ie if the transaction was still in-flight at the
time of reconsideration, such as from some other peer). Fix this.

This is Core PR16196: bitcoin/bitcoin#16196

 * doc: Add release notes for 14897 & 15834

Test Plan:
  make check
  ./test/functional/test_runner.py --extended

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Subscribers: Fabien

Differential Revision: https://reviews.bitcoinabc.org/D4574
jonspock pushed a commit to jonspock/devault that referenced this pull request Sep 29, 2020
Summary:
They all are backported at once to avoid leaving master in a buggy state.

This is Core PR14897: bitcoin/bitcoin#14897

* Change in transaction pull scheduling to prevent InvBlock-related attacks

Co-authored-by: Suhas Daftuar <sdaftuar@gmail.com>

This is Core PR15834: bitcoin/bitcoin#15834

 * Remove NOTFOUND transactions from in-flight data structures

This prevents a bug where the in-flight queue for our peers will not be
drained, resulting in not downloading any new transactions from our peers.

Thanks to ajtowns for reporting this bug.

 * Add an explicit memory bound to m_tx_process_time

Previously there was an implicit bound based on the handling of m_tx_announced,
but that approach is error-prone (particularly if we start automatically
removing things from that set).

 * Improve NOTFOUND comment

 * Expire old entries from the in-flight tx map

If a peer hasn't responded to a getdata request, eventually time out the request
and remove it from the in-flight data structures.  This is to prevent any bugs in
our handling of those in-flight data structures from filling up the in-flight
map and preventing us from requesting more transactions (such as the NOTFOUND
bug, fixed in a previous commit).

Co-authored-by: Anthony Towns <aj@erisian.com.au>

 * Fix bug around transaction requests

If a transaction is already in-flight when a peer announces a new tx to us, we
schedule a time in the future to reconsider whether to download. At that future
time, there was a bug that would prevent transactions from being rescheduled
for potential download again (ie if the transaction was still in-flight at the
time of reconsideration, such as from some other peer). Fix this.

This is Core PR16196: bitcoin/bitcoin#16196

 * doc: Add release notes for 14897 & 15834

Test Plan:
  make check
  ./test/functional/test_runner.py --extended

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Subscribers: Fabien

Differential Revision: https://reviews.bitcoinabc.org/D4574
jonspock pushed a commit to devaultcrypto/devault that referenced this pull request Oct 10, 2020
Summary:
They all are backported at once to avoid leaving master in a buggy state.

This is Core PR14897: bitcoin/bitcoin#14897

* Change in transaction pull scheduling to prevent InvBlock-related attacks

Co-authored-by: Suhas Daftuar <sdaftuar@gmail.com>

This is Core PR15834: bitcoin/bitcoin#15834

 * Remove NOTFOUND transactions from in-flight data structures

This prevents a bug where the in-flight queue for our peers will not be
drained, resulting in not downloading any new transactions from our peers.

Thanks to ajtowns for reporting this bug.

 * Add an explicit memory bound to m_tx_process_time

Previously there was an implicit bound based on the handling of m_tx_announced,
but that approach is error-prone (particularly if we start automatically
removing things from that set).

 * Improve NOTFOUND comment

 * Expire old entries from the in-flight tx map

If a peer hasn't responded to a getdata request, eventually time out the request
and remove it from the in-flight data structures.  This is to prevent any bugs in
our handling of those in-flight data structures from filling up the in-flight
map and preventing us from requesting more transactions (such as the NOTFOUND
bug, fixed in a previous commit).

Co-authored-by: Anthony Towns <aj@erisian.com.au>

 * Fix bug around transaction requests

If a transaction is already in-flight when a peer announces a new tx to us, we
schedule a time in the future to reconsider whether to download. At that future
time, there was a bug that would prevent transactions from being rescheduled
for potential download again (ie if the transaction was still in-flight at the
time of reconsideration, such as from some other peer). Fix this.

This is Core PR16196: bitcoin/bitcoin#16196

 * doc: Add release notes for 14897 & 15834

Test Plan:
  make check
  ./test/functional/test_runner.py --extended

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Subscribers: Fabien

Differential Revision: https://reviews.bitcoinabc.org/D4574
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants