Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expire inflight GETDATA requests #15776

Closed
wants to merge 7 commits into from

Conversation

Projects
None yet
9 participants
@ajtowns
Copy link
Contributor

commented Apr 9, 2019

I think #14897 introduces a livelock between peers who don't reply to GETDATA requests with TX data.

Scenario is likely: Alice sends Bob and INV for a new tx, Bob sends GETDATA for the tx, Alice responds with TX, Bob doesn't have the parent so sends GETDATA for the parent, adding it to the tx_in_flight set, Alice replies NOTFOUND due to the parent having expired from the relay set due to being in the mempool for over 15 minutes, so Bob doesn't remove the entry from the tx_in_flight set. Once this happens 100 times (MAX_PEER_TX_IN_FLIGHT), Bob will not request any more tx info via GETDATA from Alice. If this happens to all Bob's peers, Bob will not receive tx data except when blocks are found. If Bob is not reachable for inbound nodes, this can plausibly happen for all Bob's outbound nodes.

I think this happened for my node this week (running master-ish), causing my mempool to be empty, despite having 8 good peers who continued sending INV notifications.

Patches are:

  • debugging info for getpeerinfo to be able to tell if either of the queues are growing
  • logging of 'notfound' messages
  • test cases for 'notfound' response to getdata, and no response at all to getdata
  • fix to expire inflight tx's that are reported as 'notfound'
  • fix to disconnect nodes that have 100 inflight tx getdata requests as well as additional tx's we'd like to request that have been waiting for 30mins or more

@fanquake fanquake added the P2P label Apr 9, 2019

@sdaftuar

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

That seems like a potentially very bad bug. However at first glance I don't see how fPauseSend could be implicated as a trigger; I don't believe we ever drop requests on the floor if the send buffer fills up, we just wait until the buffer drains and then continue as before.

Actually, I realize now that one way this could maybe trigger is if our peer sends a NOTFOUND in response to a getdata, I don't think we would reduce the in flight count (I think this is fixed in #15505 but this seems like a bug in 0.18). So over time, as NOTFOUND's pile up, the in-flight queue would get full and we'd stop requesting transactions. So we should definitely take some kind of fix for this, yikes.

@fanquake fanquake added this to the 0.18.0 milestone Apr 9, 2019

@DrahtBot

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #15505 ([p2p] Request NOTFOUND transactions immediately from other outbound peers, when possible by sdaftuar)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@gmaxwell

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2019

We should perhaps disconnect a peer that is consistently getdata DOSing us... If all your peers do that (for whatever reason) you're not going to gain txn, regardless of what happens with this. So this fix only helps the case where a peer is randomly responding to some getdatas.

The fact that we'd stop asking for data from peers if they were not responding wasn't lost on me in the original changes, but this seemed like a good thing (I can't caught the notfound interaction, either through just missing it or due to havign 15505 on the brain).

@sdaftuar

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

I think we should do two things:
(a) Fix the bug in NOTFOUND processing -- if a peer sends a NOTFOUND for a transaction, we should remove it from the in-flight map for that peer. I think we should definitely do this and backport to 0.18.

(b) As a safeguard, we should also do something if a peer's in-flight count hits the max so that we're not downloading new transactions from them. I think there are two responses that make sense to me: either we eventually disconnect them, or we expire the in-flight entry and make room for a new request. For 0.18, I think expiry makes more sense, as it's more robust to other bugs that may be lurking in this new code. Eventually, I think disconnection is where I expect things to go in the long-run (which is what we do for block getdata's that are dropped by the peer).

Also, I think I spotted another bug in this code related to emptying out the tx_in_flight set -- if a different peer serves us a transaction that we requested from another, then the entry doesn't get removed (upon receiving a TX message, we only erase from the delivering peer's in_flight set). So a "DoS"er that helpfully blasted us with transactions would be able to blind us to transactions from our other peers. The expiry solution would mitigate this, which is a point in its favor for merging into 0.18, but we should fix this too I think if we can do so easily.

@ajtowns ajtowns force-pushed the ajtowns:201904-inflighttxtimeout branch from c5ce6af to f6d1857 Apr 10, 2019

@ajtowns

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

at first glance I don't see how fPauseSend could be implicated as a trigger

Yep, I think you're right.

Actually, I realize now that one way this could maybe trigger is if our peer sends a NOTFOUND in response to a getdata, I don't think we would reduce the in flight count

Right; I think a plausible scenario is "INV tx1, GETDATA tx1, TX tx1, oh I'm missing a parent for tx1, GETDATA tx1p, nope that's over 15m old and you're not doing MEMPOOL bloom filters, so NOTFOUND" -- and I can duplicate that behaviour with real (regtest) nodes, too. Seems like it should mostly only happen in real life if you've just reconnected to the network after some time offline, and there's a reasonable backlog in the mempool (as otherwise parents that are older than 15m are probably already in a block)

Updated the PR to remove tx's from notfound messages from the inflight set and to disconnect (and log) any outbound peers whose inflight queue gets full despite that. Has test cases for for both scenarios.

@ajtowns

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

Also, I think I spotted another bug in this code related to emptying out the tx_in_flight set -- if a different peer serves us a transaction that we requested from another, then the entry doesn't get removed (upon receiving a TX message, we only erase from the delivering peer's in_flight set).

Maybe I'm missing something, but I think this should be okay -- if it's added to in_flight then we've sent a GETDATA to that peer, and whether or not any other peer sends a TX message, that peer should send a TX or NOTFOUND message, at which point the entry's unconditionally removed from in_flight as on of the very first steps...

@ajtowns ajtowns force-pushed the ajtowns:201904-inflighttxtimeout branch from f6d1857 to 12df73f Apr 10, 2019

@@ -178,6 +178,8 @@ static UniValue getpeerinfo(const JSONRPCRequest& request)
obj.pushKV("banscore", statestats.nMisbehavior);
obj.pushKV("synced_headers", statestats.nSyncHeight);
obj.pushKV("synced_blocks", statestats.nCommonHeight);
obj.pushKV("tx_inflight", statestats.nTxInFlight);
obj.pushKV("tx_process", statestats.nTxProcess);

This comment has been minimized.

Copy link
@morcos

morcos Apr 11, 2019

Member

do these need to be added to the help text?

This comment has been minimized.

Copy link
@ajtowns

ajtowns Apr 12, 2019

Author Contributor

are they even worth keeping? i mostly added them for debugging, and i am using them in one of the test cases, but they're pretty obscure info.

This comment has been minimized.

Copy link
@ajtowns

ajtowns Apr 12, 2019

Author Contributor

added help text anyway

@gmaxwell

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2019

I think it's okay-- even preferable-- to not remove the tx when we get the tx from another peer. We should either get the tx or a not found or we should eventually stop requesting tx from that peer (e.g. because we disconnected them...)

Show resolved Hide resolved src/net_processing.cpp Outdated

@ajtowns ajtowns force-pushed the ajtowns:201904-inflighttxtimeout branch from 12df73f to ab5a0e5 Apr 12, 2019

@jonasschnelli

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Not sure if it is related but today I found – for the second time – that my test 0.18.0RC3 node has a "stuck" mempool (size of 6 or so)

@fanquake fanquake added this to Blockers in High-priority for review Apr 15, 2019

@jonasschnelli

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Just looked at the debug.log and my node gets a lot of invs (but no getdata response) and mempool won't increase.

@jonasschnelli

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

After an addnode, I could see getdata's and a growing mempool

LogPrint(BCLog::NET, "received notfound (%u invsz) peer=%d\n", vInv.size(), pfrom->GetId());

if (vInv.size() > 0) {
LogPrint(BCLog::NET, "received notfound for: %s peer=%d\n", vInv[0].ToString(), pfrom->GetId());

This comment has been minimized.

Copy link
@naumenkogs

naumenkogs Apr 15, 2019

Contributor

Could you explain what's going on here? Why we're logging only the first notfound INV?

This comment has been minimized.

Copy link
@ajtowns

ajtowns Apr 16, 2019

Author Contributor

It's doing the same logging as getdata which also only logs the first entry. Didn't seem terribly valuable to log every notfound entry seen.

@ajtowns

This comment has been minimized.

Copy link
Contributor Author

commented Apr 16, 2019

Scenario is likely: Alice sends Bob and INV for a new tx, Bob sends GETDATA for the tx, Alice responds with TX, Bob doesn't have the parent so sends GETDATA for the parent, adding it to the tx_in_flight set, Alice replies NOTFOUND due to the parent having expired from the relay set due to being in the mempool for over 15 minutes, so Bob doesn't remove the entry from the tx_in_flight set.

I added logging for when my node sends a notfound to try and see what the reasons were. It looks like most of the time these are for tx's that aren't even in my mempool at all anymore, though that varies between peers:

  • peer=0 -- 2821 not in mempool, 0 in mempool

  • peer=3 -- 1495 not in mempool, 0 in mempool

  • peer=1 -- 1092 not in mempool, 0 in mempool

  • peer=5 -- 401 not in mempool, 117 in mempool

  • peer=4 -- 308 not in mempool, 0 in mempool

  • peer=2 -- 146 not in mempool, 0 in mempool

  • peer=102 -- 73 in mempool, 14 not in mempool

  • peer=204 -- 111 in mempool, 3 not in mempool

For the ones that are in the mempool, the times are all just >15m as expected (ranging from 902s to 17432s -- almost 5 hours), and the parent explanation seems fine for them.

Not sure what's causing other nodes to request tx's that aren't in my mempool anymore though. For the couple I've looked at, they seem to have been confirmed in prior blocks, eg:

2019-04-16T03:36:38Z got inv: tx 4c5f3fdd242b0f352716fb3bccca810585e726c2f1b435e7fdbcfb8ffc1faa4c  have peer=90
2019-04-16T03:46:25Z UpdateTip: new best=000000000000000000166ca879430bcbd524b0e3a4b333cf9646a4a4cc03c54f height=571829
2019-04-16T04:06:43Z UpdateTip: new best=...a417 height=571830 
2019-04-16T04:12:32Z received getdata for: witness-tx 4c5f3fdd242b0f352716fb3bccca810585e726c2f1b435e7fdbcfb8ffc1faa4c peer=1
2019-04-16T04:12:32Z will send notfound for: 4c5f3fdd242b0f352716fb3bccca810585e726c2f1b435e7fdbcfb8ffc1faa4c -1s in mempool peer=1

or

2019-04-16T00:45:15Z got inv: tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39  new peer=1
2019-04-16T00:45:15Z Requesting witness-tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 peer=1
2019-04-16T00:45:17Z got inv: tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39  have peer=6
2019-04-16T00:45:21Z got inv: tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39  have peer=5
2019-04-16T00:51:06Z Requesting block 00000000000000000022a55d495ebbec42ebcf1de23a7b6db3c4fc0361efd044 from  peer=49
2019-04-16T00:51:06Z UpdateTip: new best=00000000000000000022a55d495ebbec42ebcf1de23a7b6db3c4fc0361efd044 height=571814
2019-04-16T00:51:06Z received: cmpctblock (20107 bytes) peer=2
2019-04-16T02:23:15Z received getdata for: witness-tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 peer=2
2019-04-16T02:23:15Z will send notfound for: c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 -1s in mempool peer=2
@MarcoFalke

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

Could update the comment that says have to wait around forever. Currently only SPV clients actually care ?

Also, I was wondering if this could be limited to just the bugfix without the RPC changes (and maybe without the tests)? You could still provide the branch with tests to the reviewers to run, but I think review should focus on the bugfix and shouldn't be distracted by RPC changes or tests. No strong opinion though.

@MarcoFalke

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

2019-04-16T02:23:15Z received getdata for: witness-tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 peer=2
2019-04-16T02:23:15Z will send notfound for: c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 -1s in mempool peer=2

I presume you are using self-made logging, since I couldn't find that logprint in the code base?

Also, I can't follow how the mempool should influence getdata at all. We are using mapRelay to store the transaction pointers and it is populated at the same time the inv gets sent. (Of course a timeout of an entry in mapRelay might cause the issue mentioned here, but that shouldn't occur so frequently)

@sdaftuar
Copy link
Member

left a comment

For robustness' sake, I don't think we should disconnect peers yet for ignoring transaction getdata requests. The logic for transaction download is new in 0.18, and if there are other bugs yet to be shaken out then we risk disconnecting honest peers, which would also be a problem.

Can we go back to the expiry approach?

const auto& tx_download = state.m_tx_download;
if (tx_download.m_tx_in_flight.size() >= MAX_PEER_TX_IN_FLIGHT) {
if (!tx_download.m_tx_process_time.empty() && (tx_download.m_tx_process_time.begin()->first + GETDATA_TX_TIMEOUT) <= time_in_seconds * 1000000) {
LogPrintf("Disconnecting outbound peer %d due to tx request backlog (%d inflight, %d queued)\n", pto->GetId(), tx_download.m_tx_in_flight.size(), tx_download.m_tx_process_time.size());

This comment has been minimized.

Copy link
@sdaftuar

sdaftuar Apr 16, 2019

Member

This will disconnect both inbound and outbound peers, I think?

This comment has been minimized.

Copy link
@ajtowns

ajtowns Apr 17, 2019

Author Contributor

Yeah, fixed

@ajtowns

This comment has been minimized.

Copy link
Contributor Author

commented Apr 17, 2019

I presume you are using self-made logging, since I couldn't find that logprint in the code base?

Yeah, it's just

+                auto txinfo = mempool.info(inv.hash);
+                LogPrint(BCLog::NET, "will send notfound for: %s %ds in mempool peer=%d\n", inv.hash.ToString(), (txinfo.tx ? GetTime() - txinfo.nTime : -1), pfrom->GetId());

before vNotFound.push_back(inv); in ProcessGetData().

Also, I can't follow how the mempool should influence getdata at all. We are using mapRelay to store the transaction pointers and it is populated at the same time the inv gets sent. (Of course a timeout of an entry in mapRelay might cause the issue mentioned here, but that shouldn't occur so frequently)

Yeah -- if it hasn't expired from mapRelay it should be fine, I'm trying to understand why a getdata is being sent after expiry from mapRelay. Apparently that's almost always when it's also been removed from the mempool.

Ah! I think what's happening is I'm getting a tx A, which spends B and C, and AcceptToMemoryPool is noticing I don't have B and setting fMissingInputs to true, at which point I'm doing RequestTx for both B and C, because AlreadyHave(C) returns false for both because they're not in the mempool, and because B's first two outputs aren't in the UTXO set it doesn't notice that I have C via the UTXO set. (EDIT to add bolded text)

So the cases where I'm sending getdata and getting back notfound seem to me to be:

  • I'm unnecessarily requesting tx data I already have, because I don't have some other parent of a tx I just received (most of the time)
  • My peer isn't forwarding me a parent that's still in the mempool but has expired from mapRelayTx (rarely, but non-zero)

@ajtowns ajtowns force-pushed the ajtowns:201904-inflighttxtimeout branch from 599e5db to 595b040 Apr 17, 2019

@ajtowns

This comment has been minimized.

Copy link
Contributor Author

commented Apr 17, 2019

Can we go back to the expiry approach?

I don't see why not. Closing this PR in favour of #15834 . Fixed some issues in the branch anyway so it's still useful for reference.

@ajtowns ajtowns closed this Apr 17, 2019

@fanquake fanquake removed this from Blockers in High-priority for review Apr 17, 2019

MarcoFalke added a commit that referenced this pull request Apr 18, 2019

Merge #15839: [0.18] Revert GetData randomization change (#14897)
8602d8b Revert "Change in transaction pull scheduling to prevent InvBlock-related attacks" (Suhas Daftuar)

Pull request description:

  This is for 0.18, not master -- I propose we revert the getdata change for the 0.18 release, rather than try to continue patching it up.  It seems like we've turned up several additional bugs that slipped through initial review (see #15776, #15834), and given the potential severe consequences of these bugs I think it'd make more sense for us to delay releasing this code until 0.19.

  Since the bugfix PRs are getting review, I think we can leave #14897 in master, but we can separately discuss if it should be reverted in master as well if anyone thinks that would be more appropriate.

ACKs for commit 8602d8:

Tree-SHA512: 0389cefc1bc74ac47f87866cf3a4dcfe202740a1baa3db074137e0aa5859672d74a50a34ccdb7cf43b3a3c99ce91e4ba1fb512d04d1d383d4cc184a8ada5543f

MarcoFalke added a commit that referenced this pull request Jun 12, 2019

Merge #15834: Fix transaction relay bugs introduced in #14897 and exp…
…ire transactions from peer in-flight map

308b767 Fix bug around transaction requests (Suhas Daftuar)
f635a3b Expire old entries from the in-flight tx map (Suhas Daftuar)
e32e084 Remove NOTFOUND transactions from in-flight data structures (Suhas Daftuar)
23163b7 Add an explicit memory bound to m_tx_process_time (Suhas Daftuar)
218697b Improve NOTFOUND comment (Suhas Daftuar)

Pull request description:

  #14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers.  Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in #15776.

  This PR does a few things:

  - Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't

  - Eliminate the possibility of a memory attack on the CNodeState `m_tx_process_time` data structure by explicitly bounding its size

  - Remove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one

  - Fix a bug relating to the coordination of request times when multiple peers announce the same transaction

  The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions.

ACKs for commit 308b76:
  ajtowns:
    utACK 308b767
  morcos:
    light ACK 308b767
  laanwj:
    Code review ACK 308b767
  jonatack:
    Light ACK 308b767.
  jamesob:
    ACK 308b767
  MarcoFalke:
    ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above)
  jamesob:
    Concept ACK 308b767
  MarcoFalke:
    ACK 308b767

Tree-SHA512: 8865dca5294447859d95655e8699085643db60c22f0719e76e961651a1398251bc932494b68932e33f68d4f6084579ab3bed7d0e7dd4ac6c362590eaf9414eda

sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jun 12, 2019

sidhujag
Merge bitcoin#15834: Fix transaction relay bugs introduced in bitcoin…
…#14897 and exp…  …ire transactions from peer in-flight map

308b767 Fix bug around transaction requests (Suhas Daftuar)
f635a3b Expire old entries from the in-flight tx map (Suhas Daftuar)
e32e084 Remove NOTFOUND transactions from in-flight data structures (Suhas Daftuar)
23163b7 Add an explicit memory bound to m_tx_process_time (Suhas Daftuar)
218697b Improve NOTFOUND comment (Suhas Daftuar)

Pull request description:

  bitcoin#14897 introduced several bugs that could lead to a node no longer requesting transactions from one or more of its peers.  Credit to ajtowns for originally reporting many of these bugs along with an originally proposed fix in bitcoin#15776.

  This PR does a few things:

  - Fix a bug in NOTFOUND processing, where the in-flight map for a peer was keeping transactions it shouldn't

  - Eliminate the possibility of a memory attack on the CNodeState `m_tx_process_time` data structure by explicitly bounding its size

  - Remove entries from a peer's in-flight map after 10 minutes, so that we should always eventually resume transaction requests even if there are other bugs like the NOTFOUND one

  - Fix a bug relating to the coordination of request times when multiple peers announce the same transaction

  The expiry mechanism added here is something we'll likely want to remove in the future, but is belt-and-suspenders for now to try to ensure we don't have other bugs that could lead to transaction relay failing due to some unforeseen conditions.

ACKs for commit 308b76:
  ajtowns:
    utACK 308b767
  morcos:
    light ACK 308b767
  laanwj:
    Code review ACK 308b767
  jonatack:
    Light ACK 308b767.
  jamesob:
    ACK 308b767
  MarcoFalke:
    ACK 308b767 (Tested two of the three bugs this pull fixes, see comment above)
  jamesob:
    Concept ACK 308b767
  MarcoFalke:
    ACK 308b767

Tree-SHA512: 8865dca5294447859d95655e8699085643db60c22f0719e76e961651a1398251bc932494b68932e33f68d4f6084579ab3bed7d0e7dd4ac6c362590eaf9414eda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.