Fix many orphan problem and stuck block download problem #4431

rebroad · 2014-06-27T13:18:10Z

This has taken me several days to code and test, and I've tested it thoroughly.

With the current code, that uses only one syncnode, with these changes, the node will never download orphan blocks. With net debugging enabled, you will get a good view of what it's doing.

It also downloads the blockchain very fast, and never allows more than 60 seconds (the default) of inactivity to occur, so recovers very quickly from stuck downloads. With my testing, this can safely be reduced to 10 seconds as if a downloads sticks for as long as 10 seconds, it never recovers, so rather than wait two minutes as the current code does (well, tries to do), it quickly switches to a new syncnode.

Please do test, and give comments. So far, with the orphan problem, many people are unable to run the node without limiting orphan blocks (which still doesn't fix the stuck download anyway).

Feel free to copy this branch and improve it. I've included quite a few commits to make it easier to see the development process, but I won't always have time to make the changes needed that people ask for, and from experience my pull requests often don't get pulled for some reason. I think this functionality is needed now based upon the issues people have been raising.

Diapolo · 2014-06-27T14:06:49Z

src/init.cpp

@@ -288,6 +288,7 @@ std::string HelpMessage(HelpMessageMode mode)
    strUsage += "  -genproclimit=<n>      " + _("Set the processor limit for when generation is on (-1 = unlimited, default: -1)") + "\n";
    strUsage += "  -help-debug            " + _("Show all debugging options (usage: --help -help-debug)") + "\n";
    strUsage += "  -logtimestamps         " + _("Prepend debug output with timestamp (default: 1)") + "\n";
+    strUsage += "  -logips                " + _("Include IP addresses in debug output") + "\n";


Please always list the default like in the other help strings.

I thought the default wasn't shown unless it was 1.

rebroad · 2014-07-04T06:51:08Z

I've removed the debugging messages, and simplified the commits. Also removed the nodeid stuff, and the getblocks changes as these are in separate pulls.

rebroad · 2014-07-04T07:13:22Z

@sipa this pull is much more readable now. Also, I've included the commit that splits MarkBlockAsReceived into two (also MarkBlockAsRequested), as I felt it was confusing that it was being used to mark blocks as requested but which are still in flight. Hope this is ok, but if not, I can remove this commit.

sipa · 2014-07-05T17:21:53Z

I hate to tell you after all the time you've spent on this, but headers-first will make most of this obsolete. It completely removes the concepts of "sync node" or "orphan blocks", and stops using "getblocks" altogether (fixing #4387 as well).

Decent logic for disconnecting blocks when they're stalling will still be useful, but many of the apparently stuck problems just caused by us not knowing what to request from whom will be gone.

ghost · 2014-07-05T17:54:11Z

@sipa Is there a PR yet for headers-first feature?

sipa · 2014-07-05T17:56:01Z

@Drak Very soon!

rebroad · 2014-07-16T16:41:53Z

I realise some of this pull request may disappear when header-first is merged, but this is still a useful fix to the many orphan problem right now, and also improves debug.log information. Some ACKs would be appreciated. @sipa @gmaxwell @laanwj @Drak @gavinandresen @Diapolo thanks

ghost · 2014-08-26T23:08:44Z

I appreciate you writing up fixes for this. Definitely important.

Will this be merged soon? The stalling due to orphans is so bad that IMO, it makes the client mostly useless. I have to baby sit it and restart it. I finally got it going with lowered max connections.

gmaxwell · 2014-08-26T23:10:44Z

This will almost certantly not be merged. Issues with orphans are resolved by #4468.

ghost · 2014-08-26T23:14:34Z

@gmaxwell Ah, thanks for linking that one over. Looking forward to seeing that one merged in, then.

rebroad · 2014-08-29T17:20:29Z

@gmaxwell This could have been merged in the mean time though, I'd have thought. It was pull-requested quite some time before #4468 came about, too.

gmaxwell · 2014-08-29T17:25:02Z

@rebroad 4468 is just the latest rebase of it.

rebroad · 2014-09-03T00:52:32Z

@gmaxwell This pull request is ready to merge (or at least has been at several stages), whereas #4468 isn't, and has never been so far - it currently breaks re-orgs, so it's clearly still under development, and no one knows when it will be ready to test.

The issues this pull fixes are urgent, and have been for some time now - the amount of bandwidth being wasted by duplicate blocks and orphans is unimpressing quite a few people who've downloaded the software and expected better.

What's the reason for not merging this now? #4468 can provide the benefit it provides when it's finally developed and ready, but I don't see why that should delay these urgent issues from being fixed now...

gmaxwell · 2014-09-03T01:05:38Z

#4468 is ready to merge except the testing infrastructure has not been updated to accommodate it, if not for that it would already be merged. Even with the it would already be merged except breaking pull tester in master would make all other pulls fail. It has no currently known problems and is suitable for testing, and I have several nodes running on it without issue.

When I looked at this patch previously it was performing a number of risky behaviors like disconnecting peers on low arbitrary timeouts. A partial adjustment that introduces new vulnerabilities and doesn't address the larger performance problems is not a good use of review or testing resources. We certantly wouldn't be rushing something like that out the door, so the claim of urgency is not helpful here.

rebroad · 2014-09-03T01:46:36Z

@gmaxwell The default timeouts were address as soon as they were mentioned, and changed to 2 minutes as requested. I still mantain, however, that 2 minutes is far riskier than my suggested 10 seconds, as at a 2 minute timeout this can significantly cause atypical nodes to delay block propagation by them issuing the latest block inv as quickly as possible (before they've even got the headers), and then ignoring the getdata block request. With my code, this is mitigated significantly, and it's highly unlikely and unusual for a node to not start sending a block within 10 seconds of it being requested - this is based on over two years of testing. On what are you basing your argument?

…sage tracking.

…e, or we've CaughtUp()

… to stop sending blocks.

…either disconnect the peer (as previous behaviour), or it will reselect a syncnode (e.g. when 500 blocks have been received from the syncnode).

gmaxwell · 2014-09-03T01:58:00Z

I though previously explained in this in adequate detail in your prior pull request that disconnected well behaving nodes: This kind approach makes the network fragile in a multitude of situations, including hosts on even slightly slow links. Effectively a two minute timeout increases the minimum bandwidth that can keep a node running bitcoin by more than 5 fold. This of behavior means that moderate or short lived DOS attacks (or just high utilization) can cause disconnection potentially allowing an attacker to partition the network.

Yes— malicious nodes attacking the network can potentially slow down receiving blocks (and thats still true with your change). This is part of why the standard advice (and widespread practice) for mining nodes is to not expose them to the public network directly. Increasing partition risk, or reducing the ability to run slow or overloaded links at all isn't a good trade-off.

I'd really encourage you to spend time testing headers first.

rebroad · 2014-09-03T02:03:36Z

@gmaxwell Strawman arguments aren't helping - no one is suggesting disconnecting well-behaving nodes. A node that doesn't start sending a block within 2 minutes of it being requested, I think we can both agree, is not a well-behaving node.

Since the default behaviour in this pull is set to 2 minutes, I'm confused as to why you're focusing on this point. Is there anything in this pull or for that matter, the nine commits that this pull comprises of, that you have any objections to?

gmaxwell · 2014-09-03T02:06:33Z

It may be wortking fine— could even be the only honest node you're connected to. Start is the wrong description of what the changes do, they may have sent immediately, and you've received 99.9% of it by the time two minutes pass. This is also a tangent, the changes here are not primarily related to steady state block relay.

rebroad · 2014-09-03T02:12:20Z

@gmaxwell Are you saying you want me to change the code so that it disconnects nodes if they haven't managed to send an entire block within two minutes? Surely if it's send 99.9% then it would be counterproductive to start the download from scratch from another node.

I'm sorry, I'm not understanding what your issue is still. Can you just state simply what you would need to see changed in this pull request before you'd be happy to ACK it?

pstratem · 2014-09-03T02:16:06Z

@rebroad this is a fairly large patch for something which will be entirely irrelevant when #4468 is merged

I appreciate the work to improve orphans that stall IBD (it's annoying to me too!) but really making these changes with headers first right around the corner is unnecessary

NACK

rebroad · 2014-09-03T02:17:40Z

@pstratem Hasn't headers first been "right around the corner" for over 2 years now?

jgarzik · 2014-09-03T02:21:41Z

headers-first is the consensus direction. It is not directly apparent, but many headers first-related patches are already in the tree, merged in tiny bites.

Any notable work in this area should be based on top of headers-first, to avoid wasting work.

gmaxwell · 2014-09-03T02:26:55Z

Two years? no way. It was originally slated for 0.9 but it was too big to review and test all at once, so it was split up and missed that release. As Jeff notes much of it has already been merged.

BitcoinPullTester · 2014-09-03T02:54:13Z

Automatic sanity-testing: FAILED BUILD/TEST, see http://jenkins.bluematt.me/pull-tester/p4431_cf203c45e2b2dadf74ec20494015680218dd5fa5/ for binaries and test log.

This could happen for one of several reasons:

It chanages paths in makefile.linux-mingw or otherwise changes build scripts in a way that made them incompatible with the automated testing scripts (please tweak those patches in qa/pull-tester)
It adds/modifies tests which test network rules (thanks for doing that), which conflicts with a patch applied at test time
It does not build on either Linux i386 or Win32 (via MinGW cross compile)
The test suite fails on either Linux i386 or Win32
The block test-cases failed (lookup the first bNN identifier which failed in https://github.com/TheBlueMatt/test-scripts/blob/master/FullBlockTestGenerator.java)

If you believe this to be in error, please ping BlueMatt on freenode or TheBlueMatt here.

This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/
Contact BlueMatt on freenode if something looks broken.

laanwj · 2014-09-03T06:21:27Z

@rebroad Instead of impatiently complaining how slow headers-first is being merged you could also help testing and reviewing it. Or even build on top of it. You know the code is available as pull request #4468 and it fully works?

rebroad · 2014-09-03T06:50:59Z

@laanwj I am testing it and building on top of it, however, I'd always understood that we were supposed to submit pull requests built on-top of master, so I'm not sure how you would want me to submit patches without including the entirity of #4468 along with every pull request I make.

laanwj · 2014-09-03T10:26:49Z

@rebroad You could submit patches to @sipa directly. Otherwise, just make sure that you don't conflict with him or do duplicate work.

This was referenced Jun 27, 2014

ProcessBlock: ORPHAN BLOCK 751, prev=00xxxxxx #4353

Closed

Proposal: disconnect clients sending unrequested txs. #4338

Closed

Fix stuck block detection #4198

Closed

Diapolo reviewed Jun 27, 2014
View reviewed changes

rebroad closed this Jul 4, 2014

rebroad reopened this Jul 4, 2014

rebroad mentioned this pull request Jul 7, 2014

Headers-first synchronization #4468

Merged

laanwj added the P2P label Jul 31, 2014

rebroad added 7 commits September 3, 2014 08:52

Add CaughtUp() function (needed by later commits)

bcd2c9b

Rearrange ProcessMessages + add nLastDataPos variable for partial mes…

d36e39a

…sage tracking.

Don't request blocks when block invs received unless it's the syncnod…

673ed30

…e, or we've CaughtUp()

Force selection of a new syncnode since the current syncnode is about…

3fae9f4

… to stop sending blocks.

Replacement stalled block logic. Depending on the situation, it will …

48474db

…either disconnect the peer (as previous behaviour), or it will reselect a syncnode (e.g. when 500 blocks have been received from the syncnode).

Add commandline option for sync timeout

284fcfa

Rename MarkBlockAsReceived to MarkBlockAsRequested to reduce confusion.

917aca9

rebroad added 2 commits September 3, 2014 08:52

Show downloaded block size and time taken to download.

39e8166

Debug when a block is incoming and show how many blocks still to come.

cf203c4

rebroad force-pushed the FixManyOrphanProblem branch from 85bb637 to cf203c4 Compare September 3, 2014 01:52

laanwj closed this Sep 3, 2014

rebroad mentioned this pull request Sep 10, 2014

slowing down sync after some time #4886

Closed

simondlr mentioned this pull request Sep 26, 2014

Max orphan blocks problem when syncing. dogecoin/dogecoin#717

Closed

bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix many orphan problem and stuck block download problem #4431

Fix many orphan problem and stuck block download problem #4431

rebroad commented Jun 27, 2014

Diapolo Jun 27, 2014

rebroad Jun 27, 2014

rebroad commented Jul 4, 2014

rebroad commented Jul 4, 2014

sipa commented Jul 5, 2014

ghost commented Jul 5, 2014

sipa commented Jul 5, 2014

rebroad commented Jul 16, 2014

ghost commented Aug 26, 2014

gmaxwell commented Aug 26, 2014

ghost commented Aug 26, 2014

rebroad commented Aug 29, 2014

gmaxwell commented Aug 29, 2014

rebroad commented Sep 3, 2014

gmaxwell commented Sep 3, 2014

rebroad commented Sep 3, 2014

gmaxwell commented Sep 3, 2014

rebroad commented Sep 3, 2014

gmaxwell commented Sep 3, 2014

rebroad commented Sep 3, 2014

pstratem commented Sep 3, 2014

rebroad commented Sep 3, 2014

jgarzik commented Sep 3, 2014

gmaxwell commented Sep 3, 2014

BitcoinPullTester commented Sep 3, 2014

laanwj commented Sep 3, 2014

rebroad commented Sep 3, 2014

laanwj commented Sep 3, 2014

Fix many orphan problem and stuck block download problem #4431

Fix many orphan problem and stuck block download problem #4431

Conversation

rebroad commented Jun 27, 2014

Diapolo Jun 27, 2014

Choose a reason for hiding this comment

rebroad Jun 27, 2014

Choose a reason for hiding this comment

rebroad commented Jul 4, 2014

rebroad commented Jul 4, 2014

sipa commented Jul 5, 2014

ghost commented Jul 5, 2014

sipa commented Jul 5, 2014

rebroad commented Jul 16, 2014

ghost commented Aug 26, 2014

gmaxwell commented Aug 26, 2014

ghost commented Aug 26, 2014

rebroad commented Aug 29, 2014

gmaxwell commented Aug 29, 2014

rebroad commented Sep 3, 2014

gmaxwell commented Sep 3, 2014

rebroad commented Sep 3, 2014

gmaxwell commented Sep 3, 2014

rebroad commented Sep 3, 2014

gmaxwell commented Sep 3, 2014

rebroad commented Sep 3, 2014

pstratem commented Sep 3, 2014

rebroad commented Sep 3, 2014

jgarzik commented Sep 3, 2014

gmaxwell commented Sep 3, 2014

BitcoinPullTester commented Sep 3, 2014

laanwj commented Sep 3, 2014

rebroad commented Sep 3, 2014

laanwj commented Sep 3, 2014