Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix many orphan problem and stuck block download problem #4431

Closed
wants to merge 9 commits into from

Conversation

rebroad
Copy link
Contributor

@rebroad rebroad commented Jun 27, 2014

This has taken me several days to code and test, and I've tested it thoroughly.

With the current code, that uses only one syncnode, with these changes, the node will never download orphan blocks. With net debugging enabled, you will get a good view of what it's doing.

It also downloads the blockchain very fast, and never allows more than 60 seconds (the default) of inactivity to occur, so recovers very quickly from stuck downloads. With my testing, this can safely be reduced to 10 seconds as if a downloads sticks for as long as 10 seconds, it never recovers, so rather than wait two minutes as the current code does (well, tries to do), it quickly switches to a new syncnode.

Please do test, and give comments. So far, with the orphan problem, many people are unable to run the node without limiting orphan blocks (which still doesn't fix the stuck download anyway).

Feel free to copy this branch and improve it. I've included quite a few commits to make it easier to see the development process, but I won't always have time to make the changes needed that people ask for, and from experience my pull requests often don't get pulled for some reason. I think this functionality is needed now based upon the issues people have been raising.

@@ -288,6 +288,7 @@ std::string HelpMessage(HelpMessageMode mode)
strUsage += " -genproclimit=<n> " + _("Set the processor limit for when generation is on (-1 = unlimited, default: -1)") + "\n";
strUsage += " -help-debug " + _("Show all debugging options (usage: --help -help-debug)") + "\n";
strUsage += " -logtimestamps " + _("Prepend debug output with timestamp (default: 1)") + "\n";
strUsage += " -logips " + _("Include IP addresses in debug output") + "\n";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always list the default like in the other help strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the default wasn't shown unless it was 1.

@rebroad rebroad closed this Jul 4, 2014
@rebroad rebroad reopened this Jul 4, 2014
@rebroad
Copy link
Contributor Author

rebroad commented Jul 4, 2014

I've removed the debugging messages, and simplified the commits. Also removed the nodeid stuff, and the getblocks changes as these are in separate pulls.

@rebroad
Copy link
Contributor Author

rebroad commented Jul 4, 2014

@sipa this pull is much more readable now. Also, I've included the commit that splits MarkBlockAsReceived into two (also MarkBlockAsRequested), as I felt it was confusing that it was being used to mark blocks as requested but which are still in flight. Hope this is ok, but if not, I can remove this commit.

@sipa
Copy link
Member

sipa commented Jul 5, 2014

I hate to tell you after all the time you've spent on this, but headers-first will make most of this obsolete. It completely removes the concepts of "sync node" or "orphan blocks", and stops using "getblocks" altogether (fixing #4387 as well).

Decent logic for disconnecting blocks when they're stalling will still be useful, but many of the apparently stuck problems just caused by us not knowing what to request from whom will be gone.

@ghost
Copy link

ghost commented Jul 5, 2014

@sipa Is there a PR yet for headers-first feature?

@sipa
Copy link
Member

sipa commented Jul 5, 2014

@Drak Very soon!

@rebroad
Copy link
Contributor Author

rebroad commented Jul 16, 2014

I realise some of this pull request may disappear when header-first is merged, but this is still a useful fix to the many orphan problem right now, and also improves debug.log information. Some ACKs would be appreciated. @sipa @gmaxwell @laanwj @Drak @gavinandresen @Diapolo thanks

@laanwj laanwj added the P2P label Jul 31, 2014
@ghost
Copy link

ghost commented Aug 26, 2014

I appreciate you writing up fixes for this. Definitely important.

Will this be merged soon? The stalling due to orphans is so bad that IMO, it makes the client mostly useless. I have to baby sit it and restart it. I finally got it going with lowered max connections.

@gmaxwell
Copy link
Contributor

This will almost certantly not be merged. Issues with orphans are resolved by #4468.

@ghost
Copy link

ghost commented Aug 26, 2014

@gmaxwell Ah, thanks for linking that one over. Looking forward to seeing that one merged in, then.

@rebroad
Copy link
Contributor Author

rebroad commented Aug 29, 2014

@gmaxwell This could have been merged in the mean time though, I'd have thought. It was pull-requested quite some time before #4468 came about, too.

@gmaxwell
Copy link
Contributor

@rebroad 4468 is just the latest rebase of it.

@rebroad
Copy link
Contributor Author

rebroad commented Sep 3, 2014

@gmaxwell This pull request is ready to merge (or at least has been at several stages), whereas #4468 isn't, and has never been so far - it currently breaks re-orgs, so it's clearly still under development, and no one knows when it will be ready to test.

The issues this pull fixes are urgent, and have been for some time now - the amount of bandwidth being wasted by duplicate blocks and orphans is unimpressing quite a few people who've downloaded the software and expected better.

What's the reason for not merging this now? #4468 can provide the benefit it provides when it's finally developed and ready, but I don't see why that should delay these urgent issues from being fixed now...

@gmaxwell
Copy link
Contributor

gmaxwell commented Sep 3, 2014

#4468 is ready to merge except the testing infrastructure has not been updated to accommodate it, if not for that it would already be merged. Even with the it would already be merged except breaking pull tester in master would make all other pulls fail. It has no currently known problems and is suitable for testing, and I have several nodes running on it without issue.

When I looked at this patch previously it was performing a number of risky behaviors like disconnecting peers on low arbitrary timeouts. A partial adjustment that introduces new vulnerabilities and doesn't address the larger performance problems is not a good use of review or testing resources. We certantly wouldn't be rushing something like that out the door, so the claim of urgency is not helpful here.

@rebroad
Copy link
Contributor Author

rebroad commented Sep 3, 2014

@gmaxwell The default timeouts were address as soon as they were mentioned, and changed to 2 minutes as requested. I still mantain, however, that 2 minutes is far riskier than my suggested 10 seconds, as at a 2 minute timeout this can significantly cause atypical nodes to delay block propagation by them issuing the latest block inv as quickly as possible (before they've even got the headers), and then ignoring the getdata block request. With my code, this is mitigated significantly, and it's highly unlikely and unusual for a node to not start sending a block within 10 seconds of it being requested - this is based on over two years of testing. On what are you basing your argument?

@gmaxwell
Copy link
Contributor

gmaxwell commented Sep 3, 2014

I though previously explained in this in adequate detail in your prior pull request that disconnected well behaving nodes: This kind approach makes the network fragile in a multitude of situations, including hosts on even slightly slow links. Effectively a two minute timeout increases the minimum bandwidth that can keep a node running bitcoin by more than 5 fold. This of behavior means that moderate or short lived DOS attacks (or just high utilization) can cause disconnection potentially allowing an attacker to partition the network.

Yes— malicious nodes attacking the network can potentially slow down receiving blocks (and thats still true with your change). This is part of why the standard advice (and widespread practice) for mining nodes is to not expose them to the public network directly. Increasing partition risk, or reducing the ability to run slow or overloaded links at all isn't a good trade-off.

I'd really encourage you to spend time testing headers first.

@rebroad
Copy link
Contributor Author

rebroad commented Sep 3, 2014

@gmaxwell Strawman arguments aren't helping - no one is suggesting disconnecting well-behaving nodes. A node that doesn't start sending a block within 2 minutes of it being requested, I think we can both agree, is not a well-behaving node.

Since the default behaviour in this pull is set to 2 minutes, I'm confused as to why you're focusing on this point. Is there anything in this pull or for that matter, the nine commits that this pull comprises of, that you have any objections to?

@gmaxwell
Copy link
Contributor

gmaxwell commented Sep 3, 2014

It may be wortking fine— could even be the only honest node you're connected to. Start is the wrong description of what the changes do, they may have sent immediately, and you've received 99.9% of it by the time two minutes pass. This is also a tangent, the changes here are not primarily related to steady state block relay.

@rebroad
Copy link
Contributor Author

rebroad commented Sep 3, 2014

@gmaxwell Are you saying you want me to change the code so that it disconnects nodes if they haven't managed to send an entire block within two minutes? Surely if it's send 99.9% then it would be counterproductive to start the download from scratch from another node.

I'm sorry, I'm not understanding what your issue is still. Can you just state simply what you would need to see changed in this pull request before you'd be happy to ACK it?

@pstratem
Copy link
Contributor

pstratem commented Sep 3, 2014

@rebroad this is a fairly large patch for something which will be entirely irrelevant when #4468 is merged

I appreciate the work to improve orphans that stall IBD (it's annoying to me too!) but really making these changes with headers first right around the corner is unnecessary

NACK

@rebroad
Copy link
Contributor Author

rebroad commented Sep 3, 2014

@pstratem Hasn't headers first been "right around the corner" for over 2 years now?

@jgarzik
Copy link
Contributor

jgarzik commented Sep 3, 2014

headers-first is the consensus direction. It is not directly apparent, but many headers first-related patches are already in the tree, merged in tiny bites.

Any notable work in this area should be based on top of headers-first, to avoid wasting work.

@gmaxwell
Copy link
Contributor

gmaxwell commented Sep 3, 2014

Two years? no way. It was originally slated for 0.9 but it was too big to review and test all at once, so it was split up and missed that release. As Jeff notes much of it has already been merged.

@BitcoinPullTester
Copy link

Automatic sanity-testing: FAILED BUILD/TEST, see http://jenkins.bluematt.me/pull-tester/p4431_cf203c45e2b2dadf74ec20494015680218dd5fa5/ for binaries and test log.

This could happen for one of several reasons:

  1. It chanages paths in makefile.linux-mingw or otherwise changes build scripts in a way that made them incompatible with the automated testing scripts (please tweak those patches in qa/pull-tester)
  2. It adds/modifies tests which test network rules (thanks for doing that), which conflicts with a patch applied at test time
  3. It does not build on either Linux i386 or Win32 (via MinGW cross compile)
  4. The test suite fails on either Linux i386 or Win32
  5. The block test-cases failed (lookup the first bNN identifier which failed in https://github.com/TheBlueMatt/test-scripts/blob/master/FullBlockTestGenerator.java)

If you believe this to be in error, please ping BlueMatt on freenode or TheBlueMatt here.

This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/
Contact BlueMatt on freenode if something looks broken.

@laanwj
Copy link
Member

laanwj commented Sep 3, 2014

@rebroad Instead of impatiently complaining how slow headers-first is being merged you could also help testing and reviewing it. Or even build on top of it. You know the code is available as pull request #4468 and it fully works?

@laanwj laanwj closed this Sep 3, 2014
@rebroad
Copy link
Contributor Author

rebroad commented Sep 3, 2014

@laanwj I am testing it and building on top of it, however, I'd always understood that we were supposed to submit pull requests built on-top of master, so I'm not sure how you would want me to submit patches without including the entirity of #4468 along with every pull request I make.

@laanwj
Copy link
Member

laanwj commented Sep 3, 2014

@rebroad You could submit patches to @sipa directly. Otherwise, just make sure that you don't conflict with him or do duplicate work.

@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants