-
Notifications
You must be signed in to change notification settings - Fork 35.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix many orphan problem and stuck block download problem #4431
Conversation
@@ -288,6 +288,7 @@ std::string HelpMessage(HelpMessageMode mode) | |||
strUsage += " -genproclimit=<n> " + _("Set the processor limit for when generation is on (-1 = unlimited, default: -1)") + "\n"; | |||
strUsage += " -help-debug " + _("Show all debugging options (usage: --help -help-debug)") + "\n"; | |||
strUsage += " -logtimestamps " + _("Prepend debug output with timestamp (default: 1)") + "\n"; | |||
strUsage += " -logips " + _("Include IP addresses in debug output") + "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please always list the default like in the other help strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the default wasn't shown unless it was 1.
I've removed the debugging messages, and simplified the commits. Also removed the nodeid stuff, and the getblocks changes as these are in separate pulls. |
@sipa this pull is much more readable now. Also, I've included the commit that splits MarkBlockAsReceived into two (also MarkBlockAsRequested), as I felt it was confusing that it was being used to mark blocks as requested but which are still in flight. Hope this is ok, but if not, I can remove this commit. |
I hate to tell you after all the time you've spent on this, but headers-first will make most of this obsolete. It completely removes the concepts of "sync node" or "orphan blocks", and stops using "getblocks" altogether (fixing #4387 as well). Decent logic for disconnecting blocks when they're stalling will still be useful, but many of the apparently stuck problems just caused by us not knowing what to request from whom will be gone. |
@sipa Is there a PR yet for headers-first feature? |
@Drak Very soon! |
I appreciate you writing up fixes for this. Definitely important. Will this be merged soon? The stalling due to orphans is so bad that IMO, it makes the client mostly useless. I have to baby sit it and restart it. I finally got it going with lowered max connections. |
This will almost certantly not be merged. Issues with orphans are resolved by #4468. |
@gmaxwell Ah, thanks for linking that one over. Looking forward to seeing that one merged in, then. |
@rebroad 4468 is just the latest rebase of it. |
@gmaxwell This pull request is ready to merge (or at least has been at several stages), whereas #4468 isn't, and has never been so far - it currently breaks re-orgs, so it's clearly still under development, and no one knows when it will be ready to test. The issues this pull fixes are urgent, and have been for some time now - the amount of bandwidth being wasted by duplicate blocks and orphans is unimpressing quite a few people who've downloaded the software and expected better. What's the reason for not merging this now? #4468 can provide the benefit it provides when it's finally developed and ready, but I don't see why that should delay these urgent issues from being fixed now... |
#4468 is ready to merge except the testing infrastructure has not been updated to accommodate it, if not for that it would already be merged. Even with the it would already be merged except breaking pull tester in master would make all other pulls fail. It has no currently known problems and is suitable for testing, and I have several nodes running on it without issue. When I looked at this patch previously it was performing a number of risky behaviors like disconnecting peers on low arbitrary timeouts. A partial adjustment that introduces new vulnerabilities and doesn't address the larger performance problems is not a good use of review or testing resources. We certantly wouldn't be rushing something like that out the door, so the claim of urgency is not helpful here. |
@gmaxwell The default timeouts were address as soon as they were mentioned, and changed to 2 minutes as requested. I still mantain, however, that 2 minutes is far riskier than my suggested 10 seconds, as at a 2 minute timeout this can significantly cause atypical nodes to delay block propagation by them issuing the latest block inv as quickly as possible (before they've even got the headers), and then ignoring the getdata block request. With my code, this is mitigated significantly, and it's highly unlikely and unusual for a node to not start sending a block within 10 seconds of it being requested - this is based on over two years of testing. On what are you basing your argument? |
…e, or we've CaughtUp()
… to stop sending blocks.
…either disconnect the peer (as previous behaviour), or it will reselect a syncnode (e.g. when 500 blocks have been received from the syncnode).
85bb637
to
cf203c4
Compare
I though previously explained in this in adequate detail in your prior pull request that disconnected well behaving nodes: This kind approach makes the network fragile in a multitude of situations, including hosts on even slightly slow links. Effectively a two minute timeout increases the minimum bandwidth that can keep a node running bitcoin by more than 5 fold. This of behavior means that moderate or short lived DOS attacks (or just high utilization) can cause disconnection potentially allowing an attacker to partition the network. Yes— malicious nodes attacking the network can potentially slow down receiving blocks (and thats still true with your change). This is part of why the standard advice (and widespread practice) for mining nodes is to not expose them to the public network directly. Increasing partition risk, or reducing the ability to run slow or overloaded links at all isn't a good trade-off. I'd really encourage you to spend time testing headers first. |
@gmaxwell Strawman arguments aren't helping - no one is suggesting disconnecting well-behaving nodes. A node that doesn't start sending a block within 2 minutes of it being requested, I think we can both agree, is not a well-behaving node. Since the default behaviour in this pull is set to 2 minutes, I'm confused as to why you're focusing on this point. Is there anything in this pull or for that matter, the nine commits that this pull comprises of, that you have any objections to? |
It may be wortking fine— could even be the only honest node you're connected to. Start is the wrong description of what the changes do, they may have sent immediately, and you've received 99.9% of it by the time two minutes pass. This is also a tangent, the changes here are not primarily related to steady state block relay. |
@gmaxwell Are you saying you want me to change the code so that it disconnects nodes if they haven't managed to send an entire block within two minutes? Surely if it's send 99.9% then it would be counterproductive to start the download from scratch from another node. I'm sorry, I'm not understanding what your issue is still. Can you just state simply what you would need to see changed in this pull request before you'd be happy to ACK it? |
@pstratem Hasn't headers first been "right around the corner" for over 2 years now? |
headers-first is the consensus direction. It is not directly apparent, but many headers first-related patches are already in the tree, merged in tiny bites. Any notable work in this area should be based on top of headers-first, to avoid wasting work. |
Two years? no way. It was originally slated for 0.9 but it was too big to review and test all at once, so it was split up and missed that release. As Jeff notes much of it has already been merged. |
Automatic sanity-testing: FAILED BUILD/TEST, see http://jenkins.bluematt.me/pull-tester/p4431_cf203c45e2b2dadf74ec20494015680218dd5fa5/ for binaries and test log. This could happen for one of several reasons:
If you believe this to be in error, please ping BlueMatt on freenode or TheBlueMatt here. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ |
This has taken me several days to code and test, and I've tested it thoroughly.
With the current code, that uses only one syncnode, with these changes, the node will never download orphan blocks. With net debugging enabled, you will get a good view of what it's doing.
It also downloads the blockchain very fast, and never allows more than 60 seconds (the default) of inactivity to occur, so recovers very quickly from stuck downloads. With my testing, this can safely be reduced to 10 seconds as if a downloads sticks for as long as 10 seconds, it never recovers, so rather than wait two minutes as the current code does (well, tries to do), it quickly switches to a new syncnode.
Please do test, and give comments. So far, with the orphan problem, many people are unable to run the node without limiting orphan blocks (which still doesn't fix the stuck download anyway).
Feel free to copy this branch and improve it. I've included quite a few commits to make it easier to see the development process, but I won't always have time to make the changes needed that people ask for, and from experience my pull requests often don't get pulled for some reason. I think this functionality is needed now based upon the issues people have been raising.