-
Notifications
You must be signed in to change notification settings - Fork 35.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trying to sync new testnet node - failing due to "stalling block download" #11037
Comments
Adding reference to #9213 |
@MarcoFalke I see I already mentioned this in #9213. I forgot that I had done that. I am running a mainnet Bitcoin Core node, also v0.14.2, and don't see this issue there. I also don't see it with my Bitcoin Cash node:
I have made a copy of my testnet3 directory, and am willing to run any tests that could help diagnose the problem. |
If it's the same issue that I think is causing #9213 then I think you can bypass this behavior by restarting your bitcoind and limiting it to just one outbound peer (either with My guess is that your node is in a state where it's missing fewer than 16 blocks out of a group of 1024 or whatever the current download window is, and so on startup it immediately tries to download all from the first peer, and then when the second connection comes up it marks the first peer as stalling, causing a very quick timeout, which then repeats... If this guess is right then you'll see your node download a few blocks; the tip should advance by about 1000; and then all should be right with the world again if you restart without the outbound peer limitation. |
I'll try that, thanks. So far I tried changing the timeout from 2 seconds to 10. It didn't help. It takes 10 seconds per peer to disconnect them, but still doesn't get any blocks:
I can't accept incoming connections, if that's relevant. They're all outgoing. And I have height=949516. It was height=949507 when I first posted this issue, but it did manage to grab 8 blocks when I disabled the disconnection altogether. |
Is this going to be an issue on mainnet when segwit activates? Is there a fix for it? |
I notice the blocks are pretty big:
I can't download that much in 2 seconds. I wonder if that's the issue? |
Now that I'm past those big blocks I am able to sync the chain using more than 1 connection. I'm getting around 100 blocks per second again now. |
Ok great. Yeah it occurred to me that bigger blocks on testnet (post-segwit-activation there) could exacerbate this issue further, we should try to improve this behavior soon for mainnet... |
Any idea why I've never seen this on mainnet? |
It's stuck again. It was flying through the blocks for a while:
but then it slowed way down again when it hit a big block:
and then it started with the 'stalling' messages:
It's been over 20 minutes and I've not seen any more blocks. It's still disconnecting peers one at a time:
I guess I'll switch back to maxconnections=1 and let the sync complete, unless there's some diagnostic use in keeping my node unsynced for now. |
I'll need to dig into this more to better understand it but my guess is that mainnet isn't as bad because the blocks are probably more uniform in size and download speed -- I'd guess that most testnet blocks are close to empty, so you might be more likely to quickly download ~1000 blocks and think the peer who is stuck serving the handful of big blocks in the download window is stalling? |
Changing |
@dooglus Thanks for reporting back, that is good to know. |
I doubt I've ever been 1024 blocks behind on mainnet. That would explain why I don't see this issue there.
|
Here's my (hand-waving) understanding: I had 8 peers. Each peer can only have 16 blocks "in flight" at a time, and we only download blocks which are within 1024 blocks of our best block. That means only 128 blocks can be "in flight" at a time, much less than the 1024 block window. When the window gets full, ie. when we would be trying to fetch a block 1024 ahead of our tip, the peer which is currently trying to download the block at tip+1 gets 2 seconds to complete, and then is disconnected. During normal (mainnet) operation, most blocks are about the same size, and take about the same time to download. We very rarely fill the 1024 block window, and so this stalling disconnect code very rarely runs. On testnet however, we have the occasional 1.5 MB block followed by thousands of 200 byte blocks. While peer 0 is downloading a 1.5 MB block (at tip+1), the other peers can very quickly download a thousand 200 byte blocks, filling the 1024 block window, and causing peer 0 to get disconnected for "stalling". The assumption seems to be that if other peers were able to download 1000 blocks before peer 0 was able to download one block, it must be that peer 0 was "stalling". However, in this case peer 0 wasn't being slow, the other peers were just lucky to all be downloading empty blocks and so got an easy ride to the end of the window. This is unlikely to be an issue on mainnet, because we don't ever get lots of small/empty blocks in a row. If one peer does fall 1000 blocks behind the rest, it's likely that it is stalling and so should be disconnected. |
I've been trying to sync testnet for a few weeks now. It keeps getting stuck. The logs are full of messages about "stalling block download".
Here's an example:
I built from source on Linux, against tag
v0.14.2
.Three minutes later I've still not made any progress:
Is this a problem with testnet? With having segregated witness activated? With my Internet being a bit laggy?
The text was updated successfully, but these errors were encountered: