Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Sometimes nodes send many duplicate blocks. This patch disconnects the p... #1382

Closed
wants to merge 1 commit into from

9 participants

@rebroad

...eer that's lagging behind. This ideally shouldn't happen, but it can and does sometimes, so this code caters for that.

This code also becomes more important if the block download algorithm is changed in future (e.g. pull #1326, which is still in development), e.g. if and retries are sent to other peers when peers seem unresponsive, only to become responsive again later, by which time they are sending duplicate blocks. Obviously the unresponsive peers could be disconnected when deemed unresponsive, but my tests have shown it's more efficient to give them a chance to become responsive again, (even though it does increase the chances of them sending duplicate blocks), and disconnecting them at that point.

Future patches could identify the duplicate blocks before they are downloaded, by the block size (looking up against blocks recently downloaded and against blocks requested from the peer sending the block). Ideally they'd be identified by something more specific than the block size, but currently block headers communicate only the size.

@TheBlueMatt

I really dont see the point in this, if we request a block, and a node lags behind due to some network congestion, and then we get the block later, there is no reason to drop that peer, if the peer is constantly lagging behind, we may want to not request blocks from that peer, but dropping them could kill eg mobile clients.

@laanwj
Owner

NACK.

Only disconnect peers that really misbehave. Any disconnect rule that you add has potential to wreck the network in some (maybe unforseen) circumstances. We should be really careful.

@rebroad

@TheBlueMatt @laanwj Please can you see Issue #1120 - do we really want to continue having nodes sending thousands of duplicate blocks to each other? If we don't disconnect the peer, then what are the alternative solutions to this problem? By the way, in case it wasn't obvious, this is relating to when a peer is catching up and is more than a thousand blocks behind the best block.

@rebroad rebroad Sometimes nodes send many duplicate blocks. This patch disconnects th…
…e peer that's running

behind. This shouldn't happen, but it can and does sometimes, so this code caters for that.

This code also becomes more important as the block download algorithm is changed in future,
such as concurrent block downloads from multiple peers, and retries are sent to other peers
when peers seem unresponsive, only to become responsive again later, by which time
they are sending duplicate blocks.
50f7bbf
@csyangchen

Duplicated blocks (and tx as well!) are probably caused from client sending getdata multiple times in the first place. It occurs when the workload is high, typical situation during initial download. The operation turnaround time might exceed the request resend timeout, triggering the duplicated requests.

So to proof the node intentionally send duplicated blocks need to do per node based tracking, which I think don't worth the effort.

Also I'm wondering what's the attack vector by intentionally sending duplicated block / tx. Since recv queue are rate limited by size, its more likely for a malicious node to send duplicated requests, to maximize effectiveness of draining neighbors' bandwidth.

@rebroad

@csyangchen this isn't meant to disconnect to due misbehavoiur, but simply to reduce wasted bandwidth between two nodes. This is a problem still occuring in 0.6.3 with a good internet connection, so unless someone has an alternative suggestion for fixing this, I would appreciate some ACKs please.

@gmaxwell
Owner

I'm opposed to disconnecting peers for expected behavior.

@rebroad

@gmaxwell what do you suggest instead of disconnecting then? Or do you propose that we continue to waste bandwidth as is currently the case?

@gmaxwell
Owner

I don't completely understand the nature of the wasteful behavior and I can't reproduce it in 0.6.3, so I'm unable to make a suggestion. I'm just reasonably confident that disconnecting peers for expected behavior is not the right thing to do (and among other problems, poses partitioning risk).

The right thing do to, of course, would be to avoid bad behavior in the first place. Once it's avoided, then you could talk about possibly dumping peers which intentionally misbehave, although O(N) bandwidth attacks are not terribly interesting. (You can flood a peer without ever connecting to its bitcoin, so there is little we can do).

@jgarzik
Owner

This targets one highly specific flood, while ignoring all the other flooding possible within the protocol. An attacker can just pick another message to flood, including messages we don't notice, but do unpack-before-ignoring.

Given the other two naks, closing.

@jgarzik jgarzik closed this
@rebroad

Looking at the closing comment, I think this pull request has been misunderstood. It is not for protection against any sort of attack. The duplicate blocks happen naturally due to the way bitcoin works, due to the set timeouts, and the delays in the network due to processing. Given that there is no current way for a node to say "please stop sending me blocks" to another node that is responding to a request for blocks but took so long that another request was sent out to another node, which started sending the blocks first, the only way to safe the wasted bandwidth (of both nodes), is to disconnect from the node that is lagging behind.

@exarkun

Just getting started with bitcoin, trying to initialize a wallet, and after days of waiting to catch up, it seems to me the issue being described here is real and some fix would be quite beneficial. By way of providing some real world data, here are some stats collected from my client's debug log:

exarkun@top:~/.bitcoin$ grep "already have block" debug.log  | wc -l
383867
exarkun@top:~/.bitcoin$ grep "ProcessBlock: ACCEPTED" debug.log  | wc -l
22486

That's nearly twenty times more duplicate blocks - redundant, useless traffic - than legitimate, helpful blocks. If there is no way to tell a peer to stop sending some blocks, and it is not desirable to disconnect from a peer which is now sending unneeded blocks (because they have been retrieved from somewhere else in the meanwhile), then it seems like some other fix at least is needed here. Perhaps the client shouldn't be so aggressive about sending out duplicate getblocks to multiple peers. However, as a temporary solution, disconnecting seems sensible to me. The client will establish connections to other peers (or even the same peers again) and continue exchanging data with them, and having gotten some blocks, not ask for them again over the new connection.

@rebroad

I'm glad to see other people confirming this issue. What other options are there other than disconnecting the peer? If peer's utilised more than one connection, then it would be possible to disconnect the transfer without disconnecting the peer, but I'm not sure it's worth coding this just for that reason.

The problem is exacerbated by the fact that during block validation the peers stops responding to requests, and so once the block validation completes, the peer suddenly sends a load of stuff that the requesting peer no longer needs, so another solution would be to change the code to continue responding to requests even during block validation.

@jgarzik could this be re-opened please?

@sipa
Owner

The solution is to remember which block was requested from which peer, and not ask it again (unless the old one is really lagging).

@rebroad

@sipa how do you tell if the "old one is really lagging"?

@sipa
Owner

The core of the problem here is that we have two mechanisms:
1) when we think we don't have all blocks, pick a peer to download them all from
2) when a new block is announced and we don't have its parents, ask them from the peer who sent it

The problem is that these two mechanisms don't place nicely together, and when a new block is found during initial syncup, you end up doing both at once. That's purely a local implementation problem - there is no blame to either of the peers involved, so they should not be disconnected for doing what is asked from them.

The solution is writing something of a download manager, which keeps track of which peers have which blocks, and tries to fetch them, and switch if a peer seems unresponsive. In my opinion, this is only really possible once we have headers-first sync, so the best chain is known before starting the actual block download. This would also mean the download can be parallellized.

That's not to say we don't need a more immediate solution. It's becoming more and more obvious that the sync mechanism is broken. I think we should aim for a simple solution now which solves the worst, and then really solve it when headers-first is implemented.

@rebroad

@sipa No one is suggesting disconnecting a peer for doing what is asked from them. We are suggesting disconnecting peers which are sending repeatedly and in a sustained fashion blocks which are not required.

You also haven't defined how to determine when a "peer seems unresponsive". I posit, that this will always be a guess, and therefore unreliable. The simplest solution IMHO is to simply end the transfer of the unwanted blocks by the only mechanism available currently - disconnection. This will benefit the entire network as currently there is a majority of wasted traffic happening due to these disconnections not happening.

What is the disadvantage in disconnection? So far, we have discussed the advantages, and it doesn't appear anyone has suggested any disadvantages. The only reason for not doing this IMHO is if there are significant disadvantages.

I agree that a simple solution is needed, and I think this patch is it. I agree that a better solution could be developed later though.

@sipa
Owner

I don't think you get it. Peers only send blocks to us, because we ASK them to send us blocks. There is no code that just sends blocks - it's always in response to a getdata. If we receive the same block twice, it is because we asked it twice. That is our fault, not theirs. The problem is that we ask for blocks both as a sync mechanism, and in response to receiving an orphan.

As to determining unresponsiveness: some timeout between expecting a block message and the getdata?

Regarding disadvantages: don't kill connections to nodes that behave as expected. You may lose the good ones.

@rebroad

@sipa I am aware that peers should only send blocks they've been asked for. There are already timeouts implemented, but these will always be arbitrary and only a guess to determine if a peer is lagging or not. They are therefore not reliable, and IMHO shouldn't be relied on. The problem cannot be fixed by tweaking timeouts, and if timeouts are set too long, then it will cause other problems too.

How do you define "good nodes"? The only nodes this patch will cause us to lose are bad ones - "bad ones" being nodes that are repeatedly and consistently sending us blocks we do not want or need. Labelling a node "good" just because it's doing what we've asked of it, isn't definitive, IMHO. It can be based on more criteria than this.

@sipa
Owner

Of course it is always a guess. You cannot know which nodes are good and which aren't for sure.

But this patch does not fix the problem. The problem is that we request the same block multiple times in certain cases, even when there is no reason to assume the peer is lagging. And then it disconnects peers that do what we ask them.

@rebroad

@sipa you're right of course that the code currently does request some blocks multiple times when it doesn't need to, but that is a separate issue and requires a separate patch to fix that. That fix, however, won't eradicate the situation that this patch mitigates.

There is an argument though that that patch should be a prerequisite patch to this one.

@sipa
Owner

I'll try to explain a bit more in detail why this is not a solution, but this is my last comment in this thread.

Assume we are doing initial block synchronization, so we're sending getblocks, receive invs, download blocks, and send a new getblocks. During that time, a new block is found. One node (typically one of the best!) is the first to announce this new block to us, so we download it, see that we don't have its parents, and we go on to request the headers from this good peer that was so kind to announce a new block to us. Now we are doing the getblocks/invs/getdata/blocks/getblocks sequence from two nodes at one: the original one we were synchronizing from, and the one that announced the new best block. You're eventually going to kill one of these - that will temporarily mitigate the situation of course, but the next time a new block is found while you're still not done synchronizing, the same thing repeats all over again. And every time, you have a large chance of killing your best-connected peer.

You say that requesting the same block multiple times is a separate issue. It is not, it is the only issue. And this is what needs fixing, not killing off peers which do the stupid thing we asked them to.

@rebroad

@sipa You're explaining things to me that I already know and understand, and I've already said I agree that there could be a patch to fix what you are describing in my previous comment, and that that patch might be better done before this one (which is still needed to cater for lagging nodes). You are failing to address the issue of lagging nodes providing blocks already received by lesser-lagging nodes, which is what this patch is intended for.

Having said that, I think that this patch is still better implemented now rather than waiting for the other patch to be done first.

@TheBlueMatt

I did some initial work on proper request management a while ago, but never got very far as I was working on bloom filter stuff, you can work on it more if you want: https://github.com/TheBlueMatt/bitcoin/commits/bloom%2Brelayblock

@gmaxwell
Owner

@rebroad What you are seeing is not due to "lagging peers", this is pretty easily tested. Sipa explained why you saw this. We only make the request out to one peer (except in the case sipa outlined) so lagging really has nothing to do with it.

@rebroad

@gmaxwell a block is requested from another peer if it's not received within 2 minutes. This 2 minutes is rather arbitrary and not a reliable way to determine that a block has failed to be downloaded. It will even request the same block from another peer even when the original peer is currently sending a block. The code behind this could do with some obvious improvements, and the 2 minute delay should be replaced with something not so time based, IMHO. Perhaps start the timer from the point at which block reception stops, rather than from the point of initial request.

@jcomeauictx

I've run into this several times while starting up new nodes. I've always managed to work around it, either by temporarily setting maxconnections=1 and listen=0, or by adding another node under my control that has the current blockchain, or by simply waiting out the duplicate blocks. but if there's a fix available, collectively covering your ears and saying "lalalala I can't hear you" might not be the best approach. I appreciate rebroad's efforts to keep shining the light on this issue.

@sipa
Owner

See #3514 for a solution that doesn't kill peers that do what we ask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 29, 2012
  1. @rebroad

    Sometimes nodes send many duplicate blocks. This patch disconnects th…

    rebroad authored
    …e peer that's running
    
    behind. This shouldn't happen, but it can and does sometimes, so this code caters for that.
    
    This code also becomes more important as the block download algorithm is changed in future,
    such as concurrent block downloads from multiple peers, and retries are sent to other peers
    when peers seem unresponsive, only to become responsive again later, by which time
    they are sending duplicate blocks.
This page is out of date. Refresh to see the latest.
Showing with 21 additions and 4 deletions.
  1. +19 −4 src/main.cpp
  2. +2 −0  src/net.h
View
23 src/main.cpp
@@ -1782,10 +1782,25 @@ bool ProcessBlock(CNode* pfrom, CBlock* pblock)
{
// Check for duplicate
uint256 hash = pblock->GetHash();
- if (mapBlockIndex.count(hash))
- return error("ProcessBlock() : already have block %d %s", mapBlockIndex[hash]->nHeight, hash.ToString().substr(0,20).c_str());
- if (mapOrphanBlocks.count(hash))
- return error("ProcessBlock() : already have block (orphan) %s", hash.ToString().substr(0,20).c_str());
+ if (mapBlockIndex.count(hash)) {
+ if (pfrom) {
+ pfrom->nDupBlocks++;
+ if (pfrom->nDupBlocks > 3) pfrom->fDisconnect = true;
+ printf("ProcessBlock() : already(%d) have block %d %s", pfrom->nDupBlocks, mapBlockIndex[hash]->nHeight, hash.ToString().substr(0,20).c_str());
+ } else
+ printf("ProcessBlock() : already have block %d %s", mapBlockIndex[hash]->nHeight, hash.ToString().substr(0,20).c_str());
+ return false;
+ }
+ if (mapOrphanBlocks.count(hash)) {
+ if (pfrom) {
+ pfrom->nDupBlocks++;
+ if (pfrom->nDupBlocks > 3) pfrom->fDisconnect = true;
+ printf("ProcessBlock() : already(%d) have block (orphan) %s", pfrom->nDupBlocks, hash.ToString().substr(0,20).c_str());
+ } else
+ printf("ProcessBlock() : already have block (orphan) %s", hash.ToString().substr(0,20).c_str());
+ return false;
+ }
+ if (pfrom) pfrom->nDupBlocks = 0; // reset the counter
// Preliminary checks
if (!pblock->CheckBlock())
View
2  src/net.h
@@ -148,6 +148,7 @@ class CNode
std::string addrName;
CService addrLocal;
int nVersion;
+ int nDupBlocks;
std::string strSubVer;
bool fOneShot;
bool fClient;
@@ -199,6 +200,7 @@ class CNode
addr = addrIn;
addrName = addrNameIn == "" ? addr.ToStringIPPort() : addrNameIn;
nVersion = 0;
+ nDupBlocks = 0;
strSubVer = "";
fOneShot = false;
fClient = false; // set by version message
Something went wrong with that request. Please try again.