Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are cases where blocks get pushed to the chain::database out-of-order #247

Closed
bytemaster opened this issue Aug 18, 2015 · 18 comments
Closed
Assignees
Milestone

Comments

@bytemaster
Copy link
Contributor

https://bitsharestalk.org/index.php/topic,17962.msg229944.html#msg229944

The fork_database::push_block method requires that blocks are submitted in order by the network.

There appears to be some cases where blocks are pushed out-of-order causing syncing to halt. Our options are to:

  1. make the fork database robust against out-of-order insertion
  2. prevent blocks from being inserted out-of-order.

This may or may not resolve the issues discussed in the bitsharestalk thread.

@theoreticalbts
Copy link
Contributor

I think this has to be handled at the p2p level.

The p2p code relies on the backend to tell it whether a block is valid, so it can kick peers that send invalid blocks. Of course validation is done by the backend, so the p2p code needs to ask the backend "Is this block valid?" As we discovered earlier this week when designing the solution to #237, witnesses' block signing keys might suddenly change, so you cannot validate the witness signature on any descendant of the head block beyond its immediate child.

Handling this on the fork DB level (alternative 1 above) would mean the backend accepts blocks it cannot validate, which breaks the p2p logic -- a malicious peer would be able to invent huge numbers of invalid blocks with bogus previous block hashes that will never connect, and you'd put them all in your fork DB.

I think what you need to do instead is, if you receive an unconnected block, you place it into a lookaside buffer maintained by the p2p code, indexed by previous block ID. When a peer has k entries in the lookaside buffer (where k is some hard coded constant, e.g. k = 4), you should stop getting new blocks from them. The size of the lookaside buffer is then bounded by k*num_peers*max_block_size. Disconnecting a peer should also delete that peer's entries in the lookaside buffer.

Whenever you connect a new block, the p2p code should check the lookaside buffer for unconnected blocks whose previous block ID matches the new head block ID, and push all such blocks. If a block from the lookaside buffer does not validate, the consequences should propagate back to the peer (i.e. the peer should be kicked for offering a bogus block). So each lookaside entry needs to have the ID of the peer it came from. Indeed it should be indexed on that peer ID, so the entries for a peer can be removed when the peer disconnects. Since it's an in-memory container with multiple indexes, it seems that a Boost multi_index_container indexed on peer ID and head block ID would be a good data structure to implement the lookaside buffer.

You might be able to get away with only removing blocks for disconnected peers by deferring removal until a "cleanup pass" to remove them when the container becomes full (i.e. the number of entries exceeds k*num_peers or possibly some slightly larger bound). The cleanup pass would iterate through the full container. The deferred-removal approach allows you to avoid multi_index_container and instead use a regular std::unordered_map where the key is previous block ID, block ID.

@abitmore
Copy link
Contributor

It happened on my witness node in puppies' testnet.

p2p.log attached: https://drive.google.com/open?id=0B3xrm70jSHn4NWNFTDBJLTJHZ00

At first it's syncing (p2p.log line 15185):

2015-08-18T02:04:13 p2p:message read_loop process_block_during ] Successfully pushed block 71842 (id:000118a2245ede2df39a1f4ae6da4e12f62f8232)          node.cpp:3109

Then my witness produced 71843 (default.log)

2015-08-18T02:04:12 th_a:Witness Block Production block_production_loo ] slot: 1 scheduled_witness: 1.6.6 scheduled_time: 2015-08-18T0
2:04:13 now: 2015-08-18T02:04:13                  witness.cpp:239
2015-08-18T02:04:13 th_a:invoke handle_block         handle_block ] Got block #71842 from network                       application.cp
p:342
2015-08-18T02:04:13  th_a:invoke get_item             get_item ] Request for item {"item_type":1001,"item_hash":"000118a2245ede2df39a1
f4ae6da4e12f62f8232"}                     application.cpp:437
2015-08-18T02:04:13  th_a:invoke get_item             get_item ] Serving up block #71842                        application.cpp:445
2015-08-18T02:04:13 th_a:Witness Block Production block_production_loo ] slot: 1 scheduled_witness: 1.6.5155 scheduled_time: 2015-08-1
8T02:04:14 now: 2015-08-18T02:04:14                       witness.cpp:239
2015-08-18T02:04:13 th_a:Witness Block Production block_production_loo ] Witness 1.6.5155 production slot has arrived; generating a bl
ock now...                        witness.cpp:242
2015-08-18T02:04:13 th_a:Witness Block Production block_production_loo ] Generated block #71843 with timestamp 2015-08-18T02:04:14 at
time 2015-08-18T02:04:14                  witness.cpp:255
2015-08-18T02:04:13  th_a:invoke get_item             get_item ] Request for item {"item_type":1001,"item_hash":"000118a30cd6a49579d36
ad1217b098a544a798a"}                     application.cpp:437
2015-08-18T02:04:13  th_a:invoke get_item             get_item ] Serving up block #71843                        application.cpp:445
2015-08-18T02:04:14 th_a:Witness Block Production block_production_loo ] slot: 1 scheduled_witness: 1.6.1439 scheduled_time: 2015-08-1
8T02:04:15 now: 2015-08-18T02:04:15                       witness.cpp:239
2015-08-18T02:04:15 th_a:Witness Block Production block_production_loo ] slot: 2 scheduled_witness: 1.6.1 scheduled_time: 2015-08-18T0
2:04:16 now: 2015-08-18T02:04:16                  witness.cpp:239
2015-08-18T02:04:16 th_a:Witness Block Production block_production_loo ] slot: 3 scheduled_witness: 1.6.0 scheduled_time: 2015-08-18T0
2:04:17 now: 2015-08-18T02:04:17                  witness.cpp:239
2015-08-18T02:04:17 th_a:Witness Block Production block_production_loo ] slot: 4 scheduled_witness: 1.6.5 scheduled_time: 2015-08-18T0
2:04:18 now: 2015-08-18T02:04:18                  witness.cpp:239
2015-08-18T02:04:18 th_a:Witness Block Production block_production_loo ] slot: 5 scheduled_witness: 1.6.4 scheduled_time: 2015-08-18T0
2:04:19 now: 2015-08-18T02:04:19                  witness.cpp:239
2015-08-18T02:04:19 th_a:Witness Block Production block_production_loo ] slot: 6 scheduled_witness: 1.6.1435 scheduled_time: 2015-08-1
8T02:04:20 now: 2015-08-18T02:04:20                       witness.cpp:239
2015-08-18T02:04:20  th_a:invoke get_item             get_item ] Request for item {"item_type":1001,"item_hash":"000118a30cd6a49579d36
ad1217b098a544a798a"}                     application.cpp:437
2015-08-18T02:04:20  th_a:invoke get_item             get_item ] Serving up block #71843                        application.cpp:445
2015-08-18T02:04:20 th_a:invoke handle_block         handle_block ] Got block #71844 from network                       application.cp
p:342
2015-08-18T02:04:20 th_a:invoke handle_block         handle_block ] Error when pushing block:
10 assert_exception: Assert Exception
itr != _index.get<block_id>().end():
    {}
    th_a  fork_database.cpp:51 push_block

    {"new_block":{"previous":"000118a3ca0d1e6e5737f248ddb53836aa6212e4","timestamp":"2015-08-18T02:04:16","witness":"1.6.1","next_secret_hash":"6449e9c56a67b58d886f2064e929f17f94fe78f1","previous_secret":"68c24af208df75b7b3d6570206492ef3bee5c5b9","transaction_merkle_root":"0000000000000000000000000000000000000000","extensions":[],"witness_signature":"1f4becadfa61a394b550ce6d086fb5d9eb3da8c883ac703048c7baee139414089f6777d1c03787d982160d23ecef7df3df3ea8713c15441247c30c4ef5568e22ac","transactions":[]}}
    th_a  db_block.cpp:173 _push_block                  application.cpp:364
2015-08-18T02:04:20 th_a:Witness Block Production block_production_loo ] slot: 7 scheduled_witness: 1.6.2 scheduled_time: 2015-08-18T02:04:21 now: 2015-08-18T02:04:21                  witness.cpp:239
2015-08-18T02:04:21 th_a:invoke handle_block         handle_block ] Got block #71845 from network                       application.cpp:342
2015-08-18T02:04:21 th_a:invoke handle_block         handle_block ] Error when pushing block:
10 assert_exception: Assert Exception
itr != _index.get<block_id>().end():
    {}
    th_a  fork_database.cpp:51 push_block

    {"new_block":{"previous":"000118a4b95f7448500f912995feeffe47e61773","timestamp":"2015-08-18T02:04:17","witness":"1.6.0","next_secret_hash":"6449e9c56a67b58d886f2064e929f17f94fe78f1","previous_secret":"68c24af208df75b7b3d6570206492ef3bee5c5b9","transaction_merkle_root":"0000000000000000000000000000000000000000","extensions":[],"witness_signature":"2041523e0275f0ce506c797145c86f2b5ed5fb7d9eb9cff0f1cb9364e5a06c327e535dfd76846d56824d41ad69e502d2b88679b7060b33200128991dacf92f97ae","transactions":[]}}
    th_a  db_block.cpp:173 _push_block                  application.cpp:364

then stopped syncing (p2p.log line 15329):

2015-08-18T02:04:20 p2p:message read_loop process_block_during ] Failed to push block 71844 (id:000118a4b95f7448500f912995feeffe47e61773), client rejected block sent by peer           node.cpp:3198

It recovered after 12 minutes(p2p.log line 41458):

2015-08-18T02:30:32 p2p:send_sync_block_to_node_delegate send_sync_block_to_n ] Successfully pushed sync block 71843 (id:000118a3ca0d1e6e5737f248ddb53836aa6212e4)          node.cpp:2789

continue syncing (p2p.log line 47464):

2015-08-18T02:30:36 p2p:send_sync_block_to_node_delegate send_sync_block_to_n ] Successfully pushed sync block 72042 (id:0001196a1eb9d0e2021fd4f6ac9fe452ea46d4b3)          node.cpp:2789

trying to produce a block but failed (default.log):

2015-08-18T02:30:36 th_a:Witness Block Production block_production_loo ] slot: 1358 scheduled_witness: 1.6.5155 scheduled_time: 2015-08-18T02:30:37 now: 2015-08-18T02:30:37            witness.cpp:239
2015-08-18T02:30:36 th_a:Witness Block Production block_production_loo ] Witness 1.6.5155 production slot has arrived; generating a block now...            witness.cpp:242
2015-08-18T02:30:36 th_a:Witness Block Production          _push_block ] Failed to push new block:
3070000 undo_database_exception: undo database exception
The database does not have enough undo history to support a blockchain with so many missed blocks. Please add a checkpoint if you would like to continue applying blocks beyond this point.
    {"recently_missed":2716,"max_undo":1000}
    th_a  db_update.cpp:68 update_global_dynamic_data

    {"next_block.block_num()":72043}
    th_a  db_block.cpp:448 _apply_block         db_block.cpp:167

Then got deadly stuck (p2p.log line 47709):

2015-08-18T02:30:37 p2p:send_sync_block_to_node_delegate send_sync_block_to_n ] Failed to push sync block 72043 (id:0001196bf9cf506fb01511e20c8566694b952a7d): client rejected sync block sent by peer: {"code":10,"name":"assert_exception","message":"Assert Exception","stack":[{"context":{"level":"error","file":"db_block.cpp","line":562,"method":"validate_block_header","hostname":"","thread_name":"th_a","timestamp":"2015-08-18T02:30:37"},"format":"_pending_block.timestamp <= next_block.timestamp: ","data":{"_pending_block.timestamp":"2015-08-18T02:30:37","next":"2015-08-18T02:08:00","blocknum":72043}},{"context":{"level":"warn","file":"db_block.cpp","line":448,"method":"_apply_block","hostname":"","thread_name":"th_a","timestamp":"2015-08-18T02:30:37"},"format":"","data":{"next_block.block_num()":72043}},{"context":{"level":"warn","file":"db_block.cpp","line":173,"method":"_push_block","hostname":"","thread_name":"th_a","timestamp":"2015-08-18T02:30:37"},"format":"","data":{"new_block":{"previous":"0001196a1eb9d0e2021fd4f6ac9fe452ea46d4b3","timestamp":"2015-08-18T02:08:00","witness":"1.6.0","next_secret_hash":"031a792684e0d99dab315ce6aea4fa92c85c6707","previous_secret":"cd722b49f39a4b7160e081ae2c26cdf71f8e9194","transaction_merkle_root":"0000000000000000000000000000000000000000","extensions":[],"witness_signature":"1f64cf6d9e48683ed2259772c5006c0a1ec195dc280c59b5941250fa1581e8b86c0d32ed49d47d9d57cc87b3088e2f52bc15c7dc500955fa334864a3ed7b08be63","transactions":[]}}},{"context":{"level":"warn","file":"application.cpp","line":373,"method":"handle_block","hostname":"","thread_name":"th_a","timestamp":"2015-08-18T02:30:37"},"format":"","data":{"blk_msg":{"block":{"previous":"0001196a1eb9d0e2021fd4f6ac9fe452ea46d4b3","timestamp":"2015-08-18T02:08:00","witness":"1.6.0","next_secret_hash":"031a792684e0d99dab315ce6aea4fa92c85c6707","previous_secret":"cd722b49f39a4b7160e081ae2c26cdf71f8e9194","transaction_merkle_root":"0000000000000000000000000000000000000000","extensions":[],"witness_signature":"1f64cf6d9e48683ed2259772c5006c0a1ec195dc280c59b5941250fa1581e8b86c0d32ed49d47d9d57cc87b3088e2f52bc15c7dc500955fa334864a3ed7b08be63","transactions":[]},"block_id":"0001196bf9cf506fb01511e20c8566694b952a7d"},"sync_mode":true}}]}           node.cpp:2813

//Edit: link of log file updates.

@bytemaster
Copy link
Contributor Author

I agree somewhat with what theoretical bts has to say, but he goes too far in pushing this off on the network code.

We currently have a vulnerability where an attack node could produce fake blocks that do link to existing blocks and those blocks would get pushed into the fork_database.

We also know that the active witnesses can only change once per maintenance interval which means that regardless of the content of any individual block, we know the valid set of witnesses for 1000's of blocks into the future 99% of the time (assuming a 1 day maintenance interval). We also know exactly when the next maintenance block will be.

We also know with a high degree of probability that the set of witnesses is unlikely to change by much from day to day.

We also know that blocks cannot be produced "in the future"

Given that information the network layer would be filtering all blocks that are "in the future" and the fork_database can easily restrict unlinked blocks to those signed by the current set of active witnesses.

This in turn will limit the potential "attackers" to the current set of active witnesses.

We also know that the same witness must have at least half a round between their slots.

With this information we can simply limit the unlinked blocks we cache to at most 1 block per-active-witness per block number. With this limitation in place it would require a witness to "double-sign" before it would even be theoretically possible for a node to get "stuck". A double-signing witness is something that we would love to detect so that they could automatically be fired.

My ultimate conclusion is that this is something for the chain database / fork database to resolve and not something for the networking code to resolve. The networking code is already complex enough and it is far easier to reason about a set of blocks outside that code.

@bytemaster
Copy link
Contributor Author

With the changes described above we have one last corner case:

A new witness is elected after a maintenance block and a node receives their block prior to receiving the maintenance block.

If this were to happen then the block would be "rejected" and the peer that provided the block would be disconnected. The network will have marked that it already received and rejected the block and would not fetch it again leaving the user stuck.

The solution is:

    if next_block_should_be_maitenance_block
       if received_block.block_num >  head_block.block_num + 1 
          saved_blocks.push(received_block)
          throw error pushing received_block
      else // next_block_is maintenance block
          push received_block (assuming it links) 
          for( block in saved_blocks )
                 push( saved_block ) 
          saved_blocks.clear()     

Under this algorithm the network layer would catch the errors and punish any nodes that accumulate too many errors pushing blocks around the maintenance interval.

@bytemaster
Copy link
Contributor Author

There is another corner case which the the proposed algorithm of only allowing nodes to push blocks from the current set of active witnesses.

If there is a fork which branches off prior to a maintenance block and the maintenance block changes the set of active witnesses then the user would be unable to accept legitimate blocks from the new fork. In this case the user could get "stuck".

This means that if syncing to a fork that branches prior to a maintenance block that changes the active set of witnesses would become impossible.

@bytemaster
Copy link
Contributor Author

If the scenario describe above were to occur the minority fork witnesses would get "stuck" and be unable to produce blocks after less than 1 hour. If the split was close-enough, then even the majority fork would get stuck after slightly longer than 1 hour if the minority fork did not switch over.

We are left with the challenge of only accepting blocks older than a maintenance interval IF they fill a missing slot AND they link to the current chain by hash. With this restriction an attacking node could construct a fake chain that "fills" each of the missing slots from the main chain. Each node only gets to submit a SINGLE history so it would me an attack of MISSED_BLOCKS_IN_MAIN_CHAIN*NUMBER_OF_CONNECTED_ATTACKING_PEERS

Where MISSED_BLOCKS_IN_MAIN_CHAIN only includes blocks missed in the last 1024 blocks. This means that there is an upper bound on the amount of memory an attacker could consume of 1024_CONNECTED_ATTACKING_PEERS_MAX_BLOCK_SIZE

This is a reasonable number and probably an acceptable attack surface.

@bytemaster
Copy link
Contributor Author

This issue needs to be divided into "two parts".

  1. caching / recovering from out-of-order blocks.
  2. preventing abuse of the caching with bogus blocks.

The fork_database only allows blocks within head-1024 < ALLOWED_BLOCK_NUM < head + 32 and an attacking peer can only present a single chain so the maximum damage an attacker could do would be to produce 1056 bogus blocks even if no other protections were in place.

We are already protected by the timestamps which cannot be from the future. We can also be protected by the maximum undo history which says that any block older than the maximum undo history is unreachable. So if we have ~99% witness participation, the undo history will be very small and therefore the attack surface will be even smaller (just a few blocks at most).

bytemaster added a commit that referenced this issue Aug 19, 2015
 - with this change the fork database now caches blocks that do not link
 and attempts to link the unlinked blocks after every new block is
 successfully linked.

 - with this change the fork database only tracks blocks as far back as
 the undo history allows.  This significantly reduces memory usage when
 there is high witness participation.
@theoreticalbts
Copy link
Contributor

I haven't read the patch yet, but just from the comments above, I have not found anywhere you consider the fact that a witness can change its block signing key.

In the discussion for #237 @nathanhourt and I briefly considered the possibility of changing block signing key updates to be deferred (i.e. not take effect until some time after being published), but this breaks revocation which is an important use case.

The set of active signing keys is what matters for validation, and the set of active signing keys can change at any time.

@theoreticalbts
Copy link
Contributor

Also, the assumption that "an attacking peer can only present a single chain" may not be sound. That question can be broken into two parts:

  • (1) Does the code currently allow a peer present multiple blocks at the same height?
  • (2) Should it allow a peer to present multiple blocks at the same height?

My answer to (1) is "I think so, because it would be easier to implement it that way, but I haven't read the code" and my answer to (2) is "Yes." Consider the case where a non-malicious peer ("Alice") is on a minority fork; she will broadcast blocks from that minority fork to Bob. If Alice later switches to the majority fork, then she needs to broadcast blocks from the majority fork to Bob -- and some of those blocks may have the same height as blocks she's already broadcast.

@nathanielhourt
Copy link
Contributor

Relevant commit eeeab17

@vikramrajkumar vikramrajkumar added this to the DevShares 2.0 milestone Aug 19, 2015
@abitmore
Copy link
Contributor

With the new commit, my witness node got stuck again. Same appearance. Never got another block which has same block number as the block my node generated but didn't be included in the major chain.

@abitmore
Copy link
Contributor

In addition, looks like it doesn't sync after restarted without --resync--blockchain (same behavior with --replay-blockchain).

2287980ms th_a       application.cpp:487           get_blockchain_synop ] reference_point: 0000000000000000000000000000000000000000 nu
mber_of_blocks_after_reference_point: 0 result: ["0000000000000000000000000000000000000000"]
2289174ms th_a       application.cpp:487           get_blockchain_synop ] reference_point: 000007d0538457bc093edc2b68a2eef47fd09c2d nu
mber_of_blocks_after_reference_point: 0 result: ["0000000000000000000000000000000000000000"]
2290838ms th_a       application.cpp:487           get_blockchain_synop ] reference_point: 000007d0538457bc093edc2b68a2eef47fd09c2d nu
mber_of_blocks_after_reference_point: 0 result: ["0000000000000000000000000000000000000000"]
2292226ms th_a       application.cpp:487           get_blockchain_synop ] reference_point: 000007d0538457bc093edc2b68a2eef47fd09c2d nu
mber_of_blocks_after_reference_point: 0 result: ["0000000000000000000000000000000000000000"]

Seems endless.

@abitmore
Copy link
Contributor

If start with --resync-blockchain, the witness doesn't produce blocks.

@abitmore
Copy link
Contributor

My observer node got stuck as well. That behavior is different. Logs:

    th_a  db_block.cpp:176 _push_block
3313909ms th_a       application.cpp:365           handle_block         ] Error when pushing block:
10 assert_exception: Assert Exception
item->num < _head->num + 32:
    {}
    th_a  fork_database.cpp:68 _push_block

    {"new_block":{"previous":"000050961ddce6db260c6efdea04536af4f0515c","timestamp":"2015-08-19T23:48:50","witness":"1.6.53","next_secret_hash":"c10ae8eaa2a224fa440d3cb45dc1b61fd210d74f","previous_secret":"5dd4023d81c595d884ce403f975da2d999c5cf63","transaction_merkle_
root":"0000000000000000000000000000000000000000","extensions":[],"witness_signature":"2027705b3f36cb981a150026ae9a9622d5233083d0577897790e1e5a832762987b3d174d86f035088883d2c1fea246d40fea384efd36df183e5bb5a24a1a0ffcb3","transactions":[]}}
    th_a  db_block.cpp:176 _push_block

@abitmore
Copy link
Contributor

p2p.log and default.log of observer node uploaded.
https://drive.google.com/open?id=0B3xrm70jSHn4MEhndDNCVTdIN0U
https://drive.google.com/open?id=0B3xrm70jSHn4enV4d2tQWTY3UXc
Final status:

{
  "head_block_num": 20425,
  "head_block_id": "00004fc9237ccd38d7b8f6c4fe588d206b21efde",
  "head_block_age": "8 hours old",
  "next_maintenance_time": "8 hours ago",
  "chain_id": "5508f5f743717fe2c78445364f62a72badd7532974d26f089af2062228b532eb",
...
}

//Edit: links of log files updated.

@abitmore
Copy link
Contributor

Observer node is stuck at:
{
"head_block_num": 20425,
"head_block_id": "00004fc9237ccd38d7b8f6c4fe588d206b21efde",
...
}

@nathanielhourt
Copy link
Contributor

If you're still getting stuck with the new commit, I think this must be a bug in the P2P code that can't be worked around in the blockchain processing code. @emfrias might be able to help.

@theoreticalbts
Copy link
Contributor

I'm pretty sure this has been fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants