Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc_getblockfrompeer.py intermittent failure: assert_equal(pruneheight, 248); not(249 == 248) #27749

Closed
1 task done
maflcko opened this issue May 25, 2023 · 8 comments · Fixed by #27784
Closed
1 task done
Labels

Comments

@maflcko
Copy link
Member

maflcko commented May 25, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current behaviour

https://cirrus-ci.com/task/4776911524593664?logs=ci#L3147

 test  2023-05-24T12:16:53.212000Z TestFramework (ERROR): Assertion failed 
                                   Traceback (most recent call last):
                                     File "/tmp/cirrus-build-825335453/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/test_framework.py", line 132, in main
                                       self.run_test()
                                     File "/tmp/cirrus-build-825335453/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/rpc_getblockfrompeer.py", line 126, in run_test
                                       assert_equal(pruneheight, 248)
                                     File "/tmp/cirrus-build-825335453/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/util.py", line 55, in assert_equal
                                       raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
                                   AssertionError: not(249 == 248)

Expected behaviour

.

Steps to reproduce

CI

Relevant log output

No response

How did you obtain Bitcoin Core

Compiled from source

What version of Bitcoin Core are you using?

current master

Operating system and version

[previous releases, qt5 dev package and depends packages, DEBUG] [focal]

Machine specifications

No response

@maflcko maflcko added the Tests label May 25, 2023
@fanquake
Copy link
Member

@Sjors

@willcl-ark
Copy link
Member

FWIW I have seen this error once before running tests locally, but I was making changes to reconsiderblock at the time and assumed it was related to that.

@furszy
Copy link
Member

furszy commented May 26, 2023

I wasn't able to reproduce it but my spider sense tells me that it will be fixed by #27770.

@theStack
Copy link
Contributor

theStack commented May 29, 2023

After digging deeper and analyzing the failed CI log, I got a clearer picture what's going on here.
Comparing the "Leaving block file" debug messages from a successful and the failed test run gave the
essential hint:

$ grep "Leaving block" successful_run.log
Leaving block file 0: CBlockFileInfo(blocks=249, size=65319, heights=0...248, time=2011-02-02...2023-05-28)
Leaving block file 1: CBlockFileInfo(blocks=251, size=65511, heights=249...499, time=2023-05-28...2023-05-28)

$ grep "Leaving block" failed_ci.log
Leaving block file 0: CBlockFileInfo(blocks=249, size=65319, heights=0...249, time=2011-02-02...2023-05-24)
Leaving block file 1: CBlockFileInfo(blocks=251, size=65511, heights=248...499, time=2023-05-24...2023-05-24)

Blocks are appended to blk*.dat files in whatever order they are received from the network (either directly as
"block" message or reconstructed via compact blocks), and it turns out that in this (obviously rare?) scenario node2 saw block 249 before block 248, so block 249 makes it still into blk00000.dat and block 248 is already ending up in the next one, blk00001.dat. So in this case, the pruneblockchain return value is obviously one higher than usual (249 vs. 248).

I was surprised to see that even with only a single peer (node2 is only getting blocks from node0) blocks could arrive out-of-order, but that seems to be due to a mix of receiving full blocks and compact blocks. Wrote up some hacky python-regex-log-parsing magic to compare order of the actual chain (as generated on node0) with on how it's received on node2, confirming what's stated above:

$ ./analyze_issue_27749.py
node2 received block 201, OK.
node2 received block 202, OK.
node2 received block 203, OK.
node2 received block 204, OK.
node2 received block 205, OK.
node2 reconstructed block 206, OK.
node2 received block 207, OK.
node2 received block 208, OK.
node2 received block 209, OK.
node2 received block 210, OK.
node2 received block 211, OK.
!!! expected to receive/reconstruct block 212, but reconstructed block 213 !!!
!!! expected to receive/reconstruct block 213, but received block 212 !!!
node2 received block 214, OK.
node2 received block 215, OK.
node2 received block 216, OK.
node2 received block 217, OK.
node2 received block 218, OK.
node2 received block 219, OK.
node2 received block 220, OK.
node2 received block 221, OK.
node2 received block 222, OK.
node2 received block 223, OK.
node2 received block 224, OK.
node2 received block 225, OK.
node2 received block 226, OK.
node2 received block 227, OK.
node2 received block 228, OK.
node2 received block 229, OK.
node2 received block 230, OK.
node2 received block 231, OK.
node2 received block 232, OK.
node2 received block 233, OK.
node2 received block 234, OK.
node2 received block 235, OK.
node2 received block 236, OK.
node2 received block 237, OK.
node2 received block 238, OK.
node2 received block 239, OK.
node2 received block 240, OK.
node2 received block 241, OK.
node2 received block 242, OK.
node2 received block 243, OK.
node2 received block 244, OK.
node2 received block 245, OK.
node2 received block 246, OK.
node2 received block 247, OK.
!!! expected to receive/reconstruct block 248, but reconstructed block 249 !!!
!!! expected to receive/reconstruct block 249, but received block 248 !!!
node2 reconstructed block 250, OK.
node2 reconstructed block 251, OK.
node2 reconstructed block 252, OK.
node2 reconstructed block 253, OK.
node2 reconstructed block 254, OK.
node2 reconstructed block 255, OK.
node2 reconstructed block 256, OK.
..........
node2 reconstructed block 605, OK.

Not sure how to best solve this issue. Is there a way to enforce linear blocks reception at least when only having a single peer? Or maybe the best solution would be to not involve the network at node2 at all (for the sake of the pruning part of the test) and just submitblock the blocks generated at node0? #27770 is of course always an alternative.

@mzumsande
Copy link
Contributor

Nice analysis!

Not sure how to best solve this issue. Is there a way to enforce linear blocks reception at least when only having a single peer?

I think the cause is that a failed DoS check
if (pindex->nHeight <= m_chainman.ActiveChain().Height() + 2) (link)
will revert to HeadersProcessing if the chain of node2 hasn't caught up enough (the new block is too far ahead of the existing chain), adding another roundtrip compared to the block download compared to compact block reconstruction. When node2 has caught up and switches to compact block reconstruction, the two blocks can arrive to disk in a different order.
This can happen in the first place because node2 initially also needs to catch up with blocks 201-205 from the first part of the test (and is rather slow to do so in the failed run!).

So I think that an alternative fix would be to add a sync_blocks call between node0 and node2 before letting node0 generate the 400 blocks. This should guarantee that all of these 400 blocks will be processed by compact block reconstruction.

@furszy
Copy link
Member

furszy commented May 29, 2023

Awesome investigation! It feels good when all dots are connected.

Apart from mzumzande's patch to enforce sequential block storage, which looks promising at first glance. I would like to refer to a portion of what I just wrote in #27770 (comment):

I believe the main conclusion of the analysis (at least mine) is that the accuracy of the pruneblockchain result is questionable. The real issue lies in the fact that the RPC command's result description claims to return the "height of the last block pruned", which may not be true since blocks could have been stored out of order. Therefore, obtaining the block height 249 as the "last block pruned" does not necessarily indicate that block 248 has been pruned from disk.

Considering this aspect, alongside mzumzande's fix, I hold the view that the test is quite fragile and would be unwise to maintain the current prune blockchain hardcoded numbers. These numbers have a significant discrepancy vs the block files information (as indicated in comment) and aren't really documented anywhere.

@Sjors
Copy link
Member

Sjors commented May 30, 2023

cc @fjahr since he worked on the prune part of this test.

mzumsande added a commit to mzumsande/bitcoin that referenced this issue May 30, 2023
This fixes an intermittent error, caused by blocks arriving
out of order due to how compact block relay may revert to headers
processing when the tip hasn't caught up, and resulting in slightly
different pruning behavior.

Making sure that all blocks from the previous tests are synced before
generating more blocks makes this impossible.
See Issue bitcoin#27749 for more details.
@mzumsande
Copy link
Contributor

Opened #27784 which just adds an additional sync, as suggested in #27770 (comment) .

sidhujag pushed a commit to syscoin/syscoin that referenced this issue May 31, 2023
9fe9074 test: add block sync to getblockfrompeer.py (Martin Zumsande)

Pull request description:

  This adds an additional `sync_blocks` call, fixing an intermittent error caused by blocks arriving out of order due to how compact block relay may revert to headers processing when the tip hasn't caught up, and resulting in slightly different pruning behavior.
  Making sure that all blocks from the previous tests are synced before generating more blocks makes this impossible.

  See bitcoin#27749 (comment) and bitcoin#27749 (comment) for a more detailed analysis.

  bitcoin#27770 is a more long-term approach to avoid having to deal with magic pruneheight numbers in the first place, but that PR introduces a new RPC and needs more discussion.

  Fixes bitcoin#27749.

ACKs for top commit:
  MarcoFalke:
    lgtm ACK 9fe9074
  theStack:
    ACK 9fe9074

Tree-SHA512: f3de1ea68725429aeef448c351ea812b805fa216912b112d7db9aceeddb1f2381b705c2577734b0d308e78ec5e0c4d26dc65fc2171f6e21f13061fc71d48216c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants