Skip to content

Conversation

@mzumsande
Copy link
Contributor

@mzumsande mzumsande commented Mar 2, 2023

The debug-only -fastprune option used in several tests is not always safe to use:
If a -fastprune node receives a block larger than the maximum blockfile size of 64kb bad things happen: The while loop in BlockManager::FindBlockPos never terminates, and the node runs oom because memory for m_blockfile_info is allocated in each iteration of the loop.
The same would happen if a naive user used -fastprune on anything other than regtest (so this can be tested by syncing on signet for example, the first block that crashes the node is at height 2232).

Change the approach by raising the blockfile size to the size of the block, if that block otherwise wouldn't fit (idea by TheCharlatan).

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 2, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK ryanofsky, TheCharlatan, pinheadmz

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #27125 (refactor, kernel: Decouple ArgsManager from blockstorage by TheCharlatan)
  • #27039 (blockstorage: do not flush block to disk if it is already there by pinheadmz)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Member

@pinheadmz pinheadmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concept ACK

I think assert is the appropriate move here to prevent the infinite memory loop! We mine as well just crash early. It might also make sense to not allow -fastprune on anything besides regtest as well. I'm not sure if that is covered by ArgsManager::DEBUG_ONLY

bool finalize_undo = false;
if (!fKnown) {
while (m_blockfile_info[nFile].nSize + nAddSize >= (gArgs.GetBoolArg("-fastprune", false) ? 0x10000 /* 64kb */ : MAX_BLOCKFILE_SIZE)) {
const unsigned int max_blockfile_size{gArgs.GetBoolArg("-fastprune", false) ? 0x10000 : MAX_BLOCKFILE_SIZE};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup is already addressed in #27039 ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, should make rebasing easy! Here, it's not a cleanup but needed to use max_blockfile_size in the added assert.

@mzumsande
Copy link
Contributor Author

It might also make sense to not allow -fastprune on anything besides regtest as well. I'm not sure if that is covered by ArgsManager::DEBUG_ONLY

Not sure if suppressing options for specific networks is something we usually do (are there any other examples for this?). But I added some more explanation to the -fastprune documentation in init with the latest push.

@TheCharlatan
Copy link
Contributor

TheCharlatan commented Mar 3, 2023

Not sure how I feel about this change. It is trivial to create blocks exceeding this limit on regtest. If this happens, it should at least log a message explaining why it failed. The size limit constants used by fastprune seem arbitrary to me. What is their rationale, just fewer resources allocated/used while running the pruning functional tests? If so, why are the revision files not size restricted as well?

@mzumsande
Copy link
Contributor Author

Not sure how I feel about this change. It is trivial to create blocks exceeding this limit on regtest. If this happens, it should at least log a message explaining why it failed.

I'll add a log message with my next push.

The size limit constants used by fastprune seem arbitrary to me. What is their rationale, just fewer resources allocated/used while running the pruning functional tests? If so, why are the revision files not size restricted as well?

It was introduced in c286a22 (#15946) without much discussion - I'd say it's a quick-and-dirty way to allow testing the pruning code in functional tests - tests like feature_index_prune.py would probably take hours without it.

I stumbled upon this while reviewing #27125, when I had the great idea to quickly sync on signet with -fastprune to make sure that it still works the same, and after my system crashed due to running oom I thought this could be handled a bit better...

@mzumsande mzumsande force-pushed the 202302_fastprune_oom branch from d526dbb to e930814 Compare March 21, 2023 19:29
@mzumsande
Copy link
Contributor Author

I changed the approach to @TheCharlatan's idea (added you as coauthor) - now we don't assert anymore but raise the size of the blockfile to the size of the added block if needed.

@mzumsande mzumsande changed the title blockstorage: add an assert to avoid running oom with -fastprune blockstorage: Adjust fastprune limit if block exceeds blockfile size Mar 21, 2023
@TheCharlatan
Copy link
Contributor

ACK e930814 fwiw :P
Thank you for following up on this @mzumsande.

Copy link
Member

@pinheadmz pinheadmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK e930814

I also wrote a test for this that freezes (infinite loop) on master but passes on this branch. Take it or leave it, I just wanted to see the mechanism work before acking:

pinheadmz@c576efa

Show Signature
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

ACK e930814fdb9261f180d5c436cefe52e9cf38fd67
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE5hdzzW4BBA4vG9eM5+KYS2KJyToFAmQbPmoACgkQ5+KYS2KJ
yTq1vQ//ddoQ67cVvJRnnTcXybDOjTtHHVMSoWCyujBBCDr54sOVNr0VgqTgkOYB
tAHWOVjlVwbIOCVG71GmmbfWAaQ97+3RICvRsAq0WaD8CrkV677nY4d5VYA471kU
qKnmRXm4j0Shb+7l4p6kDhX67pWDD1q9kACia823BsMhi1c7AbqYJM9Wte9oVrjM
FxZaWjFTcZiFSk9o6kDnAA4Va4H8vQCI5SGAyYynw1LaBJvDT+UjEPsheKGDQ45p
JqkASuNDTNzGtLGZsYR2W42zbIImHg8YMVqIaJUtCWZJgGMUlQz4qGRA4tylAE/H
IDcyuz+dCRmqlVHWczgNrYrSRASmFzWpvNFBdL67x36OndY/o4QehGU3PF/gazAR
5o6jIrr6AieuttFwWEF82s7fwCJWtkbx1jTVPCpRXrKPNhQvaVMcVBYx5ah0IG93
V3n6q3iayIu8xZ3yY1MSuGtszbxTkxSMH7cwOGNGfj2nAwGdANFmtYzU3lE2SjBd
HD1felNhxRcSyjvxLWh5+21O1BuD7sj7+lrULFZR5eYKUzn2dI266HwJNOir8HrT
FybjNY1RQThHizZG8iWQjsEBvBVCkZxm0xMVQE8W33TIOBjxouJs2BAX5CVR2RyF
NfhriauvkMXJYJh4clbVy814BhCm+aWSmEwl5ODsVtjnC6QgUH0=
=b/RR
-----END PGP SIGNATURE-----

pinheadmz's public key is on keybase

@fanquake fanquake requested a review from ryanofsky March 22, 2023 18:01
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK e930814, but I think it would be great to add the test written by @pinheadmz #27191 (review) because the blockmanager code is fragile and a regression could happen, and because the test setup is very clean so it could probably be used to branch out and test other things.

if (gArgs.GetBoolArg("-fastprune", false)) {
max_blockfile_size = 0x10000; // 64kiB
if (nAddSize >= max_blockfile_size) {
max_blockfile_size = nAddSize + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "blockstorage: Adjust fastprune limit if block exceeds blockfile size" (e930814)

What is the +1 for? Would be helpful to add a code comment saying what the extra byte is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this carefully again and don't think anymore that the + 1 is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is necessary, as long as the while condition says
while (m_blockfile_info[nFile].nSize + nAddSize >= max_blockfile_size)
it won't terminate if nAddSize == max_blockfile_size, because m_blockfile_info[nFile].nSize will be 0 if we skip to the next block.

Maybe we could lose the + 1 if we'd change the >= from the while loop into a >, however I really don't want to change this critical code in this PR because it would mean we'd try to squeeze another byte into all blocks, affecting blockstorage in normal operation as well, not just the test-only -fastprune mode.
So I'll add a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense the +1 is needed to be consistent with >= below. It does look like there is a minor bug here that will cause block files to be at most MAX_BLOCKFILE_SIZE-1 bytes and never reach the maximum size. But if this is a bug, it goes back to 5382bcf when this code was introduced. Agree it would not be good to change that behavior here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functional test covers the +1 nicely as well. Without the +1 (and removing the assertion on L632) the test will hang until timed out at 30s

max_blockfile_size = nAddSize + 1;
}
}
while (m_blockfile_info[nFile].nSize + nAddSize >= max_blockfile_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "blockstorage: Adjust fastprune limit if block exceeds blockfile size" (e930814)

Would be good to add to add assert(nAddSize <= max_blockfile_size); here to make it clear what this function is assuming and catch the problem instead of going into an infinite loop if outside code changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, although I think we should assert nAddSize < max_blockfile_size; - see explanation above.

@ryanofsky
Copy link
Contributor

I think assert is the appropriate move here to prevent the infinite memory loop!

I think throwing an exception or calling abort() would be ok, but better to reserve the assert macro for conditions which are actually impossible and not use it for error handling, because that would water down the meaning of other asserts.

@mzumsande mzumsande force-pushed the 202302_fastprune_oom branch from e930814 to 19e9073 Compare April 18, 2023 18:37
@mzumsande
Copy link
Contributor Author

Thanks for the reviews! I pushed an update to address the outstanding comments and added the test by @pinheadmz (thanks!).

@mzumsande mzumsande force-pushed the 202302_fastprune_oom branch from 19e9073 to af16fc9 Compare April 19, 2023 14:17
@maflcko
Copy link
Member

maflcko commented Apr 19, 2023

unrelated: Red CI can be ignored, or if you care a lot, you can rebase

mzumsande and others added 2 commits April 19, 2023 11:25
If the added block exceeds the blockfile size in test-only
-fastprune mode, the node would get stuck in an infinite loop and
run out of memory.

Avoid this by raising the blockfile size to the size of the added block
in this situation.

Co-authored-by: TheCharlatan <seb.kung@gmail.com>
@mzumsande mzumsande force-pushed the 202302_fastprune_oom branch from af16fc9 to 8f14fc8 Compare April 19, 2023 15:25
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 8f14fc8. Added new assert, test, and comment since last review

if (gArgs.GetBoolArg("-fastprune", false)) {
max_blockfile_size = 0x10000; // 64kiB
if (nAddSize >= max_blockfile_size) {
max_blockfile_size = nAddSize + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense the +1 is needed to be consistent with >= below. It does look like there is a minor bug here that will cause block files to be at most MAX_BLOCKFILE_SIZE-1 bytes and never reach the maximum size. But if this is a bug, it goes back to 5382bcf when this code was introduced. Agree it would not be good to change that behavior here.

@DrahtBot DrahtBot requested review from TheCharlatan and pinheadmz May 1, 2023 17:49
Copy link
Contributor

@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 8f14fc8

Copy link
Member

@pinheadmz pinheadmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 8f14fc8

Show Signature
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

ACK 8f14fc86225d8fe77353f61ebd6b0bdda8d13aa9
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE5hdzzW4BBA4vG9eM5+KYS2KJyToFAmRQBA4ACgkQ5+KYS2KJ
yTr14BAAw1BUdgVGe3UKG5epiL1XsijS0stgQRrXVwPJKM/7O4nAy5LRQBx/ASB+
+xH0CGkd8dULO9yPrXZHXn2tjlv36Wde+ECHy76BRy6qPYgQYf/KfX0RKl7IUw3D
RNTXxYPOopXDO9h9U2jmQpwZDU7Ri3DOD5BujVq+oAu0a+782QvoTMo/Yv9mEiKr
JOyNhSzCuy1vHK/bgO+CafNQLW0JwyQVntVDtVj3vC36OG0ChauUfJV7vALJ5KYI
/s5F9JA9BUPSNH1a6ruqCyc2pJR1URj1MVu44DQko4ICKQpphcq54r8dtlX4FvK2
/1F+czB4BKTmvhJyujynMte1FrgcHeZ/8df4tyvBa3+6j4y1jUjLHKEJaMXTgxC3
NdI1OMjskZdbTYnr+kewTemI261SXUOx87bL95BG2Mqk7drKpraQG7PG9JkB94oQ
g7W1B0xapGlP+AItM4esikInBKXE17ITxy5pQwznyonHCnAOiT9rbw12RSqXH/le
yiDhXsj6BzY6fiR7nOwjbzJuqlg6RGCgmbjz5zk5MYeiW0fuCbh30rbM07MY6+6u
CV9BQ0+Hn4RSDYQ4Na8v2dY0s8BbA4gd7swzcjPCveBppb1dQ2qt5CQ7Zk/Y5l21
jeG1vtRxvVt3U9QENfdghAj4yabhTSkO6pWc1nBK7DFOvOMFVmw=
=k0Pv
-----END PGP SIGNATURE-----

pinheadmz's public key is on keybase

@fanquake fanquake merged commit 8a373a5 into bitcoin:master May 2, 2023
@mzumsande mzumsande deleted the 202302_fastprune_oom branch May 2, 2023 11:54
height = self.nodes[0].getblockcount() + 1
block = create_block(hashprev=tip, ntime=time, txlist=[tx], coinbase=create_coinbase(height=height))
add_witness_commitment(block)
block.solve()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to make this simple test more complicated than it needs to be? See #27553

sidhujag pushed a commit to syscoin/syscoin that referenced this pull request May 4, 2023
…ceeds blockfile size

8f14fc8 test: cover fastprune with excessive block size (Matthew Zipkin)
271c23e blockstorage: Adjust fastprune limit if block exceeds blockfile size (Martin Zumsande)

Pull request description:

  The debug-only `-fastprune` option used in several tests is not always safe to use:
  If a `-fastprune` node receives a block larger than the maximum blockfile size of `64kb` bad things happen: The while loop in `BlockManager::FindBlockPos` never terminates, and the node runs oom because memory for `m_blockfile_info` is allocated in each iteration of the loop.
  The same would happen if a naive user used `-fastprune` on anything other than regtest (so this can be tested by syncing on signet for example, the first block that crashes the node is at height 2232).

  Change the approach by raising the blockfile size to the size of the block, if that block otherwise wouldn't fit (idea by TheCharlatan).

ACKs for top commit:
  ryanofsky:
    Code review ACK 8f14fc8. Added new assert, test, and comment since last review
  TheCharlatan:
    ACK 8f14fc8
  pinheadmz:
    ACK 8f14fc8

Tree-SHA512: df2fea30613ef9d40ebbc2416eacb574f6d7d96847db5c33dda22a29a2c61a8db831aa9552734ea4477e097f253dbcb6dcb1395d43d2a090cc0588c9ce66eac3
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Apr 19, 2024
Summary:
If the added block exceeds the blockfile size in test-only
-fastprune mode, the node would get stuck in an infinite loop and
run out of memory.

Avoid this by raising the blockfile size to the size of the added block
in this situation.

Co-authored-by: TheCharlatan <seb.kung@gmail.com>

This is a backport of [[bitcoin/bitcoin#27191 | core#27191]]

Depends on D16028

Note that this fix does not address the non-test issue of blocks larger than 128 MiB. But at least now we get a meaningful assertion failure rather than an OOM crash.

Test Plan:
`ninja all check-all`

Check that without the change in blockstorage.cpp the new test slowly consumes  all the memory and the test eventually fails.

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Differential Revision: https://reviews.bitcoinabc.org/D16023
@bitcoin bitcoin locked and limited conversation to collaborators May 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants