New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use non-atomic flushing with block replay #10148

Merged
merged 6 commits into from Jun 28, 2017

Conversation

@sipa
Member

sipa commented Apr 4, 2017

This implements an alternative solution to the flush-time memory usage peak, suggested by @gmaxwell.

Instead of relying on using atomic batch writes in LevelDB for the chainstate, we rely on the fact that we have an external log of updates to it already (called the blockchain).

This patch adds an extra "head blocks" to the chainstate, which gives the range of blocks for writes may be incomplete. At the start of a flush, we write this record, write the dirty dbcache entries in 16 MiB batches, and at the end we remove the heads record again. If it is present at startup it means we crashed during flush, and we rollback/roll forward blocks inside of it to get a consistent tip on disk before proceeding.

If a flush completes succesfully, the resulting database is compatible with previous versions (down to 0.8). If the node crashes in the middle of a flush, a version of the code with this patch is needed to recovery.

@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Apr 4, 2017

Member

This badly needs testing, but I'm not sure how to simulate crashes in the middle of flushing (I've manually verified this patch can recover from failure by introducing an exit(0) in the middle of the flush code).

Member

sipa commented Apr 4, 2017

This badly needs testing, but I'm not sure how to simulate crashes in the middle of flushing (I've manually verified this patch can recover from failure by introducing an exit(0) in the middle of the flush code).

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Apr 4, 2017

Member

Cool!

but I'm not sure how to simulate crashes in the middle of flushing

I'll get to that :)

Member

laanwj commented Apr 4, 2017

Cool!

but I'm not sure how to simulate crashes in the middle of flushing

I'll get to that :)

@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Apr 5, 2017

Member

Rebased, fixed a bug, and added a commit to allows simulating crashes after partial flushes.

Member

sipa commented Apr 5, 2017

Rebased, fixed a bug, and added a commit to allows simulating crashes after partial flushes.

Show outdated Hide outdated src/txdb.cpp Outdated
@gmaxwell

This comment has been minimized.

Show comment
Hide comment
@gmaxwell

gmaxwell Apr 5, 2017

Member

contrib/devtools/check-doc.py is unhappy that you added new arguments without asking for permission from the argument gods.

Member

gmaxwell commented Apr 5, 2017

contrib/devtools/check-doc.py is unhappy that you added new arguments without asking for permission from the argument gods.

Show outdated Hide outdated src/txdb.cpp Outdated
Show outdated Hide outdated src/dbwrapper.h Outdated
Show outdated Hide outdated src/init.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Apr 12, 2017

Member

Addressed some of @TheBlueMatt's comments.

Member

sipa commented Apr 12, 2017

Addressed some of @TheBlueMatt's comments.

Show outdated Hide outdated src/coins.h Outdated
Show outdated Hide outdated src/validation.cpp Outdated
Show outdated Hide outdated src/txdb.cpp Outdated
@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Apr 16, 2017

Member

Updated to deal with reorganizations. The disk format and recovery code can now also deal with multiple partially written branches. That functionality is not needed yet, but means we can switch to different partial flushing strategies later without breaking compatibility with older versions.

Member

sipa commented Apr 16, 2017

Updated to deal with reorganizations. The disk format and recovery code can now also deal with multiple partially written branches. That functionality is not needed yet, but means we can switch to different partial flushing strategies later without breaking compatibility with older versions.

Show outdated Hide outdated src/init.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
@sipa

Thanks for the detailed review, @TheBlueMatt.

Show outdated Hide outdated src/validation.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated

@sipa sipa changed the title from Use non-atomic flushing with block replay to [WIP] Use non-atomic flushing with block replay Apr 17, 2017

Show outdated Hide outdated src/txdb.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Apr 18, 2017

Member

contrib/devtools/check-doc.py is unhappy that you added new arguments without asking for permission from the argument gods.

Not sure if you were being sarcastic here, but to be clear: that script checks whether options are documented (either as debug option or as normal option), not whether you have a signed permission on a stone tablet from $PANTHEON.

Going to test this on a node.

Member

laanwj commented Apr 18, 2017

contrib/devtools/check-doc.py is unhappy that you added new arguments without asking for permission from the argument gods.

Not sure if you were being sarcastic here, but to be clear: that script checks whether options are documented (either as debug option or as normal option), not whether you have a signed permission on a stone tablet from $PANTHEON.

Going to test this on a node.

@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Apr 19, 2017

Member

Rebased, and removed the unused multi-reorg support in the recovery code.

Member

sipa commented Apr 19, 2017

Rebased, and removed the unused multi-reorg support in the recovery code.

@sdaftuar

This comment has been minimized.

Show comment
Hide comment
@sdaftuar

sdaftuar Apr 19, 2017

Member

@sipa Thanks for the simplification. I've read through the logic and I believe it is correct; will think about ways to carefully test.

There was some discussion on IRC but I wanted to summarize my current thoughts: this is still some complication we're adding to consensus, and I expect that the per-txout caching from #10195 will change the behavior and performance of the cache substantially, so I'd prefer to review that PR independently of this one to better consider the design decisions of this PR.
Edit: Just noticed you rebased #10195 without this, thanks!

Member

sdaftuar commented Apr 19, 2017

@sipa Thanks for the simplification. I've read through the logic and I believe it is correct; will think about ways to carefully test.

There was some discussion on IRC but I wanted to summarize my current thoughts: this is still some complication we're adding to consensus, and I expect that the per-txout caching from #10195 will change the behavior and performance of the cache substantially, so I'd prefer to review that PR independently of this one to better consider the design decisions of this PR.
Edit: Just noticed you rebased #10195 without this, thanks!

@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Apr 19, 2017

Member

@sdaftuar My plan is adding a unit test that uses wrapper around CCoinsViewDB that drops 50% of the dirty entries before writing, and then issues ReplayBlock to recover from it. Then run a test that creates random regtest blocks with a few transactions, and frequently reorganizes.

Member

sipa commented Apr 19, 2017

@sdaftuar My plan is adding a unit test that uses wrapper around CCoinsViewDB that drops 50% of the dirty entries before writing, and then issues ReplayBlock to recover from it. Then run a test that creates random regtest blocks with a few transactions, and frequently reorganizes.

Show outdated Hide outdated src/validation.cpp Outdated
return error("ReplayBlocks(): reorganization from unknown block requested");
}
pindexOld = mapBlockIndex[hashHeads[1]];
pindexFork = LastCommonAncestor(pindexOld, pindexNew);

This comment has been minimized.

@NicolasDorier

NicolasDorier Apr 20, 2017

Member

I would add a assert(pindexFork);

@NicolasDorier

NicolasDorier Apr 20, 2017

Member

I would add a assert(pindexFork);

This comment has been minimized.

@sipa

sipa Apr 20, 2017

Member

Done.

@sipa

sipa Apr 20, 2017

Member

Done.

@NicolasDorier

This comment has been minimized.

Show comment
Hide comment
@NicolasDorier

NicolasDorier Apr 20, 2017

Member

CodeReview ACK. Would be more at ease with some functional tests.

An easy test is trying to sync a node for 1000 blocks, with dbbatchsize to 1 bytes (it is in MB right now though) and dbcrashratio to like 10, and see if it eventually manage to do it.

Member

NicolasDorier commented Apr 20, 2017

CodeReview ACK. Would be more at ease with some functional tests.

An easy test is trying to sync a node for 1000 blocks, with dbbatchsize to 1 bytes (it is in MB right now though) and dbcrashratio to like 10, and see if it eventually manage to do it.

@TheBlueMatt

utACK 29548bc minus one GetArg and needing squashing.

Show outdated Hide outdated src/txdb.cpp Outdated
@@ -59,12 +80,22 @@ bool CCoinsViewDB::BatchWrite(CCoinsMap &mapCoins, const uint256 &hashBlock) {
count++;
CCoinsMap::iterator itOld = it++;
mapCoins.erase(itOld);
if (batch.SizeEstimate() > batch_size) {
LogPrint(BCLog::COINDB, "Writing partial batch of %.2f MiB\n", batch.SizeEstimate() * (1.0 / 1048576.0));
db.WriteBatch(batch);

This comment has been minimized.

@morcos

morcos Apr 20, 2017

Member

Should we be looking for failure here and returning false if any of the WriteBatch calls fail?

@morcos

morcos Apr 20, 2017

Member

Should we be looking for failure here and returning false if any of the WriteBatch calls fail?

This comment has been minimized.

@sipa

sipa Apr 24, 2017

Member

WriteBatch throws a dbwrapper_error when writing fails.

@sipa

sipa Apr 24, 2017

Member

WriteBatch throws a dbwrapper_error when writing fails.

@morcos

This comment has been minimized.

Show comment
Hide comment
@morcos

morcos Apr 20, 2017

Member

Concept ACK. I've done a preliminary review and it looks pretty good, but I'd like to spend some more time thinking about it.

Member

morcos commented Apr 20, 2017

Concept ACK. I've done a preliminary review and it looks pretty good, but I'd like to spend some more time thinking about it.

@laanwj laanwj added this to Blockers in High-priority for review Apr 20, 2017

@laanwj laanwj added this to Blockers in High-priority for review Jun 8, 2017

Show outdated Hide outdated src/txdb.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
Show outdated Hide outdated src/validation.cpp Outdated
@sdaftuar

Re-reviewed; it looks pretty good though I have a couple questions... Will work on testing.

Show outdated Hide outdated src/txdb.cpp Outdated
return true;
}
bool ReplayBlocks(const CChainParams& params, CCoinsView* view)

This comment has been minimized.

@sdaftuar

sdaftuar Jun 14, 2017

Member

Would we ever call this on a view that isn't a CCoinsViewDB? If not, perhaps the code would be slightly clearer if we passed in a CCoinsViewDB * instead, to clue future code readers in to what's going on here?

@sdaftuar

sdaftuar Jun 14, 2017

Member

Would we ever call this on a view that isn't a CCoinsViewDB? If not, perhaps the code would be slightly clearer if we passed in a CCoinsViewDB * instead, to clue future code readers in to what's going on here?

This comment has been minimized.

@sipa

sipa Jun 15, 2017

Member

I prefer making types as generic as possible, and I think this code should work fine even on a cache on top of a db view.

@sipa

sipa Jun 15, 2017

Member

I prefer making types as generic as possible, and I think this code should work fine even on a cache on top of a db view.

Show outdated Hide outdated src/txdb.cpp Outdated
Show outdated Hide outdated src/txdb.cpp Outdated
Show outdated Hide outdated src/init.cpp Outdated
Show outdated Hide outdated src/init.cpp Outdated
@sdaftuar

It occurs to me that we aren't bounding memory usage at all during recovery. This seems like it could cause problems for users who are running with a big -dbcache and then crash.

Perhaps we should suggest a -reindex if there are a lot of blocks to disconnect/connect, or abort and suggest a -reindex if the cache in ReplayBlocks exceeds the dbcache+mempool size?

Show outdated Hide outdated src/txdb.h Outdated
Show outdated Hide outdated src/validation.cpp Outdated
@sdaftuar

This comment has been minimized.

Show comment
Hide comment
@sdaftuar

sdaftuar Jun 22, 2017

Member

Also, I was able to write a functional test that exercises this logic (and which caught the undo.fCoinbase vs !fClean issue in ApplyTxInUndo()):
https://github.com/sdaftuar/bitcoin/commits/test-10148

Member

sdaftuar commented Jun 22, 2017

Also, I was able to write a functional test that exercises this logic (and which caught the undo.fCoinbase vs !fClean issue in ApplyTxInUndo()):
https://github.com/sdaftuar/bitcoin/commits/test-10148

@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Jun 22, 2017

Member
  • Made the added argument to AddCoins check instead of overwrite. Its semantics are now to issue a HaveCoin to determine whether an overwrite happens, rather than just assume there is one. This has the advantage that (some) of the coins added in RollForwardBlock correctly get a FRESH marker, possibly reducing memory usage if the rollforward is large.
  • Added @sdaftuar 's dbcrash test (which still runs succesfully).

@sdaftuar As for memory usage during recovery, I think the correct approach is to make the recovery itself non-atomic later.

Member

sipa commented Jun 22, 2017

  • Made the added argument to AddCoins check instead of overwrite. Its semantics are now to issue a HaveCoin to determine whether an overwrite happens, rather than just assume there is one. This has the advantage that (some) of the coins added in RollForwardBlock correctly get a FRESH marker, possibly reducing memory usage if the rollforward is large.
  • Added @sdaftuar 's dbcrash test (which still runs succesfully).

@sdaftuar As for memory usage during recovery, I think the correct approach is to make the recovery itself non-atomic later.

@sdaftuar

This comment has been minimized.

Show comment
Hide comment
@sdaftuar

sdaftuar Jun 22, 2017

Member

Looks good, ACK. However, if we include the dbcrash test, then I think we also need something like sdaftuar@96a7ef4 in order for test/functional/test_runner.py --extended to not fail?

Member

sdaftuar commented Jun 22, 2017

Looks good, ACK. However, if we include the dbcrash test, then I think we also need something like sdaftuar@96a7ef4 in order for test/functional/test_runner.py --extended to not fail?

@sdaftuar

This comment has been minimized.

Show comment
Hide comment
@sdaftuar
Member

sdaftuar commented Jun 23, 2017

ACK 03604ea

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Jun 24, 2017

Member

Small merge conflict in check-doc.py

Member

laanwj commented Jun 24, 2017

Small merge conflict in check-doc.py

@gmaxwell

This comment has been minimized.

Show comment
Hide comment
@gmaxwell

gmaxwell Jun 26, 2017

Member

@sdaftuar Thats a great test.

@sipa needs conflicts fixed

Member

gmaxwell commented Jun 26, 2017

@sdaftuar Thats a great test.

@sipa needs conflicts fixed

sipa and others added some commits Apr 16, 2017

Dont create pcoinsTip until after ReplayBlocks.
This requires that we not access pcoinsTip in InitBlockIndex's
FlushStateToDisk (so we just skip it until later in AppInitMain)
and the LoadChainTip in LoadBlockIndex (which there is already one
later in AppinitMain, after ReplayBlocks, so skipping it there is
fine).

Includes some simplifications by Suhas Daftuar and Pieter Wuille.
@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Jun 26, 2017

Member

Rebased, and squashed the last two commits.

Member

sipa commented Jun 26, 2017

Rebased, and squashed the last two commits.

@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Jun 26, 2017

Member

Travis where art thou

Member

sipa commented Jun 26, 2017

Travis where art thou

[qa] Test non-atomic chainstate writes
Adds new functional test, dbcrash.py, which uses -dbcrashratio to exercise the
logic for recovering from a crash during chainstate flush.

dbcrash.py is added to the extended tests, as it may take ~10 minutes to run

Use _Exit() instead of exit() for crash simulation

This eliminates stderr output such as:
    terminate called without an active exception
or
    Assertion failed: (!pthread_mutex_destroy(&m)), function ~recursive_mutex, file /usr/local/include/boost/thread/pthread/recursive_mutex.hpp, line 104.

Eliminating the stderr output on crash simulation allows testing with
test_runner.py, which reports a test as failed if stderr is produced.
@sdaftuar

This comment has been minimized.

Show comment
Hide comment
@sdaftuar

sdaftuar Jun 27, 2017

Member

re-ACK 176c021

Member

sdaftuar commented Jun 27, 2017

re-ACK 176c021

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Jun 28, 2017

Member

Tested ACK 176c021

Member

laanwj commented Jun 28, 2017

Tested ACK 176c021

@laanwj laanwj merged commit 176c021 into bitcoin:master Jun 28, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

laanwj added a commit that referenced this pull request Jun 28, 2017

Merge #10148: Use non-atomic flushing with block replay
176c021 [qa] Test non-atomic chainstate writes (Suhas Daftuar)
d6af06d Dont create pcoinsTip until after ReplayBlocks. (Matt Corallo)
eaca1b7 Random db flush crash simulator (Pieter Wuille)
0580ee0 Adapt memory usage estimation for flushing (Pieter Wuille)
013a56a Non-atomic flushing using the blockchain as replay journal (Pieter Wuille)
b3a279c [MOVEONLY] Move LastCommonAncestor to chain (Pieter Wuille)

Tree-SHA512: 47ccc62303f9075c44d2a914be75bd6969ff881a857a2ff1227f05ec7def6f4c71c46680c5a28cb150c814999526797dc05cf2701fde1369c06169f46eccddee
@a415263

This comment has been minimized.

Show comment
Hide comment
@a415263

a415263 commented on b3a279c Jun 29, 2017

b3a279b

@laanwj laanwj removed this from Blockers in High-priority for review Jun 29, 2017

@TheBlueMatt

This comment has been minimized.

Show comment
Hide comment
@TheBlueMatt

TheBlueMatt Jul 7, 2017

Contributor

utACK-sans-tests once the fixes for init-order bugs here goes through in #10758.

Contributor

TheBlueMatt commented Jul 7, 2017

utACK-sans-tests once the fixes for init-order bugs here goes through in #10758.

@morcos

This comment has been minimized.

Show comment
Hide comment
@morcos

morcos Jul 20, 2017

Member

posthumous utACK-sans-tests and modulo some of the same bugs @TheBlueMatt found and fixed in #10758. I'll review that now.

Member

morcos commented Jul 20, 2017

posthumous utACK-sans-tests and modulo some of the same bugs @TheBlueMatt found and fixed in #10758. I'll review that now.

@jnewbery jnewbery referenced this pull request Jul 31, 2017

Closed

TODO for release notes 0.15.0 #9889

12 of 12 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment