Skip to content

Coin database checks #2145

Merged
merged 3 commits into from Jan 11, 2013

4 participants

@sipa
Bitcoin member
sipa commented Jan 3, 2013

-checklevel gets a new meaning:
0: verify blocks can be read from disk (like before)
1: verify (contextless) block validity (like before)
2: verify undo data can be read and matches checksums
3: verify coin database is consistent with the last few blocks (close to level 6 before)
4: verify all validity rules of the last few blocks (including signature checks)

Level 3 is the new default, as it's reasonably fast. As levels 3 and 4 are implemented using an in-memory rollback of the database, they are limited to as many blocks as possible without exceeding the limits set by -dbcache. The default of -dbcache=25 allows for some 150-200 blocks to be rolled back.

In case an error is found, the application quits with a message instructing the user to restart with -reindex. Better instructions, and automatic recovery (when possible) or automatic reindexing are left as future work.

In addition, this also changes the on-disk format of undo data (adding a checksum), as the correctness of the coindb checks depends on having intact undo data.

sipa added some commits Dec 30, 2012
@sipa sipa Make DisconnectBlock fault-tolerant 2cbd71d
@sipa sipa Add checksums to undo data
This should be compatible with older code that didn't write checksums.
8539361
@BitcoinPullTester

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/24e864547189c6b9347740f50d5fed656fc48816 for binaries and test log.

@BitcoinPullTester

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/aa38398d4bd32c6b90175d08b65ecb324701a368 for binaries and test log.

@jgarzik
Bitcoin member
jgarzik commented Jan 3, 2013

ACK

@gmaxwell
gmaxwell commented Jan 3, 2013

Some problems with making reindex automatic is that reindex is quite burdensome and intrusive so you may want to control when its run, and the other problem is that making it automatic may hide hardware or software errors that can also cause silent failures. A failure of this test is something which should never happen, even if the user is randomly powering off their machine.

@gmaxwell
gmaxwell commented Jan 3, 2013

So, I see that setting dbcache really large doesn't check more than 2500 blocks by default. Thats good— wouldn't want a long startup time. But setting dbcache to 0 results in it checking very few. I'm not sure that this is a good idea.

I'm also seeing some odd memory usage behavior. Starting with -dbcache=4000 -checkblocks=0 results in an RES of 2.3 GiB once the check is finished.

I also found
2013-01-03 20:18:52 Verifying last 215017 blocks at level 2
2013-01-03 20:23:16 No coin database inconsistencies in last 31137 blocks (6770582 transactions)

a little odd, since that it's actually doing is checking 215017 at level 0/1 and checking 31137 at 2.

Should the undo data also get checked at level 1? Obviously if it's broken it will break the level 2 tests, but since we do the level 2 tests on fewer block, some specific undo tests might make sense? (e.g. we could count the txins in the block and the undo data)

@sipa
Bitcoin member
sipa commented Jan 4, 2013

@gmaxwell there is a trivial undo test possible: verify their checksums. However, how meaningful is it to check undo data without checking the coin state - there is no scenario in which you need undo data but not coin state.

In way, the system is backwards. The coin state is by far the most important thing to verify, but unfortunately it is also almost the most expensive, and it already requires consistent blocks and undo data.

Regarding only checking very few with low dbcache... what would you suggest?

Also, the calculation for memory sizes based on transaction counts is only very approximately and spread over several types of caches (leveldb blktree, leveldb coindb, coins view cache). Doing the rollback pulls the transaction data in the global cache, and then changes to it in a memory-only cache on top of it, that is discarded. The transactions themself remain in the global cache though, afterwards.

@sipa
Bitcoin member
sipa commented Jan 4, 2013

@Diapolo such a flag would be possible, but why add it? The next startup the database will most likely still be inconsistent, so it would be detected again (and if it isn't, maybe it was something temporary...).

For GUI users, I would like to see something like "Your database is corrupted. Do you want to rebuild it now, or exit?". For bitcoind I think the right approach is indeed quitting immediately with an error message instructing the user to restart with -reindex, so they can run the rebuild on their own terms (or perhaps copy a database from another instance or backup they have running).

@sipa
Bitcoin member
sipa commented Jan 4, 2013

@gmaxwell Suggestions for better reporting are welcome :)

@gmaxwell
gmaxwell commented Jan 4, 2013

The reason I suggested a basic check of undo data is simply because unless dbcache is very large we test far fewer blocks at level 2 and there is currently no way to do even basic checking of all the undo data. This means that if dbcache is set small, we may not discover corrupted undo data which could be trivially discovered, until a reorg.

For small dbcache values— we could give it a minimum value... or could make it always some reasonable number even if we go over our coincachesize limit?

I really wish there were a way to randomly test the coins database as I expect the leveldb disk layout probably ends up with old and new data being exposed to different errors. Oh well.

In any case, tested with fuzzed undo (with and without undo checksums) and coins data. It's quite sensitive to errors.

After running it some with a corrupted leveldb and corrupted undos, after restoring the non-corrupted files, I'm getting

2013-01-04 07:09:47 ERROR: DisconnectBlock() : block and undo data inconsistent
2013-01-04 07:09:47 ERROR: VerifyDB() : *** irrecoverable inconsistency in block data

So either I screwed up, or the corruption was somehow propagated to other files. I note that many of the error messages are repeated, so I can't easily tell which test failed. If giving them distinct messages is too much work, perhaps just slipping in FILE and LINE would be helpful. (though this isn't an issue in the codebase unique to the checking pull)

@sipa sipa New database check routine
-checklevel gets a new meaning:
0: verify blocks can be read from disk (like before)
1: verify (contextless) block validity (like before)
2: verify undo files can be read and have good checksums
3: verify coin database is consistent with the last few blocks
   (close to level 6 before)
4: verify all validity rules of the last few blocks

Level 3 is the new default, as it's reasonably fast. As level 3 and
4 are implemented using an in-memory rollback of the database, they
are limited to as many blocks as possible without exceeding the
limits set by -dbcache. The default of -dbcache=25 allows for some
150-200 blocks to be rolled back.

In case an error is found, the application quits with a message
instructing the user to restart with -reindex. Better instructions,
and automatic recovery (when possible) or automatic reindexing are
left as future work.
1f355b6
@sipa
Bitcoin member
sipa commented Jan 4, 2013

@gmaxwell I've added a new level in between 1 and 2 (which verifies undo data). I've also changed the heuristic for determining how far to roll back a bit - it now aims for using +- 5-10 MB extra for validation, even with very small dbcache.

@BitcoinPullTester

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/1f355b66cd90f1bd96a862604a7216e8ba56f853 for binaries and test log.

@gmaxwell
gmaxwell commented Jan 4, 2013

ACK ACK ACK

@sipa
Bitcoin member
sipa commented Jan 6, 2013

Well the reported messages are still confusing. It's not actually checking 2500 blocks at level 3...

@gmaxwell

Because this increases our sensitivity to corruption so much, I'd rather pull now and get the messages just right in another pull.

@gmaxwell gmaxwell merged commit 1f4b80a into bitcoin:master Jan 11, 2013
@sipa sipa deleted the sipa:checkcoins branch May 3, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.