Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion when restarting after a crash with pruning #9001

Open
tdaede opened this issue Oct 24, 2016 · 11 comments

Comments

Projects
None yet
6 participants
@tdaede
Copy link

commented Oct 24, 2016

  1. Run bitcoind on testnet with prune=1000
  2. Kill or crash bitcoind while it is syncing (such as running out of memory on a VPS)
  3. Start bitcoind again

Expected behaviour

Bitcoind starts again, potentially resyncing from scratch.

Actual behaviour

bitcoind: chain.cpp:96: CBlockIndex* CBlockIndex::GetAncestor(int): Assertion `pindexWalk->pprev' failed.

What version of bitcoin-core are you using?

0.13.1rc2 binaries

@fanquake

This comment has been minimized.

Copy link
Member

commented Oct 24, 2016

What operating system are you using?
How much memory/disk space is available?

@unsystemizer

This comment has been minimized.

Copy link
Contributor

commented Oct 24, 2016

Such as running out of memory on a VPS

If that's the case your filesystem may be corrupt, and of course data on it as well.
You should check it with fsck and make sure it's sound before bitcoind is suspected.

@jonasschnelli

This comment has been minimized.

Copy link
Member

commented Oct 24, 2016

A sudden shutdown (crash/kill) of bitcoind may lead to database corruption (which results in a re-sync from the scratch). I observed these types of corruptions often on VPS.
Features like #8037 could be a relief in such cases...

IMO applications with heavy database-interaction like bitcoind (UTXO set interaction) tend to loose integrity in force shutdown (crash/kill) situations.

@tdaede

This comment has been minimized.

Copy link
Author

commented Oct 24, 2016

@fanquake, this is Fedora 24 with 1GB of RAM. That said, this is 100% reproducible for me with the given settings. It shouldn't depend on available RAM.

@unsystemizer, the error is repeatable, and I don't think there is any situation where an OOM would lead to FS corruption.

Note here that the issue isn't the database corruption, or the requirement that the re-sync happen - that's fine. The issue is that bitcoind hits an assert and exits immediately, rather than automatically starting a re-sync from scratch.

@jonasschnelli

This comment has been minimized.

Copy link
Member

commented Oct 24, 2016

@tdaede: the assertion you hit is very likely caused by a database corruption (block index).

@unsystemizer

This comment has been minimized.

Copy link
Contributor

commented Oct 24, 2016

@tdaede - when you say it's repeatable, if you just restart bitcoind without fixing anything, I believe it's repeatable. It could also be repeatable if you shutdown the system while bitcoind is running and corrupt data the same or similar way. It doesn't mean it's a problem with Bitcoin Core. As I said you need to prove with fsck that the filesystem is sound. Even then it may be a problem with LevelDB or something that would have to be dealt with upstream.

OOM can't lead to FS corruption: I disagree.
https://unix.stackexchange.com/questions/12699/do-journaling-filesystems-guarantee-against-corruption-after-a-power-failure

@luke-jr

This comment has been minimized.

Copy link
Member

commented Oct 26, 2016

@unsystemizer OOM isn't a power failure. It cannot cause filesystem corruption unless there are very serious kernel bugs.

@unsystemizer

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2016

Or in the hypervisor or h/w drivers sitting below the VM or somewhere else. It should still be shown that the filesystem is not corrupt, I think.

@jnewbery

This comment has been minimized.

Copy link
Member

commented Jan 24, 2017

I think this is almost certainly not filesystem corruption. I can reproduce this failure mode by manually removing a single block from my block index when flushing to disk, and then starting bitcoind again. I hit the assert with this backtrace:

#0  CBlockIndex::GetAncestor (this=0x555556396400, height=8) at chain.cpp:105
#1  0x000055555592b7e6 in CBlockIndex::BuildSkip (this=0x555556395ec0) at chain.cpp:123
#2  0x000055555587de97 in LoadBlockIndexDB (chainparams=...) at validation.cpp:3526
#3  0x00005555558800f9 in LoadBlockIndex (chainparams=...) at validation.cpp:3815
#4  0x00005555555b1fb2 in AppInitMain (threadGroup=..., scheduler=...) at init.cpp:1428
#5  0x0000555555583f8f in AppInit (argc=2, argv=0x7fffffffe568) at bitcoind.cpp:167
#6  0x0000555555584668 in main (argc=2, argv=0x7fffffffe568) at bitcoind.cpp:196

If I do anything that corrupts the block index in a more intrusive way (eg removing the pprev pointer or changing any of the other header fields), then we fail at a different point. The blockhash is no longer valid so we fail CheckProofOfWork(). Critically, CBlockIndex::BuildSkip in LoadBlockIndexDB() is called after LoadBlockIndexGuts(), which loads all of the block indexes from disk. So it looks to me like this failure mode can probably only be hit if all the indexes on disk are valid, but one or more block are missing.

@tdaede are you able to upload debug.log? You suspect that this may be something to do with pruning. I've had quite a close read of that code and I can't see where it would cause us to lose block indexes or not flush them to disk. If you have a debug.log, it might help pin down where we're losing the block index.

I'm also planning to open a PR which gives us slightly better diagnostics here by printing out the blockhash and height of the orphan block.

@tdaede

This comment has been minimized.

Copy link
Author

commented Jan 25, 2017

@jnewbery I had an offline conversation with @gmaxwell and apparently flushing is disabled during initial sync for speed, which means that the indexes can end up ahead of the block database. This was apparently done for speed - a simple fix would be to disable this optimization if it no longer gives significant speed gains.

There is also no way to fetch the missing block when pruning, so you have to start over.

@jnewbery

This comment has been minimized.

Copy link
Member

commented Jan 25, 2017

Thanks @tdaede. That makes sense, but I still don't understand how the block index database can get into this bad state. This assert is only hit if there's a block in the database with a parent which isn't in the database. I don't yet understand how not flushing during startup could get us into that state.

I'll try to get hold of @gmaxwell later today to try to understand this a bit better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.