Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync fail due to possible DB corruption #397

Closed
serejandmyself opened this issue Oct 4, 2019 · 2 comments
Closed

Sync fail due to possible DB corruption #397

serejandmyself opened this issue Oct 4, 2019 · 2 comments
Assignees
Projects

Comments

@serejandmyself
Copy link
Member

serejandmyself commented Oct 4, 2019

Current Behavior

I came across an issue while running a Validator node.
The issue is that every so often, the system finds a mismatch in a block and crashes.
"Corruption on data-block checksum mismatch error".
All the obvious thing, like deleting DB, re-syncing, starting a new validator, new accounts, reinstalling dependencies, etc. have been tried.
The mistake keeps reoccurring.
The blocks are different each time, and the head block that the chain is synced up to, is much higher than the mismatch.
In fact the validator works perfectly for a while, before falling.
NOTE: OFTEN the chain keeps on syncing (6 - 12 hours after) if I leave it, it of course, crashes again thereafter

Expected Behavior

Chain should be syncing stably and constantly

Reproduction

Not sure if its possible to reproduce on purpose.

But it has been mentioned in one way or another in some places across other DB's i.e. BTC, ETH:

Log

This is how the mistake itself looks, where the chain crashes, although the block number can differ from time to time:
CONSENSUS FAILURE!!! module=consensus err="leveldb/table: corruption on data-block (pos=399680): checksum mismatch, want=0xcf6de1ec got=0x99ba8252 [file=97839418.ldb]" stack="goroutine 1022538 [running]:\nruntime/debug.Stack(0xc0f3301870, 0xfd53c0, 0xc0578403c0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.

This is how the log looks after it tries to sync with the mismatch already in place:
(Different crush to the above, but it looks exactly the same)
E[2019-10-01|07:01:45.455] Connection failed @ sendRoutine module=p2p peer=561ac562a79db5c7aebc4dbefd2d728836ce412e@0.0.0.0:26656 conn=MConn{93.125.26.210:26656} err="pong timeout" E[2019-10-01|07:01:45.455] Stopping peer for error module=p2p peer="Peer{MConn{93.125.26.210:26656} 561ac562a79db5c7aebc4dbefd2d728836ce412e out}" err="pong timeout" E[2019-10-01|07:01:45.539] Connection failed @ sendRoutine module=p2p peer=b34bcaa7536d0f7e09f775d56ceced3c29ba62c0@95.216.244.235:46656 conn=MConn{95.216.244.235:46656} err="pong timeout" E[2019-10-01|07:01:45.539] Stopping peer for error module=p2p peer="Peer{MConn{95.216.244.235:46656} b34bcaa7536d0f7e09f775d56ceced3c29ba62c0 out}" err="pong timeout" E[2019-10-01|07:02:00.651] Failed Sanity Check! Cant add old address to new bucket module=p2p book=/root/.cyberd/config/addrbook.json ka="&{Addr:b34bcaa7536d0f7e09f775d56ceced3c29ba62c0@95.216.244.235:46656 Src:6a0fb53aeedbd6882963413ad6cc5bd52cf01cdb@0.0.0.0:26656 Attempts:0 LastAttempt:2019-09-30 16:23:23.102398865 +0000 UTC m=+13558.026448007 LastSuccess:2019-09-30 16:23:23.102398865 +0000 UTC m=+13558.026448007 BucketType:2 Buckets:[50]}" bucket=102 E[2019-10-01|07:02:05.332] Error on broadcastTxCommit module=rpc err="Timed out waiting for tx to be included in a block

Additional Information

System (local machine):

  • Ubuntu 18.04 64bits
  • X570 aorus elite MB
  • 32gb ram (3200 MHz)
  • Ryzen 5 3600 6 core

Some information from tendermint users (no one actually has a solution.
I have opened a similar issue on the tendermint git:

  • Possible issue with nondeterminism in the state machine (i.e. the tendermint app)
  • Possible issue with tendermint blocks database, not the abci app
  • Possible LevelDB corruption
  • Possible faulty memory (hardware) or with disk subsystem
@cyborgshead
Copy link
Member

cyborgshead commented Nov 26, 2019

@serejandmyself going to close this, please reopen if same will with the new release

cyber automation moved this from To do to Done Nov 26, 2019
@melekes
Copy link

melekes commented Apr 13, 2020

@litvintech have you fixed this? I don't see any linked PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
cyber
  
Done
Development

No branches or pull requests

3 participants