Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't restart node with cosmos-sdk v0.38.0 #5570

Closed
4 tasks
cyborgshead opened this issue Jan 27, 2020 · 4 comments · Fixed by #5579
Closed
4 tasks

Can't restart node with cosmos-sdk v0.38.0 #5570

cyborgshead opened this issue Jan 27, 2020 · 4 comments · Fixed by #5579
Assignees

Comments

@cyborgshead
Copy link

cyborgshead commented Jan 27, 2020

Edited by @AdityaSripal

Cause of Bug

With the new Pruning changes, the IAVL only flushes to disk at each snapshot interval defined by the SDK KeepEvery parameter. On restart, the application should replay blocks from the last persisted version (or should replay from an empty state if nothing has been persisted). However, the CommitInfo needs to contain the last persisted commit, rather than the latest commit so that the tendermint process can restart the application correctly.

Solution

A couple changes need to be integrated into the SDK

  1. The CommitInfo needs to contain the hash of the latest persisted state only.
  2. The disk-flush interval needs to be reduced to something much lower to make restarting more convenient, reduce memory usage to persist intermediary state. To make the pruning parameters more flexible, we need to introduce an additional parameter:
KeepRecent int64 // how many recent versions should we persist in memory

FlushEvery int64 // how often do we flush to disk

SnapshotEvery int64 // how often do we snapshot a version

Here {KeepRecent, FlushEvery} form the IAVL PruningOptions {KeepRecent, KeepEvery}.

The SDK will on each commit of a FlushEvery version, remove the last FlushEvery version unless the last version is a snapshot version which is defined with the SnapshotEvery parameter.

Thanks to @ethanfrey and @zmanian for help diagnosing issue and helping with solution

End of edit

Summary of Bug

I started the migration of cyber to the latest SDK v0.38.0.
After refactoring of application and modules it built and ran but I found after node restart it crashes with consensus failure every time. I spent holidays trying to fix this think this is an application problem this but after tried to check bumped to 38 Gaia version and took the same issue.

Upgraded to 0.38.0 code, single node, start, stop, restart -> failure.

Stacktrace, restarting Gaia node

I[2020-01-27|14:46:12.284] starting ABCI with Tendermint                module=main 
panic: stored minter should not have been nil

goroutine 1 [running]:
github.com/cosmos/cosmos-sdk/x/mint/internal/keeper.Keeper.GetMinter(0xc00013c000, 0x52282e0, 0xc000b8caf0, 0xc00013c000, 0x52282e0, 0xc000b8cb30, 0x5228320, 0xc000b8cb70, 0xc000b95a20, 0x4, ...)
        github.com/cosmos/cosmos-sdk@v0.34.4-0.20200124164056-b647824716d9/x/mint/internal/keeper/keeper.go:57 +0x18f
github.com/cosmos/cosmos-sdk/x/mint.BeginBlocker(0x5238820, 0xc0000d8008, 0x524c360, 0xc0000a8e80, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/cosmos/cosmos-sdk@v0.34.4-0.20200124164056-b647824716d9/x/mint/abci.go:11 +0x8e
github.com/cosmos/cosmos-sdk/x/mint.AppModule.BeginBlock(...)
        github.com/cosmos/cosmos-sdk@v0.34.4-0.20200124164056-b647824716d9/x/mint/module.go:130
github.com/cosmos/cosmos-sdk/types/module.(*Manager).BeginBlock(0xc000139260, 0x5238820, 0xc0000d8008, 0x524c360, 0xc0000a8e80, 0xa, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/cosmos/cosmos-sdk@v0.34.4-0.20200124164056-b647824716d9/types/module/module.go:297 +0x1ca
github.com/cosmos/gaia/app.(*GaiaApp).BeginBlocker(...)
        /Users/litvintech/Projects/gaia/app/app.go:299
github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock(0xc000b9fe00, 0xc000dd8680, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/cosmos/cosmos-sdk@v0.34.4-0.20200124164056-b647824716d9/baseapp/abci.go:136 +0x469
github.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync(0xc0000cf620, 0xc000dd8680, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/tendermint/tendermint@v0.33.0/abci/client/local_client.go:231 +0x101
github.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync(0xc000cce940, 0xc000dd8680, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/tendermint/tendermint@v0.33.0/proxy/app_conn.go:69 +0x6b
github.com/tendermint/tendermint/state.execBlockOnProxyApp(0x52391e0, 0xc000b7cd00, 0x5245c00, 0xc000cce940, 0xc000c121c0, 0x524e260, 0xc000d6c000, 0x6, 0xc000dc04b0, 0xc)
        github.com/tendermint/tendermint@v0.33.0/state/execution.go:280 +0x3e1
github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc000b5a380, 0xa, 0x0, 0xc000dc0496, 0x6, 0xc000dc04b0, 0xc, 0x6, 0xc0000e4d80, 0x20, ...)
        github.com/tendermint/tendermint@v0.33.0/state/execution.go:131 +0x17a
github.com/tendermint/tendermint/consensus.(*Handshaker).replayBlock(0xc000d130b0, 0xa, 0x0, 0xc000dc0496, 0x6, 0xc000dc04b0, 0xc, 0x6, 0xc0000e4d80, 0x20, ...)
        github.com/tendermint/tendermint@v0.33.0/consensus/replay.go:475 +0x233
github.com/tendermint/tendermint/consensus.(*Handshaker).ReplayBlocks(0xc000ab90b0, 0xa, 0x0, 0xc000dc0496, 0x6, 0xc000dc04b0, 0xc, 0x6, 0xc0000e4d80, 0x20, ...)
        github.com/tendermint/tendermint@v0.33.0/consensus/replay.go:394 +0xe03
github.com/tendermint/tendermint/consensus.(*Handshaker).Handshake(0xc000d130b0, 0x524ef60, 0xc000ac6310, 0x80, 0x4d037c0)
        github.com/tendermint/tendermint@v0.33.0/consensus/replay.go:269 +0x485
github.com/tendermint/tendermint/node.doHandshake(0x524e260, 0xc000d6c000, 0xa, 0x0, 0xc000dc0496, 0x6, 0xc000dc04b0, 0xc, 0x6, 0xc0000e4d80, ...)
        github.com/tendermint/tendermint@v0.33.0/node/node.go:281 +0x19a
github.com/tendermint/tendermint/node.NewNode(0xc000b9f540, 0x5232e60, 0xc000b44000, 0xc000b8d350, 0x5217380, 0xc000ace920, 0xc000b8d4d0, 0x5032578, 0xc000b8d4e0, 0x52391e0, ...)
        github.com/tendermint/tendermint@v0.33.0/node/node.go:597 +0x343
github.com/cosmos/cosmos-sdk/server.startInProcess(0xc0000ef360, 0x5032dd8, 0x1d, 0x0, 0x0)
        github.com/cosmos/cosmos-sdk@v0.34.4-0.20200124164056-b647824716d9/server/start.go:157 +0x4c1
github.com/cosmos/cosmos-sdk/server.StartCmd.func1(0xc000370780, 0xc0000b5db0, 0x0, 0x1, 0x0, 0x0)
        github.com/cosmos/cosmos-sdk@v0.34.4-0.20200124164056-b647824716d9/server/start.go:67 +0xb4
github.com/spf13/cobra.(*Command).execute(0xc000370780, 0xc0000b5d90, 0x1, 0x1, 0xc000370780, 0xc0000b5d90)
        github.com/spf13/cobra@v0.0.5/command.go:826 +0x460
github.com/spf13/cobra.(*Command).ExecuteC(0xc0000f1900, 0x4ecdc0e, 0xc000ab5e90, 0x4185832)
        github.com/spf13/cobra@v0.0.5/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v0.0.5/command.go:864
github.com/tendermint/tendermint/libs/cli.Executor.Execute(0xc0000f1900, 0x5033220, 0x4eb1b22, 0x10)
        github.com/tendermint/tendermint@v0.33.0/libs/cli/setup.go:89 +0x3c
main.main()
        /Users/litvintech/Projects/gaia/cmd/gaiad/main.go:72 +0x8cb

It looks like this is some storage issues. It first halts with mint module during BeginBlock but I checked that this is the same with other modules in OrderBeginBlockers.

store := ctx.KVStore(k.storeKey)
b := store.Get(types.MinterKey)
if b == nil {
	panic("stored minter should not have been nil")
}

Version

Cosmos-SDK release v0.38.0
Gaia b2f508950d11897fdc89924fad81b1045379a937

Steps to Reproduce

Take provided in version section gaia commit and

./gaiad testnet --v=1 --output-dir=./mytestnet
./gaiad start --home=./mytestnet/node0/gaiad
stop node
./gaiad start --home=./mytestnet/node0/gaiad

Note

I initially asked @ethanfrey about this and he confirmed SDK's issue in Wasmd project, CosmWasm/wasmd#54

Update

@ethanfrey provided more deep details, CosmWasm/wasmd#54


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@alexanderbez
Copy link
Contributor

ref: CosmWasm/wasmd#54 (comment)

@jackzampolin
Copy link
Member

@AdityaSripal Is working on a fix to this.

@ethanfrey
Copy link
Contributor

I agree that the fix will work for the current sdk. However, it does break the generality of MultiStore.

A lot of work was made on such an abstract store than can use multiple sub-dbs, like ethereum patricia tree, under one root. This pruning approach only applied to the iavl substores. In 99+% of the cases currently this is the only substore used, so please make the fix and get v0.38.1 out. But also note that this adds tech debt (making rootmultistore only usable by the iavl substore), so please make an issue on that and start working on a proper design that doesn't couple the two so closely

@alexanderbez
Copy link
Contributor

alexanderbez commented Jan 27, 2020

I think we can tackle this w/o introducing tech-debt, of which I've spent the better part of the last two months trying to reduce so I know the pain. Instead of introducing changes to the root multistore, can we push the fix down to the IAVL store -- most likely in SaveVersion and do the custom logic and tracking there (essentially like we used to before we updated IAVL)? Surely, there must be a way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants