Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: iavl stores end up with inconsistent versions after upgrade #13477

Closed
yihuang opened this issue Oct 8, 2022 · 8 comments
Closed

Problem: iavl stores end up with inconsistent versions after upgrade #13477

yihuang opened this issue Oct 8, 2022 · 8 comments

Comments

@yihuang
Copy link
Collaborator

yihuang commented Oct 8, 2022

Summary of Bug

We found a app hash mismatch in our testnet which upgrade to sdk 0.46 recently, we traced to a inconsistent iavl tree version number in a store that is added in the upgrade.

The hypothesis for how this happens is like this, the migration is slow, so some nodes manually killed the node during the commit event, which will trigger following bug:

  • first startup
    • handle store upgrades, there's a newly added feeibc store, set the initialVersion to latestVersion+1.
  • first commit
    • commit feeibc at initialVersion
    • kill node before flush metadata.
  • second startup, replay the same block
    • the commit info for feeibc don't exists, so call LoadVersion with target version 0, which will load whatever latest version of the iavl tree, which is latestVersion+1.
  • second commit
    • feeibc store is commited normally, with version increased to latestVersion + 2, while the other stores is at latestVersion + 1.

Then as long as feeibc store is empty, the issue get unnoticed, until some day it's been written to and trigger the app hash mismatch.

Version

Steps to Reproduce

@alexanderbez
Copy link
Contributor

@yihuang are the repro steps just based on hypothesis, or are you certain that Commit is abrupted prior to metadata being flushed?

@yihuang
Copy link
Collaborator Author

yihuang commented Oct 8, 2022

@yihuang are the repro steps just based on hypothesis, or are you certain that Commit is abrupted prior to metadata being flushed?

The steps are hypothesis, will try to reproduce locally at Monday.
What's observed is the bad node always save feeibc store one version bigger than the other stores, the commit info is like this:

...
feegrant 5504246
feeibc 5504247
feemarket 5504246
...

And the log shows the node is killed several times during the upgrade migration.

@alexanderbez
Copy link
Contributor

Hypothesis makes sense. Unfortunately, there's no direct way to use the same DB batch object when committing all the IAVL stores and the metadata. In other words, it's not atomic. Another reason why I really dislike this multi-logical-db store approach.

Note, the store refactor has an item for using a single logical store.

@yihuang
Copy link
Collaborator Author

yihuang commented Oct 10, 2022

@alexanderbez what do you think should be the best short term solution?

  • when a new iavl store is found not empty, reject by return error
  • when a new iavl store is found not empty, override the version
  • make the commit atomic.

@alexanderbez
Copy link
Contributor

when a new iavl store is found not empty, reject by return error

If we return an error, how will the module's store be used successfully? Will the chain operate correctly, specifically for the new module?

when a new iavl store is found not empty, override the version

Seems like the most plausible solution 👍

make the commit atomic

This is the ideal solution, but unfortunately, not possible in the current design. #12986 should solve this by using a single logical SC store, but that is a future upcoming refactor.

@yihuang
Copy link
Collaborator Author

yihuang commented Oct 11, 2022

when a new iavl store is found not empty, override the version

Seems like the most plausible solution 👍

I'm not sure how to do this though, do we ignore the loaded roots, and force it to be a empty iavl store, will it work when commit?

@alexanderbez
Copy link
Contributor

No, we just fix the version/height of the faulty module to be correct, in this case +1. This would happen in the root-MS.

@yihuang
Copy link
Collaborator Author

yihuang commented Oct 26, 2022

fixed by #13530

@yihuang yihuang closed this as completed Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants