Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Consensus failure after validator is slashed #1197
My validator (Add: 295C0821D6D2EC71772E86773CD7F46F072CB764) is supposed to got slashed, but somehow it still send out the pre-vote messages on the same height, then the network has a consensus failure.
related error message:
referenced this issue
Jun 11, 2018
After an extensive debugging session, we were able to track the problem to a bug in the staking data model.
We are making some of the debugging tooling available in a new
Note the address here:
We observed the following sequence of events in the blockchain:
At this point,
At this point, we loop through that set of secondary records and for each one attempt to load the primary record. We include a sanity check to ensure that the primary record exists. But since the primary record got deleted while the secondary one did not, the sanity check failed and we paniced with
The reason the secondary record did not get deleted is as follows:
Since the computed power changes, when
At height 60522, when the other validator unbonds and we loop through the secondary records, we find this stale record fails the sanity check, so we panic.
Please note we are taking an explicit "FAIL CLOSED" approach to the software - we include liberal sanity checks in the code to ensure certain invariants aren't being violated. This ensures that we discover bugs sooner than later and halt the blockchain to fix them, rather than the chain continuing to run and be potentially exploited.
Independent of this bug, the staking specification was updated to not have inflationary coins automatically bonded. This would remove the relevance of the exchange rate and actually prevent this kind of bug.
We will make an immediate release that has some minor fixes and also removed bonded inflation for the time being, in order to allow a new testnet to restart. Note this release will be breaking and will require a new chain-id and restarting from the genesis block.
In the near-term, we will release a new version of the staking module that reflects the new design.
Additionally, we are working on randomized testing to catch more bugs sooner.
We are also beginning a careful code review in an effort to improve the structure of the code.
It's already been added to the STATUS.md: https://github.com/cosmos/cosmos-sdk/blob/master/cmd/gaia/testnets/STATUS.md#june-13-2018-230-est---published-postmortem-of-gaia-6001-failure
And yes, we will also link to it from the new release notes.
Think that's all sufficient? Or should we copy it for instance into a markdown file in the testnets/gaia-6001 folder ?
I think just linking to this somewhere intuitive or that people would normally check between releases (e.g., release notes) would allow validators to understand what got into a new release and what was wrong with a previous release or why something was removed/added. This is quite sufficient, at least for me, thanks.
Fixed in v0.19.0 on master. See https://github.com/cosmos/cosmos-sdk/tree/master/cmd/gaia/testnets for connecting to the new gaia-6002.