-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chain crashed with consensus failure, about 8 hrs after a hard fork. Delegator staking error. #7506
Comments
This seems to affect 0.39.1 @clevinson @ethanfrey @alexanderbez |
Can you please! https://pastebin.com/ is a good place to paste large chunks. Your genesis file would be helpful as well. |
This is virtually impossible to debug unless we can detect this in simulations with a specific seed (which we haven't). My suspicion is that something in the export/upgrade process may have corrupted state somehow. |
Does it happen everytime you restart with the same db? Does it happen to a new node that syncs from genesis? Please save a complete backup of the data and config dies and see how well you can reproduce it. |
I have a total backup of my .blzd folders on the ONLY two validators for this network. If I start either validator up with this .blzd state, I get this error, very deterministically. I reset the network but I DID backup .blzd first, to ensure we could investigate. .blzd is not large. Please let me know what to do. I could upload both copies of .blzd (for the two validators) someplace for you, if that helps to resolve this. @alexanderbez I do not know how I could have done something wrong in the export or upgrade. I am no expert on this particular process, but I followed these instructions (which are very brief): Note that I had to start two validators even to get the new network running, which in itself, was counter to my understanding, given that one of the validators has enough power to create blocks. But that might be irrelevant. |
@ethanfrey It happens when I start either of the two validators I have. They literally crash on start. As mentioned in the previous comment, please let me know if I should just give you guys a tarball of my two .blzd folders. You can then duplicate the behaviour (well, at least, I can). |
Also, this CANNOT be a coincidence. The crash happened at block 6031, both times. By both times, I mean that it happened when I initially launched the new upgraded chain. Then, that crashed and burned. I reset both validators, started again, and it died again at the same block! Here is another dump from the first crash:
|
I am not an expert on the staking module at all... I have had nothing to do with it. I am just trying to figure out a reproduceable test case for someone to debug. I think those blzd dirs as tarballs would be good to share (link from dropbox or other such?) Also, can you take a copy one of those dirs, delete |
Also, pointing to the exact code commit you are running is very helpful. I wonder if you have any code that interacts with the staking or distribution system somehow? Moving coins in fee collector or community pool maybe? |
@ethanfrey Here are links to the .blzd folder (click the download icon on the top right... not always obvious): https://drive.google.com/file/d/1xYJPJCjUG0lGNwozsoIx5mZCoz1KeBy-/view?usp=sharing Please untar and try two different daemons. You will obviously need to build the daemon with our code and point the peers at each other (config.toml)... obvious stuff, I am sure. Link to the code (please use the test-cm-409 branch): https://github.com/bluzelle/curium We were using using the 0.39.1 version of COSMOS. No code that directly talks to the staking or distribution system. Nothing quite that sophisticated, yet. |
Anything else I can provide to help resolve this? |
@njmurarka there is virtually impossible to debug :-/ but the best suggestion I can give is that it's most likely operator error as we've executed many upgrades (both halted and live) w/o any issues. |
@alexanderbez I am not too sure how I could have made an error. Of course, I am not denying it is possible. But what are the instructions then to do an upgrade? Right now, I have only got the following link (that I posted above): For convenience, I have posted the instructions I followed below:
I did not deviate from this (did not even whitelist jailed validators, partly because I could not tell what the syntax was for the whitelist command), but do note that I had to start two validators just to get the new network to start. I don't know if and how this fact is related possibly to the crash, but I opened a separate bug for this "need to start two validators requirement". Would really appreciate if you could provide me with the instructions you follow to do an upgrade. On the matter of this bug I filed, shall I assume then that the .blzd folder tarballs I uploaded earlier as per @ethanfrey's request were not helpful? Thanks. |
You'll need as many validators as you need to get enough power online -- could be two could be 80. I recall you manually modified power or something? Did you manually modify anything at all? |
I specifically ensured that the one validator I was bringing up in the new network had far more than enough power (> 70%) before I did the export. So with the new network, this validator alone should have enough power to start alone, right? Manually modified power? |
How did you "ensure" this w/o tweaking anything? |
I stopped one of the nodes. I exported the genesis file. I did not change its contents, and then I followed the directions listed above, to start a new network with that exported genesis file. I did ensure, before exporting (many blocks before), that I had a validator (let's call it validatorBob) that had over 70% voting power. The reason was when I started the new network, I ensured the first validator was the very same validatorBob node (same validator private key, etc...). So ok, I had validatorBob stake alot more, to ensure this validator singularly had supermajority voting power. The rationale was so that I could bring up just one validator on the network to test the new network. Unrelated, but I also discovered that this validator alone would not start to create blocks, despite having that voting power. I had to start a "token" other validator (the power of this second validator was irrelevant) to get the new network going. Odd but does not in any obvious way seem related to the crash. |
Does this help? I really would love some guidance on what I could have done wrong. It is really difficult to know what to do as the instructions to upgrade a network are pretty short, so can't see where I might have done something wrong. Thank you. |
Update. I did the same as before but with a newer "export"... but used the same process. No crash yet at block 9,000. Still, isn't anyone here interested in finding out how and why it crashed? I did not "fix" anything. I can reproduce the issue readily... so @alexanderbez @ethanfrey I am under the impression we don't need to reproduce this in a simulator. Am I wrong? |
Wdym by a "newer" export? Just at a later height? |
Yes. Later height. No other difference. I am assuming I am not the only one bothered by the fact this "problem" happened. I am delighted it has not happened again, but it begs the question why it happened. Like I said, I did not do anything outside of the scope of the instructions for an upgrade. Also, if you were to grab the two tarballs I provided above and try to deploy two quick validators pointing at each other, you will immediately see the crash, in the flesh. Unless there is evidence I did something wrong, I have to rationalize this could happen again to me or someone else. |
Well we haven't seen this problem before and we're doing handful of upgrades for Staraget. But let's leave the issue for now. I don't have any suggestions for how to proceed atm. |
Let's keep in open then. While I am SOMEWHAT ok the problem is not re-occurring, as a rational person, it bothers me it has not specifically been replicated and fixed. I have to take the safe stance it could happen again, if I am the only person unfortunate enough to have run into it. Let me know your thoughts. |
Summary of Bug
I recently exported the genesis from my old network, as per ticket #7505. I was able to "successfully" launch a new network of two nodes, using this exported genesis.
I ran the new network for a while, and then, after about 6,000 blocks, I got the following error:
CONSENSUS FAILURE: Calculated final stake for delegator... greater than current stake final stake.
The code has not really changed all that much and I am in fact still even using the same version of the COSMOS SDK. So it is quite puzzling. Furthermore, given that the old network has been running for months without an issue and now, having forked it, I get a crash after 8 hrs, is worrisome.
I can provide trace and logs if needed. I kept a copy of everything.
Some other similar mentions:
#4088
https://www.gitmemory.com/issue/cosmos/cosmos-sdk/4012/480477596
Version
cosmos-sdk v0.39.1
Steps to Reproduce
Above.
For Admin Use
The text was updated successfully, but these errors were encountered: