-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launching a new network with exported genesis.json requires 2+ validators #7505
Comments
I can provide the genesis.json file, if that helps. I am able to reproduce this at will. |
I was able to reproduce this issue on my local testnet. My voting power was split 80-20 between validators. Network should have started but did not unless there were atleast two validators online. |
@kaustubhkapatral Thanks for verifying and confirming this. Are we in agreement this is a bug? I know for a fact that a new network (not an upgrade) starts just fine with a singular validator. So there is no logical reason in my mind why a local testnet with a singular validator with 80% voting power cannot run the network. If this is a bug, I am of the mind it is a pretty serious one. FYI. I was able replicate this not just in the "upgrading scenario" but also in a simple case as follows:
If all is correct, this validator should start up and run and create blocks. The validators that are not peered not not necessary as this validator is enough. In fact, I found that this validator was not able to start the network. |
@anilcse Hi Anil. Any thoughts here? Would love to help resolve or explain this :). My hypothesis (and 2 cents): When a network has multiple validators in the validator set, a validator (the first node) can only be started if it can peer with another node (the second node). It matters not what the peer list is... even if it is empty. It does not matter if that first node has even close to 100% voting power. The second node is needed literally to "jump start" the network (like starting a car who's battery is dead). Once it is started, the second node can be turned off, and we are good. But if you stop that second node, and then stop that first node (so now nothing is running) and try to start that first node alone, you are back to square one. Start the second node again, and you are good. Best metaphor: a car with a dead battery that can be started with a jump, but turn off the car and it can't start again without another jump. Most of what I said above has been verified by me. I can do more tests if you want. |
Yes @njmurarka, the observations are correct and I am able to reproduce the same. I am not 100% sure if this is a normal/expected behaviour, I will defer this to @alexanderbez @alessio |
Hi gentlemen. Any thoughts on this? |
I had a call with @alessio, and we were able to reproduce this issue on commercionetwork too. The setup we used is simple, just two validators starting from scratch after an Interesting stuff revealed to us when running the daemon with
// IsCaughtUp returns true if this node is caught up, false - otherwise.
// TODO: relax conditions, prevent abuse.
func (pool *BlockPool) IsCaughtUp() bool {
pool.mtx.Lock()
defer pool.mtx.Unlock()
// Need at least 1 peer to be considered caught up.
if len(pool.peers) == 0 {
pool.Logger.Debug("Blockpool has no peers")
return false
}
// Some conditions to determine if we're caught up.
// Ensures we've either received a block or waited some amount of time,
// and that we're synced to the highest known height.
// Note we use maxPeerHeight - 1 because to sync block H requires block H+1
// to verify the LastCommit.
receivedBlockOrTimedOut := pool.height > 0 || time.Since(pool.startTime) > 5*time.Second
ourChainIsLongestAmongPeers := pool.maxPeerHeight == 0 || pool.height >= (pool.maxPeerHeight-1)
isCaughtUp := receivedBlockOrTimedOut && ourChainIsLongestAmongPeers
return isCaughtUp
}
The purpose of that I wonder if this is expected behavior or not. Shouldn't a validator be able to create blocks just because it has enough power rather than whether it has seen peers or not? |
@gsora That's great investigative work. Thanks for that. I hope the powers that be (including you) will be able to either rationalize this behaviour and confirm it is a bug and resolve it. I am curious though. If I start a virgin new network with a single validator, I can stop and start that validator just fine. The code above would not allow that validator to start after it has been stopped. Unless of course, the code above only gets run and is a condition for starting only if there are multiple validators in the past. I was talking to @alessio and @tessr about this matter a bit in Discord chat, and my own feeling is this is very contrary to expected behaviour. To me, it is pretty simple -- if I have the voting power, I get to create blocks. It should not be relevant what other validators have existed or still exist, as long as I hold sufficient power. To quote what I said to them:
|
Excellent debugging work. I too am curious why this isn't an issue on a "clean" chain but is on an exported chain under this criteria. Why is the conditional falsey when export + restart, but truthy on a "clean" chain? WRT to the why, @melekes can you possibly explain why such a conditional exists in |
Have you tried using v1 or v2 of the blockchain reactor. I wonder if this problem persists there as well. It should not but if it does then the blockchain reactor is not the culprit. |
@alexanderbez As an experiment, I tried to NOT export the genesis. Rather, I just took the same genesis as the old, verbatim, "copied" two nodes over (one of which had close to 95% voting power), peered them with each other directly, and started it. It worked. But I then removed the weaker powered peer in this two-node network and tried only start mister-95%. Same problem. It appears to me, that the problem manifests the moment a network has had more than one validator, whether the network is the same network or an exported version of a network with more than 1 validator. |
The condition seems reasonable to me. If there are no peers, we can't know whenever we're caught up to the latest block or not. In this scenario, however, we should just switch to consensus / slower syncing. |
It should when it's his turn to produce a block. But if it's not his turn, then it should halt until connectivity is restored / timeout commit passes and then it will be again his turn. We'll probably need to add a special case though: |
I wonder if backporting this to Tendermint 0.33 would solve this issue: tendermint/tendermint#5211 |
I am running this through my brain, everyone and especially @melekes, given your comment that this "condition seems reasonable". My first thought was -- "yah... it might be, actually". But I don't see how it is possibly reasonable, now that I think about it. If I have 67%+ power (actually even 33.3%+), NOTHING can change without my say-so. No new nodes. No new delegations. No changes in power. No new blocks. My point? If I have that much voting power, I DO KNOW that I am caught up, since my absence makes it impossible for new blocks to have been created that I am not aware of. In other words, if I have 67%+ (actually even 33.3%+) power, I am caught up :). Is my logic wrong? Feel free to say so. If so, as @melekes said, then if I have that kind of voting power, I should be able to continue to consensus without peers (if I have 66.6%+... if I have only 33.3%+, I stall till I get enough peers to get to 66%+ collectively, but I do know no new blocks can exist yet that I am not aware if). I can see how there are three very distinct cases:
|
@njmurarka Thats what @tessr is saying -- back-porting that PR should address this (I think, please confirm @melekes). |
Sounds about right but I think an improvement on this is you don't have to be just the only validator. As long as you have 66.6%+ voting power, that's enough of a reason to be able to proceed without waiting. |
There may be reasons that that may be a bad idea, but I'm not entirely sure. It could very well be the case we just need to check for how much power we have. |
no, because here we have 2 or more validators
right, if you had 1/3+ of the voting power at the moment of failure, the network could not have created any new blocks (2/3+ is required). created tendermint/tendermint#5696 |
@melekes I did not understand your comment:
My guess is that @tessr's referred-to backported code would not apply here, since that only affects use cases where there is exactly one validator in the validator set. My bug (this one) specifically happens when there are multiple validators in the set. Am I right? It probably answers my question above then too. |
Yes |
Closing this issue since it's entirely on the Tendermint side and @melekes created an issue for us. |
As @erikgrinaker pointed out, this is not always correct. If your node's disk is fried and you lost your state (i.e. you're forced to start from scratch), you're not caught up. |
Summary of Bug
I wanted to hard fork my network (or even have a backup "clone" of it). The network has eight validators. I deliberately ensured one of these validators had 70% voting power.
I literally brought up a new VM, installed my application (blzd, blzcli, etc) on it, copied over .blzd from that validator with 70% power, replaced the .blzd/config/genesis.json with the exported genesis.json (changing the chain-id and genesis time), and then started the validator, with "blzd start".
Obviously it could not talk to the other validators but regardless, it had 70% voting power, which is sufficient for it to generate blocks. But it would just sit and do nothing. I tried various things including emptying the peer list and turning pex on and off. No difference. This single validator just would not start.
I then tried to start a second validator, and "copy over" (like for that 70% validator) the same, but for another validator from the old network. I also went ahead and had both validators peer with each other.
Bingo! The two validators came to life and started to create blocks. In fact, I then killed the second validator's process, and the first one still kept going. But when I tried to stop and start the first validator, same problem again.
It almost seems like the network now expects at least two validators to be up and running. This is contrary to my understanding and seems like a bug. I should be able to start just one validator if I wanted, as long as it had power.
I know there is a flag called "--jail-whitelist" that lets me specify a list of validators to pre-jail (without slashing them) so that I can fork the network and start it without waiting for them to start their validators. I did not use this flag, so as expected, when I started my new chain (with 2 of the 8 validators running), 6 of them eventually got slashed and jailed.
Version
cosmos-sdk v0.39.1
Steps to Reproduce
Above.
For Admin Use
The text was updated successfully, but these errors were encountered: