Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launching a new network with exported genesis.json requires 2+ validators #7505

Closed
4 tasks
njmurarka opened this issue Oct 9, 2020 · 23 comments
Closed
4 tasks

Comments

@njmurarka
Copy link
Contributor

Summary of Bug

I wanted to hard fork my network (or even have a backup "clone" of it). The network has eight validators. I deliberately ensured one of these validators had 70% voting power.

I literally brought up a new VM, installed my application (blzd, blzcli, etc) on it, copied over .blzd from that validator with 70% power, replaced the .blzd/config/genesis.json with the exported genesis.json (changing the chain-id and genesis time), and then started the validator, with "blzd start".

Obviously it could not talk to the other validators but regardless, it had 70% voting power, which is sufficient for it to generate blocks. But it would just sit and do nothing. I tried various things including emptying the peer list and turning pex on and off. No difference. This single validator just would not start.

I then tried to start a second validator, and "copy over" (like for that 70% validator) the same, but for another validator from the old network. I also went ahead and had both validators peer with each other.

Bingo! The two validators came to life and started to create blocks. In fact, I then killed the second validator's process, and the first one still kept going. But when I tried to stop and start the first validator, same problem again.

It almost seems like the network now expects at least two validators to be up and running. This is contrary to my understanding and seems like a bug. I should be able to start just one validator if I wanted, as long as it had power.

I know there is a flag called "--jail-whitelist" that lets me specify a list of validators to pre-jail (without slashing them) so that I can fork the network and start it without waiting for them to start their validators. I did not use this flag, so as expected, when I started my new chain (with 2 of the 8 validators running), 6 of them eventually got slashed and jailed.

Version

cosmos-sdk v0.39.1

Steps to Reproduce

Above.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@njmurarka
Copy link
Contributor Author

I can provide the genesis.json file, if that helps. I am able to reproduce this at will.

@kaustubhkapatral
Copy link
Contributor

I was able to reproduce this issue on my local testnet. My voting power was split 80-20 between validators. Network should have started but did not unless there were atleast two validators online.

@njmurarka
Copy link
Contributor Author

@kaustubhkapatral Thanks for verifying and confirming this.

Are we in agreement this is a bug? I know for a fact that a new network (not an upgrade) starts just fine with a singular validator. So there is no logical reason in my mind why a local testnet with a singular validator with 80% voting power cannot run the network.

If this is a bug, I am of the mind it is a pretty serious one.

FYI. I was able replicate this not just in the "upgrading scenario" but also in a simple case as follows:

  1. Copy the .appd folder to a new machine. Ensure you are doing so for a validator with at least enough voting power to run the network alone (> 67%).
  2. On the new machine, to not mess up or double sign on the first network, clear out the peers list in config.toml.
  3. Run appd.

If all is correct, this validator should start up and run and create blocks. The validators that are not peered not not necessary as this validator is enough.

In fact, I found that this validator was not able to start the network.

@njmurarka
Copy link
Contributor Author

@anilcse Hi Anil. Any thoughts here? Would love to help resolve or explain this :).

My hypothesis (and 2 cents):

When a network has multiple validators in the validator set, a validator (the first node) can only be started if it can peer with another node (the second node).

It matters not what the peer list is... even if it is empty. It does not matter if that first node has even close to 100% voting power.

The second node is needed literally to "jump start" the network (like starting a car who's battery is dead). Once it is started, the second node can be turned off, and we are good.

But if you stop that second node, and then stop that first node (so now nothing is running) and try to start that first node alone, you are back to square one. Start the second node again, and you are good.

Best metaphor: a car with a dead battery that can be started with a jump, but turn off the car and it can't start again without another jump.

Most of what I said above has been verified by me. I can do more tests if you want.

@anilcse
Copy link
Collaborator

anilcse commented Nov 10, 2020

Yes @njmurarka, the observations are correct and I am able to reproduce the same. I am not 100% sure if this is a normal/expected behaviour, I will defer this to @alexanderbez @alessio

@njmurarka
Copy link
Contributor Author

@alexanderbez @alessio

Hi gentlemen. Any thoughts on this?

@gsora
Copy link
Contributor

gsora commented Nov 18, 2020

I had a call with @alessio, and we were able to reproduce this issue on commercionetwork too.

The setup we used is simple, just two validators starting from scratch after an unsafe-reset-all.

Interesting stuff revealed to us when running the daemon with --log_level=debug:

D[2020-11-18|17:41:13.150] Consensus ticker                             module=blockchain numPending=0 total=0 outbound=0 inbound=0
D[2020-11-18|17:41:13.150] Blockpool has no peers                       module=blockchain
D[2020-11-18|17:41:14.151] Consensus ticker                             module=blockchain numPending=0 total=0 outbound=0 inbound=0
D[2020-11-18|17:41:14.151] Blockpool has no peers                       module=blockchain

greping around in the Tendermint source code - version v0.33.8 - revealed that debug log string in the blockchain/v0/pool.go file:

// IsCaughtUp returns true if this node is caught up, false - otherwise.
// TODO: relax conditions, prevent abuse.
func (pool *BlockPool) IsCaughtUp() bool {
        pool.mtx.Lock()
        defer pool.mtx.Unlock()

        // Need at least 1 peer to be considered caught up.
        if len(pool.peers) == 0 {
                pool.Logger.Debug("Blockpool has no peers")
                return false
        }

        // Some conditions to determine if we're caught up.
        // Ensures we've either received a block or waited some amount of time,
        // and that we're synced to the highest known height.
        // Note we use maxPeerHeight - 1 because to sync block H requires block H+1
        // to verify the LastCommit.
        receivedBlockOrTimedOut := pool.height > 0 || time.Since(pool.startTime) > 5*time.Second
        ourChainIsLongestAmongPeers := pool.maxPeerHeight == 0 || pool.height >= (pool.maxPeerHeight-1)
        isCaughtUp := receivedBlockOrTimedOut && ourChainIsLongestAmongPeers
        return isCaughtUp
}

pool.peers seems to be always zero when

  1. the node has seen another validator
  2. there are no previously seen validators reachable

The purpose of that if isn't clear to us, but obviously without it the single node with the biggest voting power is able to start and produce blocks fine.

I wonder if this is expected behavior or not.

Shouldn't a validator be able to create blocks just because it has enough power rather than whether it has seen peers or not?

@njmurarka
Copy link
Contributor Author

@gsora That's great investigative work. Thanks for that. I hope the powers that be (including you) will be able to either rationalize this behaviour and confirm it is a bug and resolve it.

I am curious though. If I start a virgin new network with a single validator, I can stop and start that validator just fine. The code above would not allow that validator to start after it has been stopped. Unless of course, the code above only gets run and is a condition for starting only if there are multiple validators in the past.

I was talking to @alessio and @tessr about this matter a bit in Discord chat, and my own feeling is this is very contrary to expected behaviour. To me, it is pretty simple -- if I have the voting power, I get to create blocks. It should not be relevant what other validators have existed or still exist, as long as I hold sufficient power.

To quote what I said to them:

The super short version of this phenomenon (in light of the newest comments in the ticket): Once a validator has seen another, the well has been poisoned, and now, no singular validator can ever start, irrespective of voting power. 

@alexanderbez
Copy link
Contributor

alexanderbez commented Nov 18, 2020

Excellent debugging work.

I too am curious why this isn't an issue on a "clean" chain but is on an exported chain under this criteria. Why is the conditional falsey when export + restart, but truthy on a "clean" chain?

WRT to the why, @melekes can you possibly explain why such a conditional exists in BlockPool#IsCaughtUp?

@tac0turtle
Copy link
Member

greping around in the Tendermint source code - version v0.33.8 - revealed that debug log string in the blockchain/v0/pool.go file:

Have you tried using v1 or v2 of the blockchain reactor. I wonder if this problem persists there as well. It should not but if it does then the blockchain reactor is not the culprit.

@njmurarka
Copy link
Contributor Author

@alexanderbez As an experiment, I tried to NOT export the genesis. Rather, I just took the same genesis as the old, verbatim, "copied" two nodes over (one of which had close to 95% voting power), peered them with each other directly, and started it. It worked. But I then removed the weaker powered peer in this two-node network and tried only start mister-95%. Same problem.

It appears to me, that the problem manifests the moment a network has had more than one validator, whether the network is the same network or an exported version of a network with more than 1 validator.

@melekes
Copy link
Contributor

melekes commented Nov 19, 2020

@melekes can you possibly explain why such a conditional exists in BlockPool#IsCaughtUp?

The condition seems reasonable to me. If there are no peers, we can't know whenever we're caught up to the latest block or not. In this scenario, however, we should just switch to consensus / slower syncing.

@melekes
Copy link
Contributor

melekes commented Nov 19, 2020

Shouldn't a validator be able to create blocks just because it has enough power rather than whether it has seen peers or not?

It should when it's his turn to produce a block. But if it's not his turn, then it should halt until connectivity is restored / timeout commit passes and then it will be again his turn.

We'll probably need to add a special case though: if {node has 2/3+ power} then continue to consensus in the blockchain reactor for this to work.

@tessr
Copy link
Contributor

tessr commented Nov 19, 2020

I wonder if backporting this to Tendermint 0.33 would solve this issue: tendermint/tendermint#5211

@njmurarka
Copy link
Contributor Author

njmurarka commented Nov 19, 2020

I am running this through my brain, everyone and especially @melekes, given your comment that this "condition seems reasonable". My first thought was -- "yah... it might be, actually".

But I don't see how it is possibly reasonable, now that I think about it.

If I have 67%+ power (actually even 33.3%+), NOTHING can change without my say-so. No new nodes. No new delegations. No changes in power. No new blocks. My point? If I have that much voting power, I DO KNOW that I am caught up, since my absence makes it impossible for new blocks to have been created that I am not aware of.

In other words, if I have 67%+ (actually even 33.3%+) power, I am caught up :).

Is my logic wrong? Feel free to say so.

If so, as @melekes said, then if I have that kind of voting power, I should be able to continue to consensus without peers (if I have 66.6%+... if I have only 33.3%+, I stall till I get enough peers to get to 66%+ collectively, but I do know no new blocks can exist yet that I am not aware if).

I can see how there are three very distinct cases:

  1. I have less than 33.3% voting power. I need to sync as I have no idea where the network has gotten to and it could have moved on without me. The rest of the network has enough voting power to create blocks. I will wait till I sync.
  2. I have > 33.3% voting power. The network could not have moved on without me. Why? Consensus cannot occur if I am not up and voting. So, the conclusion is the network is exactly where I last saw it. Now, if I did a local reset of my blockchain data, I can still sync to my peers, but no way they have created new blocks without my say-so.
  3. I have > 66.6% voting power. Like test in tests/tendermint/main.go fails #2, the network did not move on without me. But I can in fact also go ahead and create new blocks as I have the power to do so.

@alexanderbez
Copy link
Contributor

@njmurarka Thats what @tessr is saying -- back-porting that PR should address this (I think, please confirm @melekes).

@njmurarka
Copy link
Contributor Author

I wonder if backporting this to Tendermint 0.33 would solve this issue: tendermint/tendermint#5211

Sounds about right but I think an improvement on this is you don't have to be just the only validator. As long as you have 66.6%+ voting power, that's enough of a reason to be able to proceed without waiting.

@alexanderbez
Copy link
Contributor

There may be reasons that that may be a bad idea, but I'm not entirely sure. It could very well be the case we just need to check for how much power we have.

@melekes
Copy link
Contributor

melekes commented Nov 20, 2020

I wonder if backporting this to Tendermint 0.33 would solve this issue: tendermint/tendermint#5211

no, because here we have 2 or more validators

In other words, if I have 67%+ (actually even 33.3%+) power, I am caught up :).

right, if you had 1/3+ of the voting power at the moment of failure, the network could not have created any new blocks (2/3+ is required).

created tendermint/tendermint#5696

@njmurarka
Copy link
Contributor Author

njmurarka commented Nov 20, 2020

@melekes I did not understand your comment:

no, because here we have 2 or more validators

My guess is that @tessr's referred-to backported code would not apply here, since that only affects use cases where there is exactly one validator in the validator set. My bug (this one) specifically happens when there are multiple validators in the set.

Am I right? It probably answers my question above then too.

@melekes
Copy link
Contributor

melekes commented Nov 20, 2020

My guess is that @tessr's referred-to backported code would not apply here, since that only affects use cases where there is exactly one validator in the validator set. My bug (this one) specifically happens when there are multiple validators in the set.

Am I right?

Yes

@alexanderbez
Copy link
Contributor

Closing this issue since it's entirely on the Tendermint side and @melekes created an issue for us.

@melekes
Copy link
Contributor

melekes commented Nov 25, 2020

In other words, if I have 67%+ (actually even 33.3%+) power, I am caught up :).

As @erikgrinaker pointed out, this is not always correct. If your node's disk is fried and you lost your state (i.e. you're forced to start from scratch), you're not caught up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants