Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EIP-1955: Specify the Cliquey proof-of-authority engine. #1955

Closed
wants to merge 25 commits into from

Conversation

soc1c
Copy link
Contributor

@soc1c soc1c commented Apr 20, 2019

Simple Summary

This document proposes a new proof-of-authority consensus engine that could be used by Ethereum testing and development networks in the future.

Abstract

Cliquey is the second iteration of the Clique proof-of-authority consensus protocol, previously discussed as "Clique v2". It comes with some usability and stability optimizations gained from creating the Görli and Kotti Classic cross-client proof-of-authority networks that were implemented in Geth, Parity Ethereum, Pantheon, Nethermind, and various other clients.

Motivation

The Kotti Classic and Görli testnets running different implementations of the Clique engine got stuck multiple times due to minor issues discovered. These issues were partially addressed on the mono-client Rinkeby network by optimizing the Geth code.

However, optimizations across multiple clients should be adequately specified and discussed. This working document is a result of a couple of months testing and running cross-client Clique networks, especially with the feedback gathered by several Pantheon, Nethermind, Parity Ethereum, and Geth engineers on different channels.

The overall goal is to simplify the setup and configuration of proof-of-authority networks, ensure testnets avoid getting stuck and mimicking mainnet conditions.

Rationale

The following changes were introduced over Clique EIP-225 and should be discussed briefly.

  • Cliquey introduces a MIN_WAIT period for out-of-turn block to be published which is not present for Clique. This addresses the issue of out-of-turn blocks often getting pushed into the network too fast causing a lot of short reorganizations and in some rare cases causing the network to come to a halt. By holding back out-of-turn blocks, Cliquey allows in-turn validators to seal blocks even under non-optimal network conditions, such as high network latency or validators with unsynchronized clocks.
  • To further strengthen the role of in-turn blocks, an authority should continue to publish in-turn blocks even if an out-of-turn block was already received on the network. This prevents in-turn validators being hindered from publishing their block and potential network problems, such as reorganizations or the network getting stuck.
  • Additionally, the DIFF_INTURN was increased from 2 to 7 to avoid situations where two different chain heads have the same total difficulty. This prevents the network from getting stuck by making in-turn blocks significantly more heavy than out-of-turn blocks.
  • The SIGNER_LIMIT was removed from block sealing logic and is only required for voting. This allows the network to continue sealing blocks even if all but one of the validators are offline. The voting governance is not affected and still requires signer majority.
  • The block period should be less strict and slightly randomized to mimic mainnet conditions. Therefore, it is slightly randomized in the uniform range of [-BLOCK_PERIOD/4, BLOCK_PERIOD/4]. With this, the average block time will still hover around BLOCK_PERIOD.

Finally, without changing any consensus logic, we propose the ability to specify an initial list of validators at genesis configuration without tampering with the extraData.

@soc1c
Copy link
Contributor Author

soc1c commented Apr 20, 2019

Requires #1954

References #1570 #225

@soc1c soc1c changed the title Specify the Cliquey proof-of-authority engine. EIP-1955: Specify the Cliquey proof-of-authority engine. Apr 20, 2019
@tkstanczak
Copy link
Contributor

tkstanczak commented Apr 20, 2019

Removing the SIGNER_LIMIT is dangerous as it would lead straightaway to a list of potential attacks.
Average block time being BLOCK_PERIOD, any signer in a N = 3 signers network can prepare a competing chain. Assume BLOCK_PERIOD is 15 seconds and the signer C decides to stop signing on the same chain as others. Then any 90 seconds that would create A B C A B C in turn blocks would create A B A B A B which gives difficulty of 10. Signer C can prepare C C C C C C chain which would have a difficulty of 10 too. Since C can now arbitrarily choose and accept any random time for signing they can create 7 blocks instead of 6 (by always choosing minimum time) and create an arbitrarily long alternative attack.

@tkstanczak
Copy link
Contributor

tkstanczak commented Apr 20, 2019

With variable block length and min wait time we will always wait for the in turn block the max time + wait time which makes the attack from my previous comment even easier.

@tkstanczak
Copy link
Contributor

Both the min wait time and publishing in turn blocks in case of existing out of turn blocks were implemented in Nethermind. The latter strengthens the main chain while the former allows malicious signers to gain greater power (by not following the min wait time rule). What we can do would be including the difference in the consensus - for any out of turn block the timestamp has to be at least BLOCK_PERIOD + 1 instead of BLOCK_PERIOD.

@tkstanczak
Copy link
Contributor

Also, as suggested in the original EIP discussion I would auggest co-prime numbers for InTurn and OutOfTurn blocks. 7 and 3 respectively.

@soc1c
Copy link
Contributor Author

soc1c commented Apr 21, 2019

Thanks for your review.

Removing the SIGNER_LIMIT is dangerous as it would lead straightaway to a list of potential attacks.

Can you expand on that? Aura does not have a signer limit and I'm not aware of any vectors here, but maybe they have other measures in place.

Signer C can prepare C C C C C C chain which would have a difficulty of 10 too.

Yes, very good catch. I restored the block timestamp constraints in b97026c - this should mitigate this entirely.

What we can do would be including the difference in the consensus - for any out of turn block the timestamp has to be at least BLOCK_PERIOD + 1 instead of BLOCK_PERIOD.

I don't think we need to make this part of the consensus, I don't want to make it more complex than necessary. Otherwise, I'm not really opposed though.

7 and 3 respectively.

Can you explain why prime numbers? Also, why 3 and 7, not 2 and 7?

I see 7 would work, but I don't see why we should increase the baseline constant. See 75bbc88

@nicksavers
Copy link
Contributor

It's not entirely clear to me whether this is A) Clique v2 or B) an additional consensus engine.
In case of A wouldn't it make sense to just name it Clique 2.0 and deprecate the old one
In case of B wouldn't it make sense to still name it Clique 2.0 and generalize it more to support this change?

@axic
Copy link
Member

axic commented Apr 29, 2019

It seems to be mostly about fixing issues in Clique as opposed to changing the design drastically, so I would agree with @nicksavers that naming this Clique 2.0 and having the supersedes: header would be the best for clarity.

@nicksavers
Copy link
Contributor

Sounds more like Clique 1.1 to me tbh, but that's besides the point.

@soc1c
Copy link
Contributor Author

soc1c commented May 4, 2019

Thanks for your review.

It's not entirely clear to me whether this is A) Clique v2 or B) an additional consensus engine.

It's more like a patch for Clique. Existing networks can activate EIP-1955 via hardfork to gain from the proposed improvements.

Sounds more like Clique 1.1 to me tbh, but that's besides the point.

If there was semantic versioning, this would be indeed 1.1, however, it was previously discussed as "v2" on various channels without actually implying a versioning theme here.

having the supersedes: header would be the best for clarity.

It does not specify the entire Clique engine though, therefore I used the requires: header. Happy to change / add it, but I personally think, requires makes it more clear.

I updated the spec to clarify and reflect the feedback.

@soc1c
Copy link
Contributor Author

soc1c commented May 6, 2019

@karalabe
Copy link
Member

karalabe commented May 6, 2019

I haven't yet thought over all the implications, but here are a new thoughts I have based on a couple quick read-throughs:

  • Allowing the initial validator set to be optionally specified via the chain config seems like a really nice idea, though I don't think that should be part of the EIP spec. The consensus engine requires the full list of signers to be present in the block extra-data section. Your proposal does not change this requirement, just adds a bit of syntactic sugar as to how to get the data in there. I think it's more of a feature request (that I completely agree with) for Geth and maybe other clients to support wrt the genesis format.
  • I feel there has not really been much time put into investigating the effect of MIN_WAIT. The spec suggests MIN_WAIT to be BLOCK_PERIOD/2, but that means that for each missing signer, the network will actually hiccup. If I have 2 out of 4 signers live on a 15s network, every second block will be delayed by 7-8 seconds. That was explicitly a feature Clique was meant to solve compared to Parity's Aura protocol: to ensure that as long as the network is healthy, blocks have reliable times and don't do weird hiccups.
    • I am also not really convinced about the reason of existence for this. The spec states "out-of-turn blocks often getting pushed into the network too fast causing a lot of short reorganizations". Why is this a problem? Frequent mini reorgs force DApp developers to cater for mainnet conditions where blocks can get reorged (every uncle is a reorg). If anything, this is actually a desired behavior.
    • I am also not convinced by catering for network propagation issues and block clock drift. The point of a PoA network, is that you have semi-reliable trusted authorities. If an authority does not have an acceptable network connection or cannot be bothered to sync its clock, imho the solution is to replace that validator, not to change the protocol. If you network is unstable due to badly implemented clients, fix the clients (or run an alternative one), not change the protocol to support unstable implementations.
  • You propose randomizing the block period in the [-BLOCK_PERIOD/4, BLOCK_PERIOD/4] range. However, you only proposed it as a miner strategy. How is that different to the current behavior of simply propagating blocks with some random delay? Unless the random drift is part of consensus and deterministic, the client who generates the earliest block will get it accepted, the rest us just ignored (i.e. current behavior).
    • A further issue with the proposal is that you permit the random drift component to be negative, but that clashes with the block timestamp, since you might end up with an invalid block. The protocol currently strictly enforces the BLOCK_PERIOD.
  • The proposal bumps the DIFF_INTURN to 7, which in all honesty seems like a random number you just picked. You would need to explore this number in detail and support why this number is better than anything else. The original choice of 2 was made to ensure that a) in-turn blocks cannot be forced out; and 2) that signers cannot censor each other. With a 7/1 difficulty setting - it would seem to me that - a node can actually censor out a whole lot of blocks before it (up to 6). Maybe I'm wrong, but the point here is that there was nothing provided to rationalize this number. Why 7? Why not 5 or 11? I'd expect this number to be better being dynamic in the number of signers and not static. I think you'd need to provide some experiments, simulations, etc to see how the network would behave with different difficulties, different number of signers, and different number of online signers.

Generally my concerns with the EIP is that it changes the block producing/acceptance logic, but provides no backing as to guarantee the censorship resistance of the proposed schema. I think that the most important thing for this EIP to move forward is to explore attack scenarios, and whether a minority of signers could have the capability to take over the network (or grieve it offline). I'm not saying that we need to make everything absolutely bullet proof, but we should be able to prove that the proposal is better and won't just blow up in a similar way.

Another thing that would be of interest it to explain why Rinkeby is stable, but Görli keeps falling apart. IF we could explain exactly what goes wrong with Görli, we might figure out which parts need fixing, and which parts can be left alone. Perhaps you are clear on this, then try to provide some more details.

@soc1c
Copy link
Contributor Author

soc1c commented May 7, 2019

Thanks Peter, your feedback is very valuable for our work.

Your proposal does not change this requirement, just adds a bit of syntactic sugar as to how to get the data in there. I think it's more of a feature request (that I completely agree with) for Geth and maybe other clients to support wrt the genesis format.

I agree. I will clarify that this is an optional feature to be considered by clients but does not affect the signer consensus logic.

If I have 2 out of 4 signers live on a 15s network, every second block will be delayed by 7-8 seconds. That was explicitly a feature Clique was meant to solve compared to Parity's Aura protocol: to ensure that as long as the network is healthy, blocks have reliable times and don't do weird hiccups.

I understand your design decision to stabilize blocktimes as compared to Aura. However, this is a trade-off we are willing to embrace. A healthy network should have all signers online and sealing. If there is one or a few offline signers, this causes a lot of reorgs and in rare cases the network to get stuck. The drawback of having some blocktimes > 22 seconds is acceptable in our opinion and is not far off from main network blocktime fluctuations.

The ultimate goal is to give in-turn blocks always a significant head-start. Offline or unhealthy signers should be released from their duty anyways.

Why is this a problem? Frequent mini reorgs force DApp developers to cater for mainnet conditions where blocks can get reorged (every uncle is a reorg).

The problem is these reorgs are not only frequent but predictable. Every block is basically sealend by 2-4 signers at the same time on Görli (for example). This is barely comparable to uncles on mainnet, rather, this creates a lot of edge cases where the network gets stuck (Kotti testnet halts every week). There is a research document written by the Pantheon developers explicitly outlining these edge-cases. I didn't include them in the spec but will link them here for reference: https://docs.google.com/document/d/1tmsr66sAPJmIZfSy5zck1uxnLzaXYxvzTNL5CF1pJQ0/

If an authority does not have an acceptable network connection or cannot be bothered to sync its clock, imho the solution is to replace that validator, not to change the protocol.

Clique currently defines a delay of rnd(0,500) milliseconds. Syncing clocks and best network connections still yield latencies and drifts in three-digit milliseconds range, which is in most cases - as stated in the document - bigger than the delay, especially if authorities are happen to be in different geographic regions of the world.

Unless the random drift is part of consensus and deterministic, the client who generates the earliest block will get it accepted, the rest us just ignored (i.e. current behavior).

Aha, I see. We will spend some more thoughts on BLOCK_PERIOD.

The proposal bumps the DIFF_INTURN to 7, which in all honesty seems like a random number you just picked.

Indeed, initially we picked [3, 1], @tkstanczak proposed to be more significant with the diff at [7, 3], and finally we ended up with the rather arbitrary choice of [7, 1].

You would need to explore this number in detail and support why this number is better than anything else. The original choice of 2 was made to ensure that a) in-turn blocks cannot be forced out; and 2) that signers cannot censor each other. With a 7/1 difficulty setting - it would seem to me that - a node can actually censor out a whole lot of blocks before it (up to 6).

This is a good point. The problem with [2, 1] that some clients were not able to distinguish best total difficulty and (probably due to network fragmentation?) the chain got stuck and clients couldn't decide on which head is the best. Having [3, 1] would clarify the significance of an in-turn block much more and ease the network to reorganize. Allowing an in-turn block to invalidate 2 out-of-turn blocks is acceptable IMO. Using primes like [7, 3] as @tkstanczak suggested actually makes more sense here to only give this a slight preference. What do you think about [7, 3]?

Generally my concerns with the EIP is that it changes the block producing/acceptance logic, but provides no backing as to guarantee the censorship resistance of the proposed schema. I think that the most important thing for this EIP to move forward is to explore attack scenarios, and whether a minority of signers could have the capability to take over the network (or grieve it offline). I'm not saying that we need to make everything absolutely bullet proof, but we should be able to prove that the proposal is better and won't just blow up in a similar way.

This is something we should totally do. I'm personally willing to implement this in Parity once we agreed on the the basic specification. That way we can write and run tests and simulations. In general, are you interested in implementing Cliquey in Geth for testing/research purposes?

Another thing that would be of interest it to explain why Rinkeby is stable, but Görli keeps falling apart. IF we could explain exactly what goes wrong with Görli, we might figure out which parts need fixing, and which parts can be left alone. Perhaps you are clear on this, then try to provide some more details.

To be honest, I only know that Rinkeby is stable because you said so. I spend much more time on Görli and Kotti. Reasons could be various, my main suspect is always the things that are not or not clearly part of the spec or different client's implementations not being 100% accurate. There is still a lot to consider.

Fact is that Görli and Kotti are much less stable. An interesting observation we have now on Görli versus Kotti is that Görli stabilized once we had much more validators (8 now) while Kotti with 3 keeps getting stuck.

@soc1c
Copy link
Contributor Author

soc1c commented May 10, 2019

From Gitter

Adrian Sutton @ajsutton May 07 12:05
@soc1c I’m definitely a supporter of the MIN_WAIT. I generally don’t think the DIFF_INTURN is particularly important assuming you set a reasonable MIN_WAIT but I don’t think it hurts. I’m quite uncertain about removing SIGNER_LIMIT but I think tkstanczak’s comments cover that. And I wouldn’t be randomising block times - makes it unsuitable for private networks and I don’t think it really gains much in terms of simulating MainNet.

@tkstanczak
Copy link
Contributor

I think that MIN_WAIT should be (if at all) introduced on the acceptor side and not the producer side. I mean that no validator should accept OutOfTurn blocks earlier than MIN_WAIT after the block time. This way no malicious validators can push their blocks just to increase the number of reorgs.

@soc1c
Copy link
Contributor Author

soc1c commented Jun 11, 2019

Needs some work. ⏳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants