Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add getter to ExecutionEngine to get TTD #2643

Closed
dapplion opened this issue Oct 4, 2021 · 13 comments
Closed

Add getter to ExecutionEngine to get TTD #2643

dapplion opened this issue Oct 4, 2021 · 13 comments
Labels
Bellatrix CL+EL Merge

Comments

@dapplion
Copy link
Collaborator

dapplion commented Oct 4, 2021

If the execution and consensus layer have different total terminal difficulty (TTD) it results in bad failures:

  • If consensus TTD1 > execution TTD2: Execution client will stop producing blocks at TTD2 and consensus will be expecting a transition block with TTD1 that will never come, stalling execution forever.
  • If consensus TTD1 < execution TTD2
    • if execution allows prepare_payload calls: consensus and execution will fork maintaining two correct chains both with execution data. However, once TTD2 is reached execution will re-org to the consensus chain. The resulting canonical chain won't contain a long range of blocks without execution data but it will drop state that explorers and users consider canonical.
    • if execution NOT allows prepare_payload calls: consensus will try to create blocks but since prepare_payload fails each attempted block can only reference the merge block at TTD1. Once TTD2 happens execution will re-org to a head that points to the merge block dropping its chain for a long gap in history.

To reduce the chance of this failure cases, the consensus client could check that its TTD matches the execution client TTD. If TTDs don't match, panic or very visibly alert the user. To allow updating nodes in any order absence of this method should not result in a error. Also a different TTD could only result in panic if TTD has been manually overwritten.

CC @mkalinin

@MicahZoltu
Copy link
Contributor

Is TTD not already part of the synchronization API? cc @mkalinin

@mkalinin
Copy link
Collaborator

mkalinin commented Oct 6, 2021

What if we hardcode TTD on the EL (execution layer) side and have a --terminal-total-difficulty-override setting there as well? Then CL can poll this parameter via engine_getTransitionState method, polling matter because of EL can be restarted with an override and CL should always be on spot.

This approach should reduce the number of potential failure points, in particular in how to sync this value between the layers if it is defined in each of them with an option to be overridden.

@hwwhww hwwhww added the Bellatrix CL+EL Merge label Oct 7, 2021
@mkalinin
Copy link
Collaborator

mkalinin commented Oct 7, 2021

Is TTD not already part of the synchronization API? cc @mkalinin

You mean Engine API? It's not yet a part of it, at least not a part of its Interop Edition. And this is a good time to make a decision on how to handle the hardcoding and overriding parts of this paramter

@djrtwo
Copy link
Contributor

djrtwo commented Oct 12, 2021

+1 on only putting this in the EL and requiring CL to get it. Definitely reduces failures in synchronization between the two layers

@ajsutton
Copy link
Contributor

+1 on only putting this in the EL and requiring CL to get it. Definitely reduces failures in synchronization between the two layers

The downside however is that now the CL can't startup until the EL is already running and accepting requests. Previously it would have been able to startup, begin finding peers and begin optimistic sync.

Given the huge number of things that already have to be in-sync between the CL and EL, I'm not sure the complexity of having to retrieve this and continuously poll for changes is going to reduce bugs. It will definitely increase complexity significantly and that usually leads to an increase in bugs.

@mkalinin
Copy link
Collaborator

The downside however is that now the CL can't startup until the EL is already running and accepting requests. Previously it would have been able to startup, begin finding peers and begin optimistic sync.

CL can't validate terminal block conditions without functioning EL and it is still possible for CL to do an optimistic sync even though EL is not functioning at all. In the case of optimistic sync CL will have to get back to terminal block validation in the case if the transition hasn't been finalized yet, but in this case it also have to rely on a functioning EL party to verify that the execution of the block built on top of it is valid.

This is rather an attempt to reduce the surface of misconfiguration issues induced by users than bugs induced by developers. Do you think that polling this data from EL is more bug prone than polling the head and verifying TTD?

@mkalinin
Copy link
Collaborator

We also should consider an override setting added here #2682. This setting requires CL to be restarted with the respective activation epoch as well. It means that EL can't always be a single source of transition data and poll won't cover all the override setting use cases cc @djrtwo

@ajsutton
Copy link
Contributor

This is rather an attempt to reduce the surface of misconfiguration issues induced by users than bugs induced by developers. Do you think that polling this data from EL is more bug prone than polling the head and verifying TTD?

So I think this complexity is the crux of my concern more than how much the beacon node can do before the EL starts up. It may be that I'm not understanding the proposal properly, but my understanding is that instead of the beacon node having the TTD hard coded (with CLI option to override) it would poll the EL to get it. Having a hard coded TTD would definitely be less error prone in that case - having to poll for potentially changing configuration data is pretty complex to get right.

If however we can get the CL to not care about TTD at all then yes that probably is simpler. I'm not finding engine_getTransitionState in the API docs so not entirely sure of the details there. In an ideal world the EL would just deal with TTD and it would return empty when asked to create an ExecutionPayload if TTD hasn't been reached and fail validation for any ExecutionPayload that doesn't meet TTD (it would probably have to be told if this is the first ExecutionPayload to be included so it can check the parent block is prior to TTD). But I feel like I'm forgetting an important detail that means the CL needs to know the actual TTD.

Bottom line for me is that having a user specify the same CLI argument in two places is pretty straight forward (we already have that for a number of args between beacon node and validator clients), whereas loading potentially changing configuration data from a web API is significantly more complex and error prone. If dealing with TTD fits reasonably well into existing semantics (or lets us not do some parts of it) then it's probably worth it, if it's just to get the user to only set it in one place then it probably isn't worth it.

@djrtwo
Copy link
Contributor

djrtwo commented Oct 27, 2021

Okay, I've been thinking about this and I want to minimize complexity of this component while ensuring it likely won't be a source of failures. The that end, I suggest the following which keeps CL as the transition leader and the only point of overrides without requiring a new engine endpoint

Mechanism

We can use the following mechanism

Release and override procedure

  1. Release CL and EL both with hardcoded TTD
  2. Provide an override point for TTD/TBH on CL
  3. EL listens to fork choice commands from CL regardless of local TTD settings (as CL could have been overriden on either TTD or TBH)
  4. (optional) In event of override necessary, both EL and CL can release new versions with overrides depending on time/urgency

3675 spec changes

Requisite change to EIP 3675:

  • Modify the definition of TRANSITION_BLOCK in EIP-3675 to just be the 0th "PoS block" (regardless of the PoW parent properties). I.e. the removal of "TRANSITION_BLOCK MUST be a child of a terminal PoW block."

Discussion

Overrides to expedite the Merge

The above simplifying mechanism assumes that TTD and TBH will only be overridden to be earlier than originally set values, and thus by not eagerly communicating an updated terminal value to EL, the worst that happens is that EL imports PoW blocks until the previously hardcoded TTD (due to the definition of "Terminal PoW Block" in EIP 3675).

Expected usage of these overrides:

  • TTD gets overridden to be earlier than the hardcoded/shipped value in the event that the transition is going to take way to long given global hashpower conditions
  • TBH gets used in the event that there is an attack on PoW and thus the chain can no longer be expected to be stable and have a high security until TTD. (taking a liveness hit to set TBH as a block at has already happened)

Overrides to slow the Merge (not supported)

There is one other type of usage that we cannot account for with the above proposed mechanism -- setting of TTD to be later than the hard-coded TTD. In the above mechanism, if we don't communicate a TTD override to EL in this case, EL will never reach the new TTD (due to not importing past the hard-coded TTD) [this is @dapplion's first case -- consensus TTD1 > execution TTD2] and thus the transition will never occur. I argue that in all likelihood a TTD override would not be used in this direction (making the transition take longer) and would only instead only be used to make the transition happen sooner. So we can simplify and assume only one direction of changs.

@dapplion's cases

Restricting overrides to only expedite the Merge and doing overrides only from CL reduces the problem of what can go wrong to only dapplion's second case (i.e. consensus TTD1 < execution TTD2). To patch the potential issues here, we need to modify the definition of TRANSITION_BLOCK in EIP-3675 as noted above.

Now in the case of consensus TTD1 < execution TTD2 (which could happen with a CL-only override), CL TTD1 would be hit and CL would call prepare and insert payloads at TTD1. If EL is following the spec, it will begin to defer it's fork choice to the PoS event. Even if it imports PoW blocks to TTD2, the PoS forkchoice always wins and once finalized all these extraneous PoW blocks would be pruned. This means that the chain built from TTD1 would be respected by EL and that due to the consensus rules on CL, only chains built with a valid TTD1 transition block would be accepted.

^ The above analysis also holds for a TBH override (assuming it is only used to expedite the Merge)

@mkalinin
Copy link
Collaborator

What if we have the requisite change to the EIP but still keep override options on EL side? If user forgets to configure these settings on EL side then we get to the case when CL TTD1 < EL TTD2 while if it doesn't forget to then we get EL and CL overrides consistent with each other. I suspect there won't be the case when user overrides TTD with different values on each layer.

@djrtwo
Copy link
Contributor

djrtwo commented Oct 28, 2021

Yeah, I almost suggested putting TTD override on EL as well but instead left it as a suggestion to just do a new release (if time permits).

Two reasons:

  1. Reduce complexity of implementation when it is only critical to do so on the CL side (with the minor 3675 change)
  2. Avoid creating a foot-gun. If you dont have the override on EL, the user cannot accidentally create a situation where consensus TTD1 > execution TTD2 if they fail to update the correct piece of software. If we have it only in one place, if the user is attempting to update, they will succeed. If we have it in two places, they might just update one side and think they've done what they needed to do

Because of the above, I'd suggest not having the override in EL, but to always encourage updated releases of both CL and EL in the event that we want to ship an override

@ajsutton
Copy link
Contributor

Sounds like a good approach to me. Keeps the CL side nice and simple and while there's possible a little more complexity on the EL side it would need to do something to handle the corner case of the CL hitting TTD before it thought it should anyway so hopefully is simple.

And yes I think anytime we are telling people to use an override we'd also wind up doing an emergency release with it in, but the override is still useful for bigger setups that have their own pipeline for verifying updates before they reach MainNet production.

@mkalinin
Copy link
Collaborator

mkalinin commented Nov 3, 2021

Addressed in ethereum/EIPs#4397

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bellatrix CL+EL Merge
Projects
None yet
Development

No branches or pull requests

6 participants