op-node: Engine P2P sync #6243

ImTei · 2023-07-10T09:50:06Z

Good morning! This is Tei from Test in Prod. This PR contains two topics:

Triggering the Execution Engine’s P2P sync for a faster sync and Happy-path sync.
Fixing UX issues caused by P2P Sync (distance from unsafe to safe is too far).

And there is another related PR in op-geth repository to support snap sync.

Triggering Execution Engine’s P2P Sync

Specs

New flag added
- --l2.engine-p2p.enabled=false|true
- default: false
- will be stored in sync.Config.EngineP2PEnabled struct, and passed to EngineQueue
New fields added in EngineQueue
- syncCfg sync.Config
  - Stores EngineP2PEnabled and SkipSanityCheck
- engineSyncTarget eth.L2BlockRef
  - Target L2 block the engine is currently syncing to.
  - If the engine p2p sync is enabled, it can be different with unsafeHead. Otherwise, it must be the same with unsafeHead.
  - Added to SyncStatus
If EngineQueue is reset, set engineSyncTarget to unsafe head.
If the EngineP2PEnabled flag is true, EngineQueue.tryNextUnsafePayload() should behave as follows:
- Do not check if the current unsafe head matches the next unsafe payload’s parent when the client receives the next unsafe payload.
- Allow SYNCING and ACCEPTED status from engine_newPayloadV1
- Allow SYNCING status from engine_forkchoiceUpdatedV1
- If engine_forkchoiceUpdatedV1 returns SYNCING or VALID, update eq.engineSyncTarget to the new unsafe block.
- If engine_forkchoiceUpdatedV1 returns VALID, update eq.unsafeHead to the new unsafe block.
If the eq.unsafeHead and eq.engineSyncTarget are not matched, consider the engine currently syncing.
If the engine is syncing, EngineQueue does not do anything except notify new unsafe payloads to the engine.
- It just returns EngineP2PSyncing error if there’s no unsafe payload in the queue. This error makes the pipeline goes idle.
EngineQueue tries to apply the next unsafe head before applying the next safe attributes in eq.Step() function.
- Current implementations prioritize safe sync over unsafe sync. So engine queue does not try to apply the next unsafe payload until derivates all batches from the current L1 chain.
- After the engine P2P sync is done, there should be a very long gap between the unsafe and safe heads. We have to wait a very long time to consolidate all blocks in the gap by L1 derivation, and then we can continue unsafe sync.
- So we need this change to advance the unsafe head even while L1 derivation is going on in the background.
If eq.tryNextUnsafePayload() returns an EOF error, do not return immediately, and continue remaining tasks in Step() , such as safe sync.
- EOF error is returned when all the below conditions are satisfied:
  - Engine P2P sync is disabled
  - The next unsafe payload is at a higher number than the current unsafe head.
  - The unsafe payload does not build upon the current unsafe head
- So the EOF error means the engine queue must advance the chain head via L1 derivation.

Test

Added test cases in op-e2e:
- Test unsafe sync when engine P2P is disabled.
- Test unsafe sync when engine P2P is enabled. P2P between execution engines could not be tested in op-e2e. so it checks if only engineSyncTarget is advanced.
Actual P2P sync can be tested in Hive. But Hive must be updated to use the latest version of op-node before implementing the engine P2P sync test.
Engine P2P sync test on real chains (connected to one static peer)
- OP Goerli
  - Snap sync: 2h20m
  - Full sync: 6h40m
- OP Mainnet
  - Snap sync: 2h20m
  - Full sync: 12h50m

Fixing UX Issues

This part might be better to open up as another PR. Please let me know if you prefer that way!

Background

The safe head is too far behind the unsafe head when the engine p2p sync is finished (~7M blocks in Goerli).
op-node fails to launch when the user restarts the process after the p2p sync in the current scheme.
- reorg is too deep error comes out while resetting the pipeline.
The following descriptions explain the details & solution 👇.

Details & Solution

op-node resets the derivation pipeline upon launch to find the sync starting point. Spec.
Finding the sync starting point is implemented in sync.FindL2Heads() function.
FindL2Heads() function has two purposes: Finding unsafe and safe heads to start the sync.
FindL2Heads() function has one main loop as follows:
- Traverse blocks backward from the current unsafe block
- Find the unsafe starting point.
- Continues traversing back to find the safe starting point.
- It checks how far the iteration goes from the initial unsafe head to calculate the L2 reorg depth. If the reorg is too deep, raise reorg is too deep error.
  - Problem: It calculates reorg depth in every iteration, even if there’s no L2 reorg.
  - Solution: Calculating reorg depth should be done only if the unsafe starting point is updated to an older block, which causes L2 reorg.

Remaining issues

There’s another UX-related issue though the reorg depth issue is resolved: pipeline reset takes a too long time if the safe and unsafe heads are very far apart.
- Reason: FindL2Heads() iterates every block between unsafe and safe heads to find the safe starting point, which delays the process starts. We speculated intentions for the iteration:
  - Even if we found an unsafe starting point candidate, one of its ancestors may have a non-candidate L1 origin. We should update the unsafe starting point to the older block.
    - However, we found conflicts in the code—L197 allows the update, but L232 doesn’t.
  - The iteration should not only check the L1 origin but also other values (i.e. if L1 origins are contiguous, if sequence numbers are right)
- We implemented --skip-sanity-check flag to avoid the issue and accelerate the process start speed.

- Add new flag and sync.Config struct for engine p2p sync - Fix EngineQueue to support engine p2p sync - Add op-e2e test casees - Fix related components to pass sync config - Fix execution engine specs

changeset-bot · 2023-07-10T09:50:11Z

⚠️ No Changeset found

Latest commit: b3cac0e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

netlify · 2023-07-10T09:50:12Z

✅ Deploy Preview for opstack-docs canceled.

Name	Link
🔨 Latest commit	`68dcaf4`
🔍 Latest deploy log	https://app.netlify.com/sites/opstack-docs/deploys/64abd451415f8a000822a519

codecov · 2023-07-10T10:02:57Z

Codecov Report

Merging #6243 (b3cac0e) into develop (d80c145) will decrease coverage by 0.07%.
Report is 602 commits behind head on develop.
The diff coverage is 22.58%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #6243      +/-   ##
===========================================
- Coverage    44.31%   44.25%   -0.07%     
===========================================
  Files          436      436              
  Lines        29025    29091      +66     
  Branches       709      709              
===========================================
+ Hits         12863    12873      +10     
- Misses       15085    15135      +50     
- Partials      1077     1083       +6

Flag	Coverage Δ
bedrock-go-tests	`43.04% <22.58%> (-0.08%)`	⬇️
cannon-go-tests	`61.78% <ø> (ø)`
common-ts-tests	`26.82% <ø> (ø)`
contracts-bedrock-tests	`58.86% <ø> (-0.11%)`	⬇️
core-utils-tests	`49.06% <ø> (ø)`
fault-detector-tests	`26.95% <ø> (ø)`
sdk-next-tests	`42.47% <ø> (ø)`
sdk-tests	`42.47% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
op-node/flags/flags.go	`62.50% <ø> (ø)`
op-node/node/config.go	`0.00% <ø> (ø)`
op-node/node/node.go	`0.64% <0.00%> (ø)`
op-node/rollup/derive/error.go	`72.97% <ø> (ø)`
op-node/rollup/derive/pipeline.go	`0.00% <0.00%> (ø)`
op-node/rollup/driver/driver.go	`0.00% <0.00%> (ø)`
op-node/rollup/driver/state.go	`0.00% <0.00%> (ø)`
op-node/service.go	`8.64% <0.00%> (-0.45%)`	⬇️
op-program/client/driver/driver.go	`75.67% <0.00%> (ø)`
op-node/rollup/derive/engine_queue.go	`42.77% <20.37%> (-2.13%)`	⬇️
... and 2 more

... and 2 files with indirect coverage changes

github-actions · 2023-07-25T02:02:29Z

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

protolambda

This looks great! I have a few suggestions to make the functionality/flags clearer to the user. Can you implement those?

op-node/flags/flags.go

op-node/rollup/sync/config.go

op-node/rollup/derive/pipeline.go

op-node/rollup/driver/state.go

Co-Authored-By: protolambda <19571989+protolambda@users.noreply.github.com>

ImTei · 2023-07-26T05:21:03Z

@protolambda Thank you for the review! I have applied your suggestions :)

mergify · 2023-07-31T16:18:37Z

This PR has been added to the merge queue, and will be merged soon.

mergify · 2023-07-31T16:18:46Z

Hey @ImTei, this pull request failed to merge and has been dequeued from the merge train. If you believe your PR failed in the merge train because of a flaky test, requeue it by commenting with @mergifyio requeue.
More details can be found on the Queue: Embarked in merge train check-run.

ghost · 2023-10-30T02:50:05Z

@ImTei @protolambda How do I enable this feature, seems like it is not documented yet?

ajsutton · 2023-10-30T02:54:01Z

It's been enabled by default for a fair few releases now. Just make sure you're using the latest release and then don't disable it or disable networking in op-node and it will work.

ghost · 2023-10-30T02:56:16Z

@ajsutton The sync process on op-geth will not conflict with op-node right?

ghost · 2023-10-30T03:00:49Z

@ajsutton I have found that if snap sync would be possible, it will require any kind of bootnodes or static nodes to connect and sync right? Is there any?

ajsutton · 2023-10-30T03:03:04Z

@kaliubuntu0206 My apologies - there are two similarly named features and I was thinking of the wrong one. p2p sync in op-node has been enabled by default to fill in any missed unsafe blocks. This is actually the op-geth snap sync functionality which is not yet enabled by default. The two systems don't interfere with each other.

To enable snap sync (this feature), for op-geth, set --syncmode snap and ensure you have not disabled discovery (ie don't have the --nodiscover flag). For op-mainnet and op-goerli you will still need to start from the downloaded data directory, for other networks that weren't migrated from pre-bedrock state you just need to ensure op-geth is configured for the right network.

Then for op-node set --l2.engine-sync=true.

Bootnodes can be shared across networks so the default op-geth boot nodes should work fine. However it is worth noting that currently there are extremely few peers available on most OP stack networks so it may be difficult to find peers that you can snap sync from.

ghost · 2023-10-30T03:05:19Z

@ajsutton Is there any bootnodes available for optimism mainnet to connect op-geth nodes? Also, I think it is important feature since no one wants to spend extra dollars for waiting the chain retrieval from L1

ajsutton · 2023-10-30T03:07:58Z

Bootnodes are designed so that they can be shared between networks so you don't need different boot nodes and are most likely to find peers using the default boot nodes. However there are just very very few peers available on OP stack networks currently.

ghost · 2023-10-30T03:09:48Z

@ajsutton Yes but with my understanding in order to sync between L1 nodes you will need a bootnode to find each other, and I see none here https://github.com/ethereum-optimism/op-geth/blob/optimism/params/bootnodes.go

ghost · 2023-10-30T03:10:44Z

Geth also has an history of very poor P2P discovery and many users aren't able to find even with the default bootnode exists. It would be great to have a list of optimism foundation op-geth nodes listed somewhere

ajsutton · 2023-10-30T03:12:28Z

The Mainnet boot nodes are used by default: https://github.com/ethereum-optimism/op-geth/blob/ceb5f3201eb74e9c32594596ea15755ec94f99c1/cmd/utils/flags.go#L1098-L1124

ghost · 2023-10-30T03:13:47Z

@ajsutton They are for --networkid=1 with my knowledge. I mean, are they able to share op-geth nodes as well? I think they will disconnect immediately.

ghost · 2023-10-30T03:15:01Z

They are also mainnet running nodes, you can see EF-Bootnode here https://ethstats.dev/

ajsutton · 2023-10-30T03:17:29Z

Yes, boot nodes are designed to be able to be shared across networks and the sets of peers almost always winds up being shared across bootnodes for different networks just from the way the distributed hash table works for discovery.

That's why the network ID is checked by clients as part of finding peers - to quickly detect peers that are found from the boot nodes that aren't on the same chain. There are a number of other other checks that happen like that as part of the initial handshake process.

There are very very few peers available on OP stack networks currently, you will see a lot of peers being immediately connected as irrelevant and it may take a long time to find ones that are useful because the vast majority of OP stack nodes currently are configured to disable the peer to peer networking.

ghost · 2023-10-30T03:19:48Z

@ajsutton I see, since I haven't really read the specification for P2P communication of ethereum nodes. I hope the feature could be enabled by default and the exclusive bootnodes could be provided for faster synchronization. ( And I think it is crucial since fetching blocks and transactions from Geth is very, very slow ). Thank you for answering my question

ImTei added 2 commits July 7, 2023 14:08

op-node: Implement engine P2P sync mode

0b3052c

- Add new flag and sync.Config struct for engine p2p sync - Fix EngineQueue to support engine p2p sync - Add op-e2e test casees - Fix related components to pass sync config - Fix execution engine specs

op-node: Fix reorg depth check, Add skip-sanity-check flag

68dcaf4

ImTei requested review from a team, protolambda and trianglesphere as code owners July 10, 2023 09:50

ImTei requested a review from tynes July 10, 2023 09:50

ImTei mentioned this pull request Jul 10, 2023

Support snap sync for OP chains ethereum-optimism/op-geth#105

Merged

github-actions bot added the Stale label Jul 25, 2023

protolambda removed the Stale label Jul 25, 2023

protolambda reviewed Jul 25, 2023

View reviewed changes

op-node/flags/flags.go Outdated Show resolved Hide resolved

op-node/rollup/sync/config.go Outdated Show resolved Hide resolved

op-node/rollup/derive/pipeline.go Outdated Show resolved Hide resolved

op-node/rollup/driver/state.go Outdated Show resolved Hide resolved

ImTei and others added 2 commits July 26, 2023 14:02

Rename flags and config variables for clarify to the user

e1a6045

Co-Authored-By: protolambda <19571989+protolambda@users.noreply.github.com>

Improve error handling

b3cac0e

ImTei requested a review from protolambda July 26, 2023 05:21

protolambda approved these changes Jul 31, 2023

View reviewed changes

mergify bot added the on-merge-train label Jul 31, 2023

mergify bot removed the on-merge-train label Jul 31, 2023

protolambda merged commit 3812408 into ethereum-optimism:develop Jul 31, 2023
68 of 71 checks passed

krish-nr mentioned this pull request Oct 7, 2023

feat: engine p2p sync feature bnb-chain/opbnb#62

Merged

ghost mentioned this pull request Oct 30, 2023

Enable snap sync base-org/node#132

Closed

ghost mentioned this pull request Nov 3, 2023

Location of Recent Snapshots #8027

Open

byteflyfunny mentioned this pull request Dec 10, 2023

[R4R]cherry pick from op mantlenetworkio/mantle-v2#66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

op-node: Engine P2P sync #6243

op-node: Engine P2P sync #6243

ImTei commented Jul 10, 2023 •

edited

Loading

changeset-bot bot commented Jul 10, 2023 •

edited

Loading

netlify bot commented Jul 10, 2023 •

edited

Loading

codecov bot commented Jul 10, 2023 •

edited

Loading

github-actions bot commented Jul 25, 2023

protolambda left a comment

ImTei commented Jul 26, 2023

mergify bot commented Jul 31, 2023

mergify bot commented Jul 31, 2023

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023 •

edited by ghost

Loading

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023

op-node: Engine P2P sync #6243

op-node: Engine P2P sync #6243

Conversation

ImTei commented Jul 10, 2023 • edited Loading

Triggering Execution Engine’s P2P Sync

Specs

Test

Fixing UX Issues

Background

Details & Solution

Remaining issues

changeset-bot bot commented Jul 10, 2023 • edited Loading

⚠️ No Changeset found

netlify bot commented Jul 10, 2023 • edited Loading

✅ Deploy Preview for opstack-docs canceled.

codecov bot commented Jul 10, 2023 • edited Loading

Codecov Report

github-actions bot commented Jul 25, 2023

protolambda left a comment

Choose a reason for hiding this comment

ImTei commented Jul 26, 2023

mergify bot commented Jul 31, 2023

mergify bot commented Jul 31, 2023

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023 • edited by ghost Loading

ghost commented Oct 30, 2023

ajsutton commented Oct 30, 2023

ghost commented Oct 30, 2023

ImTei commented Jul 10, 2023 •

edited

Loading

changeset-bot bot commented Jul 10, 2023 •

edited

Loading

netlify bot commented Jul 10, 2023 •

edited

Loading

codecov bot commented Jul 10, 2023 •

edited

Loading

ghost commented Oct 30, 2023 •

edited by ghost

Loading