Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

op-node: Engine P2P sync #6243

Merged

Conversation

ImTei
Copy link
Contributor

@ImTei ImTei commented Jul 10, 2023

Good morning! This is Tei from Test in Prod. This PR contains two topics:

  • Triggering the Execution Engine’s P2P sync for a faster sync and Happy-path sync.
  • Fixing UX issues caused by P2P Sync (distance from unsafe to safe is too far).

And there is another related PR in op-geth repository to support snap sync.

Triggering Execution Engine’s P2P Sync

Specs

  • New flag added
    • --l2.engine-p2p.enabled=false|true
    • default: false
    • will be stored in sync.Config.EngineP2PEnabled struct, and passed to EngineQueue
  • New fields added in EngineQueue
    • syncCfg sync.Config
      • Stores EngineP2PEnabled and SkipSanityCheck
    • engineSyncTarget eth.L2BlockRef
      • Target L2 block the engine is currently syncing to.
      • If the engine p2p sync is enabled, it can be different with unsafeHead. Otherwise, it must be the same with unsafeHead.
      • Added to SyncStatus
  • If EngineQueue is reset, set engineSyncTarget to unsafe head.
  • If the EngineP2PEnabled flag is true, EngineQueue.tryNextUnsafePayload() should behave as follows:
    • Do not check if the current unsafe head matches the next unsafe payload’s parent when the client receives the next unsafe payload.
    • Allow SYNCING and ACCEPTED status from engine_newPayloadV1
    • Allow SYNCING status from engine_forkchoiceUpdatedV1
    • If engine_forkchoiceUpdatedV1 returns SYNCING or VALID, update eq.engineSyncTarget to the new unsafe block.
    • If engine_forkchoiceUpdatedV1 returns VALID, update eq.unsafeHead to the new unsafe block.
  • If the eq.unsafeHead and eq.engineSyncTarget are not matched, consider the engine currently syncing.
  • If the engine is syncing, EngineQueue does not do anything except notify new unsafe payloads to the engine.
    • It just returns EngineP2PSyncing error if there’s no unsafe payload in the queue. This error makes the pipeline goes idle.
  • EngineQueue tries to apply the next unsafe head before applying the next safe attributes in eq.Step() function.
    • Current implementations prioritize safe sync over unsafe sync. So engine queue does not try to apply the next unsafe payload until derivates all batches from the current L1 chain.
    • After the engine P2P sync is done, there should be a very long gap between the unsafe and safe heads. We have to wait a very long time to consolidate all blocks in the gap by L1 derivation, and then we can continue unsafe sync.
    • So we need this change to advance the unsafe head even while L1 derivation is going on in the background.
  • If eq.tryNextUnsafePayload() returns an EOF error, do not return immediately, and continue remaining tasks in Step() , such as safe sync.
    • EOF error is returned when all the below conditions are satisfied:
      • Engine P2P sync is disabled
      • The next unsafe payload is at a higher number than the current unsafe head.
      • The unsafe payload does not build upon the current unsafe head
    • So the EOF error means the engine queue must advance the chain head via L1 derivation.

Test

  • Added test cases in op-e2e:
    • Test unsafe sync when engine P2P is disabled.
    • Test unsafe sync when engine P2P is enabled. P2P between execution engines could not be tested in op-e2e. so it checks if only engineSyncTarget is advanced.
  • Actual P2P sync can be tested in Hive. But Hive must be updated to use the latest version of op-node before implementing the engine P2P sync test.
  • Engine P2P sync test on real chains (connected to one static peer)
    • OP Goerli
      • Snap sync: 2h20m
      • Full sync: 6h40m
    • OP Mainnet
      • Snap sync: 2h20m
      • Full sync: 12h50m

Fixing UX Issues

This part might be better to open up as another PR. Please let me know if you prefer that way!

Background

  • The safe head is too far behind the unsafe head when the engine p2p sync is finished (~7M blocks in Goerli).
  • op-node fails to launch when the user restarts the process after the p2p sync in the current scheme.
    • reorg is too deep error comes out while resetting the pipeline.
  • The following descriptions explain the details & solution 👇.

Details & Solution

  • op-node resets the derivation pipeline upon launch to find the sync starting point. Spec.
  • Finding the sync starting point is implemented in sync.FindL2Heads() function.
  • FindL2Heads() function has two purposes: Finding unsafe and safe heads to start the sync.
  • FindL2Heads() function has one main loop as follows:
    • Traverse blocks backward from the current unsafe block
    • Find the unsafe starting point.
    • Continues traversing back to find the safe starting point.
    • It checks how far the iteration goes from the initial unsafe head to calculate the L2 reorg depth. If the reorg is too deep, raise reorg is too deep error.
      • Problem: It calculates reorg depth in every iteration, even if there’s no L2 reorg.
      • Solution: Calculating reorg depth should be done only if the unsafe starting point is updated to an older block, which causes L2 reorg.

Remaining issues

  • There’s another UX-related issue though the reorg depth issue is resolved: pipeline reset takes a too long time if the safe and unsafe heads are very far apart.
    • Reason: FindL2Heads() iterates every block between unsafe and safe heads to find the safe starting point, which delays the process starts. We speculated intentions for the iteration:
      • Even if we found an unsafe starting point candidate, one of its ancestors may have a non-candidate L1 origin. We should update the unsafe starting point to the older block.
        • However, we found conflicts in the code—L197 allows the update, but L232 doesn’t.
      • The iteration should not only check the L1 origin but also other values (i.e. if L1 origins are contiguous, if sequence numbers are right)
    • We implemented --skip-sanity-check flag to avoid the issue and accelerate the process start speed.

ImTei added 2 commits July 7, 2023 14:08
- Add new flag and sync.Config struct for engine p2p sync
- Fix EngineQueue to support engine p2p sync
- Add op-e2e test casees
- Fix related components to pass sync config
- Fix execution engine specs
@ImTei ImTei requested a review from tynes July 10, 2023 09:50
@changeset-bot
Copy link

changeset-bot bot commented Jul 10, 2023

⚠️ No Changeset found

Latest commit: b3cac0e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@netlify
Copy link

netlify bot commented Jul 10, 2023

Deploy Preview for opstack-docs canceled.

Name Link
🔨 Latest commit 68dcaf4
🔍 Latest deploy log https://app.netlify.com/sites/opstack-docs/deploys/64abd451415f8a000822a519

@codecov
Copy link

codecov bot commented Jul 10, 2023

Codecov Report

Merging #6243 (b3cac0e) into develop (d80c145) will decrease coverage by 0.07%.
Report is 602 commits behind head on develop.
The diff coverage is 22.58%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #6243      +/-   ##
===========================================
- Coverage    44.31%   44.25%   -0.07%     
===========================================
  Files          436      436              
  Lines        29025    29091      +66     
  Branches       709      709              
===========================================
+ Hits         12863    12873      +10     
- Misses       15085    15135      +50     
- Partials      1077     1083       +6     
Flag Coverage Δ
bedrock-go-tests 43.04% <22.58%> (-0.08%) ⬇️
cannon-go-tests 61.78% <ø> (ø)
common-ts-tests 26.82% <ø> (ø)
contracts-bedrock-tests 58.86% <ø> (-0.11%) ⬇️
core-utils-tests 49.06% <ø> (ø)
fault-detector-tests 26.95% <ø> (ø)
sdk-next-tests 42.47% <ø> (ø)
sdk-tests 42.47% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
op-node/flags/flags.go 62.50% <ø> (ø)
op-node/node/config.go 0.00% <ø> (ø)
op-node/node/node.go 0.64% <0.00%> (ø)
op-node/rollup/derive/error.go 72.97% <ø> (ø)
op-node/rollup/derive/pipeline.go 0.00% <0.00%> (ø)
op-node/rollup/driver/driver.go 0.00% <0.00%> (ø)
op-node/rollup/driver/state.go 0.00% <0.00%> (ø)
op-node/service.go 8.64% <0.00%> (-0.45%) ⬇️
op-program/client/driver/driver.go 75.67% <0.00%> (ø)
op-node/rollup/derive/engine_queue.go 42.77% <20.37%> (-2.13%) ⬇️
... and 2 more

... and 2 files with indirect coverage changes

@github-actions
Copy link
Contributor

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Copy link
Contributor

@protolambda protolambda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I have a few suggestions to make the functionality/flags clearer to the user. Can you implement those?

op-node/flags/flags.go Outdated Show resolved Hide resolved
op-node/rollup/sync/config.go Outdated Show resolved Hide resolved
op-node/rollup/derive/pipeline.go Outdated Show resolved Hide resolved
op-node/rollup/driver/state.go Outdated Show resolved Hide resolved
ImTei and others added 2 commits July 26, 2023 14:02
Co-Authored-By: protolambda <19571989+protolambda@users.noreply.github.com>
@ImTei
Copy link
Contributor Author

ImTei commented Jul 26, 2023

@protolambda Thank you for the review! I have applied your suggestions :)

@ImTei ImTei requested a review from protolambda July 26, 2023 05:21
@mergify
Copy link
Contributor

mergify bot commented Jul 31, 2023

This PR has been added to the merge queue, and will be merged soon.

@mergify
Copy link
Contributor

mergify bot commented Jul 31, 2023

Hey @ImTei, this pull request failed to merge and has been dequeued from the merge train. If you believe your PR failed in the merge train because of a flaky test, requeue it by commenting with @mergifyio requeue.
More details can be found on the Queue: Embarked in merge train check-run.

@mergify mergify bot removed the on-merge-train label Jul 31, 2023
@protolambda protolambda merged commit 3812408 into ethereum-optimism:develop Jul 31, 2023
68 of 71 checks passed
@ghost
Copy link

ghost commented Oct 30, 2023

@ImTei @protolambda How do I enable this feature, seems like it is not documented yet?

@ajsutton
Copy link
Contributor

It's been enabled by default for a fair few releases now. Just make sure you're using the latest release and then don't disable it or disable networking in op-node and it will work.

@ghost
Copy link

ghost commented Oct 30, 2023

@ajsutton The sync process on op-geth will not conflict with op-node right?

@ghost
Copy link

ghost commented Oct 30, 2023

@ajsutton I have found that if snap sync would be possible, it will require any kind of bootnodes or static nodes to connect and sync right? Is there any?

@ajsutton
Copy link
Contributor

@kaliubuntu0206 My apologies - there are two similarly named features and I was thinking of the wrong one. p2p sync in op-node has been enabled by default to fill in any missed unsafe blocks. This is actually the op-geth snap sync functionality which is not yet enabled by default. The two systems don't interfere with each other.

To enable snap sync (this feature), for op-geth, set --syncmode snap and ensure you have not disabled discovery (ie don't have the --nodiscover flag). For op-mainnet and op-goerli you will still need to start from the downloaded data directory, for other networks that weren't migrated from pre-bedrock state you just need to ensure op-geth is configured for the right network.

Then for op-node set --l2.engine-sync=true.

Bootnodes can be shared across networks so the default op-geth boot nodes should work fine. However it is worth noting that currently there are extremely few peers available on most OP stack networks so it may be difficult to find peers that you can snap sync from.

@ghost ghost mentioned this pull request Oct 30, 2023
@ghost
Copy link

ghost commented Oct 30, 2023

@ajsutton Is there any bootnodes available for optimism mainnet to connect op-geth nodes? Also, I think it is important feature since no one wants to spend extra dollars for waiting the chain retrieval from L1

@ajsutton
Copy link
Contributor

Bootnodes are designed so that they can be shared between networks so you don't need different boot nodes and are most likely to find peers using the default boot nodes. However there are just very very few peers available on OP stack networks currently.

@ghost
Copy link

ghost commented Oct 30, 2023

@ajsutton Yes but with my understanding in order to sync between L1 nodes you will need a bootnode to find each other, and I see none here https://github.com/ethereum-optimism/op-geth/blob/optimism/params/bootnodes.go

@ghost
Copy link

ghost commented Oct 30, 2023

Geth also has an history of very poor P2P discovery and many users aren't able to find even with the default bootnode exists. It would be great to have a list of optimism foundation op-geth nodes listed somewhere

@ajsutton
Copy link
Contributor

@ghost
Copy link

ghost commented Oct 30, 2023

@ajsutton They are for --networkid=1 with my knowledge. I mean, are they able to share op-geth nodes as well? I think they will disconnect immediately.

@ghost
Copy link

ghost commented Oct 30, 2023

They are also mainnet running nodes, you can see EF-Bootnode here https://ethstats.dev/

@ajsutton
Copy link
Contributor

Yes, boot nodes are designed to be able to be shared across networks and the sets of peers almost always winds up being shared across bootnodes for different networks just from the way the distributed hash table works for discovery.

That's why the network ID is checked by clients as part of finding peers - to quickly detect peers that are found from the boot nodes that aren't on the same chain. There are a number of other other checks that happen like that as part of the initial handshake process.

There are very very few peers available on OP stack networks currently, you will see a lot of peers being immediately connected as irrelevant and it may take a long time to find ones that are useful because the vast majority of OP stack nodes currently are configured to disable the peer to peer networking.

@ghost
Copy link

ghost commented Oct 30, 2023

@ajsutton I see, since I haven't really read the specification for P2P communication of ethereum nodes. I hope the feature could be enabled by default and the exclusive bootnodes could be provided for faster synchronization. ( And I think it is crucial since fetching blocks and transactions from Geth is very, very slow ). Thank you for answering my question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants