Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

derivation: remove EngineQueue #10643

Merged
merged 3 commits into from
Jun 5, 2024
Merged

derivation: remove EngineQueue #10643

merged 3 commits into from
Jun 5, 2024

Conversation

protolambda
Copy link
Contributor

@protolambda protolambda commented May 24, 2024

Description

Removal of the engine-queue.

This decouples the execution-engine from the L2 derivation-pipeline, which now attributes L2 block attributes.

The "sync deriver" (in earlier PRs, syncStep function) encapsulates the synchronous sequence of sub-derivers, preserving the old behavior still.

The DerivationPipeline does not know about any related to the EngineController anymore, making it possible to schedule it asynchronous to the other derivation work.

Builds on top of #10642

Tests

Minor test changes:

  • Fixed sequencer action test: since derivation traverses L1 to find when L1 reorgs, it needs to traverse the old chain before it can report a reorg. If the reorg is detected upon a sequencing action, then the derivation will be reset, but not applied to the engine, until the derivation itself runs.

Additional context

Completes phase 1a of op-node derivers design-doc

Metadata

Fix https://github.com/ethereum-optimism/protocol-quest/issues/272

@protolambda protolambda self-assigned this May 24, 2024
Copy link
Contributor

semgrep-app bot commented May 24, 2024

Semgrep found 2 golang_fmt_errorf_no_params findings:

No fmt.Errorf invocations without fmt arguments allowed

Ignore this finding from golang_fmt_errorf_no_params.

Semgrep found 2 invalid-usage-of-modified-variable findings:

Variable unsafeInNode is likely modified and later used on error. In some cases this could result in panics due to a nil dereference

Ignore this finding from invalid-usage-of-modified-variable.

Semgrep found 1 todos_require_linear finding:

  • packages/core-utils/src/common/bn.ts

Please create a GitHub ticket for this TODO.

Ignore this finding from todos_require_linear.

@protolambda protolambda force-pushed the no-engine-queue branch 2 times, most recently from 0bbb879 to 82b6f03 Compare May 30, 2024 15:10
@protolambda protolambda changed the title derivation: remove EngineQueue (work in progress) derivation: remove EngineQueue May 30, 2024
@protolambda
Copy link
Contributor Author

Might need a rebase to fix the finalize-test flake

@protolambda protolambda marked this pull request as ready for review May 30, 2024 15:31
@protolambda protolambda requested review from a team and ajsutton as code owners May 30, 2024 15:31
@protolambda
Copy link
Contributor Author

Trying to debug what is wrong with op-program-compat here, I did have to make some changes to fit in the separation of ResetEngine from derivation. Regular tests pass, but something is still off about FP program tests.

@protolambda
Copy link
Contributor Author

Rebased on the rebased base

@ajsutton
Copy link
Contributor

The op-program-compat test is failing because it's now reading an additional preimage to what it used to. We can regenerate that test if needed and it will grab the extra preimages required, but it's slightly suspicious that the behaviour changed. it is in the l2 state trie reads though and we have seen that hash map order can cause an extra read in the case of a deletion with those sometimes (not an issue in cannon, but can be when running natively) - odd that we haven't hit it for that test before though given how much its run since it went in. @Inphi any chance you can look at the failing test and tell if it is this random difference quickly or not?

It could just be a new read op-node is doing which isn't an issue though (fault proofs used a fixed version of op-program so repeatability isn't a concern as long as this new behaviour is deterministic).

@ajsutton
Copy link
Contributor

Ok I'm hitting that test failure locally as well with the verify-mainnet-genesis case (cd op-program && make verify-mainnet-genesis). That test runs with --l1.head "0x4903424f6cc2cfba7c2bf8c8f48ca46721c963fa64b411cfee3697b781e3e5f1" --l2.start "105235063" --l2.end "105235064" which is the very first bedrock block. So I'm guessing we do have an additional read in that specific case now.

@ajsutton
Copy link
Contributor

Code looks sane, though I need to dig into it better to follow the actual flow.

I've captured fresh test data for the mainnet-genesis case - on this branch it winds up with 167758 preimages and on develop it has 167728. It quite consistently fails with the existing data and passes with the new data. So it is requesting something new, but it's not going crazy requesting a ton of extra data which is good.

Looking at the difference in logs I think it's not stopping derivation when it should and winds up importing a bunch more blocks. The old log stops at:

Derivation complete: reached L2 block" head=0x1e90d8496358cbe4998efcd3657d243bcdfdcf0f13127afa146210b9ef7576af:105235064

but the new one stops at:

Derivation complete: reached L2 block" head=0x02609db31f70efb1b906cc572d23ac9ef79f8a5cb8e3baff11b824720b55ebd5:105237751

And it's part of importing one of those additional blocks that it winds up needing the extra l2 state nodes. It ultimately works because op-program will get the output root at the claimed block even if derivation went past there (required for span batches) but we are doing more work than we need to which I'd like to understand. I've put the old (from develop) and new (from this branch) logs in a gist. I'd guess though that we're stepping through a full L1 block in each "Step" call the op-program driver makes now or not updating safe head until the L1 block has been fully processed which would have the same result. The last Advancing bq origin message we get in the new logs is indeed before we import any blocks which supports the argument we're applying all batches from the L1 block in a single "step".

@Inphi
Copy link
Contributor

Inphi commented May 31, 2024

I'll take a closer look later. But from glancing at the stacktrace, I see that the missing trie node was fetched during deletion:

github.com/ethereum/go-ethereum/core/state.(*StateDB).getDeletedStateObject(0xc0111dcf00, {0x9b, 0xbf, 0xb9, 0x91, 0x90, 0x62, 0xc2, 0x9a, 0x5e, ...})
        /go/pkg/mod/github.com/ethereum-optimism/op-geth@v1.101315.1/core/state/statedb.go:595 +0x39a
github.com/ethereum/go-ethereum/core/state.(*StateDB).getStateObject(...)
        /go/pkg/mod/github.com/ethereum-optimism/op-geth@v1.101315.1/core/state/statedb.go:550
github.com/ethereum/go-ethereum/core/state.(*StateDB).GetNonce(0x5?, {0x9b, 0xbf, 0xb9, 0x91, 0x90, 0x62, 0xc2, 0x9a, 0x5e, ...})
        /go/pkg/mod/github.com/ethereum-optimism/op-geth@v1.101315.1/core/state/statedb.go:293 +0x26
github.com/ethereum/go-ethereum/core.(*StateTransition).preCheck(0xc0113ee630)
        /go/pkg/mod/github.com/ethereum-optimism/op-geth@v1.101315.1/core/state_transition.go:314 +0x90
github.com/ethereum/go-ethereum/core.(*StateTransition).innerTransitionDb(0xc0113ee630)
        /go/pkg/mod/github.com/ethereum-optimism/op-geth@v1.101315.1/core/state_transition.go:449 +0x51
github.com/ethereum/go-ethereum/core.(*StateTransition).TransitionDb(0xc0113ee630)
        /go/pkg/mod/github.com/ethereum-optimism/op-geth@v1.101315.1/core/state_transition.go:412 +0xe5
github.com/ethereum/go-ethereum/core.ApplyMessage(0x9ac2629091b9bf9b?, 0xc3d7c993cd5ae15e?, 0xaa314db1?)
        /go/pkg/mod/github.com/ethereum-optimism/op-geth@v1.101315.1/core/state_transition.go:193 +0x57
github.com/ethereum/go-ethereum/core.applyTransaction(0xc01104f450, 0xc0001ce300, 0x11000000?, 0xc0111dcf00, 0xc010545280, {0xc2, 0x94, 0x72, 0xc9, 0xa8, ...}, ...)

It seems similar to the Go map iteration order issue.

@ajsutton
Copy link
Contributor

ajsutton commented Jun 1, 2024

Thanks @Inphi I dug a bit deeper after my initial comment and it is because op-program is now processing more blocks than before - these state accesses happen because one of those blocks contains a transaction that accesses them. So thankfully not the iteration order issue.

Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - we just need to put in a fix for op-program not stopping at the target block. I'd suggest just having the if head.number >= d.targetBlockNum from the op-program driver.go run only if there isn't an error from the pipeline rather than on NotEnoughData. That actually picks up the new block slightly earlier even in the old code and avoids depending on implementation details - typically nil return value means the Step made some progress so we should check if we've reached the target block.

op-e2e/actions/l2_verifier.go Outdated Show resolved Hide resolved
op-program/client/driver/driver.go Outdated Show resolved Hide resolved
op-program/client/driver/driver.go Show resolved Hide resolved
@protolambda protolambda force-pushed the attributes-handler branch 2 times, most recently from 3f5004d to 3bc8330 Compare June 4, 2024 12:49
Base automatically changed from attributes-handler to develop June 4, 2024 13:29
Copy link

codecov bot commented Jun 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.92%. Comparing base (70912c0) to head (7f93249).

Current head 7f93249 differs from pull request most recent head aadb641

Please upload reports for the commit aadb641 to get more accurate results.

Additional details and impacted files
@@             Coverage Diff              @@
##           develop   #10643       +/-   ##
============================================
+ Coverage    54.65%   81.92%   +27.27%     
============================================
  Files           37       10       -27     
  Lines         2944     1079     -1865     
  Branches       415        0      -415     
============================================
- Hits          1609      884      -725     
+ Misses        1303      163     -1140     
  Partials        32       32               
Flag Coverage Δ
cannon-go-tests 81.92% <ø> (+2.31%) ⬆️
chain-mon-tests ?
sdk-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 30 files with indirect coverage changes

@protolambda
Copy link
Contributor Author

Rebased on develop. Reviewing the op-program issue / comments of above now.

Copy link
Contributor

semgrep-app bot commented Jun 4, 2024

Semgrep found 5 sol-style-return-arg-fmt findings:

Named return arguments to functions must be appended with an underscore (_)

Ignore this finding from sol-style-return-arg-fmt.

@protolambda
Copy link
Contributor Author

Rebased on develop, trying to deal with Fjord flakes. At least the op-program tests seem to be passing.

@protolambda protolambda requested a review from ajsutton June 4, 2024 21:36
Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@protolambda protolambda added this pull request to the merge queue Jun 5, 2024
Merged via the queue into develop with commit 4a91c9a Jun 5, 2024
64 checks passed
@protolambda protolambda deleted the no-engine-queue branch June 5, 2024 13:50
rdovgan pushed a commit to rdovgan/optimism that referenced this pull request Jun 24, 2024
* op-node: remove engine queue

(squashed) remove debug line

* op-node: test VerifyNewL1Origin

* op-node: engine-queue removal review fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants