execution, cl: persist initial-cycle lifecycle by block TTL#20895
execution, cl: persist initial-cycle lifecycle by block TTL#20895JkLondon wants to merge 35 commits into
Conversation
…es/otterprunefixes
There was a problem hiding this comment.
Pull request overview
Persists “initial sync cycle” lifecycle state across restarts by writing a “tip reached” marker to the DB and deriving initial-cycle status from progress deltas over a block-based TTL, rather than a process-local startup heuristic.
Changes:
- Add
Sync.InitialCycleBlockTTL/--sync.initial-cycle-block-ttl(default1024) and snapshot it in config tests. - Persist and manage a “last tip reached block” marker in
kv.DatabaseInfo, including reset-path cleanup and lifecycle helpers. - Propagate known-tip hints (including Caplin/checkpoint-sync derived hints) into staged sync loop and execmodule FCU handling to compute initial-cycle consistently.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| node/ethconfig/config.go | Adds InitialCycleBlockTTL to sync config defaults/struct. |
| node/cli/flags.go | Introduces CLI flag and wires it into config application. |
| node/cli/default_flags.go | Registers the new sync lifecycle flag in default flag set. |
| execution/stagedsync/stageloop/stageloop.go | Switches stageloop to DB-derived initial-cycle decision and updates tip marker during sync/prune. |
| execution/stagedsync/stage_snapshots.go | Adjusts snapshot prune worker behavior based on CurrentSyncCycle.IsInitialCycle (but includes a stray println). |
| execution/stagedsync/stage_execute.go | Uses CurrentSyncCycle.IsInitialCycle in pruning behavior (but includes a stray println + stack). |
| execution/stagedsync/rawdbreset/reset_stages.go | Clears the persisted tip-reached marker on reset paths. |
| execution/stagedsync/initial_cycle_test.go | Adds tests for initial-cycle derivation and tip marker CRUD behavior. |
| execution/stagedsync/initial_cycle.go | Adds lifecycle helpers for computing initial-cycle and updating the tip marker. |
| execution/stagedsync/headerdownload/header_algos.go | Adds SetInitialCycle(bool) and routes AfterInitialCycle() through it. |
| execution/execmodule/forkchoice.go | Makes FCU pipeline initial-cycle incorporate lifecycle state + limited-big-jump and updates tip marker post-run. |
| execution/execmodule/executor.go | Adds known-tip hint plumbing and per-iteration initial-cycle decision + post-prune hook to update marker. |
| execution/execmodule/exec_module.go | Exposes SetKnownTipHint on ExecModule for upstream hint propagation. |
| execution/execmodule/chainreader/chain_reader.go | Forwards known-tip hints to the execution module when supported. |
| db/rawdb/accessors_chain.go | Adds rawdb accessors for reading/writing/deleting persisted tip-reached marker. |
| db/kv/kv_interface.go | Extracts Deleter interface and embeds it into Putter. |
| cmd/erigon/node/config_snapshot_test.go | Extends config snapshot to include initial-cycle TTL and adds flag/default tests. |
| cmd/caplin/caplin1/run.go | Propagates Caplin known-tip hints (local + checkpoint-sync head hint) into execution engine at startup. |
| cl/phase1/execution_client/execution_client_direct.go | Implements SetKnownTipHint by forwarding to chainRW. |
| cl/phase1/core/checkpoint_sync/util.go | Adds helper to fetch best-effort head execution block number from checkpoint endpoints. |
| cl/phase1/core/checkpoint_sync/remote_checkpoint_sync.go | Implements endpoint translation + JSON parsing to fetch head execution block number. |
| cl/phase1/core/checkpoint_sync/checkpoint_sync_test.go | Adds tests for head execution block number fetch and URI translation. |
| .gitignore | Ignores local Go cache/artifact directories. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…es/otterprunefixes
…es/otterprunefixes
…es/otterprunefixes
…es/otterprunefixes
…es/otterprunefixes
…es/otterprunefixes
…es/otterprunefixes
yperbasis
left a comment
There was a problem hiding this comment.
Correctness
fetchHeadExecutionBlockNumber: deferred Close error is silently dropped (cl/phase1/core/checkpoint_sync/remote_checkpoint_sync.go:111)
resp, err := http.DefaultClient.Do(req)
if err != nil { return 0, err }
defer func() {
err = resp.Body.Close()
}()
err here is the local variable from :=, not a named return. The deferred assignment runs after return 0, … has captured the return value, so the close error is unobservable. Either
ignore it explicitly (_ = resp.Body.Close()) or use named returns and merge:
func (r *RemoteCheckpointSync) fetchHeadExecutionBlockNumber(...) (n uint64, err error) {
...
defer func() {
if cerr := resp.Body.Close(); err == nil { err = cerr }
}()
(Note: GetLatestBeaconState above has the same pattern. The PR is copying a pre-existing bug — but it's worth fixing here since it's a brand-new function, and ideally fixing both.)
PipelineExecutor.SetKnownTipHint is not actually atomic-monotonic (execution/execmodule/executor.go)
if cur := pe.knownTipHint.Load(); blockNum > cur {
pe.knownTipHint.Store(blockNum)
}
This is a classic check-then-store TOCTOU. The comment claims "The value is monotonic (max wins)" but with concurrent writers a higher value can be overwritten by a lower one. Today
the only caller is Caplin startup (single-shot), but the contract advertised in the doc-comment isn't met. Fix with a CAS loop:
for {
cur := pe.knownTipHint.Load()
if blockNum <= cur { return }
if pe.knownTipHint.CompareAndSwap(cur, blockNum) { return }
}
Startup delay: per-URI timeout × N URIs
ReadRemoteHeadExecutionBlockNumber is called synchronously from RunCaplinService (cmd/caplin/caplin1/run.go:218–232). GetHeadExecutionBlockNumber iterates all checkpoint URIs and
applies r.timeout (500ms) to each. Mainnet has 5 default URIs → up to 2.5s of added startup latency if every endpoint is slow/dead. Two suggestions:
- Cap with an overall context deadline (context.WithTimeout(ctx, HeadExecutionBlockHintTimeout) once before the loop), or
- Race the URIs with errgroup and return on first success.
This is a "best-effort hint" — failing fast matters more than being thorough.
executeBlock's isInitialCycle parameter becomes dead (execution/stagedsync/exec3_serial.go:324)
After replacing the only use site at line 421 with se.accumulator != nil, the isInitialCycle parameter is no longer read. unparam will likely flag this. Either remove the parameter
(and the corresponding arg at line 174) or document why it's retained.
Semantic change: !IsInitialCycle → accumulator != nil for RecentReceipts
Both exec3_serial.go:421 and exec3_parallel.go:295 switch the gate on RecentReceipts.Add from "not in initial cycle" to "accumulator is set". This is justifiable (no subscribers ⇒ no
point populating the cache), but it's a behavior change, not a refactor:
- During initial cycle, the parallel/serial executors traditionally skipped RecentReceipts work entirely. Now they do it whenever an accumulator is present.
- Conversely, if there's ever a path where accumulator == nil while IsInitialCycle == false, RecentReceipts will no longer be populated where it used to be.
Worth verifying both cases empirically, and ideally the PR description should call this out — right now it's hidden inside a "lifecycle" PR.
Stale tip-reached marker on generic Reset
DeleteTipReached is wired into ResetState, ResetBlocks, and ResetExec, but not into Reset(ctx, db, stagesList...) (reset_stages.go:193). If someone calls Reset(ctx, db,
stages.Finish), Finish progress goes to 0 (which makes IsInitialCycle correctly return true regardless), but the stored marker is left at its old high-water value.
UpdateTipReachedFromProgress then refuses to update until finish climbs above the old marker, so the marker stays stale through the entire re-sync. Cosmetic — not load-bearing for
correctness — but worth either deleting in Reset too or commenting on why we don't bother.
Style / Smaller issues
.gitignore (/.gitignore)
- Adds gocache, .gomodcache, _artifacts — these look like personal dev/CI artifacts unrelated to the lifecycle change. Either move to a separate cleanup PR or justify why they belong
here. - Last line lacks a trailing newline (\ No newline at end of file in the diff). Add one.
stage_snapshots.go
- Spurious blank line added at line 432; no other change in this file. Drop it.
Tests in checkpoint_sync_test.go
- TestRemoteCheckpointSyncHeadExecutionBlockNumber mutates package global clparams.ConfigurableCheckpointsURLs without restoring it (no t.Cleanup). This will cross-contaminate any
test that runs after it in the same package. Save and restore. - Same test uses positional struct initialization: &RemoteCheckpointSync{&clparams.MainnetBeaconConfig, chainspec.MainnetChainID, time.Second}. The factory function
NewRemoteCheckpointSync exists and ReadRemoteHeadExecutionBlockNumber uses named init — match that style for consistency and so a field reorder doesn't silently break the test.
Two KnownTipHintReceiver interfaces with the same name
Defined in both cl/phase1/execution_client/interface.go and execution/execmodule/interface.go. Each is appropriate at its own seam (Caplin → ChainReader → ExecModule), but the
identical name across packages will confuse gopls "go to definition" navigation. Consider renaming one to disambiguate (e.g. ExecKnownTipHintReceiver on the inner one).
headBlockURI URL manipulation
String-finding /eth/v2/ in u.Path is fragile (custom proxy paths can vary). It's a best-effort hint endpoint with graceful degradation, so this is acceptable, but a brief comment
explaining the assumption ("checkpoint URLs are expected to follow …/eth/v2/debug/beacon/states/finalized; we strip everything after /eth/v2/ and substitute the head-block path")
would help future readers.
Default InitialCycleBlockTTL = 1024
At ~12s/slot that's roughly 3.4 hours on mainnet. A node that's been off overnight will re-enter initial cycle. Probably the intended trade-off (better safe than sorry on a long
downtime), but a one-line Why: 1024 ≈ ~3 hours, balancing "don't reuse stale fast-pipeline mode" vs. "don't punish brief restarts" in the flag's Usage string would make the choice
legible to operators.
Test coverage
Good for the pure helpers (IsInitialCycleFromProgress, UpdateTipReachedFromProgress, headBlockURI, the new HTTP path). Gaps:
- No integration test exercising the full restart-on-synced-node story end-to-end (write marker → restart → confirm prune uses non-initial-cycle flag). Hard to write but high-value
given how subtle the semantics are. - No test for the Reset* paths confirming the marker is cleared.
- No race test for SetKnownTipHint (which, given the issue above, would currently fail under -race if exercised concurrently).
Summary
The persistence approach is the right fix for the underlying bug, and the call-site comments do an honest job of explaining the dual InitialCycle semantics. Before merge I'd want at
minimum:
- Fix the defer ... = resp.Body.Close() no-op.
- Either fix SetKnownTipHint to be CAS-based or weaken its doc-comment.
- Bound the checkpoint-hint fetch with a single overall timeout.
- Drop the dead isInitialCycle parameter (or document why it stays).
- Restore globals in tests; revert the unrelated .gitignore / whitespace churn or split it out.
The RecentReceipts gate change deserves a separate sentence in the PR description so reviewers don't miss it.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if execctx.IsDomainAheadOfBlocks(ctx, tx, pe.logger) { | ||
| if err := stagedsync.UpdateTipReached(tx, pe.blockReader.FrozenBlocks()); err != nil { | ||
| return err | ||
| } |
| if headExecutionBlockNumber, ok, err := checkpoint_sync.ReadRemoteHeadExecutionBlockNumber(ctx, beaconConfig, config); err != nil { | ||
| log.Debug("[Checkpoint Sync] Could not fetch head execution block hint", "err", err) | ||
| } else if ok { | ||
| knownTipHint = max(knownTipHint, headExecutionBlockNumber) | ||
| } |
| pruneInitialCycle, err := stagedsync.IsInitialCycle(tx, e.syncCfg, e.blockReader.FrozenBlocks(), fcuBlockNum) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if err := e.pipelineExecutor.RunPrune(e.bacgroundCtx, tx, pruneInitialCycle, pruneTimeout); err != nil { |
|
|
||
| SyncInitialCycleBlockTTLFlag = cli.Uint64Flag{ | ||
| Name: "sync.initial-cycle-block-ttl", | ||
| Usage: "Sets the block window for treating a recently synced node as past the initial sync cycle", |
closes #20854
Summary
Fix
IsInitialCyclestartup semantics so a restarted synced node no longer looks like a fresh initial sync just because the process restarted.This replaces the process-local startup flag and duration heuristic with persisted tip-reached state measured in blocks.
Changes
--sync.initial-cycle-block-ttl/Sync.InitialCycleBlockTTLwith default1024.kv.DatabaseInfo.limitedBigJumpbehavior while allowing lifecycle initial-cycle to participate in FCU pipeline mode.