Skip to content

execution, cl: persist initial-cycle lifecycle by block TTL#20895

Open
JkLondon wants to merge 35 commits into
mainfrom
jklondon/prune-features/otterprunefixes
Open

execution, cl: persist initial-cycle lifecycle by block TTL#20895
JkLondon wants to merge 35 commits into
mainfrom
jklondon/prune-features/otterprunefixes

Conversation

@JkLondon
Copy link
Copy Markdown
Member

@JkLondon JkLondon commented Apr 28, 2026

closes #20854

Summary

Fix IsInitialCycle startup semantics so a restarted synced node no longer looks like a fresh initial sync just because the process restarted.

This replaces the process-local startup flag and duration heuristic with persisted tip-reached state measured in blocks.

Changes

  • Add --sync.initial-cycle-block-ttl / Sync.InitialCycleBlockTTL with default 1024.
  • Persist the last block where the node reached tip in kv.DatabaseInfo.
  • Add rawdb accessors for reading, writing, and deleting the tip-reached marker.
  • Add stagedsync lifecycle helpers that compute initial-cycle from:
    • persisted tip-reached marker
    • headers progress
    • frozen blocks
    • optional known-tip hints
    • Finish progress
    • block TTL
  • Use the DB-backed lifecycle decision in staged-loop startup, staged-loop iterations, and execmodule startup processing.
  • Update the tip-reached marker after successful sync/prune when Finish has caught known tip.
  • Clear the marker from reset paths.
  • Set header downloader initial mode from the same lifecycle decision.
  • Propagate embedded Caplin known-tip hints into execmodule startup processing.
  • Fetch a best-effort CL head execution block hint from checkpoint sync endpoints so the first startup cycle sees the actual head, not only finalized checkpoint state.
  • Preserve FCU request-local limitedBigJump behavior while allowing lifecycle initial-cycle to participate in FCU pipeline mode.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Persists “initial sync cycle” lifecycle state across restarts by writing a “tip reached” marker to the DB and deriving initial-cycle status from progress deltas over a block-based TTL, rather than a process-local startup heuristic.

Changes:

  • Add Sync.InitialCycleBlockTTL / --sync.initial-cycle-block-ttl (default 1024) and snapshot it in config tests.
  • Persist and manage a “last tip reached block” marker in kv.DatabaseInfo, including reset-path cleanup and lifecycle helpers.
  • Propagate known-tip hints (including Caplin/checkpoint-sync derived hints) into staged sync loop and execmodule FCU handling to compute initial-cycle consistently.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
node/ethconfig/config.go Adds InitialCycleBlockTTL to sync config defaults/struct.
node/cli/flags.go Introduces CLI flag and wires it into config application.
node/cli/default_flags.go Registers the new sync lifecycle flag in default flag set.
execution/stagedsync/stageloop/stageloop.go Switches stageloop to DB-derived initial-cycle decision and updates tip marker during sync/prune.
execution/stagedsync/stage_snapshots.go Adjusts snapshot prune worker behavior based on CurrentSyncCycle.IsInitialCycle (but includes a stray println).
execution/stagedsync/stage_execute.go Uses CurrentSyncCycle.IsInitialCycle in pruning behavior (but includes a stray println + stack).
execution/stagedsync/rawdbreset/reset_stages.go Clears the persisted tip-reached marker on reset paths.
execution/stagedsync/initial_cycle_test.go Adds tests for initial-cycle derivation and tip marker CRUD behavior.
execution/stagedsync/initial_cycle.go Adds lifecycle helpers for computing initial-cycle and updating the tip marker.
execution/stagedsync/headerdownload/header_algos.go Adds SetInitialCycle(bool) and routes AfterInitialCycle() through it.
execution/execmodule/forkchoice.go Makes FCU pipeline initial-cycle incorporate lifecycle state + limited-big-jump and updates tip marker post-run.
execution/execmodule/executor.go Adds known-tip hint plumbing and per-iteration initial-cycle decision + post-prune hook to update marker.
execution/execmodule/exec_module.go Exposes SetKnownTipHint on ExecModule for upstream hint propagation.
execution/execmodule/chainreader/chain_reader.go Forwards known-tip hints to the execution module when supported.
db/rawdb/accessors_chain.go Adds rawdb accessors for reading/writing/deleting persisted tip-reached marker.
db/kv/kv_interface.go Extracts Deleter interface and embeds it into Putter.
cmd/erigon/node/config_snapshot_test.go Extends config snapshot to include initial-cycle TTL and adds flag/default tests.
cmd/caplin/caplin1/run.go Propagates Caplin known-tip hints (local + checkpoint-sync head hint) into execution engine at startup.
cl/phase1/execution_client/execution_client_direct.go Implements SetKnownTipHint by forwarding to chainRW.
cl/phase1/core/checkpoint_sync/util.go Adds helper to fetch best-effort head execution block number from checkpoint endpoints.
cl/phase1/core/checkpoint_sync/remote_checkpoint_sync.go Implements endpoint translation + JSON parsing to fetch head execution block number.
cl/phase1/core/checkpoint_sync/checkpoint_sync_test.go Adds tests for head execution block number fetch and URI translation.
.gitignore Ignores local Go cache/artifact directories.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/stagedsync/stage_snapshots.go Outdated
Comment thread execution/stagedsync/stage_execute.go Outdated
Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness

fetchHeadExecutionBlockNumber: deferred Close error is silently dropped (cl/phase1/core/checkpoint_sync/remote_checkpoint_sync.go:111)

resp, err := http.DefaultClient.Do(req)
if err != nil { return 0, err }
defer func() {
err = resp.Body.Close()
}()

err here is the local variable from :=, not a named return. The deferred assignment runs after return 0, … has captured the return value, so the close error is unobservable. Either
ignore it explicitly (_ = resp.Body.Close()) or use named returns and merge:

func (r *RemoteCheckpointSync) fetchHeadExecutionBlockNumber(...) (n uint64, err error) {
...
defer func() {
if cerr := resp.Body.Close(); err == nil { err = cerr }
}()

(Note: GetLatestBeaconState above has the same pattern. The PR is copying a pre-existing bug — but it's worth fixing here since it's a brand-new function, and ideally fixing both.)

PipelineExecutor.SetKnownTipHint is not actually atomic-monotonic (execution/execmodule/executor.go)

if cur := pe.knownTipHint.Load(); blockNum > cur {
pe.knownTipHint.Store(blockNum)
}

This is a classic check-then-store TOCTOU. The comment claims "The value is monotonic (max wins)" but with concurrent writers a higher value can be overwritten by a lower one. Today
the only caller is Caplin startup (single-shot), but the contract advertised in the doc-comment isn't met. Fix with a CAS loop:

for {
cur := pe.knownTipHint.Load()
if blockNum <= cur { return }
if pe.knownTipHint.CompareAndSwap(cur, blockNum) { return }
}

Startup delay: per-URI timeout × N URIs

ReadRemoteHeadExecutionBlockNumber is called synchronously from RunCaplinService (cmd/caplin/caplin1/run.go:218–232). GetHeadExecutionBlockNumber iterates all checkpoint URIs and
applies r.timeout (500ms) to each. Mainnet has 5 default URIs → up to 2.5s of added startup latency if every endpoint is slow/dead. Two suggestions:

  • Cap with an overall context deadline (context.WithTimeout(ctx, HeadExecutionBlockHintTimeout) once before the loop), or
  • Race the URIs with errgroup and return on first success.

This is a "best-effort hint" — failing fast matters more than being thorough.

executeBlock's isInitialCycle parameter becomes dead (execution/stagedsync/exec3_serial.go:324)

After replacing the only use site at line 421 with se.accumulator != nil, the isInitialCycle parameter is no longer read. unparam will likely flag this. Either remove the parameter
(and the corresponding arg at line 174) or document why it's retained.

Semantic change: !IsInitialCycle → accumulator != nil for RecentReceipts

Both exec3_serial.go:421 and exec3_parallel.go:295 switch the gate on RecentReceipts.Add from "not in initial cycle" to "accumulator is set". This is justifiable (no subscribers ⇒ no
point populating the cache), but it's a behavior change, not a refactor:

  • During initial cycle, the parallel/serial executors traditionally skipped RecentReceipts work entirely. Now they do it whenever an accumulator is present.
  • Conversely, if there's ever a path where accumulator == nil while IsInitialCycle == false, RecentReceipts will no longer be populated where it used to be.

Worth verifying both cases empirically, and ideally the PR description should call this out — right now it's hidden inside a "lifecycle" PR.

Stale tip-reached marker on generic Reset

DeleteTipReached is wired into ResetState, ResetBlocks, and ResetExec, but not into Reset(ctx, db, stagesList...) (reset_stages.go:193). If someone calls Reset(ctx, db,
stages.Finish), Finish progress goes to 0 (which makes IsInitialCycle correctly return true regardless), but the stored marker is left at its old high-water value.
UpdateTipReachedFromProgress then refuses to update until finish climbs above the old marker, so the marker stays stale through the entire re-sync. Cosmetic — not load-bearing for
correctness — but worth either deleting in Reset too or commenting on why we don't bother.


Style / Smaller issues

.gitignore (/.gitignore)

  • Adds gocache, .gomodcache, _artifacts — these look like personal dev/CI artifacts unrelated to the lifecycle change. Either move to a separate cleanup PR or justify why they belong
    here.
  • Last line lacks a trailing newline (\ No newline at end of file in the diff). Add one.

stage_snapshots.go

  • Spurious blank line added at line 432; no other change in this file. Drop it.

Tests in checkpoint_sync_test.go

  • TestRemoteCheckpointSyncHeadExecutionBlockNumber mutates package global clparams.ConfigurableCheckpointsURLs without restoring it (no t.Cleanup). This will cross-contaminate any
    test that runs after it in the same package. Save and restore.
  • Same test uses positional struct initialization: &RemoteCheckpointSync{&clparams.MainnetBeaconConfig, chainspec.MainnetChainID, time.Second}. The factory function
    NewRemoteCheckpointSync exists and ReadRemoteHeadExecutionBlockNumber uses named init — match that style for consistency and so a field reorder doesn't silently break the test.

Two KnownTipHintReceiver interfaces with the same name

Defined in both cl/phase1/execution_client/interface.go and execution/execmodule/interface.go. Each is appropriate at its own seam (Caplin → ChainReader → ExecModule), but the
identical name across packages will confuse gopls "go to definition" navigation. Consider renaming one to disambiguate (e.g. ExecKnownTipHintReceiver on the inner one).

headBlockURI URL manipulation

String-finding /eth/v2/ in u.Path is fragile (custom proxy paths can vary). It's a best-effort hint endpoint with graceful degradation, so this is acceptable, but a brief comment
explaining the assumption ("checkpoint URLs are expected to follow …/eth/v2/debug/beacon/states/finalized; we strip everything after /eth/v2/ and substitute the head-block path")
would help future readers.

Default InitialCycleBlockTTL = 1024

At ~12s/slot that's roughly 3.4 hours on mainnet. A node that's been off overnight will re-enter initial cycle. Probably the intended trade-off (better safe than sorry on a long
downtime), but a one-line Why: 1024 ≈ ~3 hours, balancing "don't reuse stale fast-pipeline mode" vs. "don't punish brief restarts" in the flag's Usage string would make the choice
legible to operators.


Test coverage

Good for the pure helpers (IsInitialCycleFromProgress, UpdateTipReachedFromProgress, headBlockURI, the new HTTP path). Gaps:

  • No integration test exercising the full restart-on-synced-node story end-to-end (write marker → restart → confirm prune uses non-initial-cycle flag). Hard to write but high-value
    given how subtle the semantics are.
  • No test for the Reset* paths confirming the marker is cleared.
  • No race test for SetKnownTipHint (which, given the issue above, would currently fail under -race if exercised concurrently).

Summary

The persistence approach is the right fix for the underlying bug, and the call-site comments do an honest job of explaining the dual InitialCycle semantics. Before merge I'd want at
minimum:

  1. Fix the defer ... = resp.Body.Close() no-op.
  2. Either fix SetKnownTipHint to be CAS-based or weaken its doc-comment.
  3. Bound the checkpoint-hint fetch with a single overall timeout.
  4. Drop the dead isInitialCycle parameter (or document why it stays).
  5. Restore globals in tests; revert the unrelated .gitignore / whitespace churn or split it out.

The RecentReceipts gate change deserves a separate sentence in the PR description so reviewers don't miss it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 260 to +263
if execctx.IsDomainAheadOfBlocks(ctx, tx, pe.logger) {
if err := stagedsync.UpdateTipReached(tx, pe.blockReader.FrozenBlocks()); err != nil {
return err
}
Comment thread cmd/caplin/caplin1/run.go
Comment on lines +226 to +230
if headExecutionBlockNumber, ok, err := checkpoint_sync.ReadRemoteHeadExecutionBlockNumber(ctx, beaconConfig, config); err != nil {
log.Debug("[Checkpoint Sync] Could not fetch head execution block hint", "err", err)
} else if ok {
knownTipHint = max(knownTipHint, headExecutionBlockNumber)
}
Comment on lines +839 to +843
pruneInitialCycle, err := stagedsync.IsInitialCycle(tx, e.syncCfg, e.blockReader.FrozenBlocks(), fcuBlockNum)
if err != nil {
return err
}
if err := e.pipelineExecutor.RunPrune(e.bacgroundCtx, tx, pruneInitialCycle, pruneTimeout); err != nil {
Comment thread node/cli/flags.go

SyncInitialCycleBlockTTLFlag = cli.Uint64Flag{
Name: "sync.initial-cycle-block-ttl",
Usage: "Sets the block window for treating a recently synced node as past the initial sync cycle",
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix prune t/o >2s after restart on ottersync and exec stages

3 participants