log: replace SyncHandler with AsyncStreamHandler to fix goroutine contention by lystopad · Pull Request #20148 · erigontech/erigon

lystopad · 2026-03-25T11:15:17Z

Problem

Under heavy validator load (thousands of Caplin validator goroutines logging
concurrently), all goroutines serialised on the sync.Mutex inside
SyncHandler. This caused complete process freezes on the Caplin beacon API
— a goroutine dump showed 8 295 of 9 875 goroutines blocked on the same mutex
address waiting for a log write.

Solution

Add AsyncHandler and AsyncStreamHandler to common/log/v3:

AsyncHandler(bufSize, h) — wraps any Handler with a buffered
channel (asyncDefaultBufSize = 4096). Callers enqueue via
select { case h.recs <- r: default: h.dropped.Add(1) } — they never
block even when the channel is full. A single drain goroutine writes
records sequentially, eliminating the shared mutex. Dropped records are
counted atomically and reported as a Warn entry after each batch.
AsyncStreamHandler(wr, fmtr) — convenience constructor:
LazyHandler(AsyncHandler(asyncDefaultBufSize, streamHandler{wr, fmtr})).
LazyHandler is outermost so Lazy values are evaluated in the caller's
goroutine, not the drain goroutine.

Switch the predefined handlers and logging setup:

StdoutHandler / StderrHandler in root.go → AsyncStreamHandler
Both StreamHandler usages in node/logging/logging.go → AsyncStreamHandler

SyncHandler is preserved unchanged for tests and any synchronous pipelines
that require it explicitly.

Tests

TestAsyncHandler — basic record delivery via a waitHandler channel.
TestAsyncHandlerNeverBlocks — fills a bufSize=4 queue with a blocking
handler; proves callers complete without blocking (2 s timeout).

Checklist

make lint passes
make erigon builds
New tests added

Co-Authored-By: Claude

…tention Under heavy validator load (thousands of goroutines logging concurrently), all goroutines serialised on the sync.Mutex inside SyncHandler, causing complete process freezes on the Caplin beacon API. Add AsyncHandler (non-blocking buffered channel + drain goroutine) and AsyncStreamHandler (LazyHandler wrapping AsyncHandler wrapping streamHandler). Callers enqueue with a select/default — they never block if the channel is full; dropped records are counted and reported by the drain goroutine. Switch StdoutHandler, StderrHandler (root.go) and both StreamHandler usages in node/logging/logging.go to AsyncStreamHandler. SyncHandler is preserved for tests and synchronous pipelines that need it explicitly. Co-Authored-By: Claude

Copilot

Pull request overview

Introduces a non-blocking async logging handler to eliminate mutex contention under high concurrent logging load (e.g., many validator goroutines), and switches key stream logging paths to use it.

Changes:

Added AsyncHandler and AsyncStreamHandler in common/log/v3 to enqueue log records without blocking callers and drain them via a single goroutine.
Updated predefined stdout/stderr handlers and node logging setup to use AsyncStreamHandler instead of StreamHandler/SyncHandler.
Added tests covering async delivery and the “never blocks when queue is full” guarantee.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`common/log/v3/handler.go`	Adds `AsyncHandler` / `AsyncStreamHandler` implementation and drop reporting.
`common/log/v3/root.go`	Switches predefined `StdoutHandler`/`StderrHandler` to async stream handlers.
`node/logging/logging.go`	Uses `AsyncStreamHandler` for console JSON logging and file logging stream handler.
`common/log/v3/log_test.go`	Adds tests validating async delivery and non-blocking behavior under overload.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T02:17:05Z

+		for m := range recs {
+			_ = h.Log(m)
+			if n := dropped.Swap(0); n > 0 {
+				_ = h.Log(&Record{
+					Time: time.Now(),


The drop-reporting behavior can become very noisy under sustained overload: whenever at least one record is dropped while processing a single record, this loop writes an additional Warn entry. At high log rates this can generate Warn spam and further slow the writer, increasing drop rates. Consider rate-limiting (e.g., emit at most once per N seconds) and aggregating the drop count over that interval.

Copilot · 2026-04-09T02:17:05Z

+// counter is incremented; the drain goroutine emits a Warn-level entry
+// reporting the count after each batch so drops are always visible.


The docstring says drops are reported "after each batch" and "always visible", but the implementation logs the warning only after successfully draining a record. If the last events are drops and no subsequent record is drained (e.g., logging stops), the drop count may never be reported. Either adjust the wording to match the actual guarantee, or change the implementation to periodically emit drop counts even when the queue is saturated/idle.

Suggested change

// counter is incremented; the drain goroutine emits a Warn-level entry

// reporting the count after each batch so drops are always visible.

// counter is incremented. If the drain goroutine later processes another

// queued record, it emits a Warn-level entry reporting the accumulated

// drop count.

Copilot · 2026-04-09T02:17:06Z

 	root          *logger
-	StdoutHandler = StreamHandler(os.Stdout, TerminalFormatNoColor()) //LogfmtFormat())
-	StderrHandler = StreamHandler(os.Stderr, TerminalFormatNoColor()) //LogfmtFormat())
+	StdoutHandler = AsyncStreamHandler(os.Stdout, TerminalFormatNoColor())
+	StderrHandler = AsyncStreamHandler(os.Stderr, TerminalFormatNoColor())


Switching the predefined StdoutHandler/StderrHandler to AsyncStreamHandler makes stdout/stderr logging asynchronous by default. This can drop the tail of logs on process shutdown (records still buffered and never drained) and there is currently no flush/stop hook to drain the queue. If shutdown-path log durability matters, consider providing an explicit drain/close mechanism or a documented synchronous option for final logs.

Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>

erigon-copilot · 2026-04-09T11:10:10Z

[SharovBot] I pushed a follow-up patch on top of this PR.

What I changed:

rate-limited async drop reporting so sustained overload does not emit a warning after every drained record;
aggregated dropped-record counts into periodic summaries instead;
fixed the AsyncHandler docs so they match the actual guarantee;
documented the stdout/stderr tradeoff explicitly: async logging avoids mutex-induced stalls, but shutdown can still lose buffered tail logs until the package grows an explicit flush/close path;
added a focused test covering the new rate-limited aggregated warning behavior.

Validation I ran locally:

go test ./common/log/v3 -count=1
go test ./node/logging -count=1

I have not added a flush/close API in this patch because that is a broader behavioral change and should be designed deliberately. For this PR, I think the implementation is now materially safer and the docs no longer overclaim.

…ntion

Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>

erigon-copilot · 2026-04-09T12:11:16Z

[SharovBot] Follow-up fix pushed for the red race-test shard.

Root cause:

the previous test made asyncDropReportInterval mutable at package scope;
parallel tests then raced with async drain goroutines reading that shared variable.

Fix:

restored the production interval to a constant;
introduced an internal helper that accepts the report interval explicitly for tests;
updated the rate-limit test to use the helper instead of mutating shared global state.

Validation:

go test ./common/log/v3 -race -count=1
go test ./node/logging -count=1

…ntion

- Drain loop efficiency: Replaced time.Since(lastReport) (calls time.Now() internally) with a pre-computed nextReport deadline — one time.Now() call per record instead of two. Also added dropped.Load() guard before dropped.Swap(0) so the atomic write only happens when drops actually occurred.

AskAlexSharov · 2026-04-10T02:45:16Z

Agree with problem statement. Solution is a bit feels wrong: because "logger which drops logs" can bite us.

Seems we didn't really need current mutex: let me try alternative approach and add some benches: #20454

Also my opinion: Erigon must not write so much logs. If our RPC writing tons of logs on each RPC request - then we must remove such logs. They are not useful for users and us. (But logger still must not cause mutex contention)

awskii · 2026-04-10T14:01:15Z

approved alternative #20454

an alternative for: #20148 Move mutex from log handler to file writer. Seems we used mutex in handler - to reduce allocs (by using shared buf in the past) - but now we don't use any shared buffer there. `TestStreamHandlerNoContention` 5K goroutines: `5s -> 7ms`

lystopad requested a review from AskAlexSharov as a code owner March 25, 2026 11:15

lystopad requested a review from yperbasis March 25, 2026 11:15

lystopad requested a review from Giulio2002 as a code owner March 25, 2026 11:15

lystopad self-assigned this Mar 25, 2026

lystopad mentioned this pull request Mar 25, 2026

ci: add cleanup-space to race tests workflow to fix disk-full failures #20156

Closed

1 task

lystopad marked this pull request as draft March 25, 2026 17:08

AskAlexSharov requested a review from Copilot April 9, 2026 02:11

Copilot started reviewing on behalf of AskAlexSharov April 9, 2026 02:11 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

log: rate-limit async drop reporting

eaccc9d

Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>

erigon-copilot Bot marked this pull request as ready for review April 9, 2026 11:10

lystopad and others added 2 commits April 9, 2026 12:50

Merge branch 'main' into feature/lystopad/fix-log-handler-mutex-conte…

72d05cd

…ntion

log: avoid shared interval state in async handler tests

399071d

Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>

AskAlexSharov added 5 commits April 10, 2026 08:45

Merge branch 'main' into feature/lystopad/fix-log-handler-mutex-conte…

ef900da

…ntion

save

5235e4b

save

893888f

save

8ff6cf2

AskAlexSharov mentioned this pull request Apr 10, 2026

logger: avoid mutex contention #20454

Merged

awskii closed this Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log: replace SyncHandler with AsyncStreamHandler to fix goroutine contention#20148

log: replace SyncHandler with AsyncStreamHandler to fix goroutine contention#20148
lystopad wants to merge 9 commits intomainfrom
feature/lystopad/fix-log-handler-mutex-contention

lystopad commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

erigon-copilot Bot commented Apr 9, 2026

Uh oh!

erigon-copilot Bot commented Apr 9, 2026

Uh oh!

AskAlexSharov commented Apr 10, 2026

Uh oh!

awskii commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// counter is incremented; the drain goroutine emits a Warn-level entry
		// reporting the count after each batch so drops are always visible.

-// counter is incremented; the drain goroutine emits a Warn-level entry
-// reporting the count after each batch so drops are always visible.
+// counter is incremented. If the drain goroutine later processes another
+// queued record, it emits a Warn-level entry reporting the accumulated
+// drop count.

Conversation

lystopad commented Mar 25, 2026

Problem

Solution

Tests

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

erigon-copilot Bot commented Apr 9, 2026

Uh oh!

erigon-copilot Bot commented Apr 9, 2026

Uh oh!

AskAlexSharov commented Apr 10, 2026

Uh oh!

awskii commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants