Skip to content

log: replace SyncHandler with AsyncStreamHandler to fix goroutine contention#20148

Closed
lystopad wants to merge 9 commits intomainfrom
feature/lystopad/fix-log-handler-mutex-contention
Closed

log: replace SyncHandler with AsyncStreamHandler to fix goroutine contention#20148
lystopad wants to merge 9 commits intomainfrom
feature/lystopad/fix-log-handler-mutex-contention

Conversation

@lystopad
Copy link
Copy Markdown
Member

Problem

Under heavy validator load (thousands of Caplin validator goroutines logging
concurrently), all goroutines serialised on the sync.Mutex inside
SyncHandler. This caused complete process freezes on the Caplin beacon API
— a goroutine dump showed 8 295 of 9 875 goroutines blocked on the same mutex
address waiting for a log write.

Solution

Add AsyncHandler and AsyncStreamHandler to common/log/v3:

  • AsyncHandler(bufSize, h) — wraps any Handler with a buffered
    channel (asyncDefaultBufSize = 4096). Callers enqueue via
    select { case h.recs <- r: default: h.dropped.Add(1) } — they never
    block
    even when the channel is full. A single drain goroutine writes
    records sequentially, eliminating the shared mutex. Dropped records are
    counted atomically and reported as a Warn entry after each batch.

  • AsyncStreamHandler(wr, fmtr) — convenience constructor:
    LazyHandler(AsyncHandler(asyncDefaultBufSize, streamHandler{wr, fmtr})).
    LazyHandler is outermost so Lazy values are evaluated in the caller's
    goroutine, not the drain goroutine.

Switch the predefined handlers and logging setup:

  • StdoutHandler / StderrHandler in root.goAsyncStreamHandler
  • Both StreamHandler usages in node/logging/logging.goAsyncStreamHandler

SyncHandler is preserved unchanged for tests and any synchronous pipelines
that require it explicitly.

Tests

  • TestAsyncHandler — basic record delivery via a waitHandler channel.
  • TestAsyncHandlerNeverBlocks — fills a bufSize=4 queue with a blocking
    handler; proves callers complete without blocking (2 s timeout).

Checklist

  • make lint passes
  • make erigon builds
  • New tests added

Co-Authored-By: Claude

…tention

Under heavy validator load (thousands of goroutines logging concurrently),
all goroutines serialised on the sync.Mutex inside SyncHandler, causing
complete process freezes on the Caplin beacon API.

Add AsyncHandler (non-blocking buffered channel + drain goroutine) and
AsyncStreamHandler (LazyHandler wrapping AsyncHandler wrapping streamHandler).
Callers enqueue with a select/default — they never block if the channel is
full; dropped records are counted and reported by the drain goroutine.

Switch StdoutHandler, StderrHandler (root.go) and both StreamHandler usages
in node/logging/logging.go to AsyncStreamHandler. SyncHandler is preserved
for tests and synchronous pipelines that need it explicitly.

Co-Authored-By: Claude
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a non-blocking async logging handler to eliminate mutex contention under high concurrent logging load (e.g., many validator goroutines), and switches key stream logging paths to use it.

Changes:

  • Added AsyncHandler and AsyncStreamHandler in common/log/v3 to enqueue log records without blocking callers and drain them via a single goroutine.
  • Updated predefined stdout/stderr handlers and node logging setup to use AsyncStreamHandler instead of StreamHandler/SyncHandler.
  • Added tests covering async delivery and the “never blocks when queue is full” guarantee.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
common/log/v3/handler.go Adds AsyncHandler / AsyncStreamHandler implementation and drop reporting.
common/log/v3/root.go Switches predefined StdoutHandler/StderrHandler to async stream handlers.
node/logging/logging.go Uses AsyncStreamHandler for console JSON logging and file logging stream handler.
common/log/v3/log_test.go Adds tests validating async delivery and non-blocking behavior under overload.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread common/log/v3/handler.go Outdated
Comment on lines +466 to +470
for m := range recs {
_ = h.Log(m)
if n := dropped.Swap(0); n > 0 {
_ = h.Log(&Record{
Time: time.Now(),
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drop-reporting behavior can become very noisy under sustained overload: whenever at least one record is dropped while processing a single record, this loop writes an additional Warn entry. At high log rates this can generate Warn spam and further slow the writer, increasing drop rates. Consider rate-limiting (e.g., emit at most once per N seconds) and aggregating the drop count over that interval.

Copilot uses AI. Check for mistakes.
Comment thread common/log/v3/handler.go Outdated
Comment on lines +454 to +455
// counter is incremented; the drain goroutine emits a Warn-level entry
// reporting the count after each batch so drops are always visible.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says drops are reported "after each batch" and "always visible", but the implementation logs the warning only after successfully draining a record. If the last events are drops and no subsequent record is drained (e.g., logging stops), the drop count may never be reported. Either adjust the wording to match the actual guarantee, or change the implementation to periodically emit drop counts even when the queue is saturated/idle.

Suggested change
// counter is incremented; the drain goroutine emits a Warn-level entry
// reporting the count after each batch so drops are always visible.
// counter is incremented. If the drain goroutine later processes another
// queued record, it emits a Warn-level entry reporting the accumulated
// drop count.

Copilot uses AI. Check for mistakes.
Comment thread common/log/v3/root.go
Comment on lines 18 to +20
root *logger
StdoutHandler = StreamHandler(os.Stdout, TerminalFormatNoColor()) //LogfmtFormat())
StderrHandler = StreamHandler(os.Stderr, TerminalFormatNoColor()) //LogfmtFormat())
StdoutHandler = AsyncStreamHandler(os.Stdout, TerminalFormatNoColor())
StderrHandler = AsyncStreamHandler(os.Stderr, TerminalFormatNoColor())
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching the predefined StdoutHandler/StderrHandler to AsyncStreamHandler makes stdout/stderr logging asynchronous by default. This can drop the tail of logs on process shutdown (records still buffered and never drained) and there is currently no flush/stop hook to drain the queue. If shutdown-path log durability matters, consider providing an explicit drain/close mechanism or a documented synchronous option for final logs.

Copilot uses AI. Check for mistakes.
Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>
@erigon-copilot
Copy link
Copy Markdown
Contributor

erigon-copilot Bot commented Apr 9, 2026

[SharovBot] I pushed a follow-up patch on top of this PR.

What I changed:

  • rate-limited async drop reporting so sustained overload does not emit a warning after every drained record;
  • aggregated dropped-record counts into periodic summaries instead;
  • fixed the AsyncHandler docs so they match the actual guarantee;
  • documented the stdout/stderr tradeoff explicitly: async logging avoids mutex-induced stalls, but shutdown can still lose buffered tail logs until the package grows an explicit flush/close path;
  • added a focused test covering the new rate-limited aggregated warning behavior.

Validation I ran locally:

  • go test ./common/log/v3 -count=1
  • go test ./node/logging -count=1

I have not added a flush/close API in this patch because that is a broader behavioral change and should be designed deliberately. For this PR, I think the implementation is now materially safer and the docs no longer overclaim.

@erigon-copilot erigon-copilot Bot marked this pull request as ready for review April 9, 2026 11:10
@erigon-copilot
Copy link
Copy Markdown
Contributor

erigon-copilot Bot commented Apr 9, 2026

[SharovBot] Follow-up fix pushed for the red race-test shard.

Root cause:

  • the previous test made asyncDropReportInterval mutable at package scope;
  • parallel tests then raced with async drain goroutines reading that shared variable.

Fix:

  • restored the production interval to a constant;
  • introduced an internal helper that accepts the report interval explicitly for tests;
  • updated the rate-limit test to use the helper instead of mutating shared global state.

Validation:

  • go test ./common/log/v3 -race -count=1
  • go test ./node/logging -count=1

  - Drain loop efficiency: Replaced time.Since(lastReport) (calls time.Now() internally) with a pre-computed nextReport
  deadline — one time.Now() call per record instead of two. Also added dropped.Load() guard before dropped.Swap(0) so the
  atomic write only happens when drops actually occurred.
@AskAlexSharov
Copy link
Copy Markdown
Collaborator

Agree with problem statement. Solution is a bit feels wrong: because "logger which drops logs" can bite us.

Seems we didn't really need current mutex: let me try alternative approach and add some benches: #20454

Also my opinion: Erigon must not write so much logs. If our RPC writing tons of logs on each RPC request - then we must remove such logs. They are not useful for users and us. (But logger still must not cause mutex contention)

@awskii
Copy link
Copy Markdown
Member

awskii commented Apr 10, 2026

approved alternative #20454

@awskii awskii closed this Apr 10, 2026
github-merge-queue Bot pushed a commit that referenced this pull request Apr 10, 2026
an alternative for: #20148

Move mutex from log handler to file writer. Seems we used mutex in
handler - to reduce allocs (by using shared buf in the past) - but now
we don't use any shared buffer there.

`TestStreamHandlerNoContention` 5K goroutines: `5s -> 7ms`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants