log: replace SyncHandler with AsyncStreamHandler to fix goroutine contention#20148
log: replace SyncHandler with AsyncStreamHandler to fix goroutine contention#20148
Conversation
…tention Under heavy validator load (thousands of goroutines logging concurrently), all goroutines serialised on the sync.Mutex inside SyncHandler, causing complete process freezes on the Caplin beacon API. Add AsyncHandler (non-blocking buffered channel + drain goroutine) and AsyncStreamHandler (LazyHandler wrapping AsyncHandler wrapping streamHandler). Callers enqueue with a select/default — they never block if the channel is full; dropped records are counted and reported by the drain goroutine. Switch StdoutHandler, StderrHandler (root.go) and both StreamHandler usages in node/logging/logging.go to AsyncStreamHandler. SyncHandler is preserved for tests and synchronous pipelines that need it explicitly. Co-Authored-By: Claude
There was a problem hiding this comment.
Pull request overview
Introduces a non-blocking async logging handler to eliminate mutex contention under high concurrent logging load (e.g., many validator goroutines), and switches key stream logging paths to use it.
Changes:
- Added
AsyncHandlerandAsyncStreamHandlerincommon/log/v3to enqueue log records without blocking callers and drain them via a single goroutine. - Updated predefined stdout/stderr handlers and node logging setup to use
AsyncStreamHandlerinstead ofStreamHandler/SyncHandler. - Added tests covering async delivery and the “never blocks when queue is full” guarantee.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
common/log/v3/handler.go |
Adds AsyncHandler / AsyncStreamHandler implementation and drop reporting. |
common/log/v3/root.go |
Switches predefined StdoutHandler/StderrHandler to async stream handlers. |
node/logging/logging.go |
Uses AsyncStreamHandler for console JSON logging and file logging stream handler. |
common/log/v3/log_test.go |
Adds tests validating async delivery and non-blocking behavior under overload. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for m := range recs { | ||
| _ = h.Log(m) | ||
| if n := dropped.Swap(0); n > 0 { | ||
| _ = h.Log(&Record{ | ||
| Time: time.Now(), |
There was a problem hiding this comment.
The drop-reporting behavior can become very noisy under sustained overload: whenever at least one record is dropped while processing a single record, this loop writes an additional Warn entry. At high log rates this can generate Warn spam and further slow the writer, increasing drop rates. Consider rate-limiting (e.g., emit at most once per N seconds) and aggregating the drop count over that interval.
| // counter is incremented; the drain goroutine emits a Warn-level entry | ||
| // reporting the count after each batch so drops are always visible. |
There was a problem hiding this comment.
The docstring says drops are reported "after each batch" and "always visible", but the implementation logs the warning only after successfully draining a record. If the last events are drops and no subsequent record is drained (e.g., logging stops), the drop count may never be reported. Either adjust the wording to match the actual guarantee, or change the implementation to periodically emit drop counts even when the queue is saturated/idle.
| // counter is incremented; the drain goroutine emits a Warn-level entry | |
| // reporting the count after each batch so drops are always visible. | |
| // counter is incremented. If the drain goroutine later processes another | |
| // queued record, it emits a Warn-level entry reporting the accumulated | |
| // drop count. |
| root *logger | ||
| StdoutHandler = StreamHandler(os.Stdout, TerminalFormatNoColor()) //LogfmtFormat()) | ||
| StderrHandler = StreamHandler(os.Stderr, TerminalFormatNoColor()) //LogfmtFormat()) | ||
| StdoutHandler = AsyncStreamHandler(os.Stdout, TerminalFormatNoColor()) | ||
| StderrHandler = AsyncStreamHandler(os.Stderr, TerminalFormatNoColor()) |
There was a problem hiding this comment.
Switching the predefined StdoutHandler/StderrHandler to AsyncStreamHandler makes stdout/stderr logging asynchronous by default. This can drop the tail of logs on process shutdown (records still buffered and never drained) and there is currently no flush/stop hook to drain the queue. If shutdown-path log durability matters, consider providing an explicit drain/close mechanism or a documented synchronous option for final logs.
Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>
|
[SharovBot] I pushed a follow-up patch on top of this PR. What I changed:
Validation I ran locally:
I have not added a flush/close API in this patch because that is a broader behavioral change and should be designed deliberately. For this PR, I think the implementation is now materially safer and the docs no longer overclaim. |
Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>
|
[SharovBot] Follow-up fix pushed for the red race-test shard. Root cause:
Fix:
Validation:
|
- Drain loop efficiency: Replaced time.Since(lastReport) (calls time.Now() internally) with a pre-computed nextReport deadline — one time.Now() call per record instead of two. Also added dropped.Load() guard before dropped.Swap(0) so the atomic write only happens when drops actually occurred.
|
Agree with problem statement. Solution is a bit feels wrong: because "logger which drops logs" can bite us. Seems we didn't really need current Also my opinion: Erigon must not write so much logs. If our RPC writing tons of logs on each RPC request - then we must remove such logs. They are not useful for users and us. (But logger still must not cause mutex contention) |
|
approved alternative #20454 |
an alternative for: #20148 Move mutex from log handler to file writer. Seems we used mutex in handler - to reduce allocs (by using shared buf in the past) - but now we don't use any shared buffer there. `TestStreamHandlerNoContention` 5K goroutines: `5s -> 7ms`
Problem
Under heavy validator load (thousands of Caplin validator goroutines logging
concurrently), all goroutines serialised on the
sync.MutexinsideSyncHandler. This caused complete process freezes on the Caplin beacon API— a goroutine dump showed 8 295 of 9 875 goroutines blocked on the same mutex
address waiting for a log write.
Solution
Add
AsyncHandlerandAsyncStreamHandlertocommon/log/v3:AsyncHandler(bufSize, h)— wraps anyHandlerwith a bufferedchannel (
asyncDefaultBufSize = 4096). Callers enqueue viaselect { case h.recs <- r: default: h.dropped.Add(1) }— they neverblock even when the channel is full. A single drain goroutine writes
records sequentially, eliminating the shared mutex. Dropped records are
counted atomically and reported as a
Warnentry after each batch.AsyncStreamHandler(wr, fmtr)— convenience constructor:LazyHandler(AsyncHandler(asyncDefaultBufSize, streamHandler{wr, fmtr})).LazyHandleris outermost soLazyvalues are evaluated in the caller'sgoroutine, not the drain goroutine.
Switch the predefined handlers and logging setup:
StdoutHandler/StderrHandlerinroot.go→AsyncStreamHandlerStreamHandlerusages innode/logging/logging.go→AsyncStreamHandlerSyncHandleris preserved unchanged for tests and any synchronous pipelinesthat require it explicitly.
Tests
TestAsyncHandler— basic record delivery via awaitHandlerchannel.TestAsyncHandlerNeverBlocks— fills abufSize=4queue with a blockinghandler; proves callers complete without blocking (2 s timeout).
Checklist
make lintpassesmake erigonbuildsCo-Authored-By: Claude