Search before asking
Version
Doris version: 4.0.4
brpc version: 1.4.0
Number of BEs: 20
Ingest rate: 400 - 500 eps
Ingest Mode: group_commit sync_mode
Table type: DUPLICATE KEY
What's Wrong?
BE nodes crash with a segmentation fault (SIGSEGV) under sustained high-throughput ingestion. The crash occurs inside bvar::SamplerCollector::run() and is caused by a race condition in brpc 1.4.0's AgentCombiner: when a thread exits while SamplerCollector is iterating the agent list, it dereferences already-freed memory.
At high EPS, the 28 global bvar::Adder<int64_t> instances in metadata_adder.h are updated tens of thousands of times per second across many worker threads, making this race reliably reproducible. Any single BE exceeding ~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes.
Segmentation fault (core dumped)
0# doris::signal::(anonymous namespace)::FailureSignalHandler
at be/src/common/signal_handler.h:420
1# PosixSignals::chained_handler
in /usr/lib/jvm/java/lib/server/libjvm.so
2# JVM_handle_linux_signal
in /usr/lib/jvm/java/lib/server/libjvm.so
3# 0x00007F9881299520
in /lib/x86_64-linux-gnu/libc.so.6
4# bvar::Reducer<long, bvar::detail::AddTo<long>,
bvar::detail::MinusFrom<long>>::SeriesSampler::take_sample()
at thirdparty/installed/include/bvar/reducer.h:79
5# bvar::detail::SamplerCollector::run()
in /opt/apache-doris/be/lib/doris_be
What You Expected?
BE nodes should remain stable under sustained high-throughput ingestion. No crashes or segmentation faults should occur regardless of EPS, as long as the hardware and configuration are within supported limits.
How to Reproduce?
- Deploy Doris 4.0.4 with 20 BEs using
group_commit sync_mode on a DUPLICATE KEY table.
- Drive sustained ingestion at 400–500K EPS across the cluster, or ~15–20K EPS on a single BE.
- Observe BE SIGSEGV within approximately 30 minutes. The crash is in
bvar::SamplerCollector::run() as shown in the stack trace above.
Anything Else?
This is a known upstream bug in brpc, already fixed via PR #2949 (merged April 17, 2025), which converts _combiner to a shared_ptr to eliminate the use-after-free.
Doris 4.0.4 pins brpc at 1.4.0 in thirdparty/vars.sh:208-209, which predates this fix. We propose bumping the brpc pin to a version that includes PR #2949, and are willing to contribute the PR. Feedback on the preferred target brpc version is welcome.
Are you willing to submit PR?
Code of Conduct
Search before asking
Version
Doris version: 4.0.4
brpc version: 1.4.0
Number of BEs: 20
Ingest rate: 400 - 500 eps
Ingest Mode: group_commit sync_mode
Table type: DUPLICATE KEY
What's Wrong?
BE nodes crash with a segmentation fault (SIGSEGV) under sustained high-throughput ingestion. The crash occurs inside bvar::SamplerCollector::run() and is caused by a race condition in brpc 1.4.0's AgentCombiner: when a thread exits while SamplerCollector is iterating the agent list, it dereferences already-freed memory.
At high EPS, the 28 global bvar::Adder<int64_t> instances in metadata_adder.h are updated tens of thousands of times per second across many worker threads, making this race reliably reproducible. Any single BE exceeding ~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes.
What You Expected?
BE nodes should remain stable under sustained high-throughput ingestion. No crashes or segmentation faults should occur regardless of EPS, as long as the hardware and configuration are within supported limits.
How to Reproduce?
group_commit sync_modeon aDUPLICATE KEYtable.bvar::SamplerCollector::run()as shown in the stack trace above.Anything Else?
This is a known upstream bug in brpc, already fixed via PR #2949 (merged April 17, 2025), which converts
_combinerto ashared_ptrto eliminate the use-after-free.Are you willing to submit PR?
Code of Conduct