Skip to content

[Bug] BE SIGSEGV in bvar::SamplerCollector::run() under high EPS — race condition in brpc 1.4.0 AgentCombiner #63193

@vchag

Description

@vchag

Search before asking

  • I had searched in the issues and found no similar issues.

Version

Doris version: 4.0.4
brpc version: 1.4.0
Number of BEs: 20
Ingest rate: 400 - 500 eps
Ingest Mode: group_commit sync_mode
Table type: DUPLICATE KEY

What's Wrong?

BE nodes crash with a segmentation fault (SIGSEGV) under sustained high-throughput ingestion. The crash occurs inside bvar::SamplerCollector::run() and is caused by a race condition in brpc 1.4.0's AgentCombiner: when a thread exits while SamplerCollector is iterating the agent list, it dereferences already-freed memory.

At high EPS, the 28 global bvar::Adder<int64_t> instances in metadata_adder.h are updated tens of thousands of times per second across many worker threads, making this race reliably reproducible. Any single BE exceeding ~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes.

Segmentation fault (core dumped)
0# doris::signal::(anonymous namespace)::FailureSignalHandler
     at be/src/common/signal_handler.h:420
1# PosixSignals::chained_handler
     in /usr/lib/jvm/java/lib/server/libjvm.so
2# JVM_handle_linux_signal
     in /usr/lib/jvm/java/lib/server/libjvm.so
3# 0x00007F9881299520
     in /lib/x86_64-linux-gnu/libc.so.6
4# bvar::Reducer<long, bvar::detail::AddTo<long>,
   bvar::detail::MinusFrom<long>>::SeriesSampler::take_sample()
     at thirdparty/installed/include/bvar/reducer.h:79
5# bvar::detail::SamplerCollector::run()
     in /opt/apache-doris/be/lib/doris_be

What You Expected?

BE nodes should remain stable under sustained high-throughput ingestion. No crashes or segmentation faults should occur regardless of EPS, as long as the hardware and configuration are within supported limits.

How to Reproduce?

  1. Deploy Doris 4.0.4 with 20 BEs using group_commit sync_mode on a DUPLICATE KEY table.
  2. Drive sustained ingestion at 400–500K EPS across the cluster, or ~15–20K EPS on a single BE.
  3. Observe BE SIGSEGV within approximately 30 minutes. The crash is in bvar::SamplerCollector::run() as shown in the stack trace above.

Anything Else?

This is a known upstream bug in brpc, already fixed via PR #2949 (merged April 17, 2025), which converts _combiner to a shared_ptr to eliminate the use-after-free.

Doris 4.0.4 pins brpc at 1.4.0 in thirdparty/vars.sh:208-209, which predates this fix. We propose bumping the brpc pin to a version that includes PR #2949, and are willing to contribute the PR. Feedback on the preferred target brpc version is welcome.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions