[Bug] BE SIGSEGV in bvar::SamplerCollector::run() under high EPS — race condition in brpc 1.4.0 AgentCombiner

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues.


### Version

Doris version: 4.0.4 
brpc version: 1.4.0
Number of BEs: 20 
Ingest rate: 400 - 500 eps
Ingest Mode: group_commit sync_mode
Table type: DUPLICATE KEY 

### What's Wrong?

BE nodes crash with a segmentation fault (SIGSEGV) under sustained high-throughput ingestion. The crash occurs inside bvar::SamplerCollector::run() and is caused by a race condition in brpc 1.4.0's AgentCombiner: when a thread exits while SamplerCollector is iterating the agent list, it dereferences already-freed memory.

At high EPS, the 28 global bvar::Adder<int64_t> instances in metadata_adder.h are updated tens of thousands of times per second across many worker threads, making this race reliably reproducible. Any single BE exceeding ~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes.

```
Segmentation fault (core dumped)
0# doris::signal::(anonymous namespace)::FailureSignalHandler
     at be/src/common/signal_handler.h:420
1# PosixSignals::chained_handler
     in /usr/lib/jvm/java/lib/server/libjvm.so
2# JVM_handle_linux_signal
     in /usr/lib/jvm/java/lib/server/libjvm.so
3# 0x00007F9881299520
     in /lib/x86_64-linux-gnu/libc.so.6
4# bvar::Reducer<long, bvar::detail::AddTo<long>,
   bvar::detail::MinusFrom<long>>::SeriesSampler::take_sample()
     at thirdparty/installed/include/bvar/reducer.h:79
5# bvar::detail::SamplerCollector::run()
     in /opt/apache-doris/be/lib/doris_be
```



### What You Expected?

BE nodes should remain stable under sustained high-throughput ingestion. No crashes or segmentation faults should occur regardless of EPS, as long as the hardware and configuration are within supported limits.

### How to Reproduce?

1. Deploy Doris 4.0.4 with 20 BEs using `group_commit sync_mode` on a `DUPLICATE KEY` table.
2. Drive sustained ingestion at 400–500K EPS across the cluster, or ~15–20K EPS on a single BE.
3. Observe BE SIGSEGV within approximately 30 minutes. The crash is in `bvar::SamplerCollector::run()` as shown in the stack trace above.

### Anything Else?

This is a known upstream bug in brpc, already fixed via [PR #2949](https://github.com/apache/brpc/pull/2949) (merged April 17, 2025), which converts `_combiner` to a `shared_ptr` to eliminate the use-after-free.

> Doris 4.0.4 pins brpc at 1.4.0 in `thirdparty/vars.sh:208-209`, which predates this fix. We propose bumping the brpc pin to a version that includes PR #2949, and are willing to contribute the PR. Feedback on the preferred target brpc version is welcome.

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] BE SIGSEGV in bvar::SamplerCollector::run() under high EPS — race condition in brpc 1.4.0 AgentCombiner #63193

Search before asking

Version

What's Wrong?

What You Expected?

How to Reproduce?

Anything Else?

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] BE SIGSEGV in bvar::SamplerCollector::run() under high EPS — race condition in brpc 1.4.0 AgentCombiner #63193

Description

Search before asking

Version

What's Wrong?

What You Expected?

How to Reproduce?

Anything Else?

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions