Skip to content

branch-4.1: [fix](be) Fix SIGSEGV in bvar::take_sample caused by AgentCombiner/TLS Agent lifetime race under high EPS #64040#64932

Merged
yiguolei merged 1 commit into
branch-4.1from
auto-pick-64040-branch-4.1
Jun 30, 2026
Merged

branch-4.1: [fix](be) Fix SIGSEGV in bvar::take_sample caused by AgentCombiner/TLS Agent lifetime race under high EPS #64040#64932
yiguolei merged 1 commit into
branch-4.1from
auto-pick-64040-branch-4.1

Conversation

@github-actions

Copy link
Copy Markdown
Contributor

Cherry-picked from #64040

…S Agent lifetime race under high EPS (#64040)

### What problem does this PR solve?

Issue Number: close 63193

Related PR: [#2949](apache/brpc#2949)


Problem Summary:

Under high throughput, a race condition in brpc's bvar subsystem causes
a SIGSEGV during take_sample. When a thread's TLS Agent destructs after
its owning
AgentCombiner (Reducer, IntRecorder, or Percentile) has already been
freed, the agent dereferences a dangling raw pointer in its destructor
via
  combiner->commit_and_erase(this).


The fix (backport of apache/brpc#2949) replaces the raw back-pointer
from Agent to AgentCombiner with a weak_ptr, and makes the owning
classes hold the combiner via
shared_ptr. The agent destructor now calls combiner.lock() — if the
combiner is already destroyed, lock() returns null and the destructor
safely no-ops, eliminating
  the use-after-free.
@github-actions github-actions Bot requested a review from yiguolei as a code owner June 29, 2026 04:20
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hello-stephen

Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.54% (20800/38140)
Line Coverage 38.07% (198481/521404)
Region Coverage 34.54% (155845/451141)
Branch Coverage 35.51% (67961/191384)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.79% (27492/37257)
Line Coverage 57.38% (297723/518859)
Region Coverage 55.03% (249948/454208)
Branch Coverage 56.40% (108096/191667)

@yiguolei

Copy link
Copy Markdown
Contributor

skip buildall

@yiguolei yiguolei merged commit 95948e5 into branch-4.1 Jun 30, 2026
31 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants