Skip to content

[fix](be) Fix runtime filter crash with shared hash table#63257

Open
BiteTheDDDDt wants to merge 1 commit into
apache:branch-4.0from
BiteTheDDDDt:fix-rf-shared-hash-crash-4.0
Open

[fix](be) Fix runtime filter crash with shared hash table#63257
BiteTheDDDDt wants to merge 1 commit into
apache:branch-4.0from
BiteTheDDDDt:fix-rf-shared-hash-crash-4.0

Conversation

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor

@BiteTheDDDDt BiteTheDDDDt commented May 14, 2026

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary: Fix BE SIGSEGV in branch-4.0 when hash join build sink publishes runtime filters during close. The crash point is RuntimeFilterWrapper::set_state(), and the direct cause is that a RuntimeFilterProducer can hold a null _wrapper in the shared hash table runtime filter path.

Root cause

Bug introduced by: #49556

#49556 refactored the broadcast/shared hash table controller and introduced the shared runtime filter wrapper handoff:

  • The builder task stores runtime filter wrappers in HashJoinBuildSinkOperatorX::_runtime_filters.
  • Non-builder tasks read wrappers from that same map when use_shared_table=true and _should_build_hash_table=false.
  • HashJoinBuildSinkLocalState::close() unconditionally sets _signaled = true for shared hash table, even if the builder task was terminated before it built the hash table and filled _runtime_filters.
  • The non-builder runtime filter path uses DCHECK(runtime_filters.contains(...)) and then runtime_filters[filter_id]. In release builds the DCHECK is disabled, so a missing map entry inserts a default null shared_ptr.
  • The producer then calls _wrapper->set_state(), causing the SIGSEGV in RuntimeFilterWrapper::set_state().

I verified the branch-4.0 blame: both the unconditional _signaled = true and the runtime_filters[filter_id] shared-wrapper path come from #49556.

Fix

Fixed on master by: #62056

#62056 fixed this shared hash table race by only setting _signaled = true when the builder task was not terminated. If the builder is terminated early, non-builder tasks return EOF and do not enter the shared hash table/runtime filter path with uninitialized shared state.

This PR is a branch-4.0 pick/backport of the necessary #62056 logic.

It also picks the relevant defensive idea from #60563: replace the runtime filter DCHECK + operator[] assumption with explicit Status::InternalError checks, so a missing/null wrapper returns an error instead of inserting a null wrapper and crashing.

Related PR

Release note

Fixed a BE crash when publishing runtime filters for shared hash table hash joins.

Check List (For Author)

  • Test: Manual test
    • build-support/check-format.sh
    • git diff --check
  • Behavior changed: No
  • Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#49556, apache#61768, apache#62056, apache#60563

Problem Summary: Shared hash table broadcast joins can wake non-builder hash join sink tasks after the builder was terminated early. The non-builder runtime filter path then reads a missing shared runtime filter wrapper from the map and may install a null wrapper, causing RuntimeFilterWrapper::set_state to segfault while publishing runtime filters during close. Only signal non-builder tasks when the builder actually built the shared hash table, and convert shared runtime filter wrapper map assumptions into Status errors instead of DCHECK/operator[].

### Release note

Fixed a BE crash when publishing runtime filters for shared hash table hash joins.

### Check List (For Author)

- Test: Manual test
    - build-support/check-format.sh
    - git diff --check
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 14, 2026 11:36
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the BE shared-hash-table runtime-filter path for broadcast hash joins, preventing non-builder tasks from using uninitialized shared runtime-filter wrappers after the builder is terminated early.

Changes:

  • Avoid marking the shared hash table as signaled when the builder task was terminated.
  • Replace shared runtime-filter map DCHECK/operator[] assumptions with explicit Status::InternalError validation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
be/src/runtime_filter/runtime_filter_producer_helper.cpp Adds validation for shared runtime-filter wrapper lookup and duplicate insertion.
be/src/pipeline/exec/hashjoin_build_sink.cpp Prevents signaling non-builder tasks when the shared hash-table builder was terminated before initialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +129 to +134
auto it = runtime_filters.find(filter_id);
if (it == runtime_filters.end() || it->second == nullptr) {
return Status::InternalError(
"runtime_filters does not contain valid filter_id {} when not building "
"hash table",
filter_id);
Comment on lines +240 to +245
// Only signal non-builder tasks when the builder actually built the hash table.
// When the builder is terminated early, process_build_block() has not initialized the
// shared hash table or runtime filter wrappers, so non-builders must return EOF.
if (!_terminated) {
p._signaled = true;
}
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 44.44% (12/27) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.16% (19368/36435)
Line Coverage 36.27% (180724/498221)
Region Coverage 32.89% (140342/426648)
Branch Coverage 33.76% (60731/179904)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 44.44% (12/27) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.36% (25452/35667)
Line Coverage 54.17% (269408/497309)
Region Coverage 51.74% (222951/430884)
Branch Coverage 53.11% (95872/180519)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 44.44% (12/27) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.38% (25458/35667)
Line Coverage 54.18% (269447/497309)
Region Coverage 51.75% (222985/430884)
Branch Coverage 53.12% (95900/180519)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants