[fix](group commit) fix can not get a block queue#63722
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR addresses failures in group-commit stream load where the BE could not obtain a usable LoadBlockQueue under concurrency, by improving error handling around FE RPCs and adding resubmission/timeout logic for creating group-commit plan fragments. It also adds a new high-concurrency regression test intended to reproduce/guard the issue.
Changes:
- Improve RPC-status handling when aborting/committing group-commit transactions (avoid masking transport errors as FE result statuses).
- Add a background “create plan” worker + resubmission/timeout mechanism to reduce “can not get a block queue” occurrences during plan creation.
- Add a high-concurrency async group-commit stream load regression test.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/load_p0/stream_load/test_group_commit_stream_load_high_concurrency_async.groovy | New regression test stressing async group-commit under high concurrency. |
| be/src/load/group_commit/wal/wal_table.cpp | Return/LOG transport (RPC) errors distinctly from FE rollback result status. |
| be/src/load/group_commit/group_commit_mgr.h | Add new create-plan worker plumbing and make _need_commit atomic. |
| be/src/load/group_commit/group_commit_mgr.cpp | Implement create-plan resubmission worker, timeout handling, and improved commit/rollback RPC error handling. |
| be/src/common/config.h | Declare new config group_commit_create_plan_timeout_ms. |
| be/src/common/config.cpp | Define default for group_commit_create_plan_timeout_ms. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/review |
There was a problem hiding this comment.
I found one blocking issue in the new group-commit plan creation worker. The new background thread does not initialize Doris thread context before executing BE code, which violates the BE runtime memory-tracking convention. Existing Copilot comments already covered the timeout comment mismatch, the add-load-id error message, and the p0 test size concern, so I did not duplicate those.
Checklist conclusions:
- Goal/test: The PR targets high-concurrency async group commit plan creation and adds a regression case, but the new worker thread setup is incomplete.
- Scope: The implementation is focused on group commit queue creation/retry, though it introduces a new manager thread.
- Concurrency: The new worker/condition-variable path is relevant; lock ordering looks consistent, but the worker thread lacks required thread-context initialization.
- Lifecycle:
GroupCommitMgr::stop()joins the new worker before shutting down the pool; no additional lifecycle blocker found. - Config: Adds a mutable timeout config with default/comment now aligned.
- Compatibility: No protocol/storage-format incompatibility found.
- Parallel paths: WAL rollback RPC handling was updated consistently in the touched paths.
- Tests: A stress regression test was added; existing review already noted the p0 size/flakiness risk.
- Observability: Existing logs are mostly sufficient for retries/failures.
- Transaction/persistence/data correctness: No additional transaction visibility or WAL recovery issue found in the changed logic.
- Performance: No additional performance blocker beyond the existing high-concurrency p0 test concern.
- User focus: No additional user-provided review focus was specified.
9bbc402 to
47e1354
Compare
47e1354 to
250bbcb
Compare
|
run buildall |
TPC-H: Total hot run time: 31323 ms |
TPC-DS: Total hot run time: 172487 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
5c6c86a to
3e7b817
Compare
|
run buildall |
TPC-H: Total hot run time: 31954 ms |
TPC-DS: Total hot run time: 172648 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)