[fix](arrow-flight) Harden split source error path to avoid BE crash on external table scan#64797
Conversation
…on external table scan When an Arrow Flight SQL query scans an external table in batch split mode, the BE fetches splits from the FE via `fetchSplitBatch`. If the fetch fails (e.g. the split source has already been released), the error path could crash the BE instead of failing the query gracefully. This hardens the error path on both sides: - BE `split_source_connector`: guard `error_msgs[0]` with an `empty()` check to avoid an out-of-bounds vector read when the FE returns a non-OK status without an error message. - BE `to_arrow_status`: truncate the error message handed to Arrow/gRPC to a length well below 8192. The message is carried in the gRPC trailer (an HTTP2 header) and may be percent-encoded; an oversized one can break the response or crash the flight transport status conversion. The 8192 limit was already documented in a comment but never enforced. - FE `fetchSplitBatch`: when the split source is released, return a structured `TStatus(NOT_FOUND)` with a message instead of throwing a bare `TException`, so the BE receives a well-formed error through the normal result path. This is the robustness layer for apache#62259: it makes the query fail gracefully rather than crashing the BE. The underlying split source lifecycle issue (the source being released before `DoGet`) is tracked separately in the issue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 29237 ms |
TPC-DS: Total hot run time: 172364 ms |
ClickBench: Total hot run time: 25.32 s |
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
/review |
There was a problem hiding this comment.
I found one blocking issue in the Arrow Flight hardening path.
Checkpoint conclusions:
- Goal/test proof: the PR improves split-source failure handling, but the oversized Arrow/gRPC error path is not fully fixed; no deterministic test was added for expansion-heavy long messages.
- Scope/focus: the changes are small and targeted to FE split batch status handling plus BE error conversion.
- Concurrency/lifecycle: split sources are weak-reference managed and can disappear before BE fetch, so the new FE null-source status path is reachable; no new lock-order issue found.
- Configuration/compatibility: no new config or protocol field was added; the structured NOT_FOUND status is compatible with the existing BE status check.
- Parallel paths: scan failures propagate through the result buffer and Arrow Flight readers to
to_arrow_status(), so the truncation behavior affects the targeted path. - Tests/style:
git diff --checkis clean. I did not run FE/BE tests because this runner is missingthirdparty/installed/bin/protocandthirdparty/installed/bin/thrift.
Subagent conclusions: tests-session-config proposed TSC-1, which was independently verified and submitted as M-1. optimizer-rewrite reported no optimizer/rewrite findings. Final convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same current ledger/comment set.
User focus: no additional user-provided review focus was supplied.
| // percent-encoded (which can expand its size), so an oversized message can break the | ||
| // response or even crash the flight transport status conversion. Truncate it well below | ||
| // 8192 to leave headroom; the full message is already logged above. | ||
| constexpr size_t kMaxArrowStatusMsgLen = 4096; |
There was a problem hiding this comment.
The cap still leaves the Arrow Flight error path exposed to the same oversized-header failure. This limits the raw status.to_string_no_stack() to 4096 bytes, but the comment above says this string is carried in a gRPC trailer and may be percent-encoded before hitting the 8192-byte header limit. If the message contains mostly escapable bytes, 4096 raw bytes plus the suffix can expand to well over 8KB on the wire, so an external error string can still reproduce the Header size exceeded max allowed size path this PR is trying to harden. Please either make this truncation encode-aware or choose a raw limit that stays below the worst-case encoded size, and add a focused test for an expansion-heavy long status message.
|
run buildall |
TPC-H: Total hot run time: 29148 ms |
TPC-DS: Total hot run time: 172227 ms |
ClickBench: Total hot run time: 25.17 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
Automated review completed.
I found one remaining issue, submitted inline: the released split-source RPC contract now returns a structured NOT_FOUND result that the BE depends on, but there is no focused test covering that no-throw/non-empty-message branch.
Critical checkpoint conclusions:
- Goal/current behavior: the PR hardens external scan error propagation by returning structured FE status, guarding empty BE error messages, and truncating Arrow status text; the implementation direction is correct, but the new FE RPC contract is not test-proven.
- Scope: the patch is small and limited to the split-source fetch path and Arrow status conversion.
- Concurrency/lifecycle: parallel scanners share RemoteSplitSourceConnector under _range_lock; SplitSourceManager weak references and FileQueryScanNode.stop() make the released-source branch reachable. I found no new lock or lifecycle bug.
- Config/protocol/compatibility: no new config or thrift fields are added; the existing optional TStatus path is used. BE handles non-OK status and empty error message fallback.
- Parallel paths: remote split-source scanning and Arrow Flight local/remote reader status propagation were traced; no additional parallel-path issue survived review.
- Tests: no FE/BE/regression test covers the new released split-source status contract, which is the inline finding. The Arrow Flight expansion-heavy truncation test concern is already covered by existing thread discussion_r3472560365 and was not duplicated.
- Observability/performance: the new warning log and full-message logging before truncation are sufficient for this path; no material performance issue found.
- Validation: exact-range git diff --check passed. FE/BE tests were not run because thirdparty/installed, protoc, and thrift are missing in this runner.
- User focus: no additional user-provided focus points were present.
Subagent conclusions: optimizer-rewrite reported no new valuable findings. tests-session-config proposed TSC-1, which was accepted as M-1 and submitted inline. D-1 was suppressed as a duplicate of existing Arrow Flight thread discussion_r3472560365. Convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same current ledger/comment set.
| // on the Arrow Flight path, may feed an empty/invalid error string into the gRPC | ||
| // status conversion. See https://github.com/apache/doris/issues/62259 | ||
| LOG.warn("split source {} is released", request.getSplitSourceId()); | ||
| result.status = new TStatus(TStatusCode.NOT_FOUND); |
There was a problem hiding this comment.
This changes the released split-source contract from a thrown TException to a structured TFetchSplitBatchResult, but the new branch is not covered by a focused test. That contract is what the BE now relies on in RemoteSplitSourceConnector::get_next() to produce a normal Status with a non-empty message instead of going through the transport-exception path. A small FrontendServiceImplTest case can call fetchSplitBatch with an unregistered split source id and assert that it does not throw, returns TStatusCode.NOT_FOUND, and carries a non-empty error_msgs entry. Without that, this regression path can slip back to a bare exception or empty message and re-open the Arrow Flight failure mode this PR is hardening.
|
PR approved by at least one committer and no changes requested. |
…on external table scan (#64797) ### What problem does this PR solve? Issue Number: ref #62259 (partial — robustness layer only, does not close the issue) Related PR: N/A Problem Summary: When an Arrow Flight SQL query scans an external table (e.g. Iceberg) in batch split mode, the BE fetches file splits from the FE via the `fetchSplitBatch` thrift RPC *during* the scan. If that fetch fails — most notably when the split source has already been released — the error path could crash the BE (SIGSEGV in `arrow::flight::internal::TransportStatus::FromStatus`) instead of failing the query gracefully (see #62259). This PR is the **robustness layer** for that issue: it ensures any `fetchSplitBatch` failure makes the query fail gracefully rather than crashing the BE. It does **not** fix the underlying split source lifecycle problem (the source being released after `GetFlightInfo` but before `DoGet` on the Arrow Flight two-phase path), which is tracked separately in the issue. Changes: 1. **BE `split_source_connector`** — guard `result.status.error_msgs[0]` with an `empty()` check to avoid an out-of-bounds vector read when the FE returns a non-OK status without an error message. 2. **BE `to_arrow_status`** — truncate the error message handed to Arrow/gRPC to a length well below 8192. The message is carried in the gRPC trailer (an HTTP2 header) and may be percent-encoded, so an oversized one can break the response or crash the flight transport status conversion. The 8192 limit was already documented in a comment in this function but was never enforced. The full message is still logged on the BE. 3. **FE `fetchSplitBatch`** — when the split source has been released, return a structured `TStatus(NOT_FOUND)` with a message instead of throwing a bare `TException`. The BE then receives a well-formed, non-empty error through the normal result path instead of a thrift transport exception. ### Release note Fix a BE crash (SIGSEGV) that could happen on the error path of Arrow Flight SQL queries against external tables when fetching split batches from the FE fails. ### Check List (For Author) - Test - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [x] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [x] Other reason: This is defensive hardening of an error path (an out-of-bounds guard, a message-length cap, and returning a structured error status instead of throwing). It only triggers when `fetchSplitBatch` fails; existing tests cover the success path, and the crash depends on Arrow/gRPC transport internals that are hard to reproduce deterministically in CI. - Behavior changed: - [x] Yes. On `fetchSplitBatch` failure the query now fails with a clear error message instead of (potentially) crashing the BE, and the FE no longer throws a bare `TException` for a released split source (it returns a `NOT_FOUND` status). - Does this need documentation? - [x] No. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What problem does this PR solve?
Issue Number: ref #62259 (partial — robustness layer only, does not close the issue)
Related PR: N/A
Problem Summary:
When an Arrow Flight SQL query scans an external table (e.g. Iceberg) in batch split mode, the BE fetches file splits from the FE via the
fetchSplitBatchthrift RPC during the scan. If that fetch fails — most notably when the split source has already been released — the error path could crash the BE (SIGSEGV inarrow::flight::internal::TransportStatus::FromStatus) instead of failing the query gracefully (see #62259).This PR is the robustness layer for that issue: it ensures any
fetchSplitBatchfailure makes the query fail gracefully rather than crashing the BE. It does not fix the underlying split source lifecycle problem (the source being released afterGetFlightInfobut beforeDoGeton the Arrow Flight two-phase path), which is tracked separately in the issue.Changes:
split_source_connector— guardresult.status.error_msgs[0]with anempty()check to avoid an out-of-bounds vector read when the FE returns a non-OK status without an error message.to_arrow_status— truncate the error message handed to Arrow/gRPC to a length well below 8192. The message is carried in the gRPC trailer (an HTTP2 header) and may be percent-encoded, so an oversized one can break the response or crash the flight transport status conversion. The 8192 limit was already documented in a comment in this function but was never enforced. The full message is still logged on the BE.fetchSplitBatch— when the split source has been released, return a structuredTStatus(NOT_FOUND)with a message instead of throwing a bareTException. The BE then receives a well-formed, non-empty error through the normal result path instead of a thrift transport exception.Release note
Fix a BE crash (SIGSEGV) that could happen on the error path of Arrow Flight SQL queries against external tables when fetching split batches from the FE fails.
Check List (For Author)
Test
fetchSplitBatchfails; existing tests cover the success path, and the crash depends on Arrow/gRPC transport internals that are hard to reproduce deterministically in CI.Behavior changed:
fetchSplitBatchfailure the query now fails with a clear error message instead of (potentially) crashing the BE, and the FE no longer throws a bareTExceptionfor a released split source (it returns aNOT_FOUNDstatus).Does this need documentation?
🤖 Generated with Claude Code