Skip to content

[fix](arrow-flight) Harden split source error path to avoid BE crash on external table scan#64797

Merged
yiguolei merged 3 commits into
apache:masterfrom
morningman:fix-arrow-flight-split-source-error-path
Jun 26, 2026
Merged

[fix](arrow-flight) Harden split source error path to avoid BE crash on external table scan#64797
yiguolei merged 3 commits into
apache:masterfrom
morningman:fix-arrow-flight-split-source-error-path

Conversation

@morningman

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: ref #62259 (partial — robustness layer only, does not close the issue)

Related PR: N/A

Problem Summary:

When an Arrow Flight SQL query scans an external table (e.g. Iceberg) in batch split mode, the BE fetches file splits from the FE via the fetchSplitBatch thrift RPC during the scan. If that fetch fails — most notably when the split source has already been released — the error path could crash the BE (SIGSEGV in arrow::flight::internal::TransportStatus::FromStatus) instead of failing the query gracefully (see #62259).

This PR is the robustness layer for that issue: it ensures any fetchSplitBatch failure makes the query fail gracefully rather than crashing the BE. It does not fix the underlying split source lifecycle problem (the source being released after GetFlightInfo but before DoGet on the Arrow Flight two-phase path), which is tracked separately in the issue.

Changes:

  1. BE split_source_connector — guard result.status.error_msgs[0] with an empty() check to avoid an out-of-bounds vector read when the FE returns a non-OK status without an error message.
  2. BE to_arrow_status — truncate the error message handed to Arrow/gRPC to a length well below 8192. The message is carried in the gRPC trailer (an HTTP2 header) and may be percent-encoded, so an oversized one can break the response or crash the flight transport status conversion. The 8192 limit was already documented in a comment in this function but was never enforced. The full message is still logged on the BE.
  3. FE fetchSplitBatch — when the split source has been released, return a structured TStatus(NOT_FOUND) with a message instead of throwing a bare TException. The BE then receives a well-formed, non-empty error through the normal result path instead of a thrift transport exception.

Release note

Fix a BE crash (SIGSEGV) that could happen on the error path of Arrow Flight SQL queries against external tables when fetching split batches from the FE fails.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason: This is defensive hardening of an error path (an out-of-bounds guard, a message-length cap, and returning a structured error status instead of throwing). It only triggers when fetchSplitBatch fails; existing tests cover the success path, and the crash depends on Arrow/gRPC transport internals that are hard to reproduce deterministically in CI.
  • Behavior changed:

    • Yes. On fetchSplitBatch failure the query now fails with a clear error message instead of (potentially) crashing the BE, and the FE no longer throws a bare TException for a released split source (it returns a NOT_FOUND status).
  • Does this need documentation?

    • No.

🤖 Generated with Claude Code

…on external table scan

When an Arrow Flight SQL query scans an external table in batch split
mode, the BE fetches splits from the FE via `fetchSplitBatch`. If the
fetch fails (e.g. the split source has already been released), the error
path could crash the BE instead of failing the query gracefully. This
hardens the error path on both sides:

- BE `split_source_connector`: guard `error_msgs[0]` with an `empty()`
  check to avoid an out-of-bounds vector read when the FE returns a
  non-OK status without an error message.
- BE `to_arrow_status`: truncate the error message handed to Arrow/gRPC
  to a length well below 8192. The message is carried in the gRPC
  trailer (an HTTP2 header) and may be percent-encoded; an oversized one
  can break the response or crash the flight transport status
  conversion. The 8192 limit was already documented in a comment but
  never enforced.
- FE `fetchSplitBatch`: when the split source is released, return a
  structured `TStatus(NOT_FOUND)` with a message instead of throwing a
  bare `TException`, so the BE receives a well-formed error through the
  normal result path.

This is the robustness layer for apache#62259: it makes the query fail
gracefully rather than crashing the BE. The underlying split source
lifecycle issue (the source being released before `DoGet`) is tracked
separately in the issue.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29237 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ff28ab696ca733750e8fa46ac9122f34cd2dedb8, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17625	3966	3969	3966
q2	1994	306	187	187
q3	10456	1423	822	822
q4	4678	466	341	341
q5	7536	868	594	594
q6	180	165	135	135
q7	766	854	619	619
q8	9486	1651	1733	1651
q9	6183	4518	4484	4484
q10	6797	1782	1521	1521
q11	448	265	240	240
q12	629	415	295	295
q13	18125	3417	2731	2731
q14	265	262	243	243
q15	q16	783	787	708	708
q17	1085	1009	961	961
q18	6746	5732	5686	5686
q19	1382	1257	1117	1117
q20	481	394	265	265
q21	5988	2614	2373	2373
q22	434	351	298	298
Total cold run time: 102067 ms
Total hot run time: 29237 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4296	4268	4343	4268
q2	329	357	227	227
q3	4572	4965	4393	4393
q4	2071	2169	1395	1395
q5	4394	4297	4328	4297
q6	231	182	125	125
q7	1760	1797	1744	1744
q8	2520	2244	2117	2117
q9	8195	8167	8079	8079
q10	4800	4795	4296	4296
q11	569	418	374	374
q12	741	863	621	621
q13	3237	3609	3003	3003
q14	313	323	280	280
q15	q16	715	739	641	641
q17	1342	1323	1342	1323
q18	7996	7538	7218	7218
q19	1172	1115	1109	1109
q20	2285	2261	1980	1980
q21	5337	4658	4516	4516
q22	506	451	400	400
Total cold run time: 57381 ms
Total hot run time: 52406 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172364 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ff28ab696ca733750e8fa46ac9122f34cd2dedb8, data reload: false

query5	4326	633	483	483
query6	454	188	168	168
query7	4863	565	310	310
query8	367	220	228	220
query9	8778	4052	4046	4046
query10	427	316	256	256
query11	5939	2368	2115	2115
query12	149	101	101	101
query13	1247	610	430	430
query14	6362	5377	5006	5006
query14_1	4399	4409	4378	4378
query15	209	196	175	175
query16	1020	464	408	408
query17	923	675	562	562
query18	2415	462	339	339
query19	193	178	135	135
query20	118	106	107	106
query21	218	138	118	118
query22	13762	13601	13415	13415
query23	17293	16485	16194	16194
query23_1	16382	16315	16366	16315
query24	7540	1793	1334	1334
query24_1	1328	1326	1322	1322
query25	574	477	392	392
query26	1289	303	176	176
query27	2716	568	362	362
query28	4489	2036	2041	2036
query29	1109	619	502	502
query30	314	237	194	194
query31	1110	1074	971	971
query32	109	63	61	61
query33	535	331	256	256
query34	1161	1143	662	662
query35	769	797	689	689
query36	1371	1403	1240	1240
query37	154	107	93	93
query38	1898	1721	1665	1665
query39	933	928	890	890
query39_1	876	870	893	870
query40	223	121	104	104
query41	74	68	68	68
query42	90	89	90	89
query43	318	324	276	276
query44	1445	803	803	803
query45	196	188	181	181
query46	1095	1173	741	741
query47	2416	2402	2236	2236
query48	409	428	306	306
query49	642	473	371	371
query50	998	369	258	258
query51	4504	4450	4386	4386
query52	85	82	74	74
query53	250	270	195	195
query54	292	230	233	230
query55	77	73	74	73
query56	256	238	220	220
query57	1439	1431	1318	1318
query58	258	219	222	219
query59	1628	1688	1474	1474
query60	321	234	227	227
query61	152	146	140	140
query62	694	649	576	576
query63	230	186	191	186
query64	2512	738	600	600
query65	4836	4716	4811	4716
query66	1801	459	343	343
query67	29798	29753	28920	28920
query68	3060	1698	957	957
query69	414	303	272	272
query70	1092	1025	957	957
query71	297	236	211	211
query72	2914	2776	2416	2416
query73	875	760	427	427
query74	5162	4989	4761	4761
query75	2617	2597	2228	2228
query76	2327	1161	775	775
query77	350	372	292	292
query78	12338	12262	11897	11897
query79	1419	1293	730	730
query80	594	473	368	368
query81	452	281	235	235
query82	567	155	122	122
query83	351	273	245	245
query84	306	149	117	117
query85	848	526	435	435
query86	378	313	293	293
query87	1851	1822	1773	1773
query88	3683	2783	2777	2777
query89	430	385	337	337
query90	1934	177	177	177
query91	170	158	127	127
query92	61	62	58	58
query93	1524	1376	952	952
query94	531	318	328	318
query95	682	464	342	342
query96	1072	787	374	374
query97	2742	2688	2554	2554
query98	220	206	205	205
query99	1206	1149	1028	1028
Total cold run time: 258035 ms
Total hot run time: 172364 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.32 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ff28ab696ca733750e8fa46ac9122f34cd2dedb8, data reload: false

query1	0.01	0.01	0.00
query2	0.09	0.06	0.05
query3	0.26	0.13	0.13
query4	1.63	0.14	0.13
query5	0.24	0.23	0.23
query6	1.25	1.08	1.13
query7	0.04	0.00	0.00
query8	0.06	0.04	0.04
query9	0.38	0.32	0.32
query10	0.56	0.55	0.54
query11	0.20	0.14	0.14
query12	0.17	0.15	0.14
query13	0.47	0.47	0.46
query14	1.03	1.01	1.01
query15	0.62	0.60	0.60
query16	0.32	0.33	0.32
query17	1.08	1.11	1.10
query18	0.22	0.21	0.21
query19	2.09	1.90	1.91
query20	0.01	0.01	0.01
query21	15.40	0.22	0.14
query22	4.87	0.06	0.05
query23	16.14	0.30	0.13
query24	2.97	0.44	0.35
query25	0.12	0.06	0.04
query26	0.75	0.22	0.14
query27	0.04	0.04	0.04
query28	3.51	0.92	0.56
query29	12.50	4.29	3.43
query30	0.29	0.16	0.15
query31	2.77	0.60	0.31
query32	3.21	0.60	0.49
query33	3.16	3.31	3.22
query34	15.47	4.17	3.54
query35	3.55	3.54	3.55
query36	0.56	0.45	0.41
query37	0.09	0.07	0.07
query38	0.06	0.04	0.04
query39	0.04	0.02	0.03
query40	0.18	0.17	0.15
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.03	0.04
Total cold run time: 96.58 s
Total hot run time: 25.32 s

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/4) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.22% (28457/38339)
Line Coverage 58.01% (309789/534020)
Region Coverage 54.62% (258497/473247)
Branch Coverage 56.04% (112464/200671)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/4) 🎉
Increment coverage report
Complete coverage report

suxiaogang223
suxiaogang223 previously approved these changes Jun 25, 2026
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@morningman

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the Arrow Flight hardening path.

Checkpoint conclusions:

  • Goal/test proof: the PR improves split-source failure handling, but the oversized Arrow/gRPC error path is not fully fixed; no deterministic test was added for expansion-heavy long messages.
  • Scope/focus: the changes are small and targeted to FE split batch status handling plus BE error conversion.
  • Concurrency/lifecycle: split sources are weak-reference managed and can disappear before BE fetch, so the new FE null-source status path is reachable; no new lock-order issue found.
  • Configuration/compatibility: no new config or protocol field was added; the structured NOT_FOUND status is compatible with the existing BE status check.
  • Parallel paths: scan failures propagate through the result buffer and Arrow Flight readers to to_arrow_status(), so the truncation behavior affects the targeted path.
  • Tests/style: git diff --check is clean. I did not run FE/BE tests because this runner is missing thirdparty/installed/bin/protoc and thirdparty/installed/bin/thrift.

Subagent conclusions: tests-session-config proposed TSC-1, which was independently verified and submitted as M-1. optimizer-rewrite reported no optimizer/rewrite findings. Final convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same current ledger/comment set.

User focus: no additional user-provided review focus was supplied.

Comment thread be/src/format/arrow/arrow_utils.cpp Outdated
// percent-encoded (which can expand its size), so an oversized message can break the
// response or even crash the flight transport status conversion. Truncate it well below
// 8192 to leave headroom; the full message is already logged above.
constexpr size_t kMaxArrowStatusMsgLen = 4096;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cap still leaves the Arrow Flight error path exposed to the same oversized-header failure. This limits the raw status.to_string_no_stack() to 4096 bytes, but the comment above says this string is carried in a gRPC trailer and may be percent-encoded before hitting the 8192-byte header limit. If the message contains mostly escapable bytes, 4096 raw bytes plus the suffix can expand to well over 8KB on the wire, so an external error string can still reproduce the Header size exceeded max allowed size path this PR is trying to harden. Please either make this truncation encode-aware or choose a raw limit that stays below the worst-case encoded size, and add a focused test for an expansion-heavy long status message.

@morningman

Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Jun 25, 2026
@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29148 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 247ffef01cb05afc0a2e3f93e8d2e385210d812b, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17716	4024	4238	4024
q2	1999	338	190	190
q3	10280	1467	823	823
q4	4676	471	350	350
q5	7508	848	572	572
q6	181	169	136	136
q7	810	837	620	620
q8	9378	1598	1648	1598
q9	5584	4507	4508	4507
q10	6765	1796	1517	1517
q11	433	279	248	248
q12	632	422	288	288
q13	18106	3388	2763	2763
q14	266	271	241	241
q15	q16	794	775	707	707
q17	932	849	899	849
q18	6937	5825	5557	5557
q19	1196	1260	1077	1077
q20	472	417	268	268
q21	5671	2666	2508	2508
q22	430	355	305	305
Total cold run time: 100766 ms
Total hot run time: 29148 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4342	4291	4462	4291
q2	328	351	225	225
q3	4573	4980	4439	4439
q4	2063	2153	1359	1359
q5	4439	4292	4357	4292
q6	236	175	130	130
q7	1736	1672	1885	1672
q8	2709	2260	2173	2173
q9	8316	8379	8202	8202
q10	4832	4751	4386	4386
q11	566	436	387	387
q12	750	755	561	561
q13	3284	3618	2957	2957
q14	315	303	269	269
q15	q16	710	739	632	632
q17	1403	1347	1366	1347
q18	8032	7419	7186	7186
q19	1162	1161	1163	1161
q20	2224	2229	1979	1979
q21	5294	4571	4625	4571
q22	520	461	404	404
Total cold run time: 57834 ms
Total hot run time: 52623 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172227 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 247ffef01cb05afc0a2e3f93e8d2e385210d812b, data reload: false

query5	4323	640	472	472
query6	437	182	169	169
query7	4817	569	303	303
query8	340	184	168	168
query9	8729	4085	4093	4085
query10	454	312	286	286
query11	5917	2364	2133	2133
query12	155	115	97	97
query13	1271	612	415	415
query14	6292	5316	4996	4996
query14_1	4286	4309	4324	4309
query15	210	206	180	180
query16	1005	495	431	431
query17	957	750	604	604
query18	2448	494	352	352
query19	211	193	152	152
query20	112	110	110	110
query21	219	145	121	121
query22	13662	13598	13485	13485
query23	17553	16574	16196	16196
query23_1	16391	16320	16375	16320
query24	7478	1781	1299	1299
query24_1	1334	1340	1326	1326
query25	613	434	364	364
query26	1295	317	178	178
query27	2702	527	347	347
query28	4448	1992	2010	1992
query29	1052	619	481	481
query30	308	245	209	209
query31	1124	1080	945	945
query32	131	59	56	56
query33	515	336	245	245
query34	1184	1129	645	645
query35	768	790	663	663
query36	1447	1386	1240	1240
query37	155	106	93	93
query38	1911	1706	1668	1668
query39	930	920	896	896
query39_1	889	877	865	865
query40	216	124	102	102
query41	66	63	64	63
query42	94	88	85	85
query43	321	324	278	278
query44	1426	779	782	779
query45	207	201	186	186
query46	1084	1194	744	744
query47	2365	2413	2221	2221
query48	408	430	288	288
query49	574	425	328	328
query50	952	373	248	248
query51	4460	4380	4332	4332
query52	82	80	68	68
query53	244	256	190	190
query54	263	216	200	200
query55	74	71	66	66
query56	263	227	217	217
query57	1446	1442	1326	1326
query58	254	216	230	216
query59	1575	1651	1406	1406
query60	276	260	240	240
query61	150	154	141	141
query62	695	653	588	588
query63	232	184	194	184
query64	2522	768	635	635
query65	4866	4754	4803	4754
query66	1802	471	343	343
query67	28895	28977	28717	28717
query68	3128	1506	875	875
query69	404	300	265	265
query70	1056	982	973	973
query71	297	242	209	209
query72	2923	2622	2351	2351
query73	812	770	432	432
query74	5110	4981	4842	4842
query75	2592	2545	2170	2170
query76	2356	1176	809	809
query77	357	383	294	294
query78	12402	12364	11972	11972
query79	1233	1113	759	759
query80	541	471	373	373
query81	451	274	239	239
query82	242	157	121	121
query83	287	275	249	249
query84	289	144	116	116
query85	839	518	419	419
query86	320	299	301	299
query87	1827	1826	1778	1778
query88	3669	2754	2759	2754
query89	415	383	340	340
query90	2168	187	180	180
query91	165	160	127	127
query92	64	60	57	57
query93	1508	1459	904	904
query94	553	342	296	296
query95	684	494	349	349
query96	1063	823	370	370
query97	2700	2695	2570	2570
query98	213	204	204	204
query99	1173	1163	1063	1063
Total cold run time: 256172 ms
Total hot run time: 172227 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.17 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 247ffef01cb05afc0a2e3f93e8d2e385210d812b, data reload: false

query1	0.01	0.01	0.01
query2	0.10	0.06	0.05
query3	0.26	0.13	0.13
query4	1.61	0.13	0.14
query5	0.24	0.22	0.22
query6	1.27	1.03	1.08
query7	0.04	0.01	0.01
query8	0.10	0.04	0.03
query9	0.37	0.30	0.31
query10	0.56	0.56	0.58
query11	0.21	0.15	0.14
query12	0.18	0.15	0.14
query13	0.46	0.47	0.48
query14	1.02	1.02	1.01
query15	0.62	0.58	0.61
query16	0.32	0.34	0.33
query17	1.09	1.08	1.08
query18	0.22	0.22	0.21
query19	2.05	1.94	1.89
query20	0.01	0.02	0.01
query21	15.43	0.19	0.13
query22	4.97	0.05	0.06
query23	16.12	0.32	0.13
query24	2.95	0.43	0.30
query25	0.10	0.07	0.05
query26	0.76	0.21	0.15
query27	0.04	0.04	0.03
query28	3.54	0.90	0.57
query29	12.48	4.32	3.49
query30	0.28	0.14	0.16
query31	2.78	0.58	0.32
query32	3.22	0.60	0.49
query33	3.16	3.20	3.28
query34	15.64	4.20	3.50
query35	3.56	3.50	3.56
query36	0.57	0.44	0.42
query37	0.09	0.07	0.07
query38	0.04	0.04	0.03
query39	0.04	0.03	0.03
query40	0.18	0.16	0.15
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.03	0.04
Total cold run time: 96.86 s
Total hot run time: 25.17 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.93% (21591/39303)
Line Coverage 38.42% (206417/537312)
Region Coverage 34.46% (162379/471174)
Branch Coverage 35.50% (71157/200450)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/4) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.11% (28430/38361)
Line Coverage 57.99% (309732/534114)
Region Coverage 54.69% (258837/473275)
Branch Coverage 56.03% (112451/200693)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/8) 🎉
Increment coverage report
Complete coverage report

@morningman

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review completed.

I found one remaining issue, submitted inline: the released split-source RPC contract now returns a structured NOT_FOUND result that the BE depends on, but there is no focused test covering that no-throw/non-empty-message branch.

Critical checkpoint conclusions:

  • Goal/current behavior: the PR hardens external scan error propagation by returning structured FE status, guarding empty BE error messages, and truncating Arrow status text; the implementation direction is correct, but the new FE RPC contract is not test-proven.
  • Scope: the patch is small and limited to the split-source fetch path and Arrow status conversion.
  • Concurrency/lifecycle: parallel scanners share RemoteSplitSourceConnector under _range_lock; SplitSourceManager weak references and FileQueryScanNode.stop() make the released-source branch reachable. I found no new lock or lifecycle bug.
  • Config/protocol/compatibility: no new config or thrift fields are added; the existing optional TStatus path is used. BE handles non-OK status and empty error message fallback.
  • Parallel paths: remote split-source scanning and Arrow Flight local/remote reader status propagation were traced; no additional parallel-path issue survived review.
  • Tests: no FE/BE/regression test covers the new released split-source status contract, which is the inline finding. The Arrow Flight expansion-heavy truncation test concern is already covered by existing thread discussion_r3472560365 and was not duplicated.
  • Observability/performance: the new warning log and full-message logging before truncation are sufficient for this path; no material performance issue found.
  • Validation: exact-range git diff --check passed. FE/BE tests were not run because thirdparty/installed, protoc, and thrift are missing in this runner.
  • User focus: no additional user-provided focus points were present.

Subagent conclusions: optimizer-rewrite reported no new valuable findings. tests-session-config proposed TSC-1, which was accepted as M-1 and submitted inline. D-1 was suppressed as a duplicate of existing Arrow Flight thread discussion_r3472560365. Convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same current ledger/comment set.

// on the Arrow Flight path, may feed an empty/invalid error string into the gRPC
// status conversion. See https://github.com/apache/doris/issues/62259
LOG.warn("split source {} is released", request.getSplitSourceId());
result.status = new TStatus(TStatusCode.NOT_FOUND);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the released split-source contract from a thrown TException to a structured TFetchSplitBatchResult, but the new branch is not covered by a focused test. That contract is what the BE now relies on in RemoteSplitSourceConnector::get_next() to produce a normal Status with a non-empty message instead of going through the transport-exception path. A small FrontendServiceImplTest case can call fetchSplitBatch with an unregistered split source id and assert that it does not throw, returns TStatusCode.NOT_FOUND, and carries a non-empty error_msgs entry. Without that, this regression path can slip back to a bare exception or empty message and re-open the Arrow Flight failure mode this PR is hardening.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 26, 2026
@yiguolei yiguolei added dev/4.1.x usercase Important user case type label labels Jun 26, 2026
@yiguolei yiguolei merged commit bb628eb into apache:master Jun 26, 2026
36 of 37 checks passed
github-actions Bot pushed a commit that referenced this pull request Jun 26, 2026
…on external table scan (#64797)

### What problem does this PR solve?

Issue Number: ref #62259 (partial — robustness layer only, does not
close the issue)

Related PR: N/A

Problem Summary:

When an Arrow Flight SQL query scans an external table (e.g. Iceberg) in
batch split mode, the BE fetches file splits from the FE via the
`fetchSplitBatch` thrift RPC *during* the scan. If that fetch fails —
most notably when the split source has already been released — the error
path could crash the BE (SIGSEGV in
`arrow::flight::internal::TransportStatus::FromStatus`) instead of
failing the query gracefully (see #62259).

This PR is the **robustness layer** for that issue: it ensures any
`fetchSplitBatch` failure makes the query fail gracefully rather than
crashing the BE. It does **not** fix the underlying split source
lifecycle problem (the source being released after `GetFlightInfo` but
before `DoGet` on the Arrow Flight two-phase path), which is tracked
separately in the issue.

Changes:

1. **BE `split_source_connector`** — guard `result.status.error_msgs[0]`
with an `empty()` check to avoid an out-of-bounds vector read when the
FE returns a non-OK status without an error message.
2. **BE `to_arrow_status`** — truncate the error message handed to
Arrow/gRPC to a length well below 8192. The message is carried in the
gRPC trailer (an HTTP2 header) and may be percent-encoded, so an
oversized one can break the response or crash the flight transport
status conversion. The 8192 limit was already documented in a comment in
this function but was never enforced. The full message is still logged
on the BE.
3. **FE `fetchSplitBatch`** — when the split source has been released,
return a structured `TStatus(NOT_FOUND)` with a message instead of
throwing a bare `TException`. The BE then receives a well-formed,
non-empty error through the normal result path instead of a thrift
transport exception.

### Release note

Fix a BE crash (SIGSEGV) that could happen on the error path of Arrow
Flight SQL queries against external tables when fetching split batches
from the FE fails.

### Check List (For Author)

- Test
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [x] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
- [x] Other reason: This is defensive hardening of an error path (an
out-of-bounds guard, a message-length cap, and returning a structured
error status instead of throwing). It only triggers when
`fetchSplitBatch` fails; existing tests cover the success path, and the
crash depends on Arrow/gRPC transport internals that are hard to
reproduce deterministically in CI.

- Behavior changed:
- [x] Yes. On `fetchSplitBatch` failure the query now fails with a clear
error message instead of (potentially) crashing the BE, and the FE no
longer throws a bare `TException` for a released split source (it
returns a `NOT_FOUND` status).

- Does this need documentation?
    - [x] No.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
yiguolei pushed a commit that referenced this pull request Jun 26, 2026
…id BE crash on external table scan #64797 (#64890)

Cherry-picked from #64797

Co-authored-by: Mingyu Chen (Rayner) <yunyou@selectdb.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.3-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants