Skip to content

[fix](pipeline) detect unread local exchange data on eos#63588

Open
Mryange wants to merge 1 commit into
apache:masterfrom
Mryange:check-local-exchange-in-eos
Open

[fix](pipeline) detect unread local exchange data on eos#63588
Mryange wants to merge 1 commit into
apache:masterfrom
Mryange:check-local-exchange-in-eos

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented May 25, 2026

What problem does this PR solve?

Problem Summary:

A local exchange source without a limit can currently trust eos even when unread data is still buffered in the exchanger. In that case the pipeline may finish with fewer output rows than input rows and the failure is hidden behind a normal end-of-stream. Root cause: the source path did not validate exchanger state after eos, and AdaptivePassthroughExchanger also has a tmp block path outside the normal queue state. This change adds a post-eos residual-data check for the no-limit source path, reports detailed queue and buffered-block diagnostics when unread data remains, and adds unit tests that cover both queue residuals and adaptive passthrough tmp-block residuals.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 25, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one blocking issue: the new BE unit tests directly access protected/private exchanger members, so the test target will not compile.

Critical checkpoint conclusions:

  • Goal/test proof: The production change aims to detect unread local-exchange data after natural EOS, and tests were added, but the new tests currently fail at compile time.
  • Scope/focus: The production change is small and focused; the blocker is limited to test construction.
  • Concurrency: The runtime check is executed after EOS from the source path. No new threads or dependency types are introduced; existing exchanger queue locking is reused.
  • Lifecycle/static initialization: No new static/global lifecycle hazards found.
  • Config/compatibility: No new config, storage format, or FE/BE protocol compatibility concern found.
  • Parallel paths: The virtual method covers all exchanger implementations through the template override, with an adaptive override for tmp-block state.
  • Conditional checks: The no-limit EOS gate is intentional to avoid flagging limit-short-circuit cases.
  • Test coverage/results: Tests cover queue residual and adaptive tmp residual intent, but they need to be rewritten to use public behavior or otherwise made legally accessible before they can validate the feature.
  • Observability: Error messages include exchange type, channel, queue info, and unread block details; sufficient for this diagnostic path.
  • Transaction/persistence/data writes: Not applicable.
  • Performance/memory: Check only runs on natural EOS and should not affect hot steady-state paths.

User focus: No additional user-provided review focus was supplied.

auto wrapper = ExchangerBase::BlockWrapper::create_shared(std::move(in_block), nullptr, -1);
auto row_idx = std::make_shared<PODArray<uint32_t>>(1);
(*row_idx)[0] = 0;
ASSERT_TRUE(exchanger->_data_queue[0].enqueue(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test target will not compile because it accesses Exchanger<PartitionedBlock>::_data_queue, which is a protected member of the base class. The next test has the same compile-time problem with AdaptivePassthroughExchanger::_tmp_block and _tmp_eos, which are private. Please construct these residual states through the public sink/get_block behavior (or otherwise add a deliberate test-only access mechanism) instead of directly reaching into class internals.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 25, 2026

/review 我们的beut是有编译选项可以反问private的。

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was skipped (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/26388203155

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 25, 2026

/review 我们的beut是有编译选项可以访问private的。

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no additional blocking issues found in the PR patch.

Critical checkpoint conclusions:

  • Goal/test: The change addresses hidden local-exchange eos with unread buffered data by validating queue/tmp-block state after eos on no-limit source operators; added BE unit tests cover queue residuals, adaptive tmp-block residuals, and adaptive copied-slice behavior.
  • Scope: The production change is small and focused on local exchange source/exchanger logic.
  • Concurrency: The new check runs only after eos from get_block; at that point running sink operators should be finished for the channel, so I did not find a concrete enqueue/check race. Existing exchanger queue locking patterns are preserved.
  • Lifecycle: No new static/global lifecycle concerns. The adaptive tmp-block lifecycle is explicitly checked.
  • Config/compatibility: No new configs, protocols, storage formats, or FE-BE compatibility concerns.
  • Parallel paths: The virtual check covers the Exchanger template used by shuffle, passthrough, pass-to-one, and broadcast; AdaptivePassthrough also checks its extra tmp-block path.
  • Tests/results: Tests are BE unit tests and target the new diagnostics. I did not run them in this review environment.
  • Observability: Error messages include channel, exchange type, queue info, and block diagnostics, which should be sufficient for this invariant failure.
  • Transaction/persistence/data writes: Not applicable.
  • Performance: The check is gated behind eos and no-limit, so it is not on the normal per-row/per-block hot path.

Focus response: I noted the existing thread about direct private/protected member access in BE unit tests. Given the user focus that BE UT has compile options for this access, I did not treat that existing concern as a new issue or duplicate it.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 25, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 32034 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 09d76de9463a6cd92bcf3e5da0dc4900a44947bb, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17794	4106	4035	4035
q2	q3	10821	1419	838	838
q4	4679	488	351	351
q5	7559	2343	2116	2116
q6	247	178	135	135
q7	1028	801	637	637
q8	9355	1903	1706	1706
q9	5192	4970	4979	4970
q10	6372	2219	1864	1864
q11	442	265	241	241
q12	632	440	293	293
q13	18127	3416	2795	2795
q14	268	260	238	238
q15	q16	827	779	708	708
q17	988	966	1010	966
q18	6866	5864	5578	5578
q19	1322	1368	1227	1227
q20	560	483	309	309
q21	6311	2873	2706	2706
q22	589	393	321	321
Total cold run time: 99979 ms
Total hot run time: 32034 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4850	4697	4840	4697
q2	q3	5010	5357	4643	4643
q4	2180	2239	1465	1465
q5	5125	4703	4698	4698
q6	249	192	131	131
q7	1948	1807	1585	1585
q8	2438	2122	2176	2122
q9	7954	7448	7425	7425
q10	4770	4692	4243	4243
q11	568	391	365	365
q12	753	745	541	541
q13	3050	3320	2791	2791
q14	274	282	262	262
q15	q16	695	714	624	624
q17	1325	1292	1308	1292
q18	7440	6853	6980	6853
q19	1119	1111	1142	1111
q20	2246	2233	1974	1974
q21	5381	4687	4526	4526
q22	535	487	398	398
Total cold run time: 57910 ms
Total hot run time: 51746 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172689 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 09d76de9463a6cd92bcf3e5da0dc4900a44947bb, data reload: false

query5	4304	668	509	509
query6	346	215	211	211
query7	4237	595	309	309
query8	325	244	219	219
query9	8807	4018	4086	4018
query10	459	344	296	296
query11	5819	2484	2304	2304
query12	187	126	125	125
query13	1290	607	431	431
query14	6121	5494	5223	5223
query14_1	4521	4506	4512	4506
query15	215	204	188	188
query16	957	453	437	437
query17	979	753	626	626
query18	2480	491	366	366
query19	219	212	171	171
query20	141	134	135	134
query21	220	140	120	120
query22	13836	13679	13319	13319
query23	17346	16617	16238	16238
query23_1	16502	16561	16323	16323
query24	7468	1787	1344	1344
query24_1	1318	1326	1334	1326
query25	588	501	448	448
query26	1325	325	181	181
query27	2693	548	352	352
query28	4444	2011	1991	1991
query29	1043	649	528	528
query30	313	242	204	204
query31	1133	1085	959	959
query32	108	82	76	76
query33	569	403	276	276
query34	1192	1118	660	660
query35	781	796	700	700
query36	1413	1436	1285	1285
query37	163	104	95	95
query38	3252	3172	3086	3086
query39	934	935	898	898
query39_1	885	881	869	869
query40	222	141	126	126
query41	64	63	63	63
query42	112	108	108	108
query43	333	330	298	298
query44	
query45	214	204	197	197
query46	1079	1191	736	736
query47	2403	2363	2296	2296
query48	397	416	298	298
query49	617	487	381	381
query50	1031	354	250	250
query51	4359	4294	4254	4254
query52	106	104	92	92
query53	248	277	201	201
query54	308	268	247	247
query55	99	92	90	90
query56	318	307	300	300
query57	1455	1444	1342	1342
query58	293	257	272	257
query59	1625	1712	1502	1502
query60	314	319	319	319
query61	157	156	156	156
query62	697	653	581	581
query63	245	196	212	196
query64	2461	820	620	620
query65	
query66	1751	478	365	365
query67	30137	30139	29973	29973
query68	
query69	464	342	302	302
query70	1048	981	1010	981
query71	301	272	269	269
query72	3047	2735	2407	2407
query73	813	782	450	450
query74	5112	4961	4816	4816
query75	2676	2644	2307	2307
query76	2279	1162	784	784
query77	409	417	345	345
query78	12478	12356	11895	11895
query79	1480	1059	727	727
query80	663	533	459	459
query81	469	287	244	244
query82	1367	164	121	121
query83	362	279	246	246
query84	315	149	110	110
query85	897	552	469	469
query86	413	330	310	310
query87	3467	3402	3244	3244
query88	3641	2751	2734	2734
query89	440	399	346	346
query90	1961	183	185	183
query91	187	175	141	141
query92	84	78	71	71
query93	1526	1438	827	827
query94	534	352	325	325
query95	669	377	442	377
query96	1085	806	337	337
query97	2739	2781	2596	2596
query98	235	250	244	244
query99	1150	1177	1021	1021
Total cold run time: 255314 ms
Total hot run time: 172689 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 78.00% (39/50) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.76% (28028/37999)
Line Coverage 57.69% (304387/527623)
Region Coverage 54.93% (255071/464393)
Branch Coverage 56.38% (110067/195230)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 26, 2026

run cloud_p0

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 78.00% (39/50) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.76% (28028/37999)
Line Coverage 57.69% (304387/527623)
Region Coverage 54.93% (255071/464393)
Branch Coverage 56.38% (110067/195230)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants