Skip to content

[fix](be) Keep prefetch reader alive for async tasks#63796

Merged
Gabriel39 merged 2 commits into
apache:masterfrom
Gabriel39:fix_0528
May 29, 2026
Merged

[fix](be) Keep prefetch reader alive for async tasks#63796
Gabriel39 merged 2 commits into
apache:masterfrom
Gabriel39:fix_0528

Conversation

@Gabriel39
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #25509

Related PR: #61248

Problem Summary: S3/OSS prefetch timeout can cancel and close PrefetchBufferedReader while an async PrefetchBuffer task is still running. The task kept PrefetchBuffer alive but only stored the underlying FileReader as a raw pointer, so the owner could destroy the reader before the async task resumed on the error path and logged reader metadata. Keep a shared FileReader reference in each PrefetchBuffer so the async prefetch task cannot outlive the reader it dereferences, and add a unit test that covers close timeout while the prefetch read is blocked.

Release note

None

Check List (For Author)

  • Test: git diff --check; git diff --cached --check; attempted build-support/clang-format.sh and build-support/check-format.sh but llvm@16/clang-format is not installed; attempted ./run-be-ut.sh --run buffered_reader_test but it failed before running tests because JAVA_HOME points to JDK 11 and JDK_17 is not set
  • Behavior changed: No
  • Does this need documentation: No

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

### What problem does this PR solve?

Issue Number: close apache#25509

Related PR: apache#61248

Problem Summary: S3/OSS prefetch timeout can cancel and close PrefetchBufferedReader while an async PrefetchBuffer task is still running. The task kept PrefetchBuffer alive but only stored the underlying FileReader as a raw pointer, so the owner could destroy the reader before the async task resumed on the error path and logged reader metadata. Keep a shared FileReader reference in each PrefetchBuffer so the async prefetch task cannot outlive the reader it dereferences, and add a unit test that covers close timeout while the prefetch read is blocked.

### Release note

None

### Check List (For Author)

- Test: git diff --check; git diff --cached --check; attempted build-support/clang-format.sh and build-support/check-format.sh but llvm@16/clang-format is not installed; attempted ./run-be-ut.sh --run buffered_reader_test but it failed before running tests because JAVA_HOME points to JDK 11 and JDK_17 is not set
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary
No blocking findings found in this PR. The change keeps the underlying FileReader alive through pending prefetch tasks by storing a FileReaderSPtr in each PrefetchBuffer, which matches the reported async lifetime failure path.

Critical checkpoint conclusions:

  • Goal and tests: The PR addresses the close-timeout/use-after-destroy path for async prefetch tasks, and adds a BE unit test covering a blocked prefetch read followed by close timeout.
  • Scope: The implementation is small and focused on ownership lifetime plus the targeted test.
  • Concurrency: The touched path is concurrent between query/close threads and the prefetch thread pool. Existing mutex/condition-variable state transitions are preserved; the new shared ownership only extends reader lifetime and does not add a new lock order.
  • Lifecycle: The main lifecycle concern is the async task outliving PrefetchBufferedReader; the shared pointer in PrefetchBuffer covers the reader dereferences after owner destruction. No static/global initialization concern was introduced.
  • Configuration: No new configuration item was added. The test temporarily changes buffered_reader_read_timeout_ms and restores it.
  • Compatibility: No storage format, RPC, function-symbol, or FE/BE protocol compatibility impact found.
  • Parallel paths: The modified constructor is the only PrefetchBuffer creation path found.
  • Conditional checks: No new special conditional behavior was introduced in production code.
  • Test coverage: Coverage is targeted to the reported lifetime race; I did not identify an additional missing negative case that should block this PR.
  • Test results: I reviewed the added test logic but did not run BE UT in this review environment.
  • Observability: Existing warning logs on prefetch failure retain path/offset/status context; no new observability appears required.
  • Transaction/persistence/data-write correctness: Not applicable to this IO reader lifetime change.
  • FE/BE variable passing: Not applicable.
  • Performance: The extra shared pointer copies are bounded by the small prefetch-buffer count and are not on the per-read hot copy path; no material performance issue found.

User focus points: No additional user-provided review focus was present.

@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 32013 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 76fa3b3e9c38c25e0cab80528b2d942005be712c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17823	4202	4134	4134
q2	q3	10794	1377	832	832
q4	4688	483	348	348
q5	7628	2256	2164	2164
q6	278	179	147	147
q7	917	814	641	641
q8	9537	1890	1675	1675
q9	6856	4989	4990	4989
q10	6433	2240	1885	1885
q11	436	282	247	247
q12	687	423	307	307
q13	18210	3445	2813	2813
q14	273	261	242	242
q15	q16	827	781	716	716
q17	941	953	1022	953
q18	6820	5647	5548	5548
q19	1216	1450	1097	1097
q20	509	439	288	288
q21	6084	2835	2671	2671
q22	447	360	316	316
Total cold run time: 101404 ms
Total hot run time: 32013 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4835	5043	4867	4867
q2	q3	4941	5388	4697	4697
q4	2129	2195	1438	1438
q5	4992	4739	4608	4608
q6	234	180	128	128
q7	1867	1766	1619	1619
q8	2271	1973	1937	1937
q9	7433	7409	7492	7409
q10	4768	4723	4247	4247
q11	545	383	356	356
q12	731	742	534	534
q13	3031	3398	2757	2757
q14	278	278	261	261
q15	q16	690	710	637	637
q17	1319	1269	1258	1258
q18	7271	6804	6790	6790
q19	1128	1080	1119	1080
q20	2230	2216	1960	1960
q21	5324	4582	4465	4465
q22	517	474	418	418
Total cold run time: 56534 ms
Total hot run time: 51466 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172326 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 76fa3b3e9c38c25e0cab80528b2d942005be712c, data reload: false

query5	4339	679	538	538
query6	381	223	204	204
query7	4641	551	317	317
query8	329	240	232	232
query9	8810	4155	4120	4120
query10	434	353	282	282
query11	5771	2620	2165	2165
query12	188	129	130	129
query13	1285	576	440	440
query14	6144	5501	5191	5191
query14_1	4537	4510	4466	4466
query15	210	203	183	183
query16	992	449	440	440
query17	1114	725	587	587
query18	2748	486	354	354
query19	215	203	153	153
query20	143	130	130	130
query21	206	144	115	115
query22	13625	13592	13573	13573
query23	17230	16531	16369	16369
query23_1	16359	16459	16365	16365
query24	7507	1783	1311	1311
query24_1	1292	1320	1335	1320
query25	556	475	408	408
query26	1293	334	170	170
query27	2660	582	345	345
query28	4442	2047	1997	1997
query29	981	629	503	503
query30	305	241	198	198
query31	1139	1084	956	956
query32	89	76	74	74
query33	540	346	297	297
query34	1194	1153	656	656
query35	772	796	702	702
query36	1460	1403	1347	1347
query37	155	100	87	87
query38	3261	3156	3058	3058
query39	942	926	888	888
query39_1	874	881	875	875
query40	239	147	124	124
query41	69	61	64	61
query42	110	113	110	110
query43	328	332	297	297
query44	
query45	216	205	204	204
query46	1079	1178	734	734
query47	2374	2340	2251	2251
query48	377	416	300	300
query49	628	506	381	381
query50	992	347	250	250
query51	4531	4409	4282	4282
query52	105	104	93	93
query53	258	279	202	202
query54	320	268	266	266
query55	95	94	87	87
query56	305	314	327	314
query57	1466	1423	1319	1319
query58	307	276	284	276
query59	1618	1623	1453	1453
query60	334	351	318	318
query61	179	192	185	185
query62	702	645	579	579
query63	246	200	219	200
query64	2395	882	707	707
query65	
query66	1694	491	378	378
query67	29812	29792	29599	29599
query68	
query69	487	353	320	320
query70	1042	1013	1035	1013
query71	316	276	278	276
query72	3220	2880	2326	2326
query73	851	759	459	459
query74	5089	4974	4797	4797
query75	2674	2616	2280	2280
query76	2285	1136	764	764
query77	405	406	335	335
query78	12382	12437	11892	11892
query79	1464	1051	773	773
query80	1338	554	464	464
query81	520	285	242	242
query82	976	159	125	125
query83	354	309	253	253
query84	263	142	106	106
query85	932	536	515	515
query86	468	326	317	317
query87	3446	3416	3257	3257
query88	3618	2797	2721	2721
query89	483	392	351	351
query90	1907	179	176	176
query91	184	174	151	151
query92	90	81	74	74
query93	1539	1492	834	834
query94	767	307	327	307
query95	692	470	366	366
query96	1101	763	349	349
query97	2750	2743	2565	2565
query98	243	232	236	232
query99	1176	1155	1025	1025
Total cold run time: 255966 ms
Total hot run time: 172326 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.80% (28083/38054)
Line Coverage 57.78% (305324/528418)
Region Coverage 54.94% (255604/465237)
Branch Coverage 56.44% (110414/195621)

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@Gabriel39 Gabriel39 merged commit 0bd933c into apache:master May 29, 2026
30 of 31 checks passed
github-actions Bot pushed a commit that referenced this pull request May 29, 2026
Problem Summary: S3/OSS prefetch timeout can cancel and close
PrefetchBufferedReader while an async PrefetchBuffer task is still
running. The task kept PrefetchBuffer alive but only stored the
underlying FileReader as a raw pointer, so the owner could destroy the
reader before the async task resumed on the error path and logged reader
metadata. Keep a shared FileReader reference in each PrefetchBuffer so
the async prefetch task cannot outlive the reader it dereferences, and
add a unit test that covers close timeout while the prefetch read is
blocked.
github-actions Bot pushed a commit that referenced this pull request May 29, 2026
Problem Summary: S3/OSS prefetch timeout can cancel and close
PrefetchBufferedReader while an async PrefetchBuffer task is still
running. The task kept PrefetchBuffer alive but only stored the
underlying FileReader as a raw pointer, so the owner could destroy the
reader before the async task resumed on the error path and logged reader
metadata. Keep a shared FileReader reference in each PrefetchBuffer so
the async prefetch task cannot outlive the reader it dereferences, and
add a unit test that covers close timeout while the prefetch read is
blocked.
yiguolei pushed a commit that referenced this pull request May 29, 2026
… (#63867)

Cherry-picked from #63796

Co-authored-by: Gabriel <liwenqiang@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.1.2-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants