Skip to content

Conversation

@liaoxin01
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #60148

Problem Summary:

Problem:
When TabletStream is destroyed without pre_close() being called (e.g., on_idle_timeout scenario), the _flush_token destructor calls shutdown() which triggers deadlock detection if called from the pool thread.

Root cause:

  • on_idle_timeout() directly calls brpc::StreamClose() without calling LoadStream::close()
  • This triggers the destruction chain without calling pre_close() on TabletStreams
  • If flush tasks are still running, TabletStream may be destroyed in pool thread

Solution:

  • Add IndexStream::~IndexStream() to ensure wait_for_flush_tasks() is called on all TabletStreams
  • Add TabletStream::wait_for_flush_tasks() to wait for all flush tasks to complete
  • This ensures _flush_token is properly handled before TabletStream destruction
  • Revert [fix](load_stream) Fix use-after-free in TabletStream async lambdas #60148 (shared_from_this) as it is no longer needed

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings January 27, 2026 14:44
@Thearas
Copy link
Contributor

Thearas commented Jan 27, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

liaoxin01 added a commit to liaoxin01/doris that referenced this pull request Jan 27, 2026
…ait_for_flush_tasks is called before destruction apache#60284

Problem:
When TabletStream is destroyed without pre_close() being called (e.g., on_idle_timeout scenario),
the _flush_token destructor calls shutdown() which triggers deadlock detection if called from
the pool thread.

Root cause:
- on_idle_timeout() directly calls brpc::StreamClose() without calling LoadStream::close()
- This triggers the destruction chain without calling pre_close() on TabletStreams
- If flush tasks are still running, TabletStream may be destroyed in pool thread

Solution:
- Add IndexStream::~IndexStream() to ensure wait_for_flush_tasks() is called on all TabletStreams
- Add TabletStream::wait_for_flush_tasks() to wait for all flush tasks to complete
- This ensures _flush_token is properly handled before TabletStream destruction
- Revert commit c18ef17 (shared_from_this) as it is no longer needed
@liaoxin01
Copy link
Contributor Author

run buildall

gavinchou
gavinchou previously approved these changes Jan 27, 2026
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 27, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a deadlock issue that occurs when TabletStream is destroyed without pre_close() being called (e.g., in the on_idle_timeout scenario). The deadlock happened because the flush token's destructor calls shutdown(), which performs deadlock detection and throws if called from the pool thread itself.

Changes:

  • Reverts PR #60148's use of shared_from_this() in async lambdas, replacing with raw this pointer
  • Adds TabletStream::wait_for_flush_tasks() to ensure flush tasks complete before destruction
  • Adds IndexStream::~IndexStream() to call wait_for_flush_tasks() on all tablet streams
  • Refactors TabletStream::pre_close() to call the new wait_for_flush_tasks() method

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
be/src/runtime/load_stream.h Removes std::enable_shared_from_this from TabletStream, adds wait_for_flush_tasks() method and _flush_tasks_done flag, adds IndexStream destructor
be/src/runtime/load_stream.cpp Implements wait_for_flush_tasks() and IndexStream destructor, reverts lambda captures from shared_from_this() to raw this, refactors pre_close()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@liaoxin01 liaoxin01 force-pushed the fix-load-stream-flush-token-deadlock-master branch from 86e21d6 to c36ec0e Compare January 27, 2026 15:05
liaoxin01 added a commit to liaoxin01/doris that referenced this pull request Jan 27, 2026
…ait_for_flush_tasks is called before destruction apache#60284

Problem:
When TabletStream is destroyed without pre_close() being called (e.g., on_idle_timeout scenario),
the _flush_token destructor calls shutdown() which triggers deadlock detection if called from
the pool thread.

Root cause:
- on_idle_timeout() directly calls brpc::StreamClose() without calling LoadStream::close()
- This triggers the destruction chain without calling pre_close() on TabletStreams
- If flush tasks are still running, TabletStream may be destroyed in pool thread

Solution:
- Add IndexStream::~IndexStream() to ensure wait_for_flush_tasks() is called on all TabletStreams
- Add TabletStream::wait_for_flush_tasks() to wait for all flush tasks to complete
- This ensures _flush_token is properly handled before TabletStream destruction
- Revert commit c18ef17 (shared_from_this) as it is no longer needed
@liaoxin01
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32864 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 86e21d6cdc818f195d0e7018ff8f9e00d1d2cc31, data reload: false

------ Round 1 ----------------------------------
q1	17648	5202	5063	5063
q2	2076	319	213	213
q3	10177	1308	754	754
q4	10205	830	313	313
q5	7515	2211	1898	1898
q6	197	181	151	151
q7	888	735	596	596
q8	9264	1456	1085	1085
q9	5185	4766	5160	4766
q10	6771	1962	1575	1575
q11	518	294	291	291
q12	335	372	225	225
q13	17802	4137	3215	3215
q14	237	238	222	222
q15	906	823	826	823
q16	672	722	625	625
q17	634	832	447	447
q18	6715	6610	7497	6610
q19	1272	1028	656	656
q20	423	360	243	243
q21	2927	2324	2069	2069
q22	1119	1126	1024	1024
Total cold run time: 103486 ms
Total hot run time: 32864 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5636	5737	5508	5508
q2	275	343	254	254
q3	2352	2859	2436	2436
q4	1459	1808	1488	1488
q5	4721	4579	4716	4579
q6	231	182	138	138
q7	2022	1918	1868	1868
q8	2566	2356	2339	2339
q9	7642	7532	7593	7532
q10	2850	3080	2667	2667
q11	526	486	450	450
q12	669	729	598	598
q13	3914	4030	3213	3213
q14	268	284	257	257
q15	853	801	787	787
q16	636	664	638	638
q17	1075	1293	1270	1270
q18	7697	7386	7384	7384
q19	851	800	839	800
q20	2017	2056	1888	1888
q21	4498	4223	4104	4104
q22	1060	1076	962	962
Total cold run time: 53818 ms
Total hot run time: 51160 ms

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jan 27, 2026
@doris-robot
Copy link

TPC-H: Total hot run time: 32400 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c36ec0e50c07c1a4a193f586ec2bb833de8b1e7e, data reload: false

------ Round 1 ----------------------------------
q1	7828	5031	5012	5012
q2	1600	339	196	196
q3	1264	1287	721	721
q4	2194	794	322	322
q5	2042	1690	1729	1690
q6	238	199	155	155
q7	939	845	647	647
q8	2112	1116	1080	1080
q9	4980	4990	4885	4885
q10	6898	1969	1594	1594
q11	519	302	285	285
q12	307	388	235	235
q13	3577	4077	3217	3217
q14	245	245	238	238
q15	911	825	808	808
q16	672	705	633	633
q17	628	792	485	485
q18	6802	6462	6395	6395
q19	690	985	638	638
q20	399	359	229	229
q21	2607	1959	1944	1944
q22	1053	1000	991	991
Total cold run time: 48505 ms
Total hot run time: 32400 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5342	5303	5287	5287
q2	291	328	265	265
q3	2198	2672	2299	2299
q4	1403	1765	1310	1310
q5	4361	4125	4236	4125
q6	215	183	140	140
q7	1914	1859	1742	1742
q8	2485	2302	2310	2302
q9	7029	7016	7012	7012
q10	2650	2815	2431	2431
q11	538	463	464	463
q12	631	702	585	585
q13	3554	4053	3245	3245
q14	293	298	282	282
q15	857	811	814	811
q16	644	691	641	641
q17	1088	1279	1349	1279
q18	7470	7217	7274	7217
q19	880	855	835	835
q20	1964	2065	1907	1907
q21	4533	4167	4187	4167
q22	1049	1045	1011	1011
Total cold run time: 51389 ms
Total hot run time: 49356 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c36ec0e50c07c1a4a193f586ec2bb833de8b1e7e, data reload: false

query1	0.06	0.05	0.05
query2	0.09	0.05	0.04
query3	0.26	0.08	0.08
query4	1.60	0.11	0.11
query5	0.27	0.25	0.26
query6	1.16	0.67	0.68
query7	0.03	0.02	0.03
query8	0.05	0.04	0.03
query9	0.56	0.51	0.48
query10	0.56	0.55	0.54
query11	0.15	0.09	0.10
query12	0.14	0.10	0.11
query13	0.64	0.61	0.62
query14	1.06	1.06	1.05
query15	0.88	0.86	0.86
query16	0.39	0.39	0.40
query17	1.16	1.09	1.11
query18	0.22	0.22	0.21
query19	2.01	1.97	2.06
query20	0.02	0.01	0.02
query21	15.40	0.25	0.14
query22	5.44	0.05	0.05
query23	16.21	0.27	0.11
query24	1.17	0.46	0.17
query25	0.09	0.07	0.07
query26	0.15	0.14	0.13
query27	0.09	0.04	0.06
query28	3.30	1.15	0.96
query29	12.54	3.95	3.20
query30	0.27	0.15	0.12
query31	2.81	0.64	0.39
query32	3.23	0.60	0.51
query33	3.20	3.30	3.38
query34	16.18	5.45	4.73
query35	4.77	4.85	4.88
query36	0.65	0.50	0.49
query37	0.11	0.07	0.07
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.18	0.16	0.15
query41	0.09	0.03	0.02
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 97.39 s
Total hot run time: 28.29 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 77.78% (28/36) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.72% (19257/36526)
Line Coverage 36.09% (178879/495690)
Region Coverage 32.56% (138818/426357)
Branch Coverage 33.48% (60047/179347)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 86.11% (31/36) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.12% (26181/35804)
Line Coverage 56.17% (277745/494457)
Region Coverage 53.94% (232306/430675)
Branch Coverage 55.59% (100093/180063)

…h_tasks is called before destruction

Problem:
When TabletStream is destroyed without pre_close() being called (e.g., on_idle_timeout scenario),
the _flush_token destructor calls shutdown() which triggers deadlock detection if called from
the pool thread.

Root cause:
- on_idle_timeout() directly calls brpc::StreamClose() without calling LoadStream::close()
- This triggers the destruction chain without calling pre_close() on TabletStreams
- If flush tasks are still running, TabletStream may be destroyed in pool thread

Solution:
- Add IndexStream::~IndexStream() to ensure wait_for_flush_tasks() is called on all TabletStreams
- Add TabletStream::wait_for_flush_tasks() to wait for all flush tasks to complete
- This ensures _flush_token is properly handled before TabletStream destruction
- Revert commit c18ef17 (shared_from_this) as it is no longer needed
@liaoxin01 liaoxin01 force-pushed the fix-load-stream-flush-token-deadlock-master branch from c36ec0e to cc77be5 Compare January 28, 2026 01:38
@liaoxin01
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33039 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit cc77be5a6cda491b88c7772015000e6620a1b236, data reload: false

------ Round 1 ----------------------------------
q1	17748	5216	5060	5060
q2	2009	312	198	198
q3	10217	1353	763	763
q4	10219	950	335	335
q5	7566	2111	1986	1986
q6	224	183	151	151
q7	900	746	616	616
q8	9265	1374	1237	1237
q9	5112	4849	4810	4810
q10	6777	1937	1571	1571
q11	500	296	270	270
q12	342	396	234	234
q13	17770	4039	3286	3286
q14	238	250	220	220
q15	888	811	829	811
q16	687	670	634	634
q17	643	801	469	469
q18	6817	6690	6524	6524
q19	1238	1010	646	646
q20	415	362	238	238
q21	2753	2119	2001	2001
q22	1044	1028	979	979
Total cold run time: 103372 ms
Total hot run time: 33039 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5333	5321	5353	5321
q2	264	344	263	263
q3	2171	2673	2285	2285
q4	1391	1741	1316	1316
q5	4251	4258	4761	4258
q6	263	213	175	175
q7	2277	2072	1782	1782
q8	2581	2454	2432	2432
q9	7543	7564	7497	7497
q10	2817	3001	2679	2679
q11	566	525	449	449
q12	707	755	592	592
q13	3813	4226	3599	3599
q14	302	338	317	317
q15	862	810	826	810
q16	681	727	672	672
q17	1126	1360	1373	1360
q18	8590	8087	7926	7926
q19	939	896	928	896
q20	2092	2253	2004	2004
q21	4876	4573	4295	4295
q22	1075	1011	962	962
Total cold run time: 54520 ms
Total hot run time: 51890 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.37 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit cc77be5a6cda491b88c7772015000e6620a1b236, data reload: false

query1	0.06	0.04	0.05
query2	0.11	0.05	0.04
query3	0.26	0.09	0.08
query4	1.60	0.11	0.12
query5	0.27	0.24	0.25
query6	1.16	0.67	0.67
query7	0.04	0.03	0.02
query8	0.05	0.04	0.04
query9	0.58	0.53	0.52
query10	0.58	0.56	0.57
query11	0.15	0.10	0.10
query12	0.14	0.11	0.10
query13	0.63	0.62	0.61
query14	1.05	1.04	1.04
query15	0.88	0.86	0.88
query16	0.40	0.40	0.39
query17	1.17	1.10	1.13
query18	0.23	0.21	0.21
query19	2.13	2.03	2.01
query20	0.02	0.02	0.02
query21	15.40	0.26	0.14
query22	5.28	0.05	0.06
query23	16.10	0.27	0.11
query24	2.46	0.28	0.72
query25	0.12	0.13	0.06
query26	0.15	0.13	0.13
query27	0.06	0.06	0.04
query28	4.67	1.12	0.96
query29	12.55	3.94	3.17
query30	0.27	0.12	0.14
query31	2.82	0.63	0.41
query32	3.23	0.59	0.51
query33	3.23	3.22	3.25
query34	16.19	5.42	4.75
query35	4.78	4.79	4.83
query36	0.69	0.50	0.49
query37	0.11	0.07	0.06
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.18	0.17	0.15
query41	0.09	0.04	0.03
query42	0.05	0.03	0.04
query43	0.04	0.04	0.04
Total cold run time: 100.1 s
Total hot run time: 28.37 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 78.38% (29/37) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.76% (19270/36526)
Line Coverage 36.11% (179000/495690)
Region Coverage 32.56% (138810/426357)
Branch Coverage 33.50% (60086/179347)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 86.49% (32/37) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.23% (26219/35804)
Line Coverage 56.29% (278322/494457)
Region Coverage 54.12% (233069/430675)
Branch Coverage 55.76% (100400/180063)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 86.49% (32/37) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.23% (26219/35804)
Line Coverage 56.29% (278326/494457)
Region Coverage 54.12% (233085/430675)
Branch Coverage 55.76% (100401/180063)

yiguolei pushed a commit that referenced this pull request Jan 28, 2026
…ait_for_flush_tasks is called before destruction #60284 (#60285)

pick from:#60284
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants