Skip to content

Conversation

@XLPE
Copy link
Contributor

@XLPE XLPE commented Jun 19, 2025

What problem does this PR solve?

Issue Number: close #51941

Related PR: #xxx

Problem Summary:
Every time an FE node restarts, the following exception keeps appearing.

2025-06-19 10:59:51,722 WARN (Manual Analysis Job Executor-1|12257) [StatisticsCache.sendStats():241] Failed to sync stats to follower: TNetworkAddress(hostname:192.168.71.14, port:9520)
org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
	at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:216) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:73) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.doris.thrift.FrontendService$Client.sendUpdateStatsCache(FrontendService.java:1370) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
	at org.apache.doris.thrift.FrontendService$Client.updateStatsCache(FrontendService.java:1362) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
	at org.apache.doris.statistics.StatisticsCache.sendStats(StatisticsCache.java:239) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.statistics.StatisticsCache.syncColStats(StatisticsCache.java:229) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.statistics.BaseAnalysisTask.runQuery(BaseAnalysisTask.java:309) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.statistics.OlapAnalysisTask.doSample(OlapAnalysisTask.java:136) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.statistics.OlapAnalysisTask.doExecute(OlapAnalysisTask.java:96) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.statistics.BaseAnalysisTask.execute(BaseAnalysisTask.java:175) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.statistics.AnalysisTaskWrapper.lambda$new$0(AnalysisTaskWrapper.java:43) ~[doris-fe.jar:1.2-SNAPSHOT]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at org.apache.doris.statistics.AnalysisTaskWrapper.run(AnalysisTaskWrapper.java:66) ~[doris-fe.jar:1.2-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.net.SocketException: Broken pipe
	at sun.nio.ch.NioSocketImpl.implWrite(NioSocketImpl.java:420) ~[?:?]
	at sun.nio.ch.NioSocketImpl.write(NioSocketImpl.java:440) ~[?:?]
	at sun.nio.ch.NioSocketImpl$2.write(NioSocketImpl.java:826) ~[?:?]
	at java.net.Socket$SocketOutputStream.write(Socket.java:1035) ~[?:?]
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81) ~[?:?]
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142) ~[?:?]
	at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:211) ~[libthrift-0.16.0.jar:0.16.0]
	... 18 more

The root cause is that the client fails to catch IO exceptions, leading to closed connections being returned to the pool. Additionally, the isOpen() validation method in the code relies on isConnected(), which cannot detect closed or broken connections.

This pull request examines and resolves all related issues.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jun 19, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@XLPE
Copy link
Contributor Author

XLPE commented Jun 19, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34652 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a0b7abc080f6f9f5c6a031f2b9dce7fa6368c070, data reload: false

------ Round 1 ----------------------------------
q1	17574	5207	5034	5034
q2	1937	296	199	199
q3	10273	1263	739	739
q4	10230	1016	534	534
q5	7545	2301	2349	2301
q6	179	172	134	134
q7	892	750	600	600
q8	9319	1486	1211	1211
q9	6685	5063	5092	5063
q10	6961	2360	2024	2024
q11	497	291	288	288
q12	364	369	223	223
q13	18366	3718	3073	3073
q14	231	223	216	216
q15	555	490	464	464
q16	425	431	379	379
q17	602	906	368	368
q18	7461	7287	7101	7101
q19	1971	937	573	573
q20	328	331	218	218
q21	4019	3222	2959	2959
q22	986	991	951	951
Total cold run time: 107400 ms
Total hot run time: 34652 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5287	5156	5110	5110
q2	248	321	228	228
q3	2137	2657	2326	2326
q4	1372	1799	1357	1357
q5	4227	4239	4404	4239
q6	208	168	129	129
q7	1997	1928	1817	1817
q8	2588	2648	2562	2562
q9	7038	7046	7066	7046
q10	3110	3256	2887	2887
q11	566	504	507	504
q12	702	776	645	645
q13	3591	3900	3418	3418
q14	273	286	273	273
q15	522	475	484	475
q16	431	487	450	450
q17	1171	1479	1421	1421
q18	7315	7240	7042	7042
q19	812	786	874	786
q20	1879	1985	1835	1835
q21	4850	4334	4359	4334
q22	1006	1014	984	984
Total cold run time: 51330 ms
Total hot run time: 49868 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185114 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a0b7abc080f6f9f5c6a031f2b9dce7fa6368c070, data reload: false

query1	1012	405	401	401
query2	6552	1861	1861	1861
query3	6751	227	224	224
query4	26390	23922	22893	22893
query5	4395	631	470	470
query6	296	210	199	199
query7	4627	499	292	292
query8	273	225	226	225
query9	8626	2612	2620	2612
query10	512	358	277	277
query11	15251	15103	14772	14772
query12	150	106	103	103
query13	1637	528	390	390
query14	9231	6172	6172	6172
query15	203	187	177	177
query16	7232	616	464	464
query17	1174	712	571	571
query18	1975	412	313	313
query19	188	181	156	156
query20	120	122	115	115
query21	212	122	109	109
query22	3948	4240	3975	3975
query23	33840	33130	33286	33130
query24	8435	2413	2397	2397
query25	524	462	396	396
query26	1226	267	155	155
query27	2782	519	351	351
query28	4295	2081	2050	2050
query29	779	547	439	439
query30	286	222	196	196
query31	934	828	736	736
query32	73	63	65	63
query33	566	361	311	311
query34	808	857	527	527
query35	776	815	732	732
query36	968	977	907	907
query37	111	101	81	81
query38	4095	4061	4064	4061
query39	1483	1431	1424	1424
query40	224	124	115	115
query41	75	71	67	67
query42	138	116	116	116
query43	528	522	481	481
query44	1363	847	829	829
query45	185	182	171	171
query46	858	1027	631	631
query47	1753	1782	1692	1692
query48	397	469	328	328
query49	768	511	407	407
query50	677	685	430	430
query51	4179	4147	4033	4033
query52	112	105	97	97
query53	225	253	191	191
query54	574	581	498	498
query55	86	81	84	81
query56	326	296	284	284
query57	1183	1195	1132	1132
query58	270	262	253	253
query59	2648	2694	2665	2665
query60	330	326	313	313
query61	131	128	131	128
query62	802	741	633	633
query63	232	192	185	185
query64	4338	1044	674	674
query65	4275	4168	4232	4168
query66	1129	405	317	317
query67	15677	15490	15210	15210
query68	7798	881	524	524
query69	460	307	274	274
query70	1174	1135	1076	1076
query71	429	333	299	299
query72	5761	4734	4768	4734
query73	657	600	356	356
query74	8853	9150	8890	8890
query75	3719	3220	2702	2702
query76	3459	1187	751	751
query77	756	417	291	291
query78	9991	10118	9267	9267
query79	2286	807	587	587
query80	644	532	443	443
query81	499	257	228	228
query82	458	128	97	97
query83	267	250	241	241
query84	245	102	97	97
query85	851	361	318	318
query86	386	318	310	310
query87	4561	4410	4282	4282
query88	3740	2327	2282	2282
query89	385	322	291	291
query90	1936	219	216	216
query91	139	152	116	116
query92	83	64	60	60
query93	1866	957	578	578
query94	679	406	310	310
query95	369	305	283	283
query96	497	567	273	273
query97	2676	2755	2615	2615
query98	261	208	209	208
query99	1434	1375	1259	1259
Total cold run time: 273538 ms
Total hot run time: 185114 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.51 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit a0b7abc080f6f9f5c6a031f2b9dce7fa6368c070, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.03
query3	0.24	0.07	0.07
query4	1.61	0.11	0.10
query5	0.43	0.41	0.42
query6	1.18	0.66	0.66
query7	0.03	0.02	0.01
query8	0.05	0.04	0.03
query9	0.59	0.51	0.52
query10	0.57	0.58	0.57
query11	0.16	0.11	0.11
query12	0.14	0.12	0.12
query13	0.63	0.61	0.61
query14	0.80	0.82	0.83
query15	0.91	0.87	0.88
query16	0.38	0.40	0.39
query17	1.08	1.07	1.05
query18	0.24	0.21	0.22
query19	1.95	1.84	1.84
query20	0.01	0.01	0.02
query21	15.39	0.88	0.55
query22	0.76	1.20	0.84
query23	14.71	1.39	0.64
query24	6.67	2.60	0.45
query25	0.41	0.17	0.19
query26	0.66	0.16	0.15
query27	0.09	0.06	0.05
query28	9.57	0.88	0.44
query29	12.60	3.95	3.35
query30	0.25	0.09	0.07
query31	2.83	0.60	0.40
query32	3.24	0.56	0.46
query33	3.06	3.05	3.14
query34	15.85	5.37	4.82
query35	4.77	4.84	4.82
query36	0.69	0.51	0.49
query37	0.09	0.07	0.06
query38	0.05	0.04	0.03
query39	0.03	0.03	0.02
query40	0.17	0.15	0.13
query41	0.08	0.03	0.02
query42	0.03	0.02	0.03
query43	0.04	0.03	0.04
Total cold run time: 103.15 s
Total hot run time: 29.51 s

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 19, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Member

@xy720 xy720 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and feel free a maintainer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 18, 2025
@github-actions github-actions bot closed this Dec 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] broken pipe when sync stats to follower

5 participants