Skip to content

branch-3.0: [fix](tabletScheduler) Fix addTablet dead lock in tabletScheduler #45298#45768

Merged
dataroaring merged 1 commit intobranch-3.0from
auto-pick-45298-branch-3.0
Dec 25, 2024
Merged

branch-3.0: [fix](tabletScheduler) Fix addTablet dead lock in tabletScheduler #45298#45768
dataroaring merged 1 commit intobranch-3.0from
auto-pick-45298-branch-3.0

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #45298

…5298)

The conditions that need to be met to trigger the bug, with the second
condition being somewhat difficult to trigger, are as follows:
1. The number of tablets that need to be fixed exceeds 2000 (in the
pending queue);
2. The scheduling of the lowest priority in the pending queue has
previously experienced a clone failure, with fewer than 3 failures, and
has been put back into the pending queue. Additionally, a new scheduling
request that happens to belong to the same table as the previous one has
a higher priority than the previous scheduling.

The fix is to write the lock trylock in finalize TabletCtx. If the lock
cannot be obtained, the current scheduling will fail and the next one
will be rescheduled


Fix
```
"colocate group clone checker" #7557 daemon prio=5 os_prio=0 cpu=686.24ms elapsed=6719.45s tid=0x00007f3e6c039ab0 nid=0x17b08 waiting on condition  [0x00007f3ec77fe000]
(1 similar threads)
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
        - parking to wait for  <0x000010014d223908> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(java.base@17.0.2/LockSupport.java:211)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:715)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:938)
        at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@17.0.2/ReentrantReadWriteLock.java:959)
        at org.apache.doris.common.lock.MonitoredReentrantReadWriteLock$WriteLock.lock(MonitoredReentrantReadWriteLock.java:98)
        at org.apache.doris.catalog.Table.writeLockIfExist(Table.java:211)
        at org.apache.doris.clone.TabletSchedCtx.releaseResource(TabletSchedCtx.java:940)
        at org.apache.doris.clone.TabletSchedCtx.releaseResource(TabletSchedCtx.java:898)
        at org.apache.doris.clone.TabletScheduler.releaseTabletCtx(TabletScheduler.java:1743)
        at org.apache.doris.clone.TabletScheduler.finalizeTabletCtx(TabletScheduler.java:1625)
        at org.apache.doris.clone.TabletScheduler.addTablet(TabletScheduler.java:287)
        - locked <0x0000100009429110> (a org.apache.doris.clone.TabletScheduler)
        at org.apache.doris.clone.ColocateTableCheckerAndBalancer.matchGroups(ColocateTableCheckerAndBalancer.java:563)
        at org.apache.doris.clone.ColocateTableCheckerAndBalancer.runAfterCatalogReady(ColocateTableCheckerAndBalancer.java:340)
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58)
        at org.apache.doris.common.util.Daemon.run(Daemon.java:119)
```
@Thearas
Copy link
Contributor

Thearas commented Dec 23, 2024

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Dec 23, 2024
@Thearas
Copy link
Contributor

Thearas commented Dec 23, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40470 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6c11addc4a7a718254696eceba4c7f0acef688a5, data reload: false

------ Round 1 ----------------------------------
q1	17762	7361	7240	7240
q2	2044	191	166	166
q3	10711	1145	1212	1145
q4	10556	770	728	728
q5	7762	2782	2790	2782
q6	236	148	152	148
q7	954	615	600	600
q8	9574	1926	1925	1925
q9	8000	6395	6427	6395
q10	7022	2287	2262	2262
q11	456	265	259	259
q12	392	212	210	210
q13	17787	2924	2975	2924
q14	241	217	209	209
q15	548	515	529	515
q16	694	606	595	595
q17	968	624	571	571
q18	7181	6699	6497	6497
q19	1896	1046	1056	1046
q20	472	198	201	198
q21	4000	3093	3080	3080
q22	1036	990	975	975
Total cold run time: 110292 ms
Total hot run time: 40470 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7339	7192	7206	7192
q2	319	228	228	228
q3	2978	2853	2856	2853
q4	2044	1739	1763	1739
q5	5656	5660	5682	5660
q6	224	146	150	146
q7	2177	1778	1785	1778
q8	3278	3519	3464	3464
q9	8776	8830	8792	8792
q10	3514	3515	3494	3494
q11	587	504	503	503
q12	799	575	611	575
q13	16590	3131	3095	3095
q14	297	270	271	270
q15	568	512	516	512
q16	708	676	672	672
q17	1840	1615	1569	1569
q18	8274	7781	7557	7557
q19	5751	1580	1541	1541
q20	2101	1867	1828	1828
q21	5469	5208	5284	5208
q22	1137	1032	1022	1022
Total cold run time: 80426 ms
Total hot run time: 59698 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 195541 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6c11addc4a7a718254696eceba4c7f0acef688a5, data reload: false

query1	1251	923	911	911
query2	6241	2139	2126	2126
query3	10850	4228	4129	4129
query4	66485	28878	23273	23273
query5	5428	456	450	450
query6	504	175	184	175
query7	6006	311	301	301
query8	330	238	225	225
query9	9278	2694	2684	2684
query10	504	278	254	254
query11	17700	15208	15705	15208
query12	158	103	104	103
query13	1560	426	412	412
query14	10357	7393	6708	6708
query15	221	171	174	171
query16	7647	478	457	457
query17	1054	550	555	550
query18	1903	303	299	299
query19	201	157	148	148
query20	117	109	106	106
query21	70	44	42	42
query22	4827	4711	4731	4711
query23	34467	34098	33861	33861
query24	5942	2901	2881	2881
query25	513	417	388	388
query26	674	171	164	164
query27	1738	289	296	289
query28	4468	2551	2506	2506
query29	680	442	427	427
query30	245	166	163	163
query31	1014	810	807	807
query32	66	54	51	51
query33	409	277	278	277
query34	897	485	499	485
query35	828	746	765	746
query36	1081	970	974	970
query37	115	72	68	68
query38	4039	4057	4052	4052
query39	1490	1508	1491	1491
query40	142	81	83	81
query41	49	45	45	45
query42	116	97	100	97
query43	535	496	498	496
query44	1164	799	825	799
query45	184	169	166	166
query46	1149	725	715	715
query47	1947	1884	1946	1884
query48	460	369	374	369
query49	737	377	366	366
query50	828	413	436	413
query51	7379	7184	7166	7166
query52	100	93	88	88
query53	253	183	184	183
query54	559	451	440	440
query55	74	75	76	75
query56	238	224	220	220
query57	1183	1117	1060	1060
query58	200	201	201	201
query59	3162	2828	2938	2828
query60	284	256	259	256
query61	131	128	132	128
query62	759	658	665	658
query63	209	184	187	184
query64	1583	742	718	718
query65	3269	3145	3159	3145
query66	717	307	310	307
query67	15756	15429	15310	15310
query68	4474	547	531	531
query69	431	267	266	266
query70	1166	1157	1077	1077
query71	325	268	249	249
query72	6741	4081	3918	3918
query73	738	342	334	334
query74	10133	8876	8854	8854
query75	3314	2620	2586	2586
query76	1849	980	1128	980
query77	475	258	262	258
query78	10749	9734	9527	9527
query79	7631	595	594	594
query80	2008	422	414	414
query81	554	239	243	239
query82	1223	116	118	116
query83	249	143	141	141
query84	283	78	79	78
query85	1683	302	301	301
query86	481	303	286	286
query87	4464	4348	4308	4308
query88	5429	2381	2380	2380
query89	412	292	290	290
query90	2059	185	188	185
query91	184	144	145	144
query92	67	48	50	48
query93	6504	527	527	527
query94	895	294	297	294
query95	341	246	244	244
query96	600	284	286	284
query97	3338	3114	3177	3114
query98	223	207	209	207
query99	1609	1300	1302	1300
Total cold run time: 335157 ms
Total hot run time: 195541 ms

@dataroaring dataroaring merged commit 7da76ae into branch-3.0 Dec 25, 2024
@github-actions github-actions bot deleted the auto-pick-45298-branch-3.0 branch December 25, 2024 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants