Skip to content

[Fix](cloud-mow) Fix correctness problem when there exists other interleaved txn between a txn's retries #50417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 28, 2025

Conversation

bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented Apr 25, 2025

What problem does this PR solve?

出问题的情况:

  1. txn=X 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task1,这包括 tablet=A 和 tablet=B 上计算
  2. txn=X 在 tablet=A 上计算完成,在 ms 写下 ver=V1 的delete bitmap,并将其写入到 tablet=A 的pending delete bitmap KV中
  3. txn=X 因为 task1 计算超时,主动释放锁
  4. txn=Y 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task2,包括 tablet=A
  5. txn=Y 在 tablet=A 上计算完成,用 tablet=A 上的 的pending delete bitmap KV删除 txn=X 写下的delete bitmap,并写下自己的 ver=V1 的delete bitmap
  6. txn=Y 因为在某些 tablet 上计算超时,主动释放锁
  7. txn=X 第二次拿锁,仍然尝试在 ver=V1 提交,向be1下发计算task3,这包括 tablet=A 和 tablet=B 上计算
  8. 此时be1上的 task1 还没有完成,task3 在be1上注册失败
  9. task1 在 tablet=B上计算完成,task1上报成功结果到fe。txn=X 成功在 ver=V1 上提交,但它在 tablet=A 上 ver=V1 的delete bitmap已经被删除了

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bobhan1 bobhan1 force-pushed the fix-cloud-mow-retry-dup-key branch 8 times, most recently from 1238a6c to 1a8f195 Compare April 25, 2025 10:24
@bobhan1 bobhan1 marked this pull request as ready for review April 25, 2025 10:24
@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 26, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33602 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 38f8dfb36abedfe61574fa42800cbc28a8bfd2f4, data reload: false

------ Round 1 ----------------------------------
q1	25626	4962	4993	4962
q2	2056	275	181	181
q3	10391	1264	711	711
q4	10239	989	522	522
q5	7570	2308	2333	2308
q6	180	160	132	132
q7	910	720	587	587
q8	9303	1288	1053	1053
q9	6858	5075	5118	5075
q10	6847	2279	1898	1898
q11	487	282	263	263
q12	339	354	225	225
q13	17785	3618	3064	3064
q14	234	221	217	217
q15	535	500	471	471
q16	457	447	399	399
q17	596	848	367	367
q18	7479	7181	7017	7017
q19	1616	935	537	537
q20	325	327	226	226
q21	4056	3361	2394	2394
q22	1053	1039	993	993
Total cold run time: 114942 ms
Total hot run time: 33602 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5071	5035	5081	5035
q2	238	330	227	227
q3	2208	2664	2285	2285
q4	1441	1776	1383	1383
q5	4337	4377	4379	4377
q6	217	169	133	133
q7	2018	1896	1758	1758
q8	2580	2588	2487	2487
q9	7200	7264	7041	7041
q10	2966	3189	2737	2737
q11	577	511	488	488
q12	666	800	605	605
q13	3513	3842	3251	3251
q14	295	319	279	279
q15	531	485	487	485
q16	459	502	462	462
q17	1138	1565	1362	1362
q18	7848	7724	7469	7469
q19	811	819	948	819
q20	1933	1971	1859	1859
q21	5208	4989	4886	4886
q22	1080	1114	1042	1042
Total cold run time: 52335 ms
Total hot run time: 50470 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193276 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 38f8dfb36abedfe61574fa42800cbc28a8bfd2f4, data reload: false

query1	1397	1087	1080	1080
query2	6128	1910	1926	1910
query3	11160	4803	4560	4560
query4	25232	23852	23821	23821
query5	5540	630	464	464
query6	324	222	186	186
query7	3981	477	275	275
query8	290	244	237	237
query9	8483	2577	2572	2572
query10	550	314	261	261
query11	15207	15043	14838	14838
query12	160	113	112	112
query13	1556	522	404	404
query14	8908	6318	6310	6310
query15	210	196	172	172
query16	7282	641	490	490
query17	1153	793	611	611
query18	2181	421	333	333
query19	219	201	179	179
query20	126	120	118	118
query21	213	141	116	116
query22	4439	4634	4362	4362
query23	34534	33732	33338	33338
query24	8609	2505	2466	2466
query25	504	459	386	386
query26	1316	267	151	151
query27	2774	497	346	346
query28	4565	2160	2157	2157
query29	726	554	436	436
query30	275	230	191	191
query31	918	856	812	812
query32	76	61	62	61
query33	528	366	319	319
query34	843	891	546	546
query35	835	876	769	769
query36	963	1022	910	910
query37	120	101	80	80
query38	4255	4344	4168	4168
query39	1554	1429	1410	1410
query40	217	123	106	106
query41	55	61	52	52
query42	120	107	131	107
query43	528	530	511	511
query44	1378	814	826	814
query45	186	180	164	164
query46	863	1034	664	664
query47	1838	1845	1785	1785
query48	388	431	299	299
query49	766	498	417	417
query50	680	705	425	425
query51	4221	4318	4291	4291
query52	113	111	103	103
query53	229	266	195	195
query54	595	580	540	540
query55	89	87	79	79
query56	319	323	293	293
query57	1179	1159	1122	1122
query58	261	256	258	256
query59	2820	2953	2801	2801
query60	337	316	307	307
query61	140	129	122	122
query62	781	734	662	662
query63	242	200	205	200
query64	4126	1033	672	672
query65	4508	4380	4360	4360
query66	903	420	321	321
query67	16132	15576	15269	15269
query68	8351	885	512	512
query69	530	310	266	266
query70	1212	1109	1106	1106
query71	479	324	294	294
query72	5754	4696	4631	4631
query73	689	663	348	348
query74	8889	8926	8753	8753
query75	4034	3179	2708	2708
query76	3782	1191	740	740
query77	789	366	288	288
query78	10019	10116	9229	9229
query79	2218	835	568	568
query80	605	510	448	448
query81	494	257	227	227
query82	532	126	97	97
query83	249	263	233	233
query84	249	99	89	89
query85	744	344	304	304
query86	357	310	294	294
query87	4452	4354	4297	4297
query88	3789	2252	2236	2236
query89	395	318	286	286
query90	1944	209	219	209
query91	147	141	112	112
query92	71	62	60	60
query93	1821	953	575	575
query94	644	426	308	308
query95	367	289	283	283
query96	501	564	279	279
query97	3147	3245	3099	3099
query98	225	213	193	193
query99	1429	1403	1310	1310
Total cold run time: 281056 ms
Total hot run time: 193276 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.25 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 38f8dfb36abedfe61574fa42800cbc28a8bfd2f4, data reload: false

query1	0.04	0.05	0.03
query2	0.12	0.10	0.11
query3	0.25	0.20	0.19
query4	1.60	0.19	0.19
query5	0.60	0.57	0.60
query6	1.17	0.72	0.73
query7	0.02	0.02	0.02
query8	0.05	0.04	0.03
query9	0.56	0.51	0.50
query10	0.56	0.57	0.56
query11	0.16	0.11	0.11
query12	0.15	0.12	0.11
query13	0.61	0.59	0.60
query14	1.20	1.19	1.18
query15	0.87	0.87	0.86
query16	0.39	0.38	0.38
query17	1.01	1.02	1.02
query18	0.20	0.19	0.19
query19	1.88	1.76	1.74
query20	0.01	0.01	0.01
query21	15.40	0.91	0.53
query22	0.75	1.13	0.83
query23	14.85	1.38	0.63
query24	7.26	1.26	0.32
query25	0.38	0.17	0.10
query26	0.62	0.17	0.14
query27	0.05	0.04	0.04
query28	9.06	0.86	0.42
query29	12.54	3.98	3.36
query30	0.25	0.10	0.07
query31	2.82	0.59	0.39
query32	3.23	0.55	0.45
query33	3.10	3.08	3.03
query34	15.82	5.11	4.50
query35	4.60	4.55	4.60
query36	0.67	0.50	0.48
query37	0.09	0.06	0.06
query38	0.06	0.04	0.03
query39	0.03	0.02	0.03
query40	0.15	0.14	0.13
query41	0.08	0.02	0.02
query42	0.03	0.03	0.02
query43	0.03	0.03	0.03
Total cold run time: 103.32 s
Total hot run time: 29.25 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.14% (14684/27124)
Line Coverage 43.01% (127882/297352)
Region Coverage 41.86% (65485/156453)
Branch Coverage 36.46% (33018/90566)

@hello-stephen
Copy link
Contributor

BE Regression P0 && UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 55.35% (14739/26630)
Line Coverage 44.74% (132821/296860)
Region Coverage 41.83% (76507/182878)
Branch Coverage 35.88% (37000/103124)

@bobhan1 bobhan1 force-pushed the fix-cloud-mow-retry-dup-key branch from 38f8dfb to f019ba0 Compare April 27, 2025 02:14
@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 27, 2025

run buildall

zhannngchen
zhannngchen previously approved these changes Apr 27, 2025
Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 27, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 33957 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f019ba056e1ec4206b7d6e2f6a8c186efbbf0369, data reload: false

------ Round 1 ----------------------------------
q1	25935	5079	4984	4984
q2	2079	284	191	191
q3	10459	1267	707	707
q4	10262	1006	529	529
q5	8654	2477	2326	2326
q6	198	162	130	130
q7	926	728	611	611
q8	9301	1290	1047	1047
q9	6915	5253	5213	5213
q10	6811	2279	1879	1879
q11	470	285	261	261
q12	340	354	222	222
q13	17758	3693	3065	3065
q14	225	228	212	212
q15	540	503	487	487
q16	445	440	401	401
q17	603	860	400	400
q18	7635	7279	7061	7061
q19	1504	946	551	551
q20	337	334	222	222
q21	4323	2622	2494	2494
q22	1080	1023	964	964
Total cold run time: 116800 ms
Total hot run time: 33957 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5149	5080	5052	5052
q2	243	328	232	232
q3	2140	2642	2290	2290
q4	1482	1819	1506	1506
q5	4585	4414	4326	4326
q6	210	166	121	121
q7	1946	1872	1766	1766
q8	2557	2538	2462	2462
q9	7237	7224	7219	7219
q10	2955	3134	2726	2726
q11	566	488	489	488
q12	684	771	579	579
q13	3494	3817	3381	3381
q14	273	287	262	262
q15	521	473	473	473
q16	461	506	483	483
q17	1143	1528	1385	1385
q18	7688	7504	7474	7474
q19	784	854	974	854
q20	1963	2017	1835	1835
q21	5153	4698	4608	4608
q22	1110	1011	978	978
Total cold run time: 52344 ms
Total hot run time: 50500 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185466 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f019ba056e1ec4206b7d6e2f6a8c186efbbf0369, data reload: false

query1	1022	471	486	471
query2	6552	1877	1858	1858
query3	6736	221	217	217
query4	31959	23341	22983	22983
query5	4591	645	469	469
query6	260	192	180	180
query7	4033	487	280	280
query8	294	246	222	222
query9	5196	2561	2568	2561
query10	414	315	268	268
query11	15223	14823	14879	14823
query12	167	117	105	105
query13	997	500	410	410
query14	9479	5964	5978	5964
query15	190	185	172	172
query16	7280	635	438	438
query17	1162	696	560	560
query18	1975	392	297	297
query19	188	179	184	179
query20	116	120	113	113
query21	211	135	114	114
query22	4071	4175	4031	4031
query23	33809	32973	33009	32973
query24	8327	2363	2354	2354
query25	541	472	407	407
query26	1240	262	146	146
query27	2745	490	336	336
query28	4343	2114	2105	2105
query29	772	556	422	422
query30	281	211	186	186
query31	922	860	754	754
query32	69	65	65	65
query33	535	354	301	301
query34	797	834	507	507
query35	804	801	726	726
query36	932	981	915	915
query37	119	96	89	89
query38	4171	4226	4129	4129
query39	1494	1377	1400	1377
query40	217	120	108	108
query41	56	53	51	51
query42	119	104	105	104
query43	504	499	465	465
query44	1265	793	791	791
query45	173	181	167	167
query46	836	1014	614	614
query47	1750	1822	1726	1726
query48	368	418	296	296
query49	783	513	412	412
query50	653	687	395	395
query51	4127	4040	4086	4040
query52	106	102	96	96
query53	221	250	182	182
query54	604	579	519	519
query55	92	84	81	81
query56	322	315	311	311
query57	1134	1145	1093	1093
query58	270	263	256	256
query59	2753	2821	2694	2694
query60	333	345	320	320
query61	152	151	153	151
query62	819	737	683	683
query63	223	185	190	185
query64	4355	998	669	669
query65	4256	4233	4266	4233
query66	1134	419	307	307
query67	15704	15741	15378	15378
query68	8016	870	511	511
query69	452	299	258	258
query70	1146	1136	1059	1059
query71	419	322	290	290
query72	5538	4826	4955	4826
query73	866	704	350	350
query74	8838	9132	8679	8679
query75	3440	3193	2823	2823
query76	3950	1191	746	746
query77	766	371	280	280
query78	9987	10170	9279	9279
query79	1624	807	571	571
query80	715	517	454	454
query81	495	257	225	225
query82	198	130	102	102
query83	262	260	239	239
query84	250	109	84	84
query85	757	349	304	304
query86	333	286	294	286
query87	4437	4497	4311	4311
query88	2875	2225	2237	2225
query89	397	325	283	283
query90	1798	215	229	215
query91	147	142	110	110
query92	65	63	59	59
query93	1062	979	600	600
query94	607	427	309	309
query95	377	290	286	286
query96	497	573	275	275
query97	3144	3215	3124	3124
query98	232	218	199	199
query99	1328	1387	1270	1270
Total cold run time: 272364 ms
Total hot run time: 185466 ms

@bobhan1 bobhan1 force-pushed the fix-cloud-mow-retry-dup-key branch from 41f4215 to f34231c Compare April 28, 2025 02:39
@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 28, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33980 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f34231cf47b283dd5468807910207f3a61ca3903, data reload: false

------ Round 1 ----------------------------------
q1	26578	5075	5075	5075
q2	2085	282	183	183
q3	10390	1252	719	719
q4	10231	992	535	535
q5	7787	2459	2332	2332
q6	184	169	135	135
q7	923	742	612	612
q8	9320	1295	1034	1034
q9	6855	5169	5177	5169
q10	6827	2354	1898	1898
q11	484	285	276	276
q12	359	354	212	212
q13	17793	3683	3075	3075
q14	223	230	222	222
q15	542	492	491	491
q16	433	438	369	369
q17	610	857	368	368
q18	7740	7293	7052	7052
q19	1645	988	564	564
q20	331	330	221	221
q21	3814	3363	2460	2460
q22	1031	1009	978	978
Total cold run time: 116185 ms
Total hot run time: 33980 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5154	5100	5152	5100
q2	247	321	232	232
q3	2117	2657	2294	2294
q4	1352	1799	1440	1440
q5	4638	4449	4389	4389
q6	215	166	125	125
q7	1992	1898	1750	1750
q8	2601	2702	2596	2596
q9	7206	7149	7208	7149
q10	2970	3160	2731	2731
q11	563	499	483	483
q12	684	745	623	623
q13	3500	3849	3332	3332
q14	288	294	261	261
q15	517	481	489	481
q16	470	487	451	451
q17	1179	1557	1418	1418
q18	7649	7593	7556	7556
q19	850	863	943	863
q20	1987	2013	1825	1825
q21	4924	4679	4492	4492
q22	1056	1024	989	989
Total cold run time: 52159 ms
Total hot run time: 50580 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185641 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f34231cf47b283dd5468807910207f3a61ca3903, data reload: false

query1	1036	477	512	477
query2	6563	1805	1786	1786
query3	6739	227	221	221
query4	26198	23785	22927	22927
query5	4399	610	465	465
query6	292	213	184	184
query7	4623	493	293	293
query8	307	251	238	238
query9	8621	2581	2542	2542
query10	484	323	281	281
query11	15689	15028	14858	14858
query12	171	111	107	107
query13	1669	529	402	402
query14	9539	6264	6254	6254
query15	216	185	162	162
query16	7244	612	481	481
query17	1172	707	563	563
query18	1968	399	329	329
query19	189	184	161	161
query20	127	115	117	115
query21	209	121	107	107
query22	4001	4122	4097	4097
query23	33930	33050	32956	32956
query24	8464	2390	2395	2390
query25	524	446	404	404
query26	1243	278	155	155
query27	2743	501	333	333
query28	4328	2093	2072	2072
query29	782	581	442	442
query30	282	211	186	186
query31	946	864	747	747
query32	70	66	63	63
query33	564	353	310	310
query34	795	852	509	509
query35	805	805	760	760
query36	952	1002	893	893
query37	122	103	80	80
query38	4243	4117	4164	4117
query39	1459	1392	1432	1392
query40	219	128	112	112
query41	62	65	63	63
query42	122	106	109	106
query43	519	506	475	475
query44	1293	801	816	801
query45	188	178	176	176
query46	849	1040	643	643
query47	1747	1797	1726	1726
query48	384	418	302	302
query49	814	525	433	433
query50	667	689	406	406
query51	4094	4101	4058	4058
query52	112	107	107	107
query53	245	261	185	185
query54	610	603	532	532
query55	84	83	86	83
query56	335	307	296	296
query57	1101	1171	1101	1101
query58	256	254	247	247
query59	2657	2632	2516	2516
query60	337	315	293	293
query61	130	125	127	125
query62	800	744	651	651
query63	228	187	188	187
query64	4335	1001	651	651
query65	4351	4274	4277	4274
query66	1128	414	309	309
query67	15915	15445	15186	15186
query68	8015	888	512	512
query69	456	306	268	268
query70	1252	1136	1128	1128
query71	459	316	300	300
query72	5297	4631	4681	4631
query73	672	575	350	350
query74	9239	9052	8929	8929
query75	3836	3244	2721	2721
query76	3692	1215	768	768
query77	797	374	291	291
query78	10135	10222	9297	9297
query79	2169	811	566	566
query80	597	506	452	452
query81	502	265	215	215
query82	458	123	102	102
query83	254	255	236	236
query84	252	106	87	87
query85	789	359	352	352
query86	395	281	259	259
query87	4406	4450	4386	4386
query88	3628	2210	2215	2210
query89	385	313	277	277
query90	1878	207	222	207
query91	145	146	114	114
query92	73	65	56	56
query93	1590	949	580	580
query94	674	418	307	307
query95	372	294	283	283
query96	484	557	272	272
query97	3215	3262	3097	3097
query98	224	202	200	200
query99	1487	1435	1285	1285
Total cold run time: 275117 ms
Total hot run time: 185641 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.59 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f34231cf47b283dd5468807910207f3a61ca3903, data reload: false

query1	0.04	0.03	0.03
query2	0.12	0.11	0.10
query3	0.26	0.20	0.20
query4	1.59	0.19	0.10
query5	0.58	0.54	0.54
query6	1.21	0.73	0.71
query7	0.02	0.01	0.01
query8	0.05	0.03	0.03
query9	0.58	0.52	0.50
query10	0.57	0.58	0.57
query11	0.15	0.11	0.11
query12	0.15	0.11	0.11
query13	0.62	0.61	0.59
query14	0.77	0.79	0.81
query15	0.88	0.87	0.85
query16	0.37	0.40	0.39
query17	1.04	1.04	1.03
query18	0.20	0.20	0.19
query19	1.93	1.84	1.79
query20	0.01	0.01	0.01
query21	15.41	0.92	0.54
query22	0.77	1.16	0.73
query23	14.90	1.37	0.63
query24	7.34	1.03	0.30
query25	0.31	0.22	0.10
query26	0.61	0.16	0.14
query27	0.05	0.05	0.05
query28	9.27	0.93	0.42
query29	12.57	3.97	3.33
query30	0.25	0.09	0.06
query31	2.84	0.60	0.37
query32	3.23	0.54	0.46
query33	3.02	3.02	3.05
query34	15.71	5.08	4.51
query35	4.52	4.50	4.50
query36	0.66	0.49	0.48
query37	0.09	0.06	0.06
query38	0.04	0.04	0.03
query39	0.03	0.02	0.03
query40	0.16	0.14	0.12
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 103.06 s
Total hot run time: 28.59 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/15) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.67% (14738/26959)
Line Coverage 43.74% (129066/295065)
Region Coverage 42.46% (65886/155163)
Branch Coverage 37.03% (33199/89654)

@hello-stephen
Copy link
Contributor

BE Regression P0 && UT Coverage Report

Increment line coverage 0.00% (0/15) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 55.82% (14771/26463)
Line Coverage 45.47% (133936/294581)
Region Coverage 42.39% (76962/181569)
Branch Coverage 36.43% (37238/102212)

@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 28, 2025

run feut

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 28, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit d22a859 into apache:master Apr 28, 2025
24 of 26 checks passed
bobhan1 added a commit to bobhan1/doris that referenced this pull request Apr 29, 2025
…rleaved txn between a txn's retries (apache#50417)

出问题的情况:
1. txn=X 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task1,这包括 tablet=A 和 tablet=B 上计算
2. txn=X 在 tablet=A 上计算完成,在 ms 写下 ver=V1 的delete bitmap,并将其写入到 tablet=A
的pending delete bitmap KV中
3. txn=X 因为 task1 计算超时,主动释放锁
4. txn=Y 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task2,包括 tablet=A
5. txn=Y 在 tablet=A 上计算完成,用 tablet=A 上的 的pending delete bitmap KV删除
txn=X 写下的delete bitmap,并写下自己的 ver=V1 的delete bitmap
6. txn=Y 因为在某些 tablet 上计算超时,主动释放锁
7. txn=X 第二次拿锁,仍然尝试在 ver=V1 提交,向be1下发计算task3,这包括 tablet=A 和 tablet=B 上计算
8. 此时be1上的 task1 还没有完成,task3 在be1上注册失败
9. task1 在 tablet=B上计算完成,task1上报成功结果到fe。txn=X 成功在 ver=V1 上提交,但它在
tablet=A 上 ver=V1 的delete bitmap已经被删除了
dataroaring pushed a commit that referenced this pull request May 6, 2025
…s other interleaved txn between a txn's retries (#50417) (#50526)

pick #50417
dataroaring pushed a commit that referenced this pull request May 15, 2025
…et_id) being executed concurrently (#50847)

### What problem does this PR solve?

After #50417, there may be multiple
calc delete bitmap tasks with different signatures on the same (txn_id,
tablet_id) load in same BE. We use _rowset_update_lock to avoid them
being executed concurrently to avoid correctness problem.

e.g. rowset meta and segment data object mismatches due to concurrent
writes on same rowset with transient rowset writer in partial update
publish phase
```
W20250513 15:50:55.371588  1049 file_reader.cpp:36] [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
W20250513 15:50:55.371667  1049 beta_rowset.cpp:202] failed to open segment. data/1747122561886/020000000000000125473fbacc484a4f8c46478ab6f64b90_2.dat under rowset 020000000000000125473fbacc484a4f8c46478ab6f64b90 : [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
```
bobhan1 added a commit to bobhan1/doris that referenced this pull request May 16, 2025
…et_id) being executed concurrently (apache#50847)

After apache#50417, there may be multiple
calc delete bitmap tasks with different signatures on the same (txn_id,
tablet_id) load in same BE. We use _rowset_update_lock to avoid them
being executed concurrently to avoid correctness problem.

e.g. rowset meta and segment data object mismatches due to concurrent
writes on same rowset with transient rowset writer in partial update
publish phase
```
W20250513 15:50:55.371588  1049 file_reader.cpp:36] [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
W20250513 15:50:55.371667  1049 beta_rowset.cpp:202] failed to open segment. data/1747122561886/020000000000000125473fbacc484a4f8c46478ab6f64b90_2.dat under rowset 020000000000000125473fbacc484a4f8c46478ab6f64b90 : [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
```
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…rleaved txn between a txn's retries (apache#50417)

### What problem does this PR solve?

出问题的情况:
1. txn=X 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task1,这包括 tablet=A 和 tablet=B 上计算
2. txn=X 在 tablet=A 上计算完成,在 ms 写下 ver=V1 的delete bitmap,并将其写入到 tablet=A
的pending delete bitmap KV中
3. txn=X 因为 task1 计算超时,主动释放锁
4. txn=Y 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task2,包括 tablet=A
5. txn=Y 在 tablet=A 上计算完成,用 tablet=A 上的 的pending delete bitmap KV删除
txn=X 写下的delete bitmap,并写下自己的 ver=V1 的delete bitmap
6. txn=Y 因为在某些 tablet 上计算超时,主动释放锁
7. txn=X 第二次拿锁,仍然尝试在 ver=V1 提交,向be1下发计算task3,这包括 tablet=A 和 tablet=B 上计算
8. 此时be1上的 task1 还没有完成,task3 在be1上注册失败
9. task1 在 tablet=B上计算完成,task1上报成功结果到fe。txn=X 成功在 ver=V1 上提交,但它在
tablet=A 上 ver=V1 的delete bitmap已经被删除了
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…et_id) being executed concurrently (apache#50847)

### What problem does this PR solve?

After apache#50417, there may be multiple
calc delete bitmap tasks with different signatures on the same (txn_id,
tablet_id) load in same BE. We use _rowset_update_lock to avoid them
being executed concurrently to avoid correctness problem.

e.g. rowset meta and segment data object mismatches due to concurrent
writes on same rowset with transient rowset writer in partial update
publish phase
```
W20250513 15:50:55.371588  1049 file_reader.cpp:36] [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
W20250513 15:50:55.371667  1049 beta_rowset.cpp:202] failed to open segment. data/1747122561886/020000000000000125473fbacc484a4f8c46478ab6f64b90_2.dat under rowset 020000000000000125473fbacc484a4f8c46478ab6f64b90 : [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.6-merged p0_w reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants