Skip to content

[Fix](cloud-mow) avoid calc delete bitmap tasks on same (txn_id, tablet_id) being executed concurrently #50847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented May 13, 2025

What problem does this PR solve?

After #50417, there may be multiple calc delete bitmap tasks with different signatures on the same (txn_id, tablet_id) load in same BE. We use _rowset_update_lock to avoid them being executed concurrently to avoid correctness problem.

e.g. rowset meta and segment data object mismatches due to concurrent writes on same rowset with transient rowset writer in partial update publish phase

W20250513 15:50:55.371588  1049 file_reader.cpp:36] [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
W20250513 15:50:55.371667  1049 beta_rowset.cpp:202] failed to open segment. data/1747122561886/020000000000000125473fbacc484a4f8c46478ab6f64b90_2.dat under rowset 020000000000000125473fbacc484a4f8c46478ab6f64b90 : [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented May 13, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bobhan1 bobhan1 force-pushed the avoid-concurrent-calc-dbm-task-on-same-txn_id-tablet_id branch from 9d515f4 to b174c1b Compare May 13, 2025 07:55
@bobhan1
Copy link
Contributor Author

bobhan1 commented May 13, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33931 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6cd698212f2152264f45a2475c2f642fcd4edb01, data reload: false

------ Round 1 ----------------------------------
q1	26338	5048	4990	4990
q2	2068	291	179	179
q3	10390	1269	722	722
q4	10215	1002	509	509
q5	7530	2387	2361	2361
q6	187	164	132	132
q7	919	758	621	621
q8	9310	1335	1119	1119
q9	6844	5062	5133	5062
q10	6825	2339	1926	1926
q11	466	292	275	275
q12	354	355	215	215
q13	17760	3657	3132	3132
q14	231	227	215	215
q15	532	486	491	486
q16	425	430	395	395
q17	606	876	382	382
q18	7847	7129	6999	6999
q19	1512	951	555	555
q20	347	336	231	231
q21	4084	2726	2447	2447
q22	1039	1029	978	978
Total cold run time: 115829 ms
Total hot run time: 33931 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5147	5049	5064	5049
q2	242	325	229	229
q3	2181	2614	2269	2269
q4	1304	1799	1369	1369
q5	4455	4384	4428	4384
q6	221	169	127	127
q7	2027	1938	1776	1776
q8	2615	2673	2504	2504
q9	7259	7146	6955	6955
q10	3074	3187	2757	2757
q11	585	522	494	494
q12	656	775	606	606
q13	3519	3872	3308	3308
q14	312	301	291	291
q15	525	489	489	489
q16	453	512	437	437
q17	1129	1495	1424	1424
q18	7760	7565	7461	7461
q19	803	805	1127	805
q20	1981	2058	1873	1873
q21	5064	4831	4775	4775
q22	1045	1057	1028	1028
Total cold run time: 52357 ms
Total hot run time: 50410 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193988 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6cd698212f2152264f45a2475c2f642fcd4edb01, data reload: false

query1	1406	1066	1041	1041
query2	6148	1852	1889	1852
query3	11085	4580	4475	4475
query4	54058	25003	23547	23547
query5	5002	583	470	470
query6	364	222	210	210
query7	4960	515	297	297
query8	323	275	263	263
query9	5769	2617	2613	2613
query10	448	323	268	268
query11	15179	15087	14858	14858
query12	174	114	108	108
query13	1083	527	422	422
query14	10306	6471	6419	6419
query15	207	214	184	184
query16	7101	675	490	490
query17	1083	753	600	600
query18	1539	428	324	324
query19	212	208	179	179
query20	131	128	121	121
query21	210	136	111	111
query22	4329	4395	4341	4341
query23	34535	33650	33665	33650
query24	6661	2427	2428	2427
query25	451	469	407	407
query26	731	277	164	164
query27	2307	523	346	346
query28	2986	2174	2163	2163
query29	592	560	440	440
query30	284	224	194	194
query31	869	862	822	822
query32	70	61	64	61
query33	445	377	337	337
query34	795	875	546	546
query35	802	826	750	750
query36	966	993	896	896
query37	119	102	78	78
query38	4190	4324	4127	4127
query39	1502	1467	1500	1467
query40	235	142	107	107
query41	56	55	62	55
query42	129	122	110	110
query43	530	523	508	508
query44	1408	853	854	853
query45	187	197	170	170
query46	875	1052	661	661
query47	1878	1872	1792	1792
query48	418	459	352	352
query49	698	532	409	409
query50	685	710	423	423
query51	4276	4287	4209	4209
query52	117	118	111	111
query53	229	262	197	197
query54	618	591	518	518
query55	84	85	87	85
query56	314	302	285	285
query57	1202	1200	1151	1151
query58	282	263	262	262
query59	2791	2865	2705	2705
query60	331	325	320	320
query61	131	133	120	120
query62	730	752	673	673
query63	227	196	205	196
query64	2000	1049	720	720
query65	4460	4366	4304	4304
query66	733	411	301	301
query67	16096	15592	15478	15478
query68	7920	895	523	523
query69	560	306	269	269
query70	1226	1114	1100	1100
query71	521	324	304	304
query72	5858	4937	4887	4887
query73	1238	683	356	356
query74	8979	9363	8622	8622
query75	4142	3242	2720	2720
query76	4226	1206	784	784
query77	677	397	292	292
query78	10220	10235	9267	9267
query79	2613	790	575	575
query80	604	505	441	441
query81	495	248	217	217
query82	419	124	98	98
query83	350	249	227	227
query84	296	101	83	83
query85	781	361	316	316
query86	366	318	280	280
query87	4462	4482	4517	4482
query88	3788	2329	2301	2301
query89	415	318	287	287
query90	1847	218	211	211
query91	143	152	110	110
query92	74	65	61	61
query93	2124	957	580	580
query94	668	416	310	310
query95	377	303	286	286
query96	501	586	285	285
query97	3197	3218	3108	3108
query98	243	200	201	200
query99	1443	1426	1263	1263
Total cold run time: 301915 ms
Total hot run time: 193988 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 6cd698212f2152264f45a2475c2f642fcd4edb01, data reload: false

query1	0.03	0.04	0.03
query2	0.12	0.11	0.12
query3	0.25	0.19	0.20
query4	1.60	0.19	0.20
query5	0.60	0.60	0.58
query6	1.20	0.72	0.73
query7	0.03	0.02	0.02
query8	0.04	0.04	0.04
query9	0.57	0.51	0.51
query10	0.55	0.58	0.57
query11	0.15	0.11	0.11
query12	0.15	0.12	0.12
query13	0.62	0.60	0.60
query14	0.80	0.81	0.82
query15	0.87	0.85	0.87
query16	0.38	0.39	0.40
query17	1.06	1.04	1.03
query18	0.23	0.21	0.21
query19	1.89	1.83	1.83
query20	0.01	0.01	0.01
query21	15.39	0.92	0.56
query22	0.76	1.41	0.97
query23	14.73	1.37	0.65
query24	7.72	1.39	0.68
query25	0.44	0.19	0.16
query26	0.72	0.17	0.13
query27	0.06	0.05	0.05
query28	8.80	0.89	0.47
query29	12.53	4.01	3.29
query30	0.26	0.10	0.07
query31	2.82	0.61	0.39
query32	3.23	0.56	0.48
query33	3.01	3.10	3.07
query34	15.81	5.09	4.47
query35	4.55	4.50	4.52
query36	0.66	0.49	0.49
query37	0.09	0.07	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.18	0.15	0.14
query41	0.08	0.03	0.03
query42	0.04	0.03	0.02
query43	0.04	0.04	0.03
Total cold run time: 103.15 s
Total hot run time: 29.69 s

@bobhan1 bobhan1 force-pushed the avoid-concurrent-calc-dbm-task-on-same-txn_id-tablet_id branch from 32959f9 to 5f80ddc Compare May 13, 2025 10:39
@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/22) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 55.78% (14894/26700)
Line Coverage 44.59% (131794/295599)
Region Coverage 43.64% (66270/151854)
Branch Coverage 38.26% (33963/88762)

@bobhan1
Copy link
Contributor Author

bobhan1 commented May 14, 2025

run p0

@bobhan1
Copy link
Contributor Author

bobhan1 commented May 14, 2025

run cloud_p0

@bobhan1
Copy link
Contributor Author

bobhan1 commented May 14, 2025

run p0

Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 14, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@zhannngchen
Copy link
Contributor

run be ut

@zhannngchen
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34087 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5f80ddc04b03d74f4a38cefa15f4558120a8edd4, data reload: false

------ Round 1 ----------------------------------
q1	26216	5089	5019	5019
q2	2082	278	186	186
q3	10472	1287	728	728
q4	10244	1027	543	543
q5	8037	2459	2359	2359
q6	187	162	133	133
q7	912	743	586	586
q8	9303	1349	1097	1097
q9	6735	5089	5104	5089
q10	6876	2337	1881	1881
q11	479	291	285	285
q12	354	347	219	219
q13	17767	3691	3112	3112
q14	229	230	217	217
q15	537	492	486	486
q16	442	432	376	376
q17	632	871	390	390
q18	7635	7153	7161	7153
q19	1352	969	570	570
q20	352	346	231	231
q21	4253	2684	2451	2451
q22	1054	1014	976	976
Total cold run time: 116150 ms
Total hot run time: 34087 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5102	5081	5069	5069
q2	239	323	234	234
q3	2219	2625	2302	2302
q4	1329	1766	1320	1320
q5	4559	4479	4395	4395
q6	219	164	126	126
q7	1941	1912	1787	1787
q8	2574	2523	2523	2523
q9	7103	7102	7114	7102
q10	3025	3213	2768	2768
q11	579	532	491	491
q12	691	789	621	621
q13	3509	3925	3283	3283
q14	276	290	269	269
q15	518	481	466	466
q16	459	487	428	428
q17	1159	1566	1417	1417
q18	7789	7572	7481	7481
q19	819	923	1157	923
q20	1963	2037	1895	1895
q21	5093	4729	4563	4563
q22	1068	1054	960	960
Total cold run time: 52233 ms
Total hot run time: 50423 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186425 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 5f80ddc04b03d74f4a38cefa15f4558120a8edd4, data reload: false

query1	1009	477	487	477
query2	6584	1794	1804	1794
query3	6745	234	225	225
query4	27333	23826	23575	23575
query5	4347	621	478	478
query6	307	214	217	214
query7	4623	498	294	294
query8	297	276	235	235
query9	8597	2607	2611	2607
query10	484	306	269	269
query11	15579	15249	14825	14825
query12	170	108	107	107
query13	1663	550	422	422
query14	9769	6291	6201	6201
query15	207	192	168	168
query16	7648	664	474	474
query17	1194	723	579	579
query18	2024	400	307	307
query19	212	188	183	183
query20	124	122	116	116
query21	209	122	110	110
query22	4174	4090	4021	4021
query23	34133	32991	33173	32991
query24	8491	2455	2407	2407
query25	549	451	391	391
query26	1243	266	154	154
query27	2737	502	333	333
query28	4309	2103	2086	2086
query29	732	563	444	444
query30	281	212	189	189
query31	936	887	752	752
query32	75	63	65	63
query33	554	386	306	306
query34	801	826	535	535
query35	762	820	722	722
query36	956	1010	903	903
query37	117	101	84	84
query38	4186	4204	4036	4036
query39	1443	1410	1406	1406
query40	218	122	104	104
query41	59	58	56	56
query42	131	108	108	108
query43	506	480	465	465
query44	1285	815	820	815
query45	184	174	167	167
query46	844	1026	632	632
query47	1750	1810	1713	1713
query48	389	408	309	309
query49	796	524	447	447
query50	638	697	405	405
query51	4161	4111	4073	4073
query52	115	105	104	104
query53	226	250	192	192
query54	583	567	512	512
query55	92	83	89	83
query56	316	304	282	282
query57	1113	1161	1087	1087
query58	270	259	255	255
query59	2511	2558	2452	2452
query60	335	327	295	295
query61	128	130	124	124
query62	760	745	647	647
query63	225	186	190	186
query64	4363	1015	674	674
query65	4328	4268	4238	4238
query66	1048	414	312	312
query67	15847	15467	15415	15415
query68	8302	880	516	516
query69	472	309	270	270
query70	1290	1126	1140	1126
query71	483	324	301	301
query72	5561	4780	4902	4780
query73	748	663	351	351
query74	9163	9097	8701	8701
query75	3910	3314	2712	2712
query76	3736	1200	749	749
query77	781	394	301	301
query78	9989	10222	9341	9341
query79	2010	823	566	566
query80	649	518	446	446
query81	470	259	225	225
query82	462	126	95	95
query83	287	268	239	239
query84	290	101	82	82
query85	792	348	322	322
query86	331	308	297	297
query87	4513	4387	4412	4387
query88	2833	2277	2287	2277
query89	390	318	284	284
query90	1938	207	207	207
query91	145	143	110	110
query92	81	63	58	58
query93	1108	920	576	576
query94	667	424	291	291
query95	379	288	281	281
query96	493	573	285	285
query97	3211	3196	3134	3134
query98	231	215	207	207
query99	1429	1441	1273	1273
Total cold run time: 275969 ms
Total hot run time: 186425 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.38 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 5f80ddc04b03d74f4a38cefa15f4558120a8edd4, data reload: false

query1	0.04	0.04	0.03
query2	0.12	0.10	0.11
query3	0.26	0.20	0.20
query4	1.59	0.20	0.19
query5	0.61	0.64	0.61
query6	1.18	0.73	0.73
query7	0.02	0.02	0.02
query8	0.04	0.04	0.04
query9	0.58	0.53	0.52
query10	0.57	0.58	0.57
query11	0.16	0.11	0.11
query12	0.15	0.11	0.11
query13	0.63	0.60	0.59
query14	0.79	0.80	0.82
query15	0.88	0.87	0.85
query16	0.38	0.38	0.39
query17	1.06	1.06	1.05
query18	0.22	0.21	0.21
query19	1.95	1.79	1.84
query20	0.02	0.01	0.01
query21	15.40	0.92	0.54
query22	0.75	1.12	0.62
query23	15.01	1.40	0.60
query24	6.94	0.85	1.42
query25	0.52	0.24	0.16
query26	0.60	0.17	0.13
query27	0.06	0.06	0.04
query28	9.33	0.90	0.44
query29	12.52	3.97	3.29
query30	0.25	0.09	0.06
query31	2.84	0.61	0.39
query32	3.24	0.55	0.47
query33	3.02	3.05	3.09
query34	15.82	5.16	4.53
query35	4.56	4.50	4.51
query36	0.65	0.50	0.49
query37	0.08	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.16	0.14	0.13
query41	0.08	0.03	0.02
query42	0.03	0.03	0.02
query43	0.04	0.04	0.02
Total cold run time: 103.23 s
Total hot run time: 29.38 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 55.81% (14899/26698)
Line Coverage 44.60% (131869/295693)
Region Coverage 43.64% (66284/151882)
Branch Coverage 38.26% (33966/88780)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 66.67% (2/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.40% (20852/26262)
Line Coverage 72.54% (214365/295502)
Region Coverage 70.71% (126111/178343)
Branch Coverage 64.41% (65282/101358)

1 similar comment
@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 66.67% (2/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.40% (20852/26262)
Line Coverage 72.54% (214365/295502)
Region Coverage 70.71% (126111/178343)
Branch Coverage 64.41% (65282/101358)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 66.67% (2/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.33% (20835/26264)
Line Coverage 72.51% (214374/295629)
Region Coverage 70.69% (126102/178388)
Branch Coverage 64.39% (65288/101392)

@bobhan1
Copy link
Contributor Author

bobhan1 commented May 15, 2025

run feut

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit a930cac into apache:master May 15, 2025
26 of 28 checks passed
bobhan1 added a commit to bobhan1/doris that referenced this pull request May 16, 2025
…et_id) being executed concurrently (apache#50847)

After apache#50417, there may be multiple
calc delete bitmap tasks with different signatures on the same (txn_id,
tablet_id) load in same BE. We use _rowset_update_lock to avoid them
being executed concurrently to avoid correctness problem.

e.g. rowset meta and segment data object mismatches due to concurrent
writes on same rowset with transient rowset writer in partial update
publish phase
```
W20250513 15:50:55.371588  1049 file_reader.cpp:36] [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
W20250513 15:50:55.371667  1049 beta_rowset.cpp:202] failed to open segment. data/1747122561886/020000000000000125473fbacc484a4f8c46478ab6f64b90_2.dat under rowset 020000000000000125473fbacc484a4f8c46478ab6f64b90 : [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
```
dataroaring pushed a commit that referenced this pull request May 16, 2025
…txn_id, tablet_id) being executed concurrently (#50847) (#50964)

pick #50847
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…et_id) being executed concurrently (apache#50847)

### What problem does this PR solve?

After apache#50417, there may be multiple
calc delete bitmap tasks with different signatures on the same (txn_id,
tablet_id) load in same BE. We use _rowset_update_lock to avoid them
being executed concurrently to avoid correctness problem.

e.g. rowset meta and segment data object mismatches due to concurrent
writes on same rowset with transient rowset writer in partial update
publish phase
```
W20250513 15:50:55.371588  1049 file_reader.cpp:36] [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
W20250513 15:50:55.371667  1049 beta_rowset.cpp:202] failed to open segment. data/1747122561886/020000000000000125473fbacc484a4f8c46478ab6f64b90_2.dat under rowset 020000000000000125473fbacc484a4f8c46478ab6f64b90 : [NOT_FOUND]failed to read from :   code=NOT_FOUND, type=16, request_id=failed to read
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.6-merged p0_w reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants