Skip to content

branch-3.0: [fix](transaction) remove visible rowset from memory during deletion transaction #50066 #50103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 22, 2025

Conversation

github-actions[bot]
Copy link
Contributor

Cherry-picked from #50066

…transaction (#50066)

### What problem does this PR solve?

Repeatedly reporting errors in the log:
```
W20250415 17:54:01.151019 218169 storage_engine.cpp:785] failed to clear transaction. txn_id=1002, partition_id=1744709709398, tablet_id=1744709709485, status=[E-228]could not delete transaction from engine, just remove it from memory not delete from disk, because related rowset already published. partition_id: 1744709709398, transaction_id: 1002, tablet: 1744709709485.fb491f2a6f29dad0-28fe3fdfba3272b5, rowset id: 020000000000001fc6430de53121366b7d7bc36d82a1ae92, version: [2-2], state: VISIBLE
W20250415 17:54:01.152154 218169 storage_engine.cpp:785] failed to clear transaction. txn_id=1002, partition_id=1744709709398, tablet_id=1744709709493, status=[E-228]could not delete transaction from engine, just remove it from memory not delete from disk, because related rowset already published. partition_id: 1744709709398, transaction_id: 1002, tablet: 1744709709493.7a47a3bf7dcc70f0-353230158f6c2390, rowset id: 0200000000000019c6430de53121366b7d7bc36d82a1ae92, version: [2-2], state: VISIBLE
W20250415 17:54:01.152177 218169 storage_engine.cpp:785] failed to clear transaction. txn_id=1002, partition_id=1744709709398, tablet_id=1744709709509, status=[E-228]could not delete transaction from engine, just remove it from memory not delete from disk, because related rowset already published. partition_id: 1744709709398, transaction_id: 1002, tablet: 1744709709509.7f49efdb3b6c2c1d-ed627da9d22c9884, rowset id: 020000000000001bc6430de53121366b7d7bc36d82a1ae92, version: [2-2], state: VISIBLE
```
This bug can occur in the following scenarios:
1. After load the three replicas of the table, the publish task failed
after making the rowset as visible in one of the replica.
2. The transactions became visible. Then FE clear visible transaction.
3. The BE node that fails to publish will not delete the memory
transaction and report it to FE. FE cannot find the transaction and
issues a clear command. However, the deletion fails due to the rowset
already published.
@github-actions github-actions bot requested a review from dataroaring as a code owner April 16, 2025 09:53
@Thearas
Copy link
Contributor

Thearas commented Apr 16, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Apr 16, 2025
@Thearas
Copy link
Contributor

Thearas commented Apr 16, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39842 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 761faa6956b4bf13e2c2cec6d2a2801af8dde55e, data reload: false

------ Round 1 ----------------------------------
q1	17583	6688	6578	6578
q2	2065	167	155	155
q3	10641	1039	1130	1039
q4	10558	757	710	710
q5	7720	2842	2838	2838
q6	216	132	131	131
q7	946	613	593	593
q8	9358	1959	2022	1959
q9	6545	6383	6369	6369
q10	7001	2228	2328	2228
q11	461	258	252	252
q12	389	209	207	207
q13	17768	2947	2978	2947
q14	230	201	221	201
q15	499	467	467	467
q16	651	577	597	577
q17	962	578	561	561
q18	7032	6681	6733	6681
q19	1392	1074	1128	1074
q20	481	226	207	207
q21	3968	3103	3138	3103
q22	1090	965	980	965
Total cold run time: 107556 ms
Total hot run time: 39842 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6589	6559	6495	6495
q2	322	231	231	231
q3	2858	2766	2800	2766
q4	2049	1801	1800	1800
q5	5741	5729	5715	5715
q6	212	124	125	124
q7	2231	1794	1782	1782
q8	3358	3496	3505	3496
q9	8702	8830	8783	8783
q10	3578	3543	3549	3543
q11	585	490	505	490
q12	795	617	606	606
q13	8762	3184	3116	3116
q14	301	281	270	270
q15	504	466	480	466
q16	719	666	657	657
q17	1843	1622	1608	1608
q18	8219	7557	7623	7557
q19	1667	1436	1541	1436
q20	2059	1856	1859	1856
q21	5530	5406	5312	5312
q22	1108	1027	1007	1007
Total cold run time: 67732 ms
Total hot run time: 59116 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 40.05% (10491/26198)
Line Coverage 30.73% (88214/287062)
Region Coverage 29.82% (45437/152346)
Branch Coverage 26.21% (23046/87922)

@doris-robot
Copy link

TPC-DS: Total hot run time: 196125 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 761faa6956b4bf13e2c2cec6d2a2801af8dde55e, data reload: false

query1	1294	922	902	902
query2	6226	1966	1976	1966
query3	10807	4333	4286	4286
query4	60421	29651	23622	23622
query5	5223	462	449	449
query6	406	179	182	179
query7	5405	313	321	313
query8	308	236	224	224
query9	8461	2595	2580	2580
query10	441	280	256	256
query11	18839	15225	15677	15225
query12	152	103	101	101
query13	1440	428	421	421
query14	9645	6768	7125	6768
query15	197	176	175	175
query16	7197	471	439	439
query17	1109	583	586	583
query18	1912	321	324	321
query19	211	157	160	157
query20	120	107	108	107
query21	221	103	105	103
query22	4656	4417	4553	4417
query23	34316	34032	33676	33676
query24	6165	2939	2941	2939
query25	516	419	439	419
query26	664	179	167	167
query27	1710	352	365	352
query28	4119	2480	2435	2435
query29	720	470	462	462
query30	261	170	161	161
query31	983	809	842	809
query32	70	58	61	58
query33	429	309	316	309
query34	904	497	533	497
query35	859	749	738	738
query36	1084	977	964	964
query37	121	70	72	70
query38	4014	4065	3949	3949
query39	1502	1480	1469	1469
query40	212	103	106	103
query41	48	50	49	49
query42	123	96	103	96
query43	554	491	511	491
query44	1192	829	822	822
query45	185	172	164	164
query46	1177	712	742	712
query47	2029	1880	1951	1880
query48	458	374	385	374
query49	706	397	376	376
query50	850	438	443	438
query51	7194	7248	7227	7227
query52	103	87	91	87
query53	258	180	183	180
query54	581	481	467	467
query55	75	77	83	77
query56	262	246	248	246
query57	1310	1165	1101	1101
query58	211	201	212	201
query59	3092	2939	2808	2808
query60	288	252	267	252
query61	110	104	150	104
query62	755	652	674	652
query63	214	184	188	184
query64	1385	680	624	624
query65	3253	3206	3172	3172
query66	704	295	298	295
query67	15828	15578	15521	15521
query68	4102	585	574	574
query69	429	268	263	263
query70	1114	1148	1028	1028
query71	358	257	257	257
query72	6347	4019	3961	3961
query73	741	352	356	352
query74	10595	9366	8975	8975
query75	3372	2639	2648	2639
query76	2240	1056	1053	1053
query77	474	271	269	269
query78	10614	9586	9603	9586
query79	2151	604	607	604
query80	1347	473	428	428
query81	519	238	237	237
query82	1117	88	84	84
query83	156	145	138	138
query84	282	88	84	84
query85	967	309	288	288
query86	351	303	296	296
query87	4373	4265	4269	4265
query88	3773	2399	2348	2348
query89	409	286	292	286
query90	2002	180	185	180
query91	183	145	147	145
query92	58	47	52	47
query93	3083	558	573	558
query94	780	281	291	281
query95	359	263	261	261
query96	616	278	283	278
query97	3302	3115	3163	3115
query98	217	200	192	192
query99	1517	1282	1303	1282
Total cold run time: 314233 ms
Total hot run time: 196125 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.34 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 761faa6956b4bf13e2c2cec6d2a2801af8dde55e, data reload: false

query1	0.03	0.03	0.03
query2	0.06	0.03	0.04
query3	0.24	0.06	0.07
query4	1.63	0.10	0.10
query5	0.53	0.50	0.52
query6	1.15	0.74	0.72
query7	0.02	0.02	0.02
query8	0.03	0.03	0.03
query9	0.55	0.51	0.52
query10	0.55	0.54	0.55
query11	0.15	0.10	0.11
query12	0.13	0.12	0.11
query13	0.61	0.61	0.59
query14	2.85	2.75	2.72
query15	0.89	0.81	0.81
query16	0.36	0.38	0.40
query17	1.03	1.00	1.01
query18	0.24	0.21	0.23
query19	1.95	1.84	2.05
query20	0.02	0.01	0.01
query21	15.36	0.58	0.59
query22	2.50	2.35	2.45
query23	16.95	1.01	0.97
query24	3.27	1.47	1.46
query25	0.28	0.12	0.28
query26	0.33	0.14	0.14
query27	0.04	0.04	0.04
query28	9.48	0.52	0.44
query29	12.60	3.25	3.24
query30	0.25	0.06	0.06
query31	2.85	0.39	0.38
query32	3.28	0.45	0.46
query33	2.95	3.02	3.03
query34	17.24	4.49	4.50
query35	4.54	4.52	4.51
query36	0.66	0.51	0.48
query37	0.08	0.06	0.07
query38	0.05	0.04	0.03
query39	0.04	0.02	0.02
query40	0.16	0.12	0.12
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.07 s
Total hot run time: 33.34 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 9009e51 into branch-3.0 Apr 22, 2025
22 of 24 checks passed
@github-actions github-actions bot deleted the auto-pick-50066-branch-3.0 branch April 22, 2025 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants