Skip to content

[fix](fe) Keep schema change waiting on conflict txn abort failure#65196

Open
Yukang-Lian wants to merge 1 commit into
apache:masterfrom
Yukang-Lian:codex/schema-change-conflict-txn-abort
Open

[fix](fe) Keep schema change waiting on conflict txn abort failure#65196
Yukang-Lian wants to merge 1 commit into
apache:masterfrom
Yukang-Lian:codex/schema-change-conflict-txn-abort

Conversation

@Yukang-Lian

@Yukang-Lian Yukang-Lian commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary:
A cloud schema change job in WAITING_TXN uses checkFailedPreviousLoadAndAbort() to abort failed conflict transactions and make progress. If that conflict transaction is concurrently cleaned up, already aborted, visible, or otherwise no longer abortable, abortTransaction() can throw UserException. Before this patch, runWaitingTxnJob() propagated that exception as AlterCancelException, and AlterJobV2.run() cancelled the schema change job.

This patch makes conflict transaction abort best-effort for schema change: abort failure keeps the job in WAITING_TXN and lets the next scheduler round re-query transaction state. The docker regression case also now reports the actual schema change state and fails immediately on CANCELLED instead of using opaque assertEquals(1,2) fallbacks.

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: A cloud schema change job in WAITING_TXN can abort failed conflict transactions to make progress. If the conflict transaction is concurrently cleaned up, already aborted, or otherwise no longer abortable, abortTransaction throws a UserException. The schema change scheduler previously propagated that exception as AlterCancelException, so AlterJobV2.run cancelled the schema change job even though the job should simply wait and re-check conflict transactions in the next round. This change makes conflict transaction abort best-effort inside SchemaChangeJobV2: an abort failure keeps the schema change job in WAITING_TXN and lets the next scheduler round re-query transaction state. The regression case now fails immediately with the observed schema change state if the job is cancelled or stuck, instead of using an opaque assertEquals(1,2).

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.alter.CloudIndexTest#testSchemaChangeWaitsWhenConflictTxnAbortFails
    - ./run-fe-ut.sh --run org.apache.doris.alter.CloudIndexTest
    - git diff --check -- fe/fe-core/src/main/java/org/apache/doris/alter/SchemaChangeJobV2.java fe/fe-core/src/test/java/org/apache/doris/alter/CloudIndexTest.java regression-test/suites/schema_change_p0/test_abort_txn_by_fe.groovy
- Behavior changed: No
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Yukang-Lian Yukang-Lian marked this pull request as ready for review July 3, 2026 07:25
@Yukang-Lian

Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29536 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8a9e9e8f81e6b881f4d0d7bd743b216449a5df63, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17875	4069	3966	3966
q2	2005	319	196	196
q3	10339	1460	814	814
q4	4726	478	337	337
q5	7723	862	570	570
q6	234	169	143	143
q7	771	840	623	623
q8	10274	1552	1620	1552
q9	6008	4453	4437	4437
q10	6804	1800	1537	1537
q11	511	340	308	308
q12	723	553	430	430
q13	18116	3343	2757	2757
q14	271	274	238	238
q15	q16	791	775	707	707
q17	1061	993	1031	993
q18	6806	5728	5540	5540
q19	1191	1318	1096	1096
q20	742	663	550	550
q21	5727	2663	2441	2441
q22	432	375	301	301
Total cold run time: 103130 ms
Total hot run time: 29536 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4372	4256	4269	4256
q2	279	314	206	206
q3	4556	4925	4404	4404
q4	2072	2173	1365	1365
q5	4447	4317	4335	4317
q6	234	177	126	126
q7	1797	2059	1611	1611
q8	2468	2156	2087	2087
q9	7936	7981	7840	7840
q10	4757	4717	4263	4263
q11	563	412	381	381
q12	938	791	545	545
q13	3293	3496	2963	2963
q14	314	296	270	270
q15	q16	725	736	644	644
q17	1347	1336	1320	1320
q18	8063	7316	6888	6888
q19	1143	1073	1089	1073
q20	2229	2198	1949	1949
q21	5272	4604	4474	4474
q22	521	462	419	419
Total cold run time: 57326 ms
Total hot run time: 51401 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172988 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8a9e9e8f81e6b881f4d0d7bd743b216449a5df63, data reload: false

query5	4313	632	491	491
query6	453	223	205	205
query7	4844	604	337	337
query8	353	185	168	168
query9	8746	3995	3995	3995
query10	475	373	300	300
query11	5847	2379	2155	2155
query12	160	105	100	100
query13	1293	605	431	431
query14	6288	5346	4991	4991
query14_1	4281	4289	4296	4289
query15	207	205	184	184
query16	1025	475	462	462
query17	1138	751	583	583
query18	2712	486	357	357
query19	217	191	162	162
query20	122	113	110	110
query21	232	154	135	135
query22	13575	13631	13404	13404
query23	17326	16540	16116	16116
query23_1	16176	16187	16291	16187
query24	7503	1773	1326	1326
query24_1	1354	1318	1308	1308
query25	578	469	402	402
query26	1375	357	223	223
query27	2519	604	394	394
query28	4427	2004	2013	2004
query29	1105	628	503	503
query30	338	260	238	238
query31	1140	1104	978	978
query32	108	66	65	65
query33	530	317	259	259
query34	1164	1135	642	642
query35	751	790	688	688
query36	1406	1383	1221	1221
query37	159	109	93	93
query38	1879	1714	1652	1652
query39	923	913	900	900
query39_1	907	907	870	870
query40	290	164	132	132
query41	70	63	61	61
query42	93	93	92	92
query43	314	324	276	276
query44	1407	794	763	763
query45	209	191	181	181
query46	1087	1175	714	714
query47	2361	2432	2201	2201
query48	380	420	292	292
query49	573	417	315	315
query50	1017	412	331	331
query51	4423	4339	4339	4339
query52	84	84	74	74
query53	261	270	206	206
query54	267	220	213	213
query55	77	68	67	67
query56	286	291	285	285
query57	1415	1421	1322	1322
query58	288	259	246	246
query59	1554	1654	1429	1429
query60	292	261	248	248
query61	154	148	153	148
query62	707	654	588	588
query63	249	205	202	202
query64	2465	765	585	585
query65	4885	4802	4766	4766
query66	1796	508	385	385
query67	29598	29635	28789	28789
query68	3078	1488	1014	1014
query69	402	314	278	278
query70	1051	980	962	962
query71	400	308	332	308
query72	2911	2683	2388	2388
query73	825	773	448	448
query74	5083	4964	4753	4753
query75	2607	2570	2230	2230
query76	2330	1187	781	781
query77	354	384	304	304
query78	12424	12539	11888	11888
query79	1395	1168	747	747
query80	1180	548	450	450
query81	508	325	280	280
query82	559	155	120	120
query83	365	312	310	310
query84	293	169	133	133
query85	958	584	511	511
query86	407	298	287	287
query87	1824	1811	1732	1732
query88	3697	2798	2780	2780
query89	450	423	354	354
query90	1901	199	194	194
query91	201	195	165	165
query92	66	64	58	58
query93	1686	1574	931	931
query94	629	373	324	324
query95	801	509	544	509
query96	1027	809	338	338
query97	2702	2664	2554	2554
query98	218	206	198	198
query99	1159	1145	1034	1034
Total cold run time: 258474 ms
Total hot run time: 172988 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.24 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 8a9e9e8f81e6b881f4d0d7bd743b216449a5df63, data reload: false

query1	0.00	0.00	0.00
query2	0.10	0.05	0.05
query3	0.26	0.14	0.16
query4	1.61	0.13	0.14
query5	0.24	0.22	0.23
query6	1.26	1.07	1.06
query7	0.03	0.01	0.01
query8	0.05	0.04	0.03
query9	0.37	0.32	0.31
query10	0.58	0.55	0.58
query11	0.19	0.14	0.13
query12	0.18	0.15	0.14
query13	0.47	0.46	0.47
query14	1.02	1.00	1.03
query15	0.60	0.58	0.58
query16	0.34	0.33	0.32
query17	1.05	1.17	1.10
query18	0.23	0.21	0.21
query19	1.99	2.00	1.97
query20	0.02	0.01	0.01
query21	15.43	0.21	0.15
query22	4.90	0.05	0.06
query23	16.12	0.32	0.12
query24	3.01	0.43	0.30
query25	0.12	0.05	0.05
query26	0.72	0.21	0.16
query27	0.05	0.04	0.03
query28	3.49	0.92	0.57
query29	12.52	4.36	3.50
query30	0.27	0.16	0.17
query31	2.77	0.60	0.31
query32	3.23	0.60	0.49
query33	3.20	3.25	3.18
query34	15.63	4.24	3.52
query35	3.51	3.44	3.49
query36	0.56	0.42	0.42
query37	0.09	0.07	0.06
query38	0.05	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.15
query41	0.09	0.03	0.04
query42	0.04	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 96.65 s
Total hot run time: 25.24 s

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 71.43% (5/7) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/98) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants