Skip to content

[fix](fe) Tolerate missing rollup partition metadata during replay#62623

Open
Hastyshell wants to merge 1 commit into
apache:masterfrom
Hastyshell:fix-rollup-replay-missing-partition
Open

[fix](fe) Tolerate missing rollup partition metadata during replay#62623
Hastyshell wants to merge 1 commit into
apache:masterfrom
Hastyshell:fix-rollup-replay-missing-partition

Conversation

@Hastyshell
Copy link
Copy Markdown
Collaborator

Summary

  • skip rollup replay entries whose partition no longer exists in table metadata
  • skip rollup replay entries whose partition data property is missing instead of failing FE startup
  • log job, table, and partition ids so operators can unblock startup and keep tracing the inconsistent metadata

Testing

  • ./build.sh --fe

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: FE replay can hit a stale rollup partition id whose partition or data property is no longer available in table metadata, which crashes startup with a NullPointerException. Skip those inconsistent partitions during rollup replay so FE can finish recovery and operators can replace the binary to unblock startup.

### Release note

None

### Check List (For Author)

- Test: FE build
    - Manual test / No need to test (startup replay guard only)
- Behavior changed: Yes (rollup replay now skips inconsistent partitions instead of failing startup)
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@luwei16
Copy link
Copy Markdown
Contributor

luwei16 commented May 19, 2026

run buildall

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30802 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e03c155274388978f5e0d29caa8d828c20b22d15, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17627	3941	3910	3910
q2	q3	10780	1341	788	788
q4	4676	471	347	347
q5	7549	2217	2064	2064
q6	238	182	140	140
q7	932	784	632	632
q8	9335	1705	1581	1581
q9	6692	4874	4874	4874
q10	6444	2093	1790	1790
q11	434	271	242	242
q12	693	424	291	291
q13	18214	3347	2748	2748
q14	264	267	235	235
q15	q16	820	792	707	707
q17	1013	989	861	861
q18	6735	5809	5442	5442
q19	1195	1310	978	978
q20	641	469	293	293
q21	6097	2655	2559	2559
q22	460	369	320	320
Total cold run time: 100839 ms
Total hot run time: 30802 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4594	4555	4671	4555
q2	q3	4839	5223	4723	4723
q4	2130	2281	1422	1422
q5	4859	4622	4628	4622
q6	232	185	137	137
q7	1885	1750	1511	1511
q8	2217	1887	1867	1867
q9	7192	7258	7135	7135
q10	4472	4423	3982	3982
q11	536	397	358	358
q12	716	713	506	506
q13	3049	3400	2881	2881
q14	269	292	252	252
q15	q16	675	699	608	608
q17	1253	1262	1227	1227
q18	7239	6883	6808	6808
q19	1150	1106	1061	1061
q20	2196	2212	1930	1930
q21	5293	4557	4413	4413
q22	517	454	424	424
Total cold run time: 55313 ms
Total hot run time: 50422 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169715 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e03c155274388978f5e0d29caa8d828c20b22d15, data reload: false

query5	4314	666	518	518
query6	320	231	200	200
query7	4222	583	294	294
query8	325	229	223	223
query9	8821	4082	4035	4035
query10	439	342	296	296
query11	5845	2422	2246	2246
query12	182	127	122	122
query13	1273	577	434	434
query14	5918	5343	5026	5026
query14_1	4296	4342	4338	4338
query15	207	200	184	184
query16	1022	457	437	437
query17	1113	720	577	577
query18	2450	471	348	348
query19	213	204	163	163
query20	138	135	132	132
query21	215	136	130	130
query22	13542	13553	13406	13406
query23	17245	16398	16010	16010
query23_1	16166	16175	16197	16175
query24	7421	1760	1332	1332
query24_1	1313	1332	1313	1313
query25	600	506	445	445
query26	1348	333	174	174
query27	2672	566	345	345
query28	4457	2001	1962	1962
query29	1014	638	519	519
query30	312	242	214	214
query31	1109	1067	939	939
query32	96	81	76	76
query33	537	368	305	305
query34	1188	1126	652	652
query35	773	787	690	690
query36	1353	1343	1128	1128
query37	158	107	93	93
query38	3191	3145	3095	3095
query39	930	922	899	899
query39_1	880	876	875	875
query40	238	152	131	131
query41	73	70	70	70
query42	113	114	119	114
query43	331	328	286	286
query44	
query45	214	204	202	202
query46	1065	1212	706	706
query47	2323	2315	2224	2224
query48	411	421	300	300
query49	650	511	403	403
query50	990	360	271	271
query51	4439	4255	4331	4255
query52	110	110	103	103
query53	267	295	213	213
query54	339	291	271	271
query55	96	94	91	91
query56	331	325	319	319
query57	1439	1453	1322	1322
query58	311	290	290	290
query59	1614	1697	1469	1469
query60	337	339	312	312
query61	190	198	150	150
query62	687	620	550	550
query63	258	212	210	210
query64	2439	843	650	650
query65	
query66	1717	475	364	364
query67	29994	29876	29887	29876
query68	
query69	453	330	306	306
query70	1048	1013	1005	1005
query71	311	284	267	267
query72	3018	2681	2415	2415
query73	844	738	418	418
query74	5038	4895	4713	4713
query75	2645	2627	2247	2247
query76	2268	1107	740	740
query77	389	398	343	343
query78	12079	12110	11651	11651
query79	1548	1013	771	771
query80	1320	538	465	465
query81	519	274	235	235
query82	913	159	122	122
query83	318	284	252	252
query84	258	140	150	140
query85	933	523	457	457
query86	451	321	334	321
query87	3420	3360	3223	3223
query88	3548	2667	2659	2659
query89	447	390	337	337
query90	1899	184	191	184
query91	178	170	137	137
query92	78	80	76	76
query93	1617	1456	878	878
query94	707	341	311	311
query95	679	481	347	347
query96	1052	790	345	345
query97	2710	2661	2543	2543
query98	238	231	233	231
query99	1081	1108	1008	1008
Total cold run time: 253667 ms
Total hot run time: 169715 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/11) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 27.27% (3/11) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants