Skip to content

[improve](partition) Increase partition limit defaults to 20000 and add near-limit metrics#61511

Open
dataroaring wants to merge 5 commits intomasterfrom
improve/partition-limit-defaults-and-metrics
Open

[improve](partition) Increase partition limit defaults to 20000 and add near-limit metrics#61511
dataroaring wants to merge 5 commits intomasterfrom
improve/partition-limit-defaults-and-metrics

Conversation

@dataroaring
Copy link
Contributor

Summary

  • Raise max_dynamic_partition_num default from 500 to 20000 and max_auto_partition_num from 2000 to 20000 to match modern production workloads
  • Add warning logs when partition counts exceed 80% of their configured limits, enabling proactive detection before hard failures
  • Add Prometheus counter metrics (auto_partition_near_limit_count, dynamic_partition_near_limit_count) for monitoring/alerting

Test plan

  • Verify existing dynamic partition tests pass with new default (tests explicitly set config values, so unaffected)
  • Verify auto-partition limit check still errors correctly when exceeded
  • Verify warning logs appear when partition count is between 80%-100% of limit
  • Verify new metrics appear in /metrics Prometheus endpoint
  • Test Prometheus alert rule: rate(doris_fe_auto_partition_near_limit_count[5m]) > 0

🤖 Generated with Claude Code

… to 20000 and add near-limit metrics

Raise the default limits from 500/2000 to 20000 for both max_dynamic_partition_num
and max_auto_partition_num to better match modern production workloads.

Add warning logs and Prometheus counter metrics (auto_partition_near_limit_count,
dynamic_partition_near_limit_count) that fire when partition counts exceed 80%
of their configured limits, enabling proactive monitoring before hard failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 19, 2026 06:38
@Thearas
Copy link
Contributor

Thearas commented Mar 19, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates FE partition-limit defaults and adds observability to help detect near-limit partition growth before hard failures.

Changes:

  • Increase default limits: max_dynamic_partition_num and max_auto_partition_num to 20000.
  • Add “near limit” warning logs when partition counts exceed 80% of configured limits.
  • Register two new Prometheus counter metrics for near-limit events.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java Adds near-limit warning + counter increment in auto-partition createPartition limit check.
fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java Declares and registers new near-limit counter metrics.
fe/fe-core/src/main/java/org/apache/doris/common/util/DynamicPartitionUtil.java Adds near-limit warning + counter increment during dynamic partition property analysis.
fe/fe-common/src/main/java/org/apache/doris/common/Config.java Raises default values for dynamic and auto partition limits to 20000.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

result.setStatus(errorStatus);
LOG.warn("send create partition error status: {}", result);
return result;
} else if (partitionNum > autoPartitionLimit * 0.8) {
Comment on lines +4407 to +4409
LOG.warn("Table {}.{} auto partition count {} is approaching limit {} (>80%)."
+ " Consider increasing max_auto_partition_num.",
db.getFullName(), olapTable.getName(), partitionNum, autoPartitionLimit);
Comment on lines +647 to +653
if (expectCreatePartitionNum > Config.max_dynamic_partition_num) {
throw new DdlException("Too many dynamic partitions: "
+ expectCreatePartitionNum + ". Limit: " + Config.max_dynamic_partition_num);
} else if (expectCreatePartitionNum > Config.max_dynamic_partition_num * 0.8) {
LOG.warn("Dynamic partition count {} is approaching limit {} (>80%)."
+ " Consider increasing max_dynamic_partition_num.",
expectCreatePartitionNum, Config.max_dynamic_partition_num);
dataroaring and others added 3 commits March 19, 2026 00:05
- Fix indentation alignment for multi-line log statements in DynamicPartitionUtil.java
- Fix indentation alignment for multi-line log statements in FrontendServiceImpl.java
- Ensure string concatenation lines align properly with checkstyle requirements
…eImpl

- Align string concatenation line to 24 spaces to match checkstyle requirements
- Ensure consistent indentation with DynamicPartitionUtil.java
- Swap MetricRepo/MetaContext imports to fix lexicographic order (CheckStyle)
- Remove accidentally committed .orig files
@dataroaring
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26983 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 06c666fcf54bcb427cf1d4e090747e3cd826a261, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17660	4523	4287	4287
q2	q3	10653	772	513	513
q4	4675	360	249	249
q5	7556	1218	1033	1033
q6	175	175	145	145
q7	776	860	660	660
q8	9668	1444	1309	1309
q9	5553	4743	4689	4689
q10	6323	1905	1669	1669
q11	470	255	248	248
q12	739	581	464	464
q13	18049	2928	2172	2172
q14	232	237	214	214
q15	q16	742	748	668	668
q17	741	844	437	437
q18	6003	5401	5263	5263
q19	1376	973	607	607
q20	543	480	372	372
q21	4523	1822	1674	1674
q22	415	350	310	310
Total cold run time: 96872 ms
Total hot run time: 26983 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4904	4690	4533	4533
q2	q3	3900	4385	3869	3869
q4	891	1193	813	813
q5	4105	4393	4334	4334
q6	178	183	143	143
q7	1732	1636	1517	1517
q8	2475	2717	2550	2550
q9	7537	7518	7360	7360
q10	3738	3962	3540	3540
q11	517	462	439	439
q12	512	602	480	480
q13	2692	3290	2354	2354
q14	283	300	277	277
q15	q16	727	750	717	717
q17	1209	1405	1415	1405
q18	7298	6783	6583	6583
q19	901	864	926	864
q20	2078	2173	2060	2060
q21	4009	3441	3325	3325
q22	468	414	372	372
Total cold run time: 50154 ms
Total hot run time: 47535 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169628 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 06c666fcf54bcb427cf1d4e090747e3cd826a261, data reload: false

query5	4332	634	507	507
query6	341	238	249	238
query7	4232	478	271	271
query8	345	251	232	232
query9	8708	2689	2695	2689
query10	537	386	367	367
query11	6928	5077	4861	4861
query12	182	134	126	126
query13	1272	479	346	346
query14	5814	3754	3517	3517
query14_1	2848	2855	2798	2798
query15	209	197	183	183
query16	980	467	438	438
query17	999	758	637	637
query18	2457	463	358	358
query19	214	217	202	202
query20	135	131	128	128
query21	216	136	113	113
query22	13367	14556	14662	14556
query23	16198	15862	15593	15593
query23_1	15814	15886	15837	15837
query24	8423	1616	1219	1219
query24_1	1242	1230	1212	1212
query25	624	463	413	413
query26	1238	258	152	152
query27	2771	480	299	299
query28	4465	1812	1819	1812
query29	867	567	474	474
query30	297	220	188	188
query31	995	945	872	872
query32	79	71	69	69
query33	506	332	305	305
query34	899	862	520	520
query35	655	678	607	607
query36	1085	1077	968	968
query37	147	97	84	84
query38	2971	2903	2980	2903
query39	856	842	816	816
query39_1	834	812	822	812
query40	236	156	140	140
query41	64	62	59	59
query42	266	266	264	264
query43	249	253	225	225
query44	
query45	202	192	191	191
query46	904	1013	629	629
query47	2114	2158	2045	2045
query48	313	329	240	240
query49	647	462	384	384
query50	683	282	231	231
query51	4117	4175	3999	3999
query52	269	268	256	256
query53	285	335	290	290
query54	303	283	278	278
query55	93	89	80	80
query56	324	331	312	312
query57	1967	1913	1720	1720
query58	290	275	273	273
query59	2768	3015	2748	2748
query60	352	337	320	320
query61	160	151	160	151
query62	623	588	551	551
query63	311	279	274	274
query64	5130	1282	1031	1031
query65	
query66	1457	500	364	364
query67	24458	24353	24182	24182
query68	
query69	414	309	278	278
query70	975	1006	952	952
query71	345	314	308	308
query72	2875	2750	2530	2530
query73	557	551	319	319
query74	9674	9602	9493	9493
query75	2872	2740	2470	2470
query76	2288	1033	692	692
query77	370	367	306	306
query78	11067	11216	10514	10514
query79	1145	776	573	573
query80	803	653	549	549
query81	512	263	229	229
query82	1359	154	118	118
query83	361	269	252	252
query84	280	117	99	99
query85	941	488	447	447
query86	375	297	303	297
query87	3154	3126	3005	3005
query88	3591	2663	2657	2657
query89	426	375	356	356
query90	1841	181	184	181
query91	176	164	148	148
query92	76	82	73	73
query93	936	840	499	499
query94	501	334	297	297
query95	589	338	392	338
query96	657	521	237	237
query97	2476	2479	2405	2405
query98	244	225	222	222
query99	1031	1011	943	943
Total cold run time: 250460 ms
Total hot run time: 169628 ms

Address Copilot review comments on #61511:

- DynamicPartitionUtil: Capture Config.max_dynamic_partition_num into a
  local variable to avoid inconsistent reads of the mutable config, and
  use 'dynamicPartitionLimit * 8L / 10' (long arithmetic) instead of
  '* 0.8' to avoid implicit int-to-double conversion.

- FrontendServiceImpl: Use 'autoPartitionLimit * 8 / 10' instead of
  '* 0.8' to keep the threshold comparison in integer math. The config
  was already captured to a local variable in the original PR.
@dataroaring
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 27021 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit edfdb4c244f7a2e9b9a3ec65bd029afd356956d5, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17616	4469	4302	4302
q2	q3	10652	773	529	529
q4	4674	347	253	253
q5	7552	1201	995	995
q6	177	176	148	148
q7	786	857	681	681
q8	9292	1481	1375	1375
q9	4967	4748	4684	4684
q10	6296	1916	1697	1697
q11	466	264	242	242
q12	754	582	471	471
q13	18042	2970	2187	2187
q14	231	234	219	219
q15	q16	752	748	669	669
q17	732	865	426	426
q18	5898	5444	5381	5381
q19	1138	995	638	638
q20	542	499	382	382
q21	4522	1885	1458	1458
q22	546	383	284	284
Total cold run time: 95635 ms
Total hot run time: 27021 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4889	4680	4530	4530
q2	q3	3857	4306	3788	3788
q4	899	1257	813	813
q5	4114	4458	4390	4390
q6	187	176	141	141
q7	1755	1650	1571	1571
q8	2601	2734	2583	2583
q9	7546	7374	7374	7374
q10	3840	4033	3712	3712
q11	525	441	425	425
q12	495	623	470	470
q13	2870	3176	2491	2491
q14	283	294	281	281
q15	q16	717	753	733	733
q17	1172	1341	1325	1325
q18	7190	6833	6606	6606
q19	1122	982	939	939
q20	2122	2218	2005	2005
q21	4104	3481	3399	3399
q22	454	432	391	391
Total cold run time: 50742 ms
Total hot run time: 47967 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169135 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit edfdb4c244f7a2e9b9a3ec65bd029afd356956d5, data reload: false

query5	4323	636	486	486
query6	338	241	214	214
query7	4216	467	262	262
query8	343	248	232	232
query9	8717	2720	2743	2720
query10	545	390	353	353
query11	6834	5131	4908	4908
query12	183	133	130	130
query13	1274	481	355	355
query14	5743	3705	3489	3489
query14_1	2851	2844	2846	2844
query15	202	193	178	178
query16	981	471	469	469
query17	915	735	654	654
query18	2454	464	348	348
query19	218	214	187	187
query20	143	129	130	129
query21	215	139	119	119
query22	13250	14039	14454	14039
query23	16556	15746	15761	15746
query23_1	15833	15937	15802	15802
query24	7402	1612	1256	1256
query24_1	1238	1228	1224	1224
query25	605	468	404	404
query26	1237	254	162	162
query27	2769	484	297	297
query28	4416	1832	1843	1832
query29	835	563	472	472
query30	296	225	186	186
query31	1009	949	870	870
query32	87	68	69	68
query33	512	335	276	276
query34	903	896	542	542
query35	641	703	620	620
query36	1137	1115	963	963
query37	134	96	85	85
query38	3026	2937	2965	2937
query39	869	837	815	815
query39_1	815	807	837	807
query40	240	165	139	139
query41	78	61	60	60
query42	265	268	267	267
query43	252	267	240	240
query44	
query45	204	191	189	189
query46	909	1009	635	635
query47	2526	2108	2068	2068
query48	322	314	229	229
query49	632	458	394	394
query50	718	280	234	234
query51	4104	4077	4007	4007
query52	263	264	257	257
query53	293	344	281	281
query54	303	272	259	259
query55	97	93	82	82
query56	322	311	323	311
query57	1954	1883	1722	1722
query58	284	271	265	265
query59	2808	2984	2747	2747
query60	337	333	324	324
query61	157	155	157	155
query62	615	576	532	532
query63	306	281	280	280
query64	5034	1274	1010	1010
query65	
query66	1482	458	356	356
query67	24468	24342	24222	24222
query68	
query69	416	310	280	280
query70	988	981	944	944
query71	336	304	297	297
query72	2817	2738	2445	2445
query73	557	544	325	325
query74	9682	9565	9439	9439
query75	2856	2768	2471	2471
query76	2280	1027	667	667
query77	359	378	306	306
query78	11027	11067	10475	10475
query79	2413	790	586	586
query80	1792	630	569	569
query81	542	261	219	219
query82	1007	154	120	120
query83	333	262	244	244
query84	308	122	94	94
query85	922	485	450	450
query86	415	297	293	293
query87	3125	3169	3010	3010
query88	3584	2696	2673	2673
query89	436	373	344	344
query90	2069	173	178	173
query91	170	166	146	146
query92	76	74	72	72
query93	1032	867	506	506
query94	667	322	293	293
query95	598	342	324	324
query96	649	522	223	223
query97	2462	2511	2409	2409
query98	236	224	231	224
query99	1021	985	892	892
Total cold run time: 252289 ms
Total hot run time: 169135 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants