Skip to content

[fix](auto partition) keep load row metrics monotonic for auto partition#64109

Merged
sollhui merged 1 commit into
apache:masterfrom
sollhui:fix_auto_partition_metrics
Jun 5, 2026
Merged

[fix](auto partition) keep load row metrics monotonic for auto partition#64109
sollhui merged 1 commit into
apache:masterfrom
sollhui:fix_auto_partition_metrics

Conversation

@sollhui
Copy link
Copy Markdown
Contributor

@sollhui sollhui commented Jun 4, 2026

What problem does this PR solve?

When loading into an auto partition table, BE may temporarily batch rows whose target partitions do not exist yet. The previous logic first incremented doris_be_load_rows when receiving the input block, then decremented it when those rows were moved into the auto-partition batching block, and finally incremented it again when the batched rows were replayed after partition creation.

Because doris_be_load_rows is exposed as a counter, the negative adjustment made the metric non-monotonic and could show row count drops during auto partition loads.

This PR keeps the existing RuntimeState row compensation logic unchanged, but stops applying negative deltas to the global BE load row/byte metrics. It also skips global metric increments when replaying the internal batched block, so each source row is counted once.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Jun 4, 2026

run buildall

@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Jun 4, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found in this PR.

Critical checkpoint conclusions:

  • Goal and proof: The PR aims to keep global BE load row/byte metrics monotonic for auto-partition batching while preserving RuntimeState progress compensation. The code accomplishes this by removing negative global metric deltas when rows are moved into the batching block and by skipping global metric increments while replaying the internal batched block. No targeted test is added, so the remaining risk is regression coverage for this metric-only behavior.
  • Scope and clarity: The change is small and focused on the three relevant BE writer/distribution locations.
  • Concurrency: The modified code does not introduce new shared state or new concurrency. The _deal_batched flag is used in the existing writer/replay flow, and the new local snapshot is taken after _send_new_partition_batch(), which preserves the intended distinction between recursive replay writes and normal source writes.
  • Lifecycle/static initialization: No new lifecycle ownership, static initialization, or release-order behavior is introduced.
  • Configuration: No configuration item is added.
  • Compatibility: No wire protocol, storage format, or function symbol compatibility impact was found.
  • Parallel paths: Both VTabletWriter and VTabletWriterV2 are updated consistently. I also checked the row-distribution path where the negative delta was previously emitted.
  • Conditional checks: The new condition only guards global DorisMetrics increments and follows the existing _deal_batched replay state. RuntimeState row/byte accounting remains unchanged.
  • Tests: No tests are added or modified. Given this is metric accounting, a targeted auto-partition load metric regression would reduce risk, but I did not find a functional blocker in the implementation.
  • Observability: The change directly corrects observability metrics and does not require additional logs or metrics.
  • Transaction/persistence/data writes: The PR does not change transaction visibility, rowset writes, persistence, or data routing semantics.
  • FE/BE variable passing: No new FE/BE variables or thrift fields are added.
  • Performance: The added boolean snapshot and conditional metric increments are negligible; no performance issue found.
  • Other issues: No duplicate existing review threads were present, and no additional user-provided focus points were specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.92% (21050/39038)
Line Coverage 37.58% (200098/532440)
Region Coverage 33.65% (156877/466177)
Branch Coverage 34.62% (68649/198304)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (10/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.65% (28174/38254)
Line Coverage 57.78% (306896/531168)
Region Coverage 54.63% (257217/470799)
Branch Coverage 56.00% (111532/199153)

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29427 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9c27b471a7fe1481a84e3b702b1c332e46cf3185, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17766	4009	4039	4009
q2	q3	10737	1386	872	872
q4	4690	481	344	344
q5	7537	905	588	588
q6	186	183	140	140
q7	779	823	651	651
q8	9346	1569	1627	1569
q9	5877	4521	4522	4521
q10	6793	1796	1523	1523
q11	433	280	254	254
q12	641	436	290	290
q13	18191	3389	2820	2820
q14	268	262	248	248
q15	q16	837	790	722	722
q17	889	950	884	884
q18	6958	5722	5605	5605
q19	1310	1323	1119	1119
q20	509	400	268	268
q21	6213	2856	2669	2669
q22	477	398	331	331
Total cold run time: 100437 ms
Total hot run time: 29427 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5085	4781	4759	4759
q2	q3	4770	5324	4599	4599
q4	2148	2184	1408	1408
q5	4847	4898	4771	4771
q6	241	178	126	126
q7	1855	1766	1649	1649
q8	2399	2115	2115	2115
q9	8024	7687	7362	7362
q10	4713	4712	4230	4230
q11	529	385	355	355
q12	728	740	531	531
q13	3039	3420	2810	2810
q14	267	276	245	245
q15	q16	674	699	613	613
q17	1287	1255	1250	1250
q18	7107	6643	6688	6643
q19	1161	1073	1095	1073
q20	2223	2200	1950	1950
q21	5278	4664	4395	4395
q22	501	451	412	412
Total cold run time: 56876 ms
Total hot run time: 51296 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170258 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9c27b471a7fe1481a84e3b702b1c332e46cf3185, data reload: false

query5	4349	641	493	493
query6	449	195	190	190
query7	4836	566	321	321
query8	366	215	208	208
query9	8775	4047	4055	4047
query10	459	319	264	264
query11	5954	2346	2130	2130
query12	159	112	110	110
query13	1252	630	448	448
query14	6407	5419	5124	5124
query14_1	4414	4422	4372	4372
query15	210	203	181	181
query16	1058	466	472	466
query17	1158	730	606	606
query18	2718	495	352	352
query19	219	190	146	146
query20	116	109	109	109
query21	222	146	117	117
query22	13749	13636	13344	13344
query23	17458	16449	16120	16120
query23_1	16283	16232	16255	16232
query24	7744	1792	1317	1317
query24_1	1316	1336	1343	1336
query25	590	471	416	416
query26	1309	313	171	171
query27	2664	573	348	348
query28	4408	2076	2042	2042
query29	1112	656	519	519
query30	314	240	205	205
query31	1121	1096	975	975
query32	108	72	63	63
query33	533	331	303	303
query34	1176	1156	635	635
query35	757	764	688	688
query36	1401	1378	1238	1238
query37	150	104	91	91
query38	3204	3138	3046	3046
query39	940	937	912	912
query39_1	866	886	862	862
query40	223	122	108	108
query41	66	62	63	62
query42	96	96	92	92
query43	313	320	279	279
query44	
query45	199	188	177	177
query46	1090	1216	748	748
query47	2359	2432	2207	2207
query48	398	388	308	308
query49	616	481	354	354
query50	977	346	262	262
query51	4366	4295	4236	4236
query52	90	90	80	80
query53	246	268	198	198
query54	263	211	196	196
query55	80	76	71	71
query56	248	231	243	231
query57	1412	1386	1302	1302
query58	241	222	209	209
query59	1564	1668	1392	1392
query60	281	250	234	234
query61	159	190	157	157
query62	701	648	586	586
query63	237	182	189	182
query64	2513	788	641	641
query65	
query66	1723	472	343	343
query67	29264	29725	29578	29578
query68	
query69	431	306	277	277
query70	992	974	949	949
query71	291	215	215	215
query72	3052	2742	2377	2377
query73	879	797	434	434
query74	5162	4957	4801	4801
query75	2664	2595	2257	2257
query76	2306	1169	777	777
query77	362	383	290	290
query78	12489	12410	11876	11876
query79	1409	996	728	728
query80	945	468	399	399
query81	511	283	249	249
query82	582	159	123	123
query83	347	279	253	253
query84	259	147	113	113
query85	914	580	438	438
query86	413	293	257	257
query87	3366	3390	3218	3218
query88	3682	2822	2774	2774
query89	437	382	329	329
query90	1910	188	183	183
query91	176	162	139	139
query92	63	62	59	59
query93	1479	1434	846	846
query94	635	354	299	299
query95	691	370	341	341
query96	1100	852	360	360
query97	2682	2690	2554	2554
query98	214	209	209	209
query99	1154	1176	1048	1048
Total cold run time: 252055 ms
Total hot run time: 170258 ms

Copy link
Copy Markdown
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (10/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.66% (28177/38254)
Line Coverage 57.78% (306908/531168)
Region Coverage 54.65% (257269/470799)
Branch Coverage 56.00% (111530/199153)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (10/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.66% (28178/38254)
Line Coverage 57.78% (306912/531168)
Region Coverage 54.65% (257307/470799)
Branch Coverage 56.01% (111536/199153)

@sollhui sollhui merged commit 43cf4fd into apache:master Jun 5, 2026
35 of 36 checks passed
@sollhui sollhui deleted the fix_auto_partition_metrics branch June 5, 2026 06:14
github-actions Bot pushed a commit that referenced this pull request Jun 5, 2026
…ion (#64109)

### What problem does this PR solve?

When loading into an auto partition table, BE may temporarily batch rows
whose target partitions do not exist yet. The previous logic first
incremented `doris_be_load_rows` when receiving the input block, then
decremented it when those rows were moved into the auto-partition
batching block, and finally incremented it again when the batched rows
were replayed after partition creation.

Because `doris_be_load_rows` is exposed as a counter, the negative
adjustment made the metric non-monotonic and could show row count drops
during auto partition loads.

This PR keeps the existing `RuntimeState` row compensation logic
unchanged, but stops applying negative deltas to the global BE load
row/byte metrics. It also skips global metric increments when replaying
the internal batched block, so each source row is counted once.
github-actions Bot pushed a commit that referenced this pull request Jun 5, 2026
…ion (#64109)

### What problem does this PR solve?

When loading into an auto partition table, BE may temporarily batch rows
whose target partitions do not exist yet. The previous logic first
incremented `doris_be_load_rows` when receiving the input block, then
decremented it when those rows were moved into the auto-partition
batching block, and finally incremented it again when the batched rows
were replayed after partition creation.

Because `doris_be_load_rows` is exposed as a counter, the negative
adjustment made the metric non-monotonic and could show row count drops
during auto partition loads.

This PR keeps the existing `RuntimeState` row compensation logic
unchanged, but stops applying negative deltas to the global BE load
row/byte metrics. It also skips global metric increments when replaying
the internal batched block, so each source row is counted once.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants