Skip to content

[fix](paimon) infer manifest format from split file format in cpp reader#60795

Open
xylaaaaa wants to merge 5 commits intoapache:masterfrom
xylaaaaa:fix/paimoncpp-manifest-format-from-split
Open

[fix](paimon) infer manifest format from split file format in cpp reader#60795
xylaaaaa wants to merge 5 commits intoapache:masterfrom
xylaaaaa:fix/paimoncpp-manifest-format-from-split

Conversation

@xylaaaaa
Copy link
Contributor

@xylaaaaa xylaaaaa commented Feb 22, 2026

Problem

Followup #60676

When FE does not pass full table options in scan ranges, paimon-cpp may default manifest.format to avro.
For non-avro environments, this can fail in PaimonCppReader initialization with:
Could not find a FileFormatFactory implementation class for format avro.

Solution

In PaimonCppReader::_build_options, if split-level file_format exists and table options are missing/empty:

  • set file.format from split file_format
  • set manifest.format from split file_format

This keeps paimon-cpp format resolution consistent with the actual split format and avoids unintended avro fallback.

Verification

  • Incremental BE build succeeded for doris_be target.
  • Change scope is limited to be/src/vec/exec/format/table/paimon_cpp_reader.cpp.

Copilot AI review requested due to automatic review settings February 22, 2026 13:40
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Doris BE’s Paimon C++ reader option construction to avoid incorrect/default manifest format selection when FE scan ranges omit table options, by inferring formats from split metadata.

Changes:

  • Infer paimon::Options::FILE_FORMAT from split-level paimon_params.file_format when the option is missing/empty.
  • Infer paimon::Options::MANIFEST_FORMAT from split-level paimon_params.file_format when the option is missing/empty.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 314 to 328
// FE currently may not pass paimon table options in scan ranges.
// Avoid paimon-cpp defaulting manifest.format to avro when split file format is known.
if (_range.__isset.table_format_params && _range.table_format_params.__isset.paimon_params &&
_range.table_format_params.paimon_params.__isset.file_format &&
!_range.table_format_params.paimon_params.file_format.empty()) {
const auto& split_file_format = _range.table_format_params.paimon_params.file_format;
auto file_format_it = options.find(paimon::Options::FILE_FORMAT);
if (file_format_it == options.end() || file_format_it->second.empty()) {
options[paimon::Options::FILE_FORMAT] = split_file_format;
}
auto manifest_format_it = options.find(paimon::Options::MANIFEST_FORMAT);
if (manifest_format_it == options.end() || manifest_format_it->second.empty()) {
options[paimon::Options::MANIFEST_FORMAT] = split_file_format;
}
}
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New option inference logic isn’t covered by existing PaimonCppReader unit tests. Consider adding a test that asserts when split-level file_format is set and options lacks/has empty paimon::Options::FILE_FORMAT / MANIFEST_FORMAT, _build_options() (or an observable init path) populates them, and does not override non-empty values.

Copilot uses AI. Check for mistakes.
Comment on lines 314 to 328
// FE currently may not pass paimon table options in scan ranges.
// Avoid paimon-cpp defaulting manifest.format to avro when split file format is known.
if (_range.__isset.table_format_params && _range.table_format_params.__isset.paimon_params &&
_range.table_format_params.paimon_params.__isset.file_format &&
!_range.table_format_params.paimon_params.file_format.empty()) {
const auto& split_file_format = _range.table_format_params.paimon_params.file_format;
auto file_format_it = options.find(paimon::Options::FILE_FORMAT);
if (file_format_it == options.end() || file_format_it->second.empty()) {
options[paimon::Options::FILE_FORMAT] = split_file_format;
}
auto manifest_format_it = options.find(paimon::Options::MANIFEST_FORMAT);
if (manifest_format_it == options.end() || manifest_format_it->second.empty()) {
options[paimon::Options::MANIFEST_FORMAT] = split_file_format;
}
}
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is slightly broader than the PR description (“only when table options are missing/empty”): it will also set FILE_FORMAT/MANIFEST_FORMAT when other table options are present but just these keys are absent/empty. If that’s intended, the PR description should be updated; if not, consider tightening the condition to only apply when the table-level paimon options map is missing/empty.

Copilot uses AI. Check for mistakes.
@xylaaaaa
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28640 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0d91196b69ceea46a7b74db43daca438dddf9880, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17628	4481	4309	4309
q2	q3	10648	780	524	524
q4	4680	349	251	251
q5	7554	1187	1023	1023
q6	177	177	149	149
q7	779	829	661	661
q8	9284	1431	1319	1319
q9	4896	4728	4738	4728
q10	6820	1872	1626	1626
q11	476	269	236	236
q12	707	565	463	463
q13	17791	4215	3427	3427
q14	225	234	227	227
q15	943	800	781	781
q16	747	712	660	660
q17	730	851	405	405
q18	6195	5406	5261	5261
q19	1119	964	600	600
q20	518	496	382	382
q21	4402	1815	1362	1362
q22	343	282	246	246
Total cold run time: 96662 ms
Total hot run time: 28640 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4396	4347	4349	4347
q2	q3	1770	2161	1723	1723
q4	847	1148	760	760
q5	3997	4288	4314	4288
q6	173	170	139	139
q7	1726	1606	1491	1491
q8	2433	2630	2515	2515
q9	7204	7737	7533	7533
q10	2812	2960	2443	2443
q11	527	431	418	418
q12	526	590	464	464
q13	4055	4490	3649	3649
q14	291	308	286	286
q15	871	830	823	823
q16	749	787	784	784
q17	1215	1659	1296	1296
q18	7357	6870	6731	6731
q19	943	910	882	882
q20	2105	2173	1994	1994
q21	4151	3484	3392	3392
q22	480	445	404	404
Total cold run time: 48628 ms
Total hot run time: 46362 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183793 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0d91196b69ceea46a7b74db43daca438dddf9880, data reload: false

query5	4763	630	508	508
query6	337	219	210	210
query7	4218	467	280	280
query8	344	269	228	228
query9	8717	2713	2702	2702
query10	534	369	319	319
query11	16981	16771	16490	16490
query12	184	128	130	128
query13	1250	448	341	341
query14	6126	3154	2934	2934
query14_1	2787	2790	2747	2747
query15	201	192	177	177
query16	1002	474	464	464
query17	1049	709	578	578
query18	2466	428	337	337
query19	200	195	169	169
query20	132	124	128	124
query21	221	151	114	114
query22	5018	6001	5728	5728
query23	17622	17129	16931	16931
query23_1	17188	16975	16980	16975
query24	7431	1609	1228	1228
query24_1	1229	1210	1220	1210
query25	560	488	424	424
query26	1230	266	157	157
query27	2780	483	300	300
query28	4532	1871	1847	1847
query29	820	579	491	491
query30	309	244	214	214
query31	865	720	655	655
query32	83	76	75	75
query33	529	348	293	293
query34	925	974	573	573
query35	636	671	593	593
query36	1098	1140	1004	1004
query37	140	93	84	84
query38	2934	2917	2872	2872
query39	918	865	849	849
query39_1	823	821	823	821
query40	238	159	136	136
query41	70	65	63	63
query42	105	101	100	100
query43	370	376	358	358
query44	
query45	201	189	183	183
query46	881	977	598	598
query47	2117	2161	2050	2050
query48	317	320	234	234
query49	640	468	381	381
query50	678	288	220	220
query51	4089	4081	4017	4017
query52	106	107	98	98
query53	292	335	292	292
query54	320	296	289	289
query55	93	81	82	81
query56	323	334	342	334
query57	1381	1345	1272	1272
query58	299	279	286	279
query59	2558	2711	2522	2522
query60	361	333	340	333
query61	180	196	151	151
query62	620	588	526	526
query63	311	276	265	265
query64	4872	1226	979	979
query65	
query66	1440	453	350	350
query67	16361	16308	16185	16185
query68	
query69	398	286	290	286
query70	1002	1005	955	955
query71	330	302	299	299
query72	2816	2710	2466	2466
query73	541	540	309	309
query74	10010	9880	9707	9707
query75	2847	2726	2592	2592
query76	2290	1013	670	670
query77	353	373	314	314
query78	11123	11296	10696	10696
query79	3121	799	597	597
query80	1781	613	537	537
query81	574	282	244	244
query82	1003	148	117	117
query83	333	260	244	244
query84	252	123	95	95
query85	908	504	432	432
query86	436	297	305	297
query87	3074	3080	2965	2965
query88	3524	2630	2626	2626
query89	425	366	335	335
query90	1940	176	166	166
query91	167	152	132	132
query92	76	74	72	72
query93	1268	828	488	488
query94	632	312	279	279
query95	600	396	322	322
query96	639	516	224	224
query97	2493	2550	2410	2410
query98	232	214	214	214
query99	964	1007	950	950
Total cold run time: 255979 ms
Total hot run time: 183793 ms

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/13) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.52% (19553/37230)
Line Coverage 36.14% (182414/504767)
Region Coverage 32.48% (141563/435802)
Branch Coverage 33.42% (61323/183487)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/13) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.20% (26711/36488)
Line Coverage 56.50% (284506/503527)
Region Coverage 53.91% (237302/440186)
Branch Coverage 55.59% (102393/184191)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants