Skip to content

[refactor](BE) split EncodingInfo defaults into 4 explicit maps#63622

Open
csun5285 wants to merge 2 commits into
apache:masterfrom
csun5285:dev/encoding-info-refactor
Open

[refactor](BE) split EncodingInfo defaults into 4 explicit maps#63622
csun5285 wants to merge 2 commits into
apache:masterfrom
csun5285:dev/encoding-info-refactor

Conversation

@csun5285
Copy link
Copy Markdown
Contributor

@csun5285 csun5285 commented May 25, 2026

Replace the EncodingPreference + runtime hook machinery in EncodingInfoResolver with four explicit maps and four matching get methods:

  • _legacy_default_map -> get_legacy_default_encoding(type)
  • _v3_default_map -> get_v3_default_encoding(type)
  • _value_seek_default_map -> get_value_seek_encoding(type)
  • _encoding_map -> get(type, encoding, out)

No on-disk format change; the resolved encodings written into ColumnMetaPB match the pre-refactor outputs for both legacy and V3 tablets.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Replace the EncodingPreference + runtime hook machinery in EncodingInfoResolver
with four explicit maps and four matching get methods:

  - _legacy_default_map     -> get_legacy_default_encoding(type)
  - _v3_default_map         -> get_v3_default_encoding(type)
  - _value_seek_default_map -> get_value_seek_encoding(type)
  - _encoding_map           -> get(type, encoding, out)

Each (type, encoding, role) registration in the resolver constructor goes
through a small set of named helpers — _add, _set_legacy_default,
_set_v3_default, _set_value_seek_default — and the constructor groups all
registrations by role, so what is in each map can be audited at a glance.
Per-type V1/V2 vs V3 defaults are read directly off the list instead of being
derived from the previous predicate-driven hooks (binary types -> PLAIN_V2 /
integer types -> PLAIN).

Write-path callers now resolve the default upfront and stamp the concrete
encoding onto ColumnMetaPB before construction, so ScalarColumnWriter::init
and IndexedColumnWriter::init no longer special-case DEFAULT_ENCODING or
write the resolved value back to the meta -- they return InternalError if
encoding is still DEFAULT_ENCODING and otherwise look it up directly:

  - SegmentWriter::init_column_meta and
    VerticalSegmentWriter::_init_column_meta compute the V3 flag from the
    tablet schema once at the top of _create_column_writer and pass it down
    as a bool argument; the meta is stamped with a concrete encoding.
  - variant_column_writer_impl::_init_column_meta gains a
    use_v3_default_encoding parameter, threaded through from each caller's
    ColumnWriterOptions.
  - The null/array-length/map-length aux child writer creators in
    column_writer.cpp resolve the encoding for their fixed FieldType
    directly.
  - primary_key_index uses get_value_seek_encoding;
    zone_map_index uses get_legacy_default_encoding (preserving the
    pre-refactor behavior where it always passed an empty preference).

Read-path callers (column_reader, indexed_column_reader, page_io, the dict
fast path in column_reader, binary_dict_page's internal lookups) drop the
trailing EncodingPreference{} argument from EncodingInfo::get; they were
already passing the concrete on-disk encoding so no behavior change.

ColumnWriterOptions::encoding_preference becomes a single
use_v3_default_encoding bool; PageBuilderOptions::encoding_preference
becomes use_plain_v2_for_binary_dict (only BinaryDictPageBuilder cares).
The EncodingPreference struct is removed.

Two small follow-ups bundled in:
- Remove PageReadOptions::is_dict_page; the dict-page check in
  PageIO::read_and_decompress_page now reads footer->type() ==
  DICTIONARY_PAGE, and ColumnReader::read_page drops the corresponding
  bool parameter.
- Merge BinaryDictPageBuilder::_dict_word_page_encoding_type and
  _fallback_binary_encoding_type into a single _binary_plain_encoding_type
  field; they were always equal (both derived from
  use_plain_v2_for_binary_dict) and conceptually represent the same choice.
- For PLAIN_ENCODING_V2, throw on non-Slice types at EncodingInfo
  construction time. All current registrations are Slice; the throw fails
  loudly if a future non-Slice registration is added.

Tests updated: encoding_info_test exercises the new four-method API and
verifies legacy vs V3 split per type; binary_dict_page_test passes a plain
bool instead of building EncodingPreference instances; variant and schema
util tests pass use_v3_default_encoding=false through _init_column_meta.

No on-disk format change; the resolved encodings written into ColumnMetaPB
match the pre-refactor outputs for both legacy and V3 tablets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@csun5285
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 32287 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b1e1a5698074aa88e6fa9a12ab270e3d63b40a42, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17639	4217	4073	4073
q2	q3	10786	1456	844	844
q4	4692	484	347	347
q5	7594	2375	2200	2200
q6	263	187	144	144
q7	1027	783	655	655
q8	9389	1660	1620	1620
q9	5418	5069	4998	4998
q10	6408	2233	1890	1890
q11	440	273	255	255
q12	648	430	301	301
q13	18111	3408	2794	2794
q14	267	266	251	251
q15	q16	832	780	712	712
q17	984	960	1014	960
q18	7084	5764	5518	5518
q19	1303	1429	1243	1243
q20	598	427	291	291
q21	6229	2866	2868	2866
q22	465	379	325	325
Total cold run time: 100177 ms
Total hot run time: 32287 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4970	4931	4916	4916
q2	q3	5088	5244	4691	4691
q4	2406	2253	1452	1452
q5	4989	4705	4828	4705
q6	239	188	138	138
q7	1903	1772	1581	1581
q8	2524	2175	2173	2173
q9	7506	7563	7471	7471
q10	4776	4757	4290	4290
q11	549	400	371	371
q12	752	749	553	553
q13	3121	3436	2826	2826
q14	270	287	257	257
q15	q16	677	709	609	609
q17	1327	1302	1300	1300
q18	7422	7056	6905	6905
q19	1152	1110	1094	1094
q20	2258	2211	1965	1965
q21	5420	4732	4609	4609
q22	541	484	421	421
Total cold run time: 57890 ms
Total hot run time: 52327 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 173107 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b1e1a5698074aa88e6fa9a12ab270e3d63b40a42, data reload: false

query5	4329	672	523	523
query6	327	221	198	198
query7	4227	595	308	308
query8	327	262	226	226
query9	8850	4146	4125	4125
query10	443	346	303	303
query11	5794	2634	2226	2226
query12	181	128	126	126
query13	1300	675	430	430
query14	6249	5560	5265	5265
query14_1	4566	4547	4505	4505
query15	215	217	188	188
query16	1021	485	397	397
query17	1153	755	626	626
query18	2485	496	369	369
query19	222	218	174	174
query20	145	140	134	134
query21	219	140	124	124
query22	13681	13541	13513	13513
query23	17428	16539	16240	16240
query23_1	16329	16413	16508	16413
query24	7551	1818	1341	1341
query24_1	1321	1350	1361	1350
query25	561	472	426	426
query26	1350	324	182	182
query27	2723	593	342	342
query28	4513	2111	2025	2025
query29	984	628	510	510
query30	311	245	203	203
query31	1131	1085	971	971
query32	91	75	70	70
query33	540	352	296	296
query34	1204	1131	675	675
query35	787	806	697	697
query36	1416	1409	1243	1243
query37	156	113	94	94
query38	3216	3189	3086	3086
query39	955	929	922	922
query39_1	870	872	875	872
query40	226	165	127	127
query41	65	65	63	63
query42	111	110	110	110
query43	337	343	295	295
query44	
query45	218	209	199	199
query46	1085	1245	766	766
query47	2416	2409	2278	2278
query48	411	416	317	317
query49	626	496	386	386
query50	972	357	249	249
query51	4361	4327	4414	4327
query52	107	105	96	96
query53	262	292	208	208
query54	319	277	257	257
query55	93	96	85	85
query56	317	298	321	298
query57	1432	1434	1331	1331
query58	300	279	275	275
query59	1626	1668	1398	1398
query60	320	318	315	315
query61	151	151	158	151
query62	692	656	578	578
query63	244	209	207	207
query64	2415	833	668	668
query65	
query66	1734	489	364	364
query67	30221	29557	29974	29557
query68	
query69	473	337	306	306
query70	1042	1024	1008	1008
query71	305	280	275	275
query72	3030	2772	2441	2441
query73	855	786	435	435
query74	5124	4959	4823	4823
query75	2708	2646	2293	2293
query76	2271	1184	780	780
query77	409	425	350	350
query78	12329	12517	11915	11915
query79	1282	1052	808	808
query80	591	546	454	454
query81	458	279	246	246
query82	238	157	121	121
query83	274	284	252	252
query84	267	144	114	114
query85	844	539	451	451
query86	366	335	349	335
query87	3406	3409	3277	3277
query88	3640	2789	2759	2759
query89	430	397	340	340
query90	2128	188	187	187
query91	179	167	139	139
query92	81	81	84	81
query93	1486	1392	838	838
query94	547	343	306	306
query95	698	472	377	377
query96	1056	834	358	358
query97	2734	2765	2645	2645
query98	241	230	228	228
query99	1201	1144	1023	1023
Total cold run time: 253938 ms
Total hot run time: 173107 ms

Adds four characterization tests that lock the per-type defaults and
the (type, encoding) registration matrix of EncodingInfoResolver:

  - locked_v3_default_per_type        — 31 types
  - locked_legacy_default_per_type    — 31 types
  - locked_value_seek_per_type        — 31 types (8 of which expect
                                        UNKNOWN_ENCODING, the
                                        no-value-seek-default sentinel)
  - locked_encoding_map_completeness  — 81 positive + 15 negative
                                        (type, encoding) entries

The expectation tables were authored against the pre-refactor commit
(68d4eb3) and verified to pass there; on this commit they continue
to pass with the new four-method API (get_v3_default_encoding /
get_legacy_default_encoding / get_value_seek_encoding / get). So the
refactor is byte-for-byte behavior-preserving for every (type, role)
pair, not just the spot-checked ones.

Also fixes 15 sites in column_reader_cache_test.cpp that constructed
ColumnMetaPB with set_encoding(EncodingTypePB::DEFAULT_ENCODING) and
expected EncodingInfo::get to auto-resolve it. After the refactor,
EncodingInfo::get returns InternalError on DEFAULT_ENCODING (the
caller contract is "resolve the default before lookup"). The fix
replaces each site with the same resolver the production write path
uses:

    meta.set_encoding(EncodingInfo::get_legacy_default_encoding(
            static_cast<FieldType>(meta.type())));

which matches what SegmentWriter::init_column_meta now does. The 11
ColumnReaderCacheTest failures from the upstream BE-UT run (build
952710) all clear with this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@csun5285
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31498 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ebb96fc4a6ab398cc891b04c4acaa659e3a0f741, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17675	4108	4052	4052
q2	q3	10780	1402	822	822
q4	4684	482	341	341
q5	7559	2248	2128	2128
q6	312	176	141	141
q7	937	790	631	631
q8	9391	1754	1534	1534
q9	6822	4982	4954	4954
q10	6455	2241	1937	1937
q11	434	269	239	239
q12	689	429	299	299
q13	18190	3361	2780	2780
q14	266	259	233	233
q15	q16	831	794	709	709
q17	994	952	960	952
q18	6757	5721	5528	5528
q19	1179	1359	1084	1084
q20	530	453	272	272
q21	6163	2773	2540	2540
q22	450	376	322	322
Total cold run time: 101098 ms
Total hot run time: 31498 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4845	4848	4946	4848
q2	q3	4882	5263	4851	4851
q4	2125	2222	1424	1424
q5	4996	4793	4670	4670
q6	228	183	141	141
q7	1871	1751	1565	1565
q8	2429	1986	1956	1956
q9	7548	7541	7430	7430
q10	4790	4722	4256	4256
q11	549	417	361	361
q12	736	743	544	544
q13	2966	3400	2795	2795
q14	279	285	252	252
q15	q16	685	710	626	626
q17	1295	1283	1277	1277
q18	7239	6834	6931	6834
q19	1160	1124	1085	1085
q20	2230	2237	1968	1968
q21	5327	4644	4472	4472
q22	527	462	408	408
Total cold run time: 56707 ms
Total hot run time: 51763 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 173081 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ebb96fc4a6ab398cc891b04c4acaa659e3a0f741, data reload: false

query5	4312	659	527	527
query6	353	225	199	199
query7	4232	553	301	301
query8	324	260	227	227
query9	8844	4192	4145	4145
query10	446	350	306	306
query11	5709	2550	2182	2182
query12	181	130	123	123
query13	1282	606	448	448
query14	6214	5632	5317	5317
query14_1	4682	4681	4659	4659
query15	221	211	189	189
query16	1006	456	437	437
query17	1162	755	616	616
query18	2453	487	367	367
query19	227	214	185	185
query20	138	135	135	135
query21	218	143	120	120
query22	13799	13584	13434	13434
query23	17482	16550	16392	16392
query23_1	16324	16395	16400	16395
query24	7645	1761	1335	1335
query24_1	1337	1324	1321	1321
query25	570	477	426	426
query26	1288	327	170	170
query27	2687	574	363	363
query28	4478	2033	2049	2033
query29	1018	639	489	489
query30	319	249	206	206
query31	1133	1093	959	959
query32	101	77	75	75
query33	539	354	299	299
query34	1186	1146	677	677
query35	770	810	703	703
query36	1388	1406	1275	1275
query37	154	107	92	92
query38	3247	3233	3087	3087
query39	934	920	899	899
query39_1	875	887	894	887
query40	234	146	130	130
query41	67	65	64	64
query42	115	111	111	111
query43	351	352	311	311
query44	
query45	223	209	199	199
query46	1131	1212	785	785
query47	2445	2411	2246	2246
query48	419	405	305	305
query49	638	508	425	425
query50	1123	354	249	249
query51	4438	4354	4318	4318
query52	108	108	95	95
query53	268	285	211	211
query54	309	282	282	282
query55	94	95	89	89
query56	307	314	313	313
query57	1481	1449	1400	1400
query58	311	288	272	272
query59	1692	1757	1537	1537
query60	322	329	322	322
query61	163	160	155	155
query62	708	641	585	585
query63	247	211	217	211
query64	2418	806	626	626
query65	
query66	2240	483	368	368
query67	29892	29677	29458	29458
query68	
query69	480	350	309	309
query70	1063	1012	1040	1012
query71	306	283	270	270
query72	3024	2644	2426	2426
query73	840	745	441	441
query74	5127	4952	4850	4850
query75	2695	2643	2265	2265
query76	2312	1134	783	783
query77	409	429	328	328
query78	12227	12367	11757	11757
query79	1268	1016	752	752
query80	559	544	461	461
query81	449	280	242	242
query82	246	158	125	125
query83	277	284	256	256
query84	267	141	113	113
query85	849	551	449	449
query86	360	368	360	360
query87	3393	3382	3260	3260
query88	3595	2737	2752	2737
query89	431	391	339	339
query90	2167	184	179	179
query91	178	165	138	138
query92	81	80	75	75
query93	1362	1353	917	917
query94	529	353	313	313
query95	685	398	439	398
query96	1074	793	372	372
query97	2741	2748	2592	2592
query98	236	231	230	230
query99	1168	1154	1027	1027
Total cold run time: 254509 ms
Total hot run time: 173081 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 92.74% (281/303) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.83% (28056/38001)
Line Coverage 57.71% (304513/527640)
Region Coverage 54.83% (254631/464427)
Branch Coverage 56.39% (110048/195169)

Comment thread be/src/storage/segment/vertical_segment_writer.cpp

void SegmentWriter::init_column_meta(ColumnMetaPB* meta, uint32_t column_id,
const TabletColumn& column, TabletSchemaSPtr tablet_schema) {
const TabletColumn& column, bool use_v3_default_encoding) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得还是直接传递tablet schema 更好一些,未来我们如果再给schema 加开关或者属性,不能一直扩参数列表

EncodingPreference encoding_preference,
bool optimize_value_seek);
// Default column encoding for the legacy (V1/V2 segment) write path.
static EncodingTypePB get_legacy_default_encoding(FieldType type);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to get_v2_default_encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants