[refactor](BE) split EncodingInfo defaults into 4 explicit maps#63622
Open
csun5285 wants to merge 2 commits into
Open
[refactor](BE) split EncodingInfo defaults into 4 explicit maps#63622csun5285 wants to merge 2 commits into
csun5285 wants to merge 2 commits into
Conversation
Replace the EncodingPreference + runtime hook machinery in EncodingInfoResolver
with four explicit maps and four matching get methods:
- _legacy_default_map -> get_legacy_default_encoding(type)
- _v3_default_map -> get_v3_default_encoding(type)
- _value_seek_default_map -> get_value_seek_encoding(type)
- _encoding_map -> get(type, encoding, out)
Each (type, encoding, role) registration in the resolver constructor goes
through a small set of named helpers — _add, _set_legacy_default,
_set_v3_default, _set_value_seek_default — and the constructor groups all
registrations by role, so what is in each map can be audited at a glance.
Per-type V1/V2 vs V3 defaults are read directly off the list instead of being
derived from the previous predicate-driven hooks (binary types -> PLAIN_V2 /
integer types -> PLAIN).
Write-path callers now resolve the default upfront and stamp the concrete
encoding onto ColumnMetaPB before construction, so ScalarColumnWriter::init
and IndexedColumnWriter::init no longer special-case DEFAULT_ENCODING or
write the resolved value back to the meta -- they return InternalError if
encoding is still DEFAULT_ENCODING and otherwise look it up directly:
- SegmentWriter::init_column_meta and
VerticalSegmentWriter::_init_column_meta compute the V3 flag from the
tablet schema once at the top of _create_column_writer and pass it down
as a bool argument; the meta is stamped with a concrete encoding.
- variant_column_writer_impl::_init_column_meta gains a
use_v3_default_encoding parameter, threaded through from each caller's
ColumnWriterOptions.
- The null/array-length/map-length aux child writer creators in
column_writer.cpp resolve the encoding for their fixed FieldType
directly.
- primary_key_index uses get_value_seek_encoding;
zone_map_index uses get_legacy_default_encoding (preserving the
pre-refactor behavior where it always passed an empty preference).
Read-path callers (column_reader, indexed_column_reader, page_io, the dict
fast path in column_reader, binary_dict_page's internal lookups) drop the
trailing EncodingPreference{} argument from EncodingInfo::get; they were
already passing the concrete on-disk encoding so no behavior change.
ColumnWriterOptions::encoding_preference becomes a single
use_v3_default_encoding bool; PageBuilderOptions::encoding_preference
becomes use_plain_v2_for_binary_dict (only BinaryDictPageBuilder cares).
The EncodingPreference struct is removed.
Two small follow-ups bundled in:
- Remove PageReadOptions::is_dict_page; the dict-page check in
PageIO::read_and_decompress_page now reads footer->type() ==
DICTIONARY_PAGE, and ColumnReader::read_page drops the corresponding
bool parameter.
- Merge BinaryDictPageBuilder::_dict_word_page_encoding_type and
_fallback_binary_encoding_type into a single _binary_plain_encoding_type
field; they were always equal (both derived from
use_plain_v2_for_binary_dict) and conceptually represent the same choice.
- For PLAIN_ENCODING_V2, throw on non-Slice types at EncodingInfo
construction time. All current registrations are Slice; the throw fails
loudly if a future non-Slice registration is added.
Tests updated: encoding_info_test exercises the new four-method API and
verifies legacy vs V3 split per type; binary_dict_page_test passes a plain
bool instead of building EncodingPreference instances; variant and schema
util tests pass use_v3_default_encoding=false through _init_column_meta.
No on-disk format change; the resolved encodings written into ColumnMetaPB
match the pre-refactor outputs for both legacy and V3 tablets.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
Contributor
TPC-H: Total hot run time: 32287 ms |
Contributor
TPC-DS: Total hot run time: 173107 ms |
Adds four characterization tests that lock the per-type defaults and
the (type, encoding) registration matrix of EncodingInfoResolver:
- locked_v3_default_per_type — 31 types
- locked_legacy_default_per_type — 31 types
- locked_value_seek_per_type — 31 types (8 of which expect
UNKNOWN_ENCODING, the
no-value-seek-default sentinel)
- locked_encoding_map_completeness — 81 positive + 15 negative
(type, encoding) entries
The expectation tables were authored against the pre-refactor commit
(68d4eb3) and verified to pass there; on this commit they continue
to pass with the new four-method API (get_v3_default_encoding /
get_legacy_default_encoding / get_value_seek_encoding / get). So the
refactor is byte-for-byte behavior-preserving for every (type, role)
pair, not just the spot-checked ones.
Also fixes 15 sites in column_reader_cache_test.cpp that constructed
ColumnMetaPB with set_encoding(EncodingTypePB::DEFAULT_ENCODING) and
expected EncodingInfo::get to auto-resolve it. After the refactor,
EncodingInfo::get returns InternalError on DEFAULT_ENCODING (the
caller contract is "resolve the default before lookup"). The fix
replaces each site with the same resolver the production write path
uses:
meta.set_encoding(EncodingInfo::get_legacy_default_encoding(
static_cast<FieldType>(meta.type())));
which matches what SegmentWriter::init_column_meta now does. The 11
ColumnReaderCacheTest failures from the upstream BE-UT run (build
952710) all clear with this change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
run buildall |
Contributor
TPC-H: Total hot run time: 31498 ms |
Contributor
TPC-DS: Total hot run time: 173081 ms |
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
yiguolei
reviewed
May 26, 2026
yiguolei
reviewed
May 26, 2026
|
|
||
| void SegmentWriter::init_column_meta(ColumnMetaPB* meta, uint32_t column_id, | ||
| const TabletColumn& column, TabletSchemaSPtr tablet_schema) { | ||
| const TabletColumn& column, bool use_v3_default_encoding) { |
Contributor
There was a problem hiding this comment.
我觉得还是直接传递tablet schema 更好一些,未来我们如果再给schema 加开关或者属性,不能一直扩参数列表
yiguolei
reviewed
May 26, 2026
| EncodingPreference encoding_preference, | ||
| bool optimize_value_seek); | ||
| // Default column encoding for the legacy (V1/V2 segment) write path. | ||
| static EncodingTypePB get_legacy_default_encoding(FieldType type); |
Contributor
There was a problem hiding this comment.
rename to get_v2_default_encoding
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace the EncodingPreference + runtime hook machinery in EncodingInfoResolver with four explicit maps and four matching get methods:
No on-disk format change; the resolved encodings written into ColumnMetaPB match the pre-refactor outputs for both legacy and V3 tablets.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)