[refactor](inverted-index) replace void* query_value with typed paraminterface#62960
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
a842dba to
5fee30d
Compare
|
run buildall |
5fee30d to
636dac4
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Automated review completed.
Critical checkpoint conclusions:
- Goal/test proof: The PR replaces inverted-index query values from raw pointers/typed wrapper objects with , removes the obsolete path, and adds expanded BE unit coverage for BKD range queries and type-limit sentinels. The changed tests are aligned with the new reader interface.
- Scope/focus: The implementation is mostly focused on the query-value refactor and shared key-encoding helper reuse; no unrelated GitHub PR files were present in the actual PR diff.
- Concurrency/lifecycle: No new concurrent mutable state or lock ordering changes were introduced. now owns the , which removes the previous raw pointer lifetime risk.
- Config/compatibility: No new config, persisted format, edit log, storage format, or FE/BE protocol compatibility changes were introduced.
- Parallel paths: The main inverted-index call sites in expression pushdown, storage predicates, match/multi-match, IP, arrays, and tests were updated to pass ; I did not find a remaining live raw path for inverted index readers.
- Error handling: New status-returning BKD encoding dispatch reports unsupported field types instead of silently continuing. Existing CLucene boundaries remain unchanged.
- Data correctness: BKD min/max sentinel handling now uses on storage/key types, with added specializations for and ; I did not find an additional correctness regression not already covered by existing review threads.
- Memory/observability/performance: No large new allocations, ownership cycles, or observability gaps stood out. Moving from raw pointers to owned values is a small per-query copy but removes unsafe lifetime coupling.
- Test coverage: BE unit tests were expanded for BKD-supported primitive types and sentinel regressions. No regression-test result files were added, which is acceptable for this internal BE refactor.
- User focus: No additional user-provided review focus was supplied.
Existing review context was checked first; I did not duplicate the existing comments about storage-type specialization or the old explanation.
There was a problem hiding this comment.
Automated review completed.
Critical checkpoint conclusions:
- Goal/test proof: The PR replaces inverted-index query values from raw pointers/typed wrapper objects with Field, removes the obsolete TypeInfo::set_to_min/max path, and adds expanded BE unit coverage for BKD range queries and type-limit sentinels. The changed tests are aligned with the new reader interface.
- Scope/focus: The implementation is mostly focused on the query-value refactor and shared key-encoding helper reuse; no unrelated GitHub PR files were present in the actual PR diff.
- Concurrency/lifecycle: No new concurrent mutable state or lock ordering changes were introduced. InvertedIndexParam now owns the Field, which removes the previous raw pointer lifetime risk.
- Config/compatibility: No new config, persisted format, edit log, storage format, or FE/BE protocol compatibility changes were introduced.
- Parallel paths: The main inverted-index call sites in expression pushdown, storage predicates, match/multi-match, IP, arrays, and tests were updated to pass Field; I did not find a remaining live raw query_value path for inverted index readers.
- Error handling: New status-returning BKD encoding dispatch reports unsupported field types instead of silently continuing. Existing CLucene boundaries remain unchanged.
- Data correctness: BKD min/max sentinel handling now uses type_limit on storage/key types, with added specializations for decimal12_t and uint24_t; I did not find an additional correctness regression not already covered by existing review threads.
- Memory/observability/performance: No large new allocations, ownership cycles, or observability gaps stood out. Moving from raw pointers to owned Field values is a small per-query copy but removes unsafe lifetime coupling.
- Test coverage: BE unit tests were expanded for BKD-supported primitive types and sentinel regressions. No regression-test result files were added, which is acceptable for this internal BE refactor.
- User focus: No additional user-provided review focus was supplied.
Existing review context was checked first; I did not duplicate the existing comments about type_limit storage-type specialization or the old KeyCoder explanation.
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
TPC-H: Total hot run time: 29387 ms |
TPC-DS: Total hot run time: 169793 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
|
||
| // DECIMALV2 storage. Largest representable DecimalV2 value (18 digits . 9 digits). | ||
| template <> | ||
| struct type_limit<decimal12_t> { |
There was a problem hiding this comment.
我觉得这个定义可能是有问题的。 前面那些定义,都是关联到计算层的value上的,但是你加的这两个,似乎又是存储层的value
| return doris::Status::InternalError("unsupported BKD field type {}", static_cast<int>(ft)); | ||
| } | ||
|
|
||
| static doris::Status encode_bkd_max_ascending(doris::FieldType ft, const doris::KeyCoder* coder, |
There was a problem hiding this comment.
对于decimal 和 datetime 这种类型,他是有scale的,此时我们min和max 怎么体现呢?
There was a problem hiding this comment.
这个 encoding 不需要考虑 scale,因为是把里面的 int64 或者 int128 来进行encoding,scale 只是缩放比例,对于同一个列的缩放比例是一样的,所以不需要考虑。
There was a problem hiding this comment.
datetime 这种也不影响,他的scale 只是表示小数秒精度,不同的精度的min 和max 都是000000 和 999999,不影响大小比较。
| SCOPED_RAW_TIMER(&context->stats->inverted_index_query_timer); | ||
|
|
||
| std::string search_str = *reinterpret_cast<const std::string*>(query_value); | ||
| std::string search_str = query_value.get<PrimitiveType::TYPE_STRING>(); |
There was a problem hiding this comment.
typename PrimitiveTypeTraits::CppType& Field::get() { 去把field 里这个函数的定义改一下,里面加一个检查,当t != type的时候抛异常把
| const T* value = (const T*)(iter->get_value()); | ||
| RETURN_IF_ERROR(InvertedIndexQueryParamFactory::create_query_value<Type>( | ||
| value, query_param)); | ||
| field_value = Field::create_field<Type>(*value); |
There was a problem hiding this comment.
in list predicate 的时候,只有string 类型不是计算层的类型吗? 其他的,比如date 都是计算层类型吗?
或者我们更广泛的说,predicate 里,运算的时候,都是按照计算层在计算吗?
比如date 类型,string的padding 之类的
There was a problem hiding this comment.
predicate 里都是按照计算层在计算,读出来的列读到columnxxx 里面就是计算层的column
…encode_field inverted_index_query_param.h has no remaining includers; the matching test file is a stub. Remove both. Also drop the FieldType template parameter from bkd_encode_field — the storage value's bytes are already correct for KeyCoder's compile-time CppType, so the explicit key_t conversion was unnecessary. bkd_encode_min/max still need the CppTypeTraits<FT>::CppType for the right type_limit sentinel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both RowCursor::_encode_field and BKD's encode_bkd_field_ascending did the same Field -> storage value -> KeyCoder dispatch with their own copy of the (FieldType, PrimitiveType) table. Extract the conversion helper and the dispatch X-macro into storage/field_key_encoder.h so both call sites share one source of truth. - field.h: expose StorageField::key_coder() for callers that already have a KeyCoder-shaped helper. - field_key_encoder.h: new header with full_encode_field_as_key / encode_field_as_key templates plus DORIS_APPLY_FOR_KEY_ENCODABLE_NON_STRING_TYPES X-macro. - row_cursor.cpp: 19 hand-written cases collapse into one macro expansion; encode_non_string_field<PT> wrapper removed. - inverted_index_reader.cpp: drops local bkd_encode_field<PT> and BKD_TYPE_CASES; the three encode_bkd_*_ascending functions reuse the shared macro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Field-to-key encoding helpers and the dispatch X-macro fit naturally next to KeyCoder rather than in a stand-alone header, since they are thin wrappers around KeyCoder calls. Inline them into storage/key_coder.h and remove storage/field_key_encoder.h. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…match indexed type `Field::get<PT>()` DCHECKs that the Field's primitive type tag equals `PT`, but predicates like `arr_col = []` reach `encode_bkd_field_ascending` via `FunctionComparison<EqualsOp>` with the entire const ARRAY literal as the query Field, so `actual = TYPE_ARRAY` while the BKD index records the inner scalar (e.g. IPV4). Under ASAN the assert aborts the BE with "requested IPV4, actual ARRAY" -- before the void*->Field refactor the old factory rejected non-scalar types via NotSupported and the engine fell back, this defense was lost when the typed dispatch moved into BKD. Validate the Field type before dispatching to `full_encode_field_as_key<PT>` and return INVERTED_INDEX_EVALUATE_SKIPPED on mismatch so `SegmentIterator::_apply_index_expr` downgrades to scalar evaluation instead of crashing on the assert. Scalar predicates (`int_col = 1`, `array_contains(int_arr, 2)`) keep matching as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pe_limit bkd_encode_min/max are now templated by PrimitiveType instead of FieldType. The +/- infinity sentinel is taken from type_limit<CppType> in the compute layer (e.g. DecimalV2Value::get_min_decimal, VecDateTimeValue::datetime_min_value) and projected onto the storage POD via PrimitiveTypeConvertor<PT>::to_storage_field_type. This single-sources every limit constant: DecimalV2 bounds live only on DecimalV2Value, DATE bounds only on VecDateTimeValue. The two storage-layer type_limit<> specialisations added for decimal12_t and uint24_t in the previous PR are no longer required and are removed along with their includes and the half-bounded-BKD comment block. core/type_limit.h is now exclusively a compute-layer header. Tests: the two sanity-probe tests that asserted on the deleted specialisations are removed; verify_bkd_range_queries (one TEST_F per BKD-supported PT) still exercises the same +/- infinity codepath end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DCHECK in both Field::get<T>() overloads was debug-only, so a release build silently returned a reinterpret_cast of unrelated storage bytes when T disagreed with the stored type. Replace it with a runtime check that throws Exception(FatalError(...)), matching the rest of field.cpp's error style and surfacing the bug in production. The existing INVERTED_INDEX_EVALUATE_SKIPPED gate in encode_bkd_field_ascending still fires first for the legitimate ARRAY-query-vs-scalar-BKD-index case so the predicate falls back to scalar evaluation; the throw inside Field::get is defense-in-depth for any future caller that forgets to gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… non-string keys `encode_field_as_key<PT>` was a near-duplicate of `full_encode_field_as_key`. After `DORIS_APPLY_FOR_KEY_ENCODABLE_NON_STRING_TYPES` excluded strings from its dispatch, the `index_size` argument became dead: every non-string `KeyCoderTraits<FT>::encode_ascending` just delegates to `full_encode_ascending` and ignores `index_size`. Drop the helper. Also tighten `full_encode_field_as_key` with a compile-time assert via a new `is_key_encodable_non_string_type(PrimitiveType)` constexpr derived from the existing macro, so future callers can't accidentally feed it a string or nested/aggregate type. `RowCursor::_encode_field` simplifies accordingly: its non-string branch no longer needs to distinguish `full_encode`, since both paths produce byte-identical output for fixed-width keys. The flag still matters for the string branch and is preserved there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ld_as_key
Adds a TEST_F that walks an ascending sequence of compute-layer values
for every (PrimitiveType, FieldType) pair that goes through
`PrimitiveTypeConvertor<PT>` + `KeyCoder` and asserts the encoded byte
order matches.
Coverage (21 key-encodable non-string types):
- BOOLEAN, TINYINT/SMALLINT/INT/BIGINT/LARGEINT (incl. numeric_limits)
- FLOAT, DOUBLE (finite values + +/-infinity, skipping NaN/-0 covered
separately by FloatComprehensiveOrdering)
- DECIMAL32/64/128I/256: 2-3 scale interpretations each (scale=0
boundary + a representative mid scale + max scale where applicable)
- DECIMALV2: full int128 split via decimal12_t {int_value, frac_value}
- DATE V1, DATETIME V1, DATEV2, DATETIMEV2 (scales 0/3/6 incl.
microsecond=999999 -> next-second carry), TIMESTAMPTZ
- IPV4, IPV6
Locks in:
- the compute-layer < <=> encoded byte < contract for the
`full_encode_field_as_key` path used by both `RowCursor::encode_key`
(short-key index, MOW PK) and `encode_bkd_field_ascending` (BKD
inverted index)
- the implicit "scale does not enter the encode path" property,
documented per scale-labelled subgroup -- scale/frac live on
TabletColumn metadata above this layer and never reach
PrimitiveTypeConvertor or KeyCoder
Raw decimal values are constructed as `whole * pow10_iN(scale) + frac`
so they read like `whole.frac` at the labelled scale; IPv4 values use
an `ip4(a, b, c, d)` lambda; date/time use existing `pack_d` / `pack_dt`
or `create_from_olap_date/datetime` helpers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3ee2b4a to
56f5489
Compare
|
run buildall |
|
/review |
|
PR approved by at least one committer and no changes requested. |
There was a problem hiding this comment.
Automated review completed for PR 62960.
Findings: no new blocking issues found beyond the existing inline review threads already present in .opencode-review.f6cqRi/pr_review_threads.md.
Critical checkpoint conclusions:
- Goal and coverage: the PR replaces raw inverted-index query values with typed
Fieldvalues, centralizes BKD encoding throughKeyCoder, removes obsolete min/max TypeInfo APIs, and adds broad BKD key/range tests. The changed implementation appears to accomplish that goal. - Scope and focus: changes are mostly focused on inverted-index query parameter ownership and key encoding; the larger test churn is directly related to the API change.
- Concurrency: no new shared mutable query state, lock ordering, or background-thread behavior was introduced. Existing reader/cache lifecycles are preserved.
- Lifecycle/static initialization: no new cross-translation-unit static dependency or unusual ownership cycle was found in the changed code.
- Configuration/compatibility: no new config item, persisted format, thrift protocol, or rolling-upgrade compatibility concern was introduced.
- Parallel paths: string, fulltext, BKD, predicate, expression-function, and array/IP callers were updated to pass
Fieldquery values; I did not find an omitted production caller. - Conditional checks/error handling: BKD type mismatch now returns
INVERTED_INDEX_EVALUATE_SKIPPED, which is consistent with fallback evaluation. Existing CLucene error boundaries remain in place. - Testing: BE unit tests were updated and expanded for key encoding and BKD range behavior across many scalar types. I did not run the test binaries in this review environment.
- Observability: no new observability need was identified for this refactor; existing query/cache stats and logs remain applicable.
- Transaction/persistence/data-write correctness: not applicable; no transaction, delete-bitmap, storage-format, or committed-data visibility path was changed.
- Performance/memory: no obvious hot-path regression was found. Passing
Fieldby value introduces copies for string query constants, but these are per predicate/query value and replace previous heap-backed param wrappers.
User focus: no additional user-provided review focus was specified.
TPC-H: Total hot run time: 29964 ms |
TPC-DS: Total hot run time: 170742 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
3 similar comments
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
A typed query-param interface replaces the void*:
produces the correct +inf byte string.
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)