[fix](be) Fix NOT_IMPLEMENTED_ERROR for length() on dict-encoded varchar columns#63498
Conversation
…har columns ### What problem does this PR solve? Issue Number: N/A Problem Summary: `SELECT length(col_varchar)` on a low-cardinality (dict-encoded) column failed with: NOT_IMPLEMENTED_ERROR: Method insert_offsets_from_lengths is not supported Root cause (two-step): 1. When `enable_low_cardinality_optimize=true`, the predicate column is created as `ColumnDictI32` by `Schema::get_predicate_column_ptr()`. The `only_read_offsets` code path in `BinaryDictPageDecoder` resolves dict codes to lengths and calls `dst->insert_offsets_from_lengths()`, but `ColumnDictI32` does not implement that method. 2. After converting `ColumnDictI32` to `PredicateColumnType<TYPE_STRING>` via `convert_to_predicate_column_if_dictionary()`, the same call failed again because `PredicateColumnType` also lacked the implementation. Fix: - `binary_dict_page.cpp`: call `convert_to_predicate_column_if_dictionary()` on `dst` before invoking `insert_offsets_from_lengths` in both `next_batch()` and `read_by_rowids()` (covers dict-encoded pages and the plain-encoded fallback path). - `predicate_column.h`: implement `insert_offsets_from_lengths` for the `StringRef` specialisation of `PredicateColumnType`. A single backing buffer is allocated from the internal Arena and zero-filled; each element records the correct length so that downstream `filter_by_selector` / `copy_column_data_by_selector` can materialise the right-sized strings into the output `ColumnString`, giving the correct result for `length()`. ### Release note Fix crash/error when calling `length()` on a varchar/char column that uses the low-cardinality (dictionary) optimisation. ### Check List (For Author) - Test: Manual test (query reproduced and confirmed fixed) - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 31033 ms |
TPC-DS: Total hot run time: 170661 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
1 similar comment
|
/review |
There was a problem hiding this comment.
Review summary: I did not find any blocking correctness issues in this PR. The change addresses the reported low-cardinality/dictionary OFFSET_ONLY failure by converting dictionary columns before insert_offsets_from_lengths and by adding PredicateColumnType string support for length-only entries. The new regression test covers length()/char_length() on dict-encoded varchar columns with the default low-cardinality optimization enabled.
Critical checkpoint conclusions:
- Goal and proof: The PR goal is to avoid NOT_IMPLEMENTED_ERROR for length()/char_length() on dict-encoded varchar columns; the code path and regression test align with that goal.
- Scope/minimality: The change is small and focused on the failing storage read path and predicate column method.
- Concurrency/lifecycle: No new shared state, locking, thread, or non-trivial lifecycle behavior is introduced.
- Configuration/compatibility: No new configuration, storage format, RPC, or serialized format changes; mixed-version compatibility is not affected.
- Parallel paths: Both next_batch() and read_by_rowids() dict paths are updated; the existing non-dict forwarding path already converts before delegating.
- Error handling: Existing Status propagation is preserved; no ignored Status or silent failure path added.
- Memory/performance: The implementation allocates valid backing memory for selected StringRefs, matching downstream filter/copy expectations. No additional obvious hot-path regression beyond the expected length-only placeholder buffer was found.
- Transactions/data visibility/MoW: Not involved.
- Test coverage/results: Regression coverage is present with generated .out. I did not run the regression suite in this review environment.
- Observability: No new observability appears necessary for this localized read-path fix.
User focus response: No additional user-provided review focus was present.
There was a problem hiding this comment.
Requesting changes for a correctness issue in the new OFFSET_ONLY predicate-column path.
Critical checkpoint conclusions:
- Goal and tests: The PR targets length()/char_length() on low-cardinality dict-encoded varchar columns and adds a regression test, but the implementation has an empty-string materialization issue and the test does not actually cover nullable NULL varchar rows.
- Scope: The change is focused on BinaryDictPageDecoder and PredicateColumnType, with a small regression test.
- Concurrency and lifecycle: No new shared state, locks, threads, or non-trivial lifecycle/static initialization concerns were introduced.
- Configuration and compatibility: No new config, storage format, or FE/BE protocol compatibility changes.
- Parallel paths: Both next_batch() and read_by_rowids() dict paths were updated; existing plain-page OFFSET_ONLY path remains parallel.
- Error handling: No Status is newly ignored; the main concern is undefined behavior from null StringRef pointers during later selector materialization.
- Memory and performance: Memory is arena-owned like adjacent predicate string handling, but the new dummy backing pointers must remain valid for zero-length strings too.
- Data correctness: Non-empty length preservation is addressed; empty strings can hit undefined behavior when selected and copied to ColumnString.
- Observability: No new observability is needed for this targeted storage decode fix.
- User focus: No additional user-provided review focus was specified.
| for (size_t i = 0; i < num; ++i) { | ||
| // For zero-length strings, data pointer is null; insert_many_strings | ||
| // and filter_by_selector both guard on size > 0 before dereferencing. | ||
| data[org_elem_num + i].data = (lengths[i] > 0) ? (buf + offset) : nullptr; |
There was a problem hiding this comment.
This leaves zero-length rows with a null StringRef::data, but PredicateColumnType::filter_by_selector() later forwards these refs to ColumnString::insert_many_strings_without_reserve(). That code does pointer arithmetic/comparison on strings[i].data + len before checking length != 0, so an empty string read through this OFFSET_ONLY predicate path can invoke undefined behavior when the predicate column is materialized by selector filtering. Please keep zero-length refs pointing at valid arena memory as well, for example by allocating at least one byte when total_bytes == 0 and assigning buf + offset for every row.
There was a problem hiding this comment.
这里是没问题的,因为insert default就是这个行为
| (6, 6, NULL, '2023-12-18', 'w', 'l', 24), | ||
| (8, -8278102, '2023-12-13', '2023-12-11', 'x', 'c', 28), | ||
| (15971, 8, NULL, '2015-06-11', 'h', 'r', 41), | ||
| (6, 5, '2023-12-11', '2023-12-17', 'd', 'q', 5) |
There was a problem hiding this comment.
The test comment below says length() returns NULL for NULL varchar values, but every inserted value for col_varchar_5__undef_signed is non-NULL; only col_date_undef_signed has NULLs. This means the nullable NULL path for dict-encoded OFFSET_ONLY length() is not exercised. Please insert at least one NULL in the nullable varchar column and regenerate the .out file so this regression is covered.
|
run buildall |
TPC-H: Total hot run time: 30971 ms |
TPC-DS: Total hot run time: 169658 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Review summary for PR 63498:
I did not find additional non-duplicate blocking issues beyond the existing inline review context already present on this PR. The added nullable-varchar NULL rows address the previously raised regression-test coverage gap for NULL passthrough.
One non-blocking coverage note: the char_length queries in the new regression are sanity checks, not coverage of this decoder change. In Nereids, OFFSET-only access is collected for Length, while char_length maps to CharacterLength and reads the string data normally. The changed path is therefore covered by the length(...) cases, not by the char_length(...) cases.
Critical checkpoint conclusions:
- Goal and proof: the code targets
length()on dict-encoded varchar columns in OFFSET-only reads; the newlength(...)regression cases exercise nullable and non-null varchar paths. - Scope: the code change is small and localized to dict-page OFFSET-only conversion plus predicate-column length insertion.
- Concurrency and lifecycle: no new shared state, locks, threads, or special lifecycle/static initialization were introduced.
- Configuration and compatibility: no new config or storage/protocol format change was introduced.
- Parallel paths: dict
next_batchandread_by_rowidspaths were both updated; plain fallback already delegates after dictionary conversion. - Tests: regression coverage exists for
length()and NULL passthrough;char_lengthcases are not evidence for the changed OFFSET-only path as noted above. - Observability, transactions, persistence, data writes: not applicable to this change.
- Performance: the implementation avoids reading real string bytes from storage but still allocates dummy bytes when materialization is needed; no separate performance blocker found in this review.
User focus: no additional user-provided review focus was specified.
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: N/A
Problem Summary:
SELECT length(col_varchar)on a low-cardinality (dict-encoded) column failed with:NOT_IMPLEMENTED_ERROR: Method insert_offsets_from_lengths is not supported
Root cause (two-step):
enable_low_cardinality_optimize=true, the predicate column is created asColumnDictI32bySchema::get_predicate_column_ptr(). Theonly_read_offsetscode path inBinaryDictPageDecoderresolves dict codes to lengths and callsdst->insert_offsets_from_lengths(), butColumnDictI32does not implement that method.ColumnDictI32toPredicateColumnType<TYPE_STRING>viaconvert_to_predicate_column_if_dictionary(), the same call failed again becausePredicateColumnTypealso lacked the implementation.Fix:
binary_dict_page.cpp: callconvert_to_predicate_column_if_dictionary()ondstbefore invokinginsert_offsets_from_lengthsin bothnext_batch()andread_by_rowids()(covers dict-encoded pages and the plain-encoded fallback path).predicate_column.h: implementinsert_offsets_from_lengthsfor theStringRefspecialisation ofPredicateColumnType. A single backing buffer is allocated from the internal Arena and zero-filled; each element records the correct length so that downstreamfilter_by_selector/copy_column_data_by_selectorcan materialise the right-sized strings into the outputColumnString, giving the correct result forlength().Release note
Fix crash/error when calling
length()on a varchar/char column that uses the low-cardinality (dictionary) optimisation.Check List (For Author)
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)