[fix](be) Validate dictionary codewords before decoding#63617
Closed
mrhhsg wants to merge 3 commits into
Closed
Conversation
### What problem does this PR solve?
Issue Number: None
Problem Summary: Dictionary-encoded string pages use codewords from the data page to index the page dictionary during decoding. Corrupted codeword data can reference entries outside the dictionary and lead to out-of-bounds dictionary reads. Validate dictionary codewords against the dictionary size before materializing string columns, predicate columns, dictionary columns, and offset-only length reads. Invalid codewords now fail with a corruption status instead of indexing outside the dictionary.
### Release note
None
### Check List (For Author)
- Test:
- Unit Test: ./run-be-ut.sh --run --filter=ColumnStringTest.insert_many_dict_data*:PredicateColumnTest.InsertManyDictData*:ColumnDictionaryTest.insert_many_dict_data*:BinaryDictPageTest.TestRejectInvalidDictCodeword* -j 16
- Behavior changed: Yes. Corrupted dictionary-encoded string pages with invalid codewords are rejected as corruption instead of being decoded.
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: None
Problem Summary: The column array view unit test still used the removed COW helper assume_mutable after rebasing to the current master. Use assert_mutable for the newly created unique column before wrapping it in ColumnNullable so the BE UT target compiles.
### Release note
None
### Check List (For Author)
- Test:
- Unit Test: ./run-be-ut.sh --run --filter=ColumnStringTest.insert_many_dict_data*:PredicateColumnTest.InsertManyDictData*:ColumnDictionaryTest.insert_many_dict_data*:BinaryDictPageTest.TestRejectInvalidDictCodeword* -j 16
- Behavior changed: No
- Does this need documentation: No
Member
Author
|
/review |
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
There was a problem hiding this comment.
Review result: no blocking issues found in the PR-scoped diff. The added validation checks reject negative/out-of-range dictionary codewords before indexing dictionary arrays, and BinaryDictPageDecoder converts those failures to Status::Corruption in the read paths covered by the change.
Critical checkpoint conclusions:
- Goal/test proof: The PR appears to target safe handling of invalid dictionary codewords. Added BE unit tests cover ColumnString, ColumnDictionary, PredicateColumn, and BinaryDictPageDecoder normal/offset-only read paths.
- Scope/focus: The code changes are small and focused on dictionary codeword validation plus tests.
- Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, configuration, storage-format, or FE-BE protocol compatibility concerns were introduced.
- Parallel paths: The affected ColumnString, ColumnDictionary, PredicateColumn, and BinaryDictPageDecoder paths are covered consistently for the PR scope.
- Error handling: Invalid data now produces Corruption instead of indexing unchecked; decoder callers use RETURN_IF_CATCH_EXCEPTION where column implementations throw.
- Data correctness: No committed-data visibility or version/delete-bitmap semantics are affected.
- Performance: Validation adds a linear precheck in dictionary insertion paths; this is proportional to decoded rows and acceptable for corruption safety.
- Test coverage: Relevant unit tests were added. I did not run tests in this review runner.
- Observability: No additional observability needed for this localized corruption handling.
User focus response: no additional user-provided review focus was present.
### What problem does this PR solve?
Issue Number: None
Problem Summary: The dictionary string decode validation can be performed while building offsets, avoiding an additional scan over the codeword array in ColumnString. Keep the rollback guard so invalid codewords or later size checks do not leave partially appended offsets.
### Release note
None
### Check List (For Author)
- Test:
- Unit Test: ./run-be-ut.sh --run --filter=ColumnStringTest.insert_many_dict_data*:PredicateColumnTest.InsertManyDictData*:ColumnDictionaryTest.insert_many_dict_data*:BinaryDictPageTest.TestRejectInvalidDictCodeword* -j 16
- Behavior changed: No
- Does this need documentation: No
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: None
Problem Summary: Dictionary-encoded string pages use codewords from the data page to index the page dictionary during decoding. Corrupted codeword data can reference entries outside the dictionary and lead to out-of-bounds dictionary reads. This PR validates dictionary codewords against the dictionary size before materializing string columns, predicate columns, dictionary columns, and offset-only length reads. Invalid codewords now fail with a corruption status instead of indexing outside the dictionary.
This PR also updates a stale column array view unit test helper usage required by the current master build after rebasing.
Release note
None
Check List (For Author)