Skip to content

[fix](be) Validate dictionary codewords before decoding#63617

Closed
mrhhsg wants to merge 3 commits into
apache:masterfrom
mrhhsg:fix/validate-dict-codewords
Closed

[fix](be) Validate dictionary codewords before decoding#63617
mrhhsg wants to merge 3 commits into
apache:masterfrom
mrhhsg:fix/validate-dict-codewords

Conversation

@mrhhsg
Copy link
Copy Markdown
Member

@mrhhsg mrhhsg commented May 25, 2026

What problem does this PR solve?

Issue Number: None

Problem Summary: Dictionary-encoded string pages use codewords from the data page to index the page dictionary during decoding. Corrupted codeword data can reference entries outside the dictionary and lead to out-of-bounds dictionary reads. This PR validates dictionary codewords against the dictionary size before materializing string columns, predicate columns, dictionary columns, and offset-only length reads. Invalid codewords now fail with a corruption status instead of indexing outside the dictionary.

This PR also updates a stale column array view unit test helper usage required by the current master build after rebasing.

Release note

None

Check List (For Author)

  • Test:
    • Unit Test: ./run-be-ut.sh --run --filter=ColumnStringTest.insert_many_dict_data*:PredicateColumnTest.InsertManyDictData*:ColumnDictionaryTest.insert_many_dict_data*:BinaryDictPageTest.TestRejectInvalidDictCodeword* -j 16
    • Manual test: build-support/clang-format.sh
    • Manual test: build-support/check-format.sh
    • Manual test: git diff --check
  • Behavior changed: Yes. Corrupted dictionary-encoded string pages with invalid codewords are rejected as corruption instead of being decoded.
  • Does this need documentation: No

mrhhsg added 2 commits May 25, 2026 18:09
### What problem does this PR solve?

Issue Number: None

Problem Summary: Dictionary-encoded string pages use codewords from the data page to index the page dictionary during decoding. Corrupted codeword data can reference entries outside the dictionary and lead to out-of-bounds dictionary reads. Validate dictionary codewords against the dictionary size before materializing string columns, predicate columns, dictionary columns, and offset-only length reads. Invalid codewords now fail with a corruption status instead of indexing outside the dictionary.

### Release note

None

### Check List (For Author)

- Test:
    - Unit Test: ./run-be-ut.sh --run --filter=ColumnStringTest.insert_many_dict_data*:PredicateColumnTest.InsertManyDictData*:ColumnDictionaryTest.insert_many_dict_data*:BinaryDictPageTest.TestRejectInvalidDictCodeword* -j 16
- Behavior changed: Yes. Corrupted dictionary-encoded string pages with invalid codewords are rejected as corruption instead of being decoded.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Problem Summary: The column array view unit test still used the removed COW helper assume_mutable after rebasing to the current master. Use assert_mutable for the newly created unique column before wrapping it in ColumnNullable so the BE UT target compiles.

### Release note

None

### Check List (For Author)

- Test:
    - Unit Test: ./run-be-ut.sh --run --filter=ColumnStringTest.insert_many_dict_data*:PredicateColumnTest.InsertManyDictData*:ColumnDictionaryTest.insert_many_dict_data*:BinaryDictPageTest.TestRejectInvalidDictCodeword* -j 16
- Behavior changed: No
- Does this need documentation: No
@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 25, 2026

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found in the PR-scoped diff. The added validation checks reject negative/out-of-range dictionary codewords before indexing dictionary arrays, and BinaryDictPageDecoder converts those failures to Status::Corruption in the read paths covered by the change.

Critical checkpoint conclusions:

  • Goal/test proof: The PR appears to target safe handling of invalid dictionary codewords. Added BE unit tests cover ColumnString, ColumnDictionary, PredicateColumn, and BinaryDictPageDecoder normal/offset-only read paths.
  • Scope/focus: The code changes are small and focused on dictionary codeword validation plus tests.
  • Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, configuration, storage-format, or FE-BE protocol compatibility concerns were introduced.
  • Parallel paths: The affected ColumnString, ColumnDictionary, PredicateColumn, and BinaryDictPageDecoder paths are covered consistently for the PR scope.
  • Error handling: Invalid data now produces Corruption instead of indexing unchecked; decoder callers use RETURN_IF_CATCH_EXCEPTION where column implementations throw.
  • Data correctness: No committed-data visibility or version/delete-bitmap semantics are affected.
  • Performance: Validation adds a linear precheck in dictionary insertion paths; this is proportional to decoded rows and acceptable for corruption safety.
  • Test coverage: Relevant unit tests were added. I did not run tests in this review runner.
  • Observability: No additional observability needed for this localized corruption handling.

User focus response: no additional user-provided review focus was present.

### What problem does this PR solve?

Issue Number: None

Problem Summary: The dictionary string decode validation can be performed while building offsets, avoiding an additional scan over the codeword array in ColumnString. Keep the rollback guard so invalid codewords or later size checks do not leave partially appended offsets.

### Release note

None

### Check List (For Author)

- Test:
    - Unit Test: ./run-be-ut.sh --run --filter=ColumnStringTest.insert_many_dict_data*:PredicateColumnTest.InsertManyDictData*:ColumnDictionaryTest.insert_many_dict_data*:BinaryDictPageTest.TestRejectInvalidDictCodeword* -j 16
- Behavior changed: No
- Does this need documentation: No
@mrhhsg mrhhsg closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants