[fix](parquet) Fix wrong condition#63509
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found a blocking correctness issue in the Parquet bounds check.
Critical checkpoint conclusions:
- Goal/test: the PR appears intended to make the byte-array bounds check overflow-safe, but the current expression still performs the potentially overflowing 32-bit addition before widening. I did not find a test that proves malformed large lengths are rejected.
- Scope: the change is small and focused, but the implementation does not accomplish the stated safety property.
- Concurrency/lifecycle: no concurrency or special lifecycle concern in this decoder path.
- Config/compatibility/persistence: no new configs, storage format changes, or edit-log compatibility concerns found.
- Parallel paths: the same overflow pattern is repeated in all three modified checks in this file and should be fixed consistently.
- Tests: malformed byte-array length coverage appears missing for this overflow case.
- Observability/performance: no additional observability or performance issue found beyond the correctness problem below.
User focus: no additional user-provided review focus was supplied.
| decode_fixed32_le(reinterpret_cast<const uint8_t*>(_data->data) + _offset); | ||
| _offset += 4; | ||
| if (UNLIKELY(_offset + length) > _data->size) { | ||
| if (UNLIKELY(cast_set<uint64_t>(_offset + length) > _data->size)) { |
There was a problem hiding this comment.
This still overflows before the cast: _offset and length are both uint32_t, so a malformed value such as _offset = 8 and length = UINT32_MAX wraps in _offset + length before cast_set<uint64_t> sees it. That can pass the bounds check and then skip/read outside the Slice. Cast one operand before adding, or compare with subtraction such as length > _data->size - _offset. The same pattern appears in the two decode paths below.
|
/review |
There was a problem hiding this comment.
Review completed. I did not find additional blocking issues in the current PR head. The previously raised overflow concern in byte_array_plain_decoder.cpp is addressed by casting _offset before adding length in all three changed checks.
Critical checkpoint conclusions:
- Goal/test: The goal is to prevent integer overflow in byte-array plain decoder bounds checks. The current code accomplishes this for the reviewed overflow path. No new test is included; risk is limited by the small localized change, but a malformed Parquet regression/unit test would improve coverage.
- Scope: The modification is small, clear, and focused.
- Concurrency/lifecycle/config/compatibility: Not applicable; no shared state, lifecycle, configuration, protocol, or storage-format change is introduced.
- Parallel paths: All three equivalent BYTE_ARRAY plain decode/skip paths in this file were updated consistently.
- Error handling/data correctness: Existing Status error propagation is preserved, and malformed input now fails before using an out-of-bounds slice.
- Performance/observability: No meaningful performance or observability impact.
- User focus: No additional user-provided review focus was specified.
|
run buildall |
TPC-H: Total hot run time: 31276 ms |
TPC-DS: Total hot run time: 169463 ms |
|
/review |
There was a problem hiding this comment.
I found one blocking correctness/safety issue in the PR. The plain decoder overflow concern already raised in the existing review thread is known context and was not duplicated here.
Critical checkpoint conclusions:
- Goal/test: The PR aims to harden malformed Parquet byte-array length handling and updates an HDFS TVF regression expectation, but the dictionary decoder fix is still incomplete for truncated dictionary pages.
- Scope: The change is small and focused, but the dictionary bounds condition does not cover the actual bytes consumed.
- Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, config, wire/storage-format, or compatibility concerns found.
- Parallel paths: Plain and dictionary paths are both relevant; the dictionary path has a distinct remaining OOB case.
- Error handling/data correctness/memory safety: The dictionary decoder can read/copy outside the page before returning
Status::Corruption, so malformed input can still violate memory safety. - Tests: The regression test exercises a plain-decoder error path, but I did not see coverage proving malformed dictionary byte-array pages are rejected before any out-of-bounds access.
- Observability/performance: No additional observability or performance concerns found for this small change.
- User focus: No additional user-provided review focus was specified.
| return Status::Corruption("Wrong data length in dictionary"); | ||
| } | ||
| uint32_t l = decode_fixed32_le(_dict.get() + offset_cursor); | ||
| if (UNLIKELY(offset_cursor + l > cast_set<uint32_t>(length) || |
There was a problem hiding this comment.
This check still allows truncated dictionary pages to be read out of bounds. decode_fixed32_le(_dict.get() + offset_cursor) consumes 4 bytes, but the preceding check only rejects offset_cursor > length; for example length = 2 and num_values = 1 reaches the decode with only two bytes available. The payload check also compares offset_cursor + l before accounting for the 4-byte prefix, so length = 4, l = 1 passes here, then the second loop copies one byte from dict_item_address + 4 before reporting corruption. Please validate length - offset_cursor >= 4 before decoding the prefix and then validate l <= length - offset_cursor after advancing past the prefix, using subtraction or widened arithmetic to avoid overflow.
|
/review |
There was a problem hiding this comment.
The PR still has blocking correctness/safety issues already captured in the existing inline review threads, so I am not duplicating those comments. In particular, the byte-array plain decoder still needs overflow-safe length checks in all decode/skip paths, and the byte-array dictionary decoder still needs to verify that at least 4 bytes remain before reading each length prefix and that the payload length fits after advancing past the prefix.
Critical checkpoint conclusions:
- Goal/test: the goal appears to be hardening malformed Parquet byte-array handling and updating the regression expectation; the current code does not fully accomplish it because malformed pages can still drive out-of-bounds reads before returning an error.
- Scope: the change is small and focused, but the bounds checks are incomplete.
- Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle, config, or compatibility concerns found in the changed code.
- Parallel paths: the same plain-decoder length pattern appears in skip, content decode, and filtered-content decode; all need the same overflow-safe treatment.
- Tests/results: the regression test exercises the ARROW-17100 failure path, but the code still lacks coverage for the truncated dictionary-prefix/payload cases and overflow cases described in the existing comments.
- Observability: existing error propagation is sufficient once the bounds checks are correct.
- Transaction/persistence/data writes/FE-BE variables: not applicable.
- Performance: no additional performance issue found.
User focus: no additional user-provided review focus was specified.
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Parquet byte array decoders still used addition-based bounds checks for malformed length-prefixed data. The plain decoder could overflow _offset + length before rejecting a value, and the dictionary decoder could decode or copy from truncated pages before validating that enough bytes remained for the length prefix and payload. This change validates remaining bytes with subtraction before decoding or advancing offsets, rejects negative dictionary page lengths, and adds unit coverage for truncated prefixes, truncated payloads, and overflow-sized values.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check
- Could not run build-support/clang-format.sh because llvm@16/clang-format is not installed in this environment
- Could not run BE UT because run-be-ut.sh requires JDK-17, but this environment only has JDK-11
- Behavior changed: No
- Does this need documentation: No
|
/review |
There was a problem hiding this comment.
No blocking issues found in the current GitHub PR diff. The final head revision fixes the previously raised byte-array plain overflow and dictionary truncated-page concerns by checking remaining bytes before decoding/advancing, and adds focused BE unit coverage for malformed prefixes/payloads and overflow-sized plain values.
Critical checkpoint conclusions:
- Goal and proof: The PR targets malformed Parquet byte-array bounds checks; the final code accomplishes this for the reviewed plain and dictionary paths, with added unit tests.
- Scope: The effective GitHub diff is small and focused on Parquet decoder validation plus the related regression expectation update.
- Concurrency/lifecycle: No new shared state, locking, threads, or non-obvious lifecycle/static initialization concerns were introduced.
- Configuration/compatibility: No new config, protocol, storage format, or rolling-upgrade compatibility concern.
- Parallel paths: The plain decoder skip/content/filtered-content paths were updated consistently; dictionary set_dict validates both sizing and copy passes.
- Error handling/data correctness: New checks return non-OK Status before out-of-bounds decode/copy; Status propagation is preserved.
- Tests: Added BE unit tests cover the core malformed cases. I did not run BE UTs in this review; I ran git diff --check for the final head commit.
- Observability/performance: No new observability need; added checks are constant-time and appropriate for corrupted input handling.
User focus: No additional user-provided review focus was specified.
|
run buildall |
TPC-H: Total hot run time: 31534 ms |
TPC-DS: Total hot run time: 170029 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)