[fix](be) Validate Arrow input buffers before column conversion#64796
[fix](be) Validate Arrow input buffers before column conversion#64796Mryange wants to merge 4 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
99899b8 to
ccada0f
Compare
|
/review |
There was a problem hiding this comment.
I found three blocking issues in the Arrow input validation changes. The main problems are that the new config does not actually cover all Arrow-to-Doris serde paths, the list/map validation reintroduces aligned typed reads on offset buffers that Doris already handled with unaligned loads, and the string/binary path can still dereference a missing value-data buffer for zero-length non-null values.
Validation performed: reviewed the full PR diff and surrounding Arrow conversion call paths; checked existing inline review context and found no duplicates; verified the Arrow 17.0.0 API behavior for the offset read issue; ran git diff --check successfully. I did not run BE unit tests because this worktree is missing .worktree_initialized, custom_env.sh, thirdparty/installed, and initialized submodules, and the required setup script would need files outside the current checkout.
| DECLARE_Int32(arrow_flight_sql_port); | ||
|
|
||
| // Validate Arrow input buffers before converting them to Doris columns. | ||
| DECLARE_Bool(enable_arrow_input_validation); |
There was a problem hiding this comment.
This flag is exposed as validating Arrow input before Doris column conversion, but it only protects the serdes touched in this PR. The same virtual read_column_from_arrow path is used by shared callers such as convert_from_arrow_batch() for Python UDF/UDAF results and by external Arrow readers, yet sibling serdes still dereference Arrow buffers without consulting the flag: JSONB reads value_offsets()/value_data(), decimals call DecimalArray::Value(), DateV2/DateTimeV2 read raw_values(), IPv4 dereferences buffers[1], and IPv6 reads string value_data()/value_offset(). A malformed Arrow result for one of those return types can still take the old unchecked path, so the new config gives a false sense that Arrow input validation is enabled for all Doris conversions. Please either apply the same range/buffer checks to the remaining read_column_from_arrow implementations that access Arrow buffers, or narrow the config/comment/tests to the exact types this PR actually protects.
| start, end, offsets.length()); | ||
| } | ||
|
|
||
| int64_t previous_offset = offsets.Value(start); |
There was a problem hiding this comment.
This validation path reads list/map offsets with offsets.Value(), which in Arrow 17 is a typed raw_values()[i] load. That reintroduces the aligned Int32 access the existing runtime code avoided with unaligned_load<int32_t>() just below the validation call. Doris already has UBSan-oriented positive tests that construct these Arrow offset buffers unaligned (offsets_address % 4 == 1) and then call the list/map readers, so with validation enabled by default those tests can fail before reaching the unaligned-safe conversion path. Please read the offsets in check_arrow_offsets_range() through the same byte-pointer plus unaligned_load<int32_t>() pattern, and keep a validation-enabled unaligned list/map test so this does not regress again.
| assert_cast<ColumnType&>(column).insert_data( | ||
| reinterpret_cast<const char*>(raw_data), length); | ||
| if (config::enable_arrow_input_validation) { | ||
| check_arrow_value_range(*concrete_array, start_offset, length, buffer_size); |
There was a problem hiding this comment.
The range check still lets a missing value-data buffer through when the non-null value is empty. In this branch buffer_size becomes 0 when value_data() is null, and check_arrow_value_range(..., start_offset=0, length=0, buffer_size=0) succeeds. The next statement still evaluates buffer->data() + start_offset, so a malformed Arrow StringArray/BinaryArray with offsets {0, 0} and no value buffer crashes instead of producing the validation error this PR is meant to add. Please either reject a missing value-data buffer before the loop when any non-null row will be read, or avoid dereferencing buffer for zero-length values. The same issue exists in the large string/binary branch below.
|
/review |
There was a problem hiding this comment.
Automated review completed on head 8de9140.
I found one additional correctness issue in the new Arrow validation helper: malformed nullable Arrow input can report nulls while omitting the validity bitmap, and the current validation path accepts it before Doris builds the nullable null map.
Critical checkpoint conclusions:
- Goal/test: the PR adds default-enabled Arrow buffer validation for selected BE Arrow readers, but the nullable validity-bitmap case is still not correctly rejected and needs a regression test.
- Scope: the changes are focused on BE Arrow serde validation, tests, and benchmarks.
- Concurrency/lifecycle: no new runtime concurrency or lifecycle issue found; test config mutation uses an RAII restore helper.
- Config/compatibility: a new BE config is added; existing review thread 3495786715 already covers the broader config-surface mismatch with unprotected sibling serdes.
- Parallel paths: existing threads already cover unaligned list/map offset validation and missing zero-length string value-data handling, so I did not duplicate those comments.
- Tests/results: the new BE UT is included by the recursive be/test CMake glob, but it does not cover the missing validity-bitmap/null_count case. I did not run BE UTs because thirdparty/installed is absent in this runner.
- Performance/observability: no separate performance or observability issue was substantiated beyond the validation correctness gaps.
User focus: no additional user-provided review focus was present.
Subagent conclusions: optimizer-rewrite proposed OPT-1, accepted as inline comment F-001. tests-session-config reported TSC-001 through TSC-003 as duplicates of existing review threads and found no distinct new tests/session/config issue. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same current ledger/comment set.
Validation performed: static review of all changed files and surrounding Arrow serde paths; git show --check over the listed PR files was clean.
| inline void check_arrow_validity_bitmap(const arrow::Array& array) { | ||
| arrow_validation_detail::check_arrow_length_and_offset(array); | ||
| const auto& buffers = array.data()->buffers; | ||
| if (buffers.empty() || !buffers[0]) { |
There was a problem hiding this comment.
This still lets a malformed nullable Arrow array through when it reports nulls but omits the validity bitmap. check_arrow_validity_bitmap() returns as soon as buffers[0] is missing, and then DataTypeNullableSerDe::read_column_from_arrow() immediately builds the Doris null map with arrow_array->IsNull(i). In Arrow 17, a primitive array without a validity bitmap does not have per-row null bits to consult; a partial positive null_count can therefore be consumed as all-valid instead of being rejected. Please reject a missing validity bitmap whenever the Arrow metadata says nulls may be present, and add a nullable test like an Int64Array with null_bitmap=nullptr and null_count=1.
24f208f to
633ae15
Compare
633ae15 to
72d3d2c
Compare
|
/review |
There was a problem hiding this comment.
Automated review completed. I found two additional blocking issues beyond the existing inline threads: sliced Arrow arrays can pass the new validation while the touched readers still ignore array.offset(), and the new validation helpers throw Exception through shared Status-returning conversion APIs without a catch-and-convert boundary.
Critical checkpoints: the PR goal is clear, but the current code does not fully accomplish safe Arrow input validation because valid sliced arrays can decode wrong rows and validation failures can escape the expected Status path. The change is mostly focused, with no concurrency, lifecycle, transaction, persistence, or FE/BE protocol concerns found. A BE config is added with default enabled behavior; the earlier existing thread already covers that the advertised coverage is broader than the touched serdes. Parallel paths were checked; existing threads already cover sibling serde coverage, unaligned offset loads, missing string value buffers, and missing nullable validity bitmaps, so I did not repeat those. Test coverage includes new BE unit tests, but it lacks sliced-array coverage and caller-level convert_from_arrow_batch() status propagation coverage. I could not run local build/tests because this checkout is shallow and thirdparty/installed is missing.
User focus: no additional user-provided focus points were supplied.
Subagent conclusions: OPT-1 became inline comment M-1, TSC-1 became inline comment M-2, D-1 through D-4 were merged as duplicates of existing review threads, and convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same final ledger/comment set.
| arrow_validation_detail::throw_invalid_arrow(array, "data buffer is missing"); | ||
| } | ||
|
|
||
| const size_t offset = static_cast<size_t>(array.offset()); |
There was a problem hiding this comment.
This validation path still accepts sliced Arrow arrays, but the readers below are not offset-aware. For example, a zero-copy sliced Int64Array can have offset=1,length=2 over a buffer containing [1,2,3]; check_arrow_fixed_width_buffer() validates offset + length bytes, but DataTypeNumberSerDe::read_column_from_arrow() later copies from buffer->data() + start, so the caller gets [1,2] instead of the logical slice [2,3]. The same mismatch exists in the touched string/list/map paths that read value_offsets()->data() or offsets()->raw_values() with start directly. Please either reject non-zero array.offset() for these raw-buffer readers, or make all of them use offset-aware APIs/pointers and add sliced-array tests for fixed-width, string, list, and map inputs.
| } | ||
|
|
||
| inline void throw_invalid_arrow(std::string_view arrow_type, std::string_view message) { | ||
| throw Exception(ErrorCode::INVALID_ARGUMENT, "Invalid Arrow {}: {}", arrow_type, message); |
There was a problem hiding this comment.
These helpers throw doris::Exception from read_column_from_arrow() implementations whose public contract is Status, but not every shared caller has a catch-and-convert boundary. In particular, FromRecordBatchToBlockConverter::convert() calls the serde through plain RETURN_IF_ERROR, and convert_from_arrow_batch() is then used by Python UDF/UDAF result conversion through another RETURN_IF_ERROR. A malformed Arrow result can therefore bypass the normal Doris Status propagation instead of returning an invalid-argument status. Please either make the validation helpers return Status, or wrap the shared read_column_from_arrow() caller paths with the existing catch-and-convert macro and add a caller-level test through convert_from_arrow_batch().
|
/review |
There was a problem hiding this comment.
Automated review pass completed. I did not find any new non-duplicate inline issue beyond the existing review threads already attached to this PR.
Critical checkpoint conclusions:
- Task goal: the PR is intended to validate Arrow input buffers before Doris column conversion. The current diff adds a config flag, validation helpers, coverage in selected serdes, BE unit tests, and a benchmark.
- Existing known issues: the active inline threads already cover the remaining high-risk validation gaps, including incomplete sibling-serde buffer validation, exception propagation through shared callers, sliced-array handling, missing validity bitmaps, missing string value buffers for empty values, and unaligned list/map offset reads. I treated these as already-known context and did not resubmit them.
- Scope/focus: no additional user focus was provided; I reviewed the changed files and directly related Arrow conversion paths.
- Concurrency/lifecycle/storage compatibility: no new thread, lock, transaction, persistence, or storage-format path was introduced by this PR.
- Config/compatibility: the new BE config defaults to enabling validation. Remaining disabled-validation behavior that looked suspicious was verified as inherited from the PR base rather than introduced by this diff.
- Tests/style: the new BE unit test file is picked up by the BE test CMake glob, and
git diff --check 64df172ca8582442770ae9edb5a5db09b80e723f 4c2b695de5b8c973b5eaef1973d1a9f9fed874ceis clean. I did not run BE unit tests becausethirdparty/installedis missing in this checkout.
Subagent conclusions:
optimizer-rewriteproposed OPT-1; main verification dismissed it as inherited from the PR base because therow_count == 0 || buffer == nullptrguard already exists before this PR.tests-session-configreported one duplicate of an existing GitHub thread and one duplicate of OPT-1; no new non-duplicate test/config issue remained.- Convergence round 1 ended with both live subagents replying
NO_NEW_VALUABLE_FINDINGSfor the same current ledger and empty proposed final inline comment set.
|
run buildall |
|
/review |
There was a problem hiding this comment.
Follow-up automated review for head 94418b2720a033ffda4e5f8c6dedb0d68ad34942.
I did not find any new non-duplicate inline issue to submit. The remaining substantiated Arrow validation concerns I could verify are already covered by existing inline threads: exception propagation from throwing validation helpers, partial raw-buffer validation coverage in sibling serdes, the zero-length string/binary missing value-buffer path, sliced-array offset handling, missing validity bitmap handling, and unaligned list/map offset validation. I did not re-submit those as duplicate comments.
Critical checkpoints: the PR goal is still Arrow input-buffer validation before Doris column conversion; the implementation is focused on BE serde/config/test/benchmark code; no FE optimizer/rewrite, transaction, storage-format, or concurrency path is changed. The new BE unit test is included by the recursive be/test/CMakeLists.txt glob and git diff --check is clean. The config is a non-mutable BE flag defaulting on; sibling serde coverage remains the key correctness concern and is already tracked in the existing review thread. Parallel Arrow conversion callers and serde families were checked for duplicate suppression.
Test/validation limits: I did not run BE UTs or format checks because this checkout is missing .worktree_initialized, thirdparty/installed, initialized submodules, and the required clang-format v16 toolchain.
Subagent conclusions: optimizer-rewrite found no optimizer/rewrite-specific issue for this BE-only diff. tests-session-config found no new test/config/style issue beyond existing threads and noted the same validation limits. Final convergence round 1 ended with both subagents reporting NO_NEW_VALUABLE_FINDINGS for the current ledger/comment set.
User focus: no additional user-provided review focus was present.
|
run buildall |
TPC-H: Total hot run time: 29660 ms |
TPC-DS: Total hot run time: 174428 ms |
ClickBench: Total hot run time: 25.29 s |
TPC-H: Total hot run time: 29885 ms |
TPC-DS: Total hot run time: 173699 ms |
ClickBench: Total hot run time: 25.18 s |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 29874 ms |
TPC-DS: Total hot run time: 173516 ms |
ClickBench: Total hot run time: 25.22 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Doris converts Arrow arrays into Doris columns through
DataTypeSerDe::read_column_from_arrow. If the Arrow producer sends malformed array metadata, such as truncated validity bitmaps, truncated offsets buffers, non-monotonic string/list offsets, or offsets pointing past the child/value buffer, the existing conversion code may read invalid Arrow memory and crash BE.Root cause: the Arrow-to-Doris serde path trusted Arrow array metadata before accessing Arrow buffers. Several hot paths call
IsNull(),Value(), raw value offsets, list offsets, or child arrays directly, so malformed Arrow buffers can trigger out-of-bounds reads before Doris reports a clean error.This PR adds lightweight, type-specific Arrow input validation before those buffer accesses. The checks are modeled as local preflight checks rather than full
ValidateFull(): validity bitmap size, fixed-width data buffer size, boolean bitmap size, binary/string offsets buffer size, per-value data range, and list/map offsets monotonicity plus child length bounds. A BE configenable_arrow_input_validationis added and defaults totrue.The change also fixes an existing
FixedSizeBinaryArraysliced-read null check: the loop uses a relative index afterGetValue(start), butIsNull()expects the original Arrow row index, so it must checkstart + offset_i.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)