[bug](arrow) fix arrow type change cause read data coredump#62681
[bug](arrow) fix arrow type change cause read data coredump#62681zhangstar333 wants to merge 4 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Findings:
be/src/core/data_type_serde/data_type_string_serde.cpp: the new int32 path usesBinaryArray::value_offset(), which dereferencesraw_value_offsets_directly. That reintroduces the misaligned-offset UB/crash that the existingArrowMemNotAligned*tests were guarding against.be/src/format/arrow/arrow_block_convertor.cpp: widening everyutf8field tolarge_utf8is not safe today. Several serdes that map to Arrowutf8(JSONB,VARIANT,DATE/DATETIME,IPV6,LARGEINT, etc.) still hard-cast the builder toarrow::StringBuilder, so they will fail as soon as a large column crosses the new threshold.be/src/format/arrow/arrow_block_convertor.cpp: returning a per-batch widened schema breaks Arrow Flight's fixed-schema contract.ArrowFlightBatchReaderBase::schema()still exposes the original schema fetched from the result buffer, and Arrow'sRecordBatchStreamrejects later batches whose schema differs.be/src/format/arrow/arrow_block_convertor.cpp: the ranged overload now constructs theRecordBatchwith_block.rows()instead ofactual_rows, producing an invalid batch whenever a caller requests a subrange.
Critical checkpoints:
- Goal / correctness: The goal is to safely read Arrow string/binary data when the backing offsets are actually int64. The receiver-side idea is valid, but the current patch does not accomplish it safely end to end because it regresses aligned-offset handling and introduces sender/streaming regressions.
- Minimality / focus: Mostly focused, but the sender-side widening change broadens behavior beyond the original bug and impacts unrelated
utf8-mapped types. - Concurrency: No new concurrency or locking risks found in these paths.
- Lifecycle / static init: No special lifecycle or static-initialization issues found.
- Config changes: None.
- Compatibility / parallel paths: Not preserved. Arrow Flight still advertises the original schema, while this patch can emit widened batch schemas; parallel
utf8writer paths in other serdes were not updated forLargeStringBuilder. - Conditional checks: The new offset-width heuristic is understandable, but the int32 branch removed the previous byte-wise offset loads that made misaligned Arrow buffers safe.
- Test coverage: Added unit tests cover the int64-offset mismatch, but they do not cover the widened-schema sender path, Arrow Flight schema handling, or the ranged
convert_to_arrow_batchpath. - Test results: Not run in this review environment.
- Observability: No additional observability appears necessary for this scope.
- Transaction / persistence / data writes / FE-BE variable passing: Not applicable.
- Performance: No blocking performance issue beyond the correctness regressions above.
- Other issues: The returned
RecordBatchrow count regression in the ranged overload should be fixed before merge.
Requesting changes.
| } else { | ||
| for (auto offset_i = start; offset_i < end; ++offset_i) { | ||
| if (!concrete_array->IsNull(offset_i)) { | ||
| const int32_t start_offset = concrete_array->value_offset(offset_i); |
There was a problem hiding this comment.
The previous code used memcpy here so wrapped Arrow buffers did not need aligned int32_t access. arrow::BinaryArray::value_offset() / value_length() dereference raw_value_offsets_[i] directly, so this change reopens the misaligned-read UB that the existing ArrowMemNotAligned and ArrowMemNotAlignedNestedArr tests are meant to catch. Any StringArray built from a misaligned offsets buffer (for example arrow::Buffer::Wrap(ptr + 1, ...)) will hit undefined behavior again. Please keep the byte-wise offset loads on this int32 path instead of switching back to value_offset().
| _cur_type = _block.get_by_position(idx).type; | ||
| auto column = _cur_col->convert_to_full_column_if_const(); | ||
| auto arrow_type = _schema->field(idx)->type(); | ||
| if (arrow_type->name() == "utf8" && column->byte_size() >= MAX_ARROW_UTF8) { |
There was a problem hiding this comment.
This promotion is too broad. convert_to_arrow_type() maps several non-string Doris types to Arrow utf8 (JSONB, VARIANT, DATE/DATETIME, IPV6, LARGEINT, ...), but their write_column_to_arrow() implementations still hard-cast array_builder to arrow::StringBuilder (for example data_type_jsonb_serde.cpp, data_type_variant_serde.cpp, data_type_date_or_datetime_serde.cpp, data_type_ipv6_serde.cpp, data_type_number_serde.cpp). Once one of those columns crosses the 2 GiB threshold, this branch will hand it a LargeStringBuilder and the serialization path will fail. The widening either needs to be limited to serdes that already support LargeStringBuilder, or all utf8 writers need to be updated together.
| } | ||
| } | ||
| *out = arrow::RecordBatch::Make(_schema, actual_rows, std::move(_arrays)); | ||
| auto schema = schema_changed ? arrow::schema(std::move(fields)) : _schema; |
There was a problem hiding this comment.
Changing only the returned RecordBatch schema is still not enough for Arrow Flight. ArrowFlightBatchReaderBase::schema() returns the original schema fetched from ArrowFlightResultBlockBuffer, and arrow::flight::RecordBatchStream initializes its IPC writer once with that fixed schema. Later WriteRecordBatch(*batch) checks batch.schema()->Equals(schema_, false) and rejects mismatches, so the first oversized utf8 batch here will fail with Tried to write record batch with different schema instead of being streamed. The source buffer / reader schema has to be widened consistently as well, not just the per-batch object.
| } | ||
| *out = arrow::RecordBatch::Make(_schema, actual_rows, std::move(_arrays)); | ||
| auto schema = schema_changed ? arrow::schema(std::move(fields)) : _schema; | ||
| *out = arrow::RecordBatch::Make(std::move(schema), _block.rows(), std::move(_arrays)); |
There was a problem hiding this comment.
The 7-argument overload computes actual_rows, but this constructor now uses _block.rows() instead. When a caller requests a subrange (start_row, end_row), each column array is built with actual_rows elements while the RecordBatch advertises the full block row count, leaving an internally inconsistent batch. This should stay actual_rows.
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
|
Hi @zhangstar333, I was thinking if we would need to rebuild schema while writing as well when large_utf8 change happens? ` |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
|
run buildall |
What problem does this PR solve?
large_utf8 using OffsetType = Int64Type;
utf8() using OffsetType = Int32Type;
so if change the arrow type, read arrow offset data should use int64 instead of int32, maybe cause native offset length.
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)