Skip to content

[bug](arrow) fix arrow type change cause read data coredump#62681

Open
zhangstar333 wants to merge 4 commits into
apache:masterfrom
zhangstar333:arrow_utf8
Open

[bug](arrow) fix arrow type change cause read data coredump#62681
zhangstar333 wants to merge 4 commits into
apache:masterfrom
zhangstar333:arrow_utf8

Conversation

@zhangstar333
Copy link
Copy Markdown
Contributor

@zhangstar333 zhangstar333 commented Apr 21, 2026

What problem does this PR solve?

large_utf8 using OffsetType = Int64Type;
utf8() using OffsetType = Int32Type;

so if change the arrow type, read arrow offset data should use int64 instead of int32, maybe cause native offset length.

*** SIGSEGV invalid permissions for mapped object (@0x14a185d7c000) received by PID 589278 (TID 592608 OR 0x14a2d1447640) from PID 18446744071660093440; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk2/zhangsida/branch-40/be/src/common/signal_handler.h:420
 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /mnt/disk2/zhangsida/install_data/jdk-17.0.2/lib/server/libjvm.so
 2# JVM_handle_linux_signal in /mnt/disk2/zhangsida/install_data/jdk-17.0.2/lib/server/libjvm.so
 3# 0x000014AAA263FC30 in /lib64/libc.so.6
 4# inline_memcpy(void*, void const*, unsigned long) at /mnt/disk2/zhangsida/branch-40/be/src/glibc-compatibility/memcpy/memcpy_x86_64.cpp:201
 5# memcpy at /mnt/disk2/zhangsida/branch-40/be/src/glibc-compatibility/memcpy/memcpy_x86_64.cpp:219
 6# doris::vectorized::ColumnStr<unsigned int>::insert_data(char const*, unsigned long) in /mnt/disk2/zhangsida/branch-40/output/be/lib/doris_be
 7# doris::vectorized::DataTypeStringSerDeBase<doris::vectorized::ColumnStr<unsigned int> >::read_column_from_arrow(doris::vectorized::IColumn&, arrow::Array const*, long, long, cctz::time_zone const&) const at /mnt/disk2/zhangsida/branch-40/be/src/vec/data_types/serde/data_type_string_serde.cpp:278
 8# doris::vectorized::DataTypeNullableSerDe::read_column_from_arrow(doris::vectorized::IColumn&, arrow::Array const*, long, long, cctz::time_zone const&) const at /mnt/disk2/zhangsida/branch-40/be/src/vec/data_types/serde/data_type_nullable_serde.cpp:351
 9# doris::vectorized::RemoteDorisReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/format/table/remote_doris_reader.cpp:91
10# doris::vectorized::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/file_scanner.cpp:468
11# doris::vectorized::FileScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/file_scanner.cpp:405
12# doris::vectorized::Scanner::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner.cpp:110
13# doris::vectorized::Scanner::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner.cpp:83
14# doris::vectorized::ScannerScheduler::_scanner_scan(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner_scheduler.cpp:178
15# doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner_scheduler.cpp:76
16# doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}::operator()() const at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner_scheduler.cpp:75
17# bool std::__invoke_impl<bool, doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&>(std::__invoke_other, doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
18# std::enable_if<is_invocable_r_v<bool, doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&>, bool>::type std::__invoke_r<bool, doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&>(doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:116
19# std::_Function_handler<bool (), doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
20# std::function<bool ()>::operator()() const at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
21# doris::vectorized::ScannerSplitRunner::process_for(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner_scheduler.cpp:419
22# doris::vectorized::PrioritizedSplitRunner::process() at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/executor/time_sharing/prioritized_split_runner.cpp:104
23# doris::vectorized::TimeSharingTaskExecutor::_dispatch_thread() at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/executor/time_sharing/time_sharing_task_executor.cpp:566
24# void std::__invoke_impl<void, void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&>(std::__invoke_memfun_deref, void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:76
25# std::__invoke_result<void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&>::type std::__invoke<void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&>(void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:98
26# void std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>::__call<void, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:515
27# void std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>::operator()<, void>() at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:600
28# void std::__invoke_impl<void, std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&>(std::__invoke_other, std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
29# std::enable_if<is_invocable_r_v<void, std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&>, void>::type std::__invoke_r<void, std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&>(std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:119
30# std::_Function_handler<void (), std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()> >::_M_invoke(std::_Any_data const&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
31# std::function<void ()>::operator()() const at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
32# doris::Thread::supervise_thread(void*) at /mnt/disk2/zhangsida/branch-40/be/src/util/thread.cpp:460
33# start_thread in /lib64/libc.so.6
34# clone3 in /lib64/libc.so.6

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zhangstar333 zhangstar333 changed the title update [bug](arrow) fix arrow type change cause read data coredump Apr 21, 2026
@zhangstar333
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings:

  1. be/src/core/data_type_serde/data_type_string_serde.cpp: the new int32 path uses BinaryArray::value_offset(), which dereferences raw_value_offsets_ directly. That reintroduces the misaligned-offset UB/crash that the existing ArrowMemNotAligned* tests were guarding against.
  2. be/src/format/arrow/arrow_block_convertor.cpp: widening every utf8 field to large_utf8 is not safe today. Several serdes that map to Arrow utf8 (JSONB, VARIANT, DATE/DATETIME, IPV6, LARGEINT, etc.) still hard-cast the builder to arrow::StringBuilder, so they will fail as soon as a large column crosses the new threshold.
  3. be/src/format/arrow/arrow_block_convertor.cpp: returning a per-batch widened schema breaks Arrow Flight's fixed-schema contract. ArrowFlightBatchReaderBase::schema() still exposes the original schema fetched from the result buffer, and Arrow's RecordBatchStream rejects later batches whose schema differs.
  4. be/src/format/arrow/arrow_block_convertor.cpp: the ranged overload now constructs the RecordBatch with _block.rows() instead of actual_rows, producing an invalid batch whenever a caller requests a subrange.

Critical checkpoints:

  • Goal / correctness: The goal is to safely read Arrow string/binary data when the backing offsets are actually int64. The receiver-side idea is valid, but the current patch does not accomplish it safely end to end because it regresses aligned-offset handling and introduces sender/streaming regressions.
  • Minimality / focus: Mostly focused, but the sender-side widening change broadens behavior beyond the original bug and impacts unrelated utf8-mapped types.
  • Concurrency: No new concurrency or locking risks found in these paths.
  • Lifecycle / static init: No special lifecycle or static-initialization issues found.
  • Config changes: None.
  • Compatibility / parallel paths: Not preserved. Arrow Flight still advertises the original schema, while this patch can emit widened batch schemas; parallel utf8 writer paths in other serdes were not updated for LargeStringBuilder.
  • Conditional checks: The new offset-width heuristic is understandable, but the int32 branch removed the previous byte-wise offset loads that made misaligned Arrow buffers safe.
  • Test coverage: Added unit tests cover the int64-offset mismatch, but they do not cover the widened-schema sender path, Arrow Flight schema handling, or the ranged convert_to_arrow_batch path.
  • Test results: Not run in this review environment.
  • Observability: No additional observability appears necessary for this scope.
  • Transaction / persistence / data writes / FE-BE variable passing: Not applicable.
  • Performance: No blocking performance issue beyond the correctness regressions above.
  • Other issues: The returned RecordBatch row count regression in the ranged overload should be fixed before merge.

Requesting changes.

} else {
for (auto offset_i = start; offset_i < end; ++offset_i) {
if (!concrete_array->IsNull(offset_i)) {
const int32_t start_offset = concrete_array->value_offset(offset_i);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code used memcpy here so wrapped Arrow buffers did not need aligned int32_t access. arrow::BinaryArray::value_offset() / value_length() dereference raw_value_offsets_[i] directly, so this change reopens the misaligned-read UB that the existing ArrowMemNotAligned and ArrowMemNotAlignedNestedArr tests are meant to catch. Any StringArray built from a misaligned offsets buffer (for example arrow::Buffer::Wrap(ptr + 1, ...)) will hit undefined behavior again. Please keep the byte-wise offset loads on this int32 path instead of switching back to value_offset().

_cur_type = _block.get_by_position(idx).type;
auto column = _cur_col->convert_to_full_column_if_const();
auto arrow_type = _schema->field(idx)->type();
if (arrow_type->name() == "utf8" && column->byte_size() >= MAX_ARROW_UTF8) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This promotion is too broad. convert_to_arrow_type() maps several non-string Doris types to Arrow utf8 (JSONB, VARIANT, DATE/DATETIME, IPV6, LARGEINT, ...), but their write_column_to_arrow() implementations still hard-cast array_builder to arrow::StringBuilder (for example data_type_jsonb_serde.cpp, data_type_variant_serde.cpp, data_type_date_or_datetime_serde.cpp, data_type_ipv6_serde.cpp, data_type_number_serde.cpp). Once one of those columns crosses the 2 GiB threshold, this branch will hand it a LargeStringBuilder and the serialization path will fail. The widening either needs to be limited to serdes that already support LargeStringBuilder, or all utf8 writers need to be updated together.

}
}
*out = arrow::RecordBatch::Make(_schema, actual_rows, std::move(_arrays));
auto schema = schema_changed ? arrow::schema(std::move(fields)) : _schema;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing only the returned RecordBatch schema is still not enough for Arrow Flight. ArrowFlightBatchReaderBase::schema() returns the original schema fetched from ArrowFlightResultBlockBuffer, and arrow::flight::RecordBatchStream initializes its IPC writer once with that fixed schema. Later WriteRecordBatch(*batch) checks batch.schema()->Equals(schema_, false) and rejects mismatches, so the first oversized utf8 batch here will fail with Tried to write record batch with different schema instead of being streamed. The source buffer / reader schema has to be widened consistently as well, not just the per-batch object.

}
*out = arrow::RecordBatch::Make(_schema, actual_rows, std::move(_arrays));
auto schema = schema_changed ? arrow::schema(std::move(fields)) : _schema;
*out = arrow::RecordBatch::Make(std::move(schema), _block.rows(), std::move(_arrays));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 7-argument overload computes actual_rows, but this constructor now uses _block.rows() instead. When a caller requests a subrange (start_row, end_row), each column array is built with actual_rows elements while the RecordBatch advertises the full block row count, leaving an internally inconsistent batch. This should stay actual_rows.

@zhangstar333
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 95.56% (43/45) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.29% (20354/38193)
Line Coverage 36.84% (191781/520537)
Region Coverage 33.16% (149188/449912)
Branch Coverage 34.26% (65230/190386)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 95.56% (43/45) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.76% (26841/37406)
Line Coverage 55.01% (285488/518940)
Region Coverage 52.12% (236642/454069)
Branch Coverage 53.47% (102110/190959)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 95.56% (43/45) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.76% (26841/37406)
Line Coverage 55.01% (285491/518940)
Region Coverage 52.11% (236633/454069)
Branch Coverage 53.47% (102104/190959)

@zhangstar333
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.87% (37/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.21% (26646/37421)
Line Coverage 53.64% (278370/518957)
Region Coverage 47.04% (214114/455174)
Branch Coverage 50.36% (96924/192470)

@zhangstar333
Copy link
Copy Markdown
Contributor Author

run buildall

@subhash-arcana
Copy link
Copy Markdown

Hi @zhangstar333, I was thinking if we would need to rebuild schema while writing as well when large_utf8 change happens?
In be/src/util/arrow/block_convertor.cpp. adding the patch for review
large_utf8.patch

`

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 77.91% (67/86) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.71% (27630/37484)
Line Coverage 57.47% (298861/519988)
Region Coverage 54.59% (248364/454949)
Branch Coverage 56.22% (107589/191358)

@zhangstar333
Copy link
Copy Markdown
Contributor Author

Hi @zhangstar333, I was thinking if we would need to rebuild schema while writing as well when large_utf8 change happens? In be/src/util/arrow/block_convertor.cpp. adding the patch for review large_utf8.patch

`
@subhash-arcana
5f01461#diff-c39041117265c90e18e794846a18235d92b1b3089f41bb14360baa583558090cR111
I have change the schema in the first commit, and the reviewer have some comment on it, u could have a look.
we change the schema maybe useless, as the arrow stream read scheam only once.

@zhangstar333
Copy link
Copy Markdown
Contributor Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants