[bug](arrow) fix arrow type change cause read data coredump by zhangstar333 · Pull Request #62681 · apache/doris

zhangstar333 · 2026-04-21T12:46:16Z

What problem does this PR solve?

large_utf8 using OffsetType = Int64Type;
utf8() using OffsetType = Int32Type;

so if change the arrow type, read arrow offset data should use int64 instead of int32, maybe cause native offset length.

*** SIGSEGV invalid permissions for mapped object (@0x14a185d7c000) received by PID 589278 (TID 592608 OR 0x14a2d1447640) from PID 18446744071660093440; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk2/zhangsida/branch-40/be/src/common/signal_handler.h:420
 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /mnt/disk2/zhangsida/install_data/jdk-17.0.2/lib/server/libjvm.so
 2# JVM_handle_linux_signal in /mnt/disk2/zhangsida/install_data/jdk-17.0.2/lib/server/libjvm.so
 3# 0x000014AAA263FC30 in /lib64/libc.so.6
 4# inline_memcpy(void*, void const*, unsigned long) at /mnt/disk2/zhangsida/branch-40/be/src/glibc-compatibility/memcpy/memcpy_x86_64.cpp:201
 5# memcpy at /mnt/disk2/zhangsida/branch-40/be/src/glibc-compatibility/memcpy/memcpy_x86_64.cpp:219
 6# doris::vectorized::ColumnStr<unsigned int>::insert_data(char const*, unsigned long) in /mnt/disk2/zhangsida/branch-40/output/be/lib/doris_be
 7# doris::vectorized::DataTypeStringSerDeBase<doris::vectorized::ColumnStr<unsigned int> >::read_column_from_arrow(doris::vectorized::IColumn&, arrow::Array const*, long, long, cctz::time_zone const&) const at /mnt/disk2/zhangsida/branch-40/be/src/vec/data_types/serde/data_type_string_serde.cpp:278
 8# doris::vectorized::DataTypeNullableSerDe::read_column_from_arrow(doris::vectorized::IColumn&, arrow::Array const*, long, long, cctz::time_zone const&) const at /mnt/disk2/zhangsida/branch-40/be/src/vec/data_types/serde/data_type_nullable_serde.cpp:351
 9# doris::vectorized::RemoteDorisReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/format/table/remote_doris_reader.cpp:91
10# doris::vectorized::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/file_scanner.cpp:468
11# doris::vectorized::FileScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/file_scanner.cpp:405
12# doris::vectorized::Scanner::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner.cpp:110
13# doris::vectorized::Scanner::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner.cpp:83
14# doris::vectorized::ScannerScheduler::_scanner_scan(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner_scheduler.cpp:178
15# doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner_scheduler.cpp:76
16# doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}::operator()() const at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner_scheduler.cpp:75
17# bool std::__invoke_impl<bool, doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&>(std::__invoke_other, doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
18# std::enable_if<is_invocable_r_v<bool, doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&>, bool>::type std::__invoke_r<bool, doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&>(doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:116
19# std::_Function_handler<bool (), doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_0::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
20# std::function<bool ()>::operator()() const at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
21# doris::vectorized::ScannerSplitRunner::process_for(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/scan/scanner_scheduler.cpp:419
22# doris::vectorized::PrioritizedSplitRunner::process() at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/executor/time_sharing/prioritized_split_runner.cpp:104
23# doris::vectorized::TimeSharingTaskExecutor::_dispatch_thread() at /mnt/disk2/zhangsida/branch-40/be/src/vec/exec/executor/time_sharing/time_sharing_task_executor.cpp:566
24# void std::__invoke_impl<void, void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&>(std::__invoke_memfun_deref, void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:76
25# std::__invoke_result<void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&>::type std::__invoke<void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&>(void (doris::vectorized::TimeSharingTaskExecutor::*&)(), doris::vectorized::TimeSharingTaskExecutor*&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:98
26# void std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>::__call<void, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:515
27# void std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>::operator()<, void>() at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:600
28# void std::__invoke_impl<void, std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&>(std::__invoke_other, std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
29# std::enable_if<is_invocable_r_v<void, std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&>, void>::type std::__invoke_r<void, std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&>(std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()>&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:119
30# std::_Function_handler<void (), std::_Bind<void (doris::vectorized::TimeSharingTaskExecutor::*(doris::vectorized::TimeSharingTaskExecutor*))()> >::_M_invoke(std::_Any_data const&) at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
31# std::function<void ()>::operator()() const at /mnt/disk2/zhangsida/install_data/ldb_toolchain_025/ldb_toolchain/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
32# doris::Thread::supervise_thread(void*) at /mnt/disk2/zhangsida/branch-40/be/src/util/thread.cpp:460
33# start_thread in /lib64/libc.so.6
34# clone3 in /lib64/libc.so.6

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-04-21T12:46:22Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

zhangstar333 · 2026-04-21T12:48:59Z

/review

github-actions

Findings:

be/src/core/data_type_serde/data_type_string_serde.cpp: the new int32 path uses BinaryArray::value_offset(), which dereferences raw_value_offsets_ directly. That reintroduces the misaligned-offset UB/crash that the existing ArrowMemNotAligned* tests were guarding against.
be/src/format/arrow/arrow_block_convertor.cpp: widening every utf8 field to large_utf8 is not safe today. Several serdes that map to Arrow utf8 (JSONB, VARIANT, DATE/DATETIME, IPV6, LARGEINT, etc.) still hard-cast the builder to arrow::StringBuilder, so they will fail as soon as a large column crosses the new threshold.
be/src/format/arrow/arrow_block_convertor.cpp: returning a per-batch widened schema breaks Arrow Flight's fixed-schema contract. ArrowFlightBatchReaderBase::schema() still exposes the original schema fetched from the result buffer, and Arrow's RecordBatchStream rejects later batches whose schema differs.
be/src/format/arrow/arrow_block_convertor.cpp: the ranged overload now constructs the RecordBatch with _block.rows() instead of actual_rows, producing an invalid batch whenever a caller requests a subrange.

Critical checkpoints:

Goal / correctness: The goal is to safely read Arrow string/binary data when the backing offsets are actually int64. The receiver-side idea is valid, but the current patch does not accomplish it safely end to end because it regresses aligned-offset handling and introduces sender/streaming regressions.
Minimality / focus: Mostly focused, but the sender-side widening change broadens behavior beyond the original bug and impacts unrelated utf8-mapped types.
Concurrency: No new concurrency or locking risks found in these paths.
Lifecycle / static init: No special lifecycle or static-initialization issues found.
Config changes: None.
Compatibility / parallel paths: Not preserved. Arrow Flight still advertises the original schema, while this patch can emit widened batch schemas; parallel utf8 writer paths in other serdes were not updated for LargeStringBuilder.
Conditional checks: The new offset-width heuristic is understandable, but the int32 branch removed the previous byte-wise offset loads that made misaligned Arrow buffers safe.
Test coverage: Added unit tests cover the int64-offset mismatch, but they do not cover the widened-schema sender path, Arrow Flight schema handling, or the ranged convert_to_arrow_batch path.
Test results: Not run in this review environment.
Observability: No additional observability appears necessary for this scope.
Transaction / persistence / data writes / FE-BE variable passing: Not applicable.
Performance: No blocking performance issue beyond the correctness regressions above.
Other issues: The returned RecordBatch row count regression in the ranged overload should be fixed before merge.

Requesting changes.

github-actions · 2026-04-21T13:19:51Z

+        } else {
+            for (auto offset_i = start; offset_i < end; ++offset_i) {
+                if (!concrete_array->IsNull(offset_i)) {
+                    const int32_t start_offset = concrete_array->value_offset(offset_i);


The previous code used memcpy here so wrapped Arrow buffers did not need aligned int32_t access. arrow::BinaryArray::value_offset() / value_length() dereference raw_value_offsets_[i] directly, so this change reopens the misaligned-read UB that the existing ArrowMemNotAligned and ArrowMemNotAlignedNestedArr tests are meant to catch. Any StringArray built from a misaligned offsets buffer (for example arrow::Buffer::Wrap(ptr + 1, ...)) will hit undefined behavior again. Please keep the byte-wise offset loads on this int32 path instead of switching back to value_offset().

github-actions · 2026-04-21T13:19:51Z

        _cur_type = _block.get_by_position(idx).type;
        auto column = _cur_col->convert_to_full_column_if_const();
        auto arrow_type = _schema->field(idx)->type();
        if (arrow_type->name() == "utf8" && column->byte_size() >= MAX_ARROW_UTF8) {


This promotion is too broad. convert_to_arrow_type() maps several non-string Doris types to Arrow utf8 (JSONB, VARIANT, DATE/DATETIME, IPV6, LARGEINT, ...), but their write_column_to_arrow() implementations still hard-cast array_builder to arrow::StringBuilder (for example data_type_jsonb_serde.cpp, data_type_variant_serde.cpp, data_type_date_or_datetime_serde.cpp, data_type_ipv6_serde.cpp, data_type_number_serde.cpp). Once one of those columns crosses the 2 GiB threshold, this branch will hand it a LargeStringBuilder and the serialization path will fail. The widening either needs to be limited to serdes that already support LargeStringBuilder, or all utf8 writers need to be updated together.

github-actions · 2026-04-21T13:19:51Z

        }
    }
-    *out = arrow::RecordBatch::Make(_schema, actual_rows, std::move(_arrays));
+    auto schema = schema_changed ? arrow::schema(std::move(fields)) : _schema;


Changing only the returned RecordBatch schema is still not enough for Arrow Flight. ArrowFlightBatchReaderBase::schema() returns the original schema fetched from ArrowFlightResultBlockBuffer, and arrow::flight::RecordBatchStream initializes its IPC writer once with that fixed schema. Later WriteRecordBatch(*batch) checks batch.schema()->Equals(schema_, false) and rejects mismatches, so the first oversized utf8 batch here will fail with Tried to write record batch with different schema instead of being streamed. The source buffer / reader schema has to be widened consistently as well, not just the per-batch object.

github-actions · 2026-04-21T13:19:51Z

    }
-    *out = arrow::RecordBatch::Make(_schema, actual_rows, std::move(_arrays));
+    auto schema = schema_changed ? arrow::schema(std::move(fields)) : _schema;
+    *out = arrow::RecordBatch::Make(std::move(schema), _block.rows(), std::move(_arrays));


The 7-argument overload computes actual_rows, but this constructor now uses _block.rows() instead. When a caller requests a subrange (start_row, end_row), each column array is built with actual_rows elements while the RecordBatch advertises the full block row count, leaving an internally inconsistent batch. This should stay actual_rows.

zhangstar333 · 2026-04-21T14:49:25Z

run buildall

hello-stephen · 2026-04-21T16:10:17Z

BE UT Coverage Report

Increment line coverage 95.56% (43/45) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.29% (20354/38193)
Line Coverage	36.84% (191781/520537)
Region Coverage	33.16% (149188/449912)
Branch Coverage	34.26% (65230/190386)

hello-stephen · 2026-04-21T17:24:58Z

BE Regression && UT Coverage Report

Increment line coverage 95.56% (43/45) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.76% (26841/37406)
Line Coverage	55.01% (285488/518940)
Region Coverage	52.12% (236642/454069)
Branch Coverage	53.47% (102110/190959)

hello-stephen · 2026-04-22T04:49:00Z

BE Regression && UT Coverage Report

Increment line coverage 95.56% (43/45) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.76% (26841/37406)
Line Coverage	55.01% (285491/518940)
Region Coverage	52.11% (236633/454069)
Branch Coverage	53.47% (102104/190959)

No

zhangstar333 · 2026-04-23T07:41:09Z

run buildall

hello-stephen · 2026-04-23T10:29:42Z

BE Regression && UT Coverage Report

Increment line coverage 94.87% (37/39) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.21% (26646/37421)
Line Coverage	53.64% (278370/518957)
Region Coverage	47.04% (214114/455174)
Branch Coverage	50.36% (96924/192470)

zhangstar333 · 2026-04-27T07:40:47Z

run buildall

subhash-arcana · 2026-04-27T08:36:20Z

Hi @zhangstar333, I was thinking if we would need to rebuild schema while writing as well when large_utf8 change happens?
In be/src/util/arrow/block_convertor.cpp. adding the patch for review
large_utf8.patch

`

hello-stephen · 2026-04-27T09:51:53Z

BE Regression && UT Coverage Report

Increment line coverage 77.91% (67/86) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.71% (27630/37484)
Line Coverage	57.47% (298861/519988)
Region Coverage	54.59% (248364/454949)
Branch Coverage	56.22% (107589/191358)

zhangstar333 · 2026-04-29T02:53:23Z

Hi @zhangstar333, I was thinking if we would need to rebuild schema while writing as well when large_utf8 change happens? In be/src/util/arrow/block_convertor.cpp. adding the patch for review large_utf8.patch

`
@subhash-arcana
5f01461#diff-c39041117265c90e18e794846a18235d92b1b3089f41bb14360baa583558090cR111
I have change the schema in the first commit, and the reviewer have some comment on it, u could have a look.
we change the schema maybe useless, as the arrow stream read scheam only once.

zhangstar333 · 2026-05-08T03:38:08Z

run buildall

zhangstar333 changed the title ~~update~~ [bug](arrow) fix arrow type change cause read data coredump Apr 21, 2026

github-actions Bot previously requested changes Apr 21, 2026

View reviewed changes

zhangstar333 added 4 commits May 8, 2026 11:33

update

f93e551

update

a0ac73e

update

f8f75cf

update review

11400dd

zhangstar333 force-pushed the arrow_utf8 branch from d8f162e to 11400dd Compare May 8, 2026 03:37

Conversation

zhangstar333 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Apr 21, 2026

Uh oh!

zhangstar333 commented Apr 21, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

zhangstar333 commented Apr 21, 2026

Uh oh!

hello-stephen commented Apr 21, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Apr 21, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Apr 22, 2026

BE Regression && UT Coverage Report

Uh oh!

zhangstar333 commented Apr 23, 2026

Uh oh!

hello-stephen commented Apr 23, 2026

BE Regression && UT Coverage Report

Uh oh!

zhangstar333 commented Apr 27, 2026

Uh oh!

subhash-arcana commented Apr 27, 2026

Uh oh!

hello-stephen commented Apr 27, 2026

BE Regression && UT Coverage Report

Uh oh!

zhangstar333 commented Apr 29, 2026

Uh oh!

zhangstar333 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhangstar333 commented Apr 21, 2026 •

edited

Loading