[fix](serde) Support large string arrow builder for variant serde#63718
[fix](serde) Support large string arrow builder for variant serde#63718eldenmoon wants to merge 1 commit into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
This PR fixes a BE crash during Parquet OUTFILE export of VARIANT columns by making DataTypeVariantSerDe::write_column_to_arrow handle both Arrow utf8 (arrow::StringBuilder) and large_utf8 (arrow::LargeStringBuilder) builders, and adds a unit test that exercises the LargeStringBuilder path.
Changes:
- Add Arrow builder type dispatch for VARIANT Arrow serialization (
STRINGvsLARGE_STRING) and return a clear error for unsupported Arrow types. - Refactor the per-row Arrow append logic into a shared templated helper.
- Add a BE unit test to reproduce and validate the
LargeStringBuilderserialization path.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| be/src/core/data_type_serde/data_type_variant_serde.cpp | Dispatch VARIANT Arrow serialization across StringBuilder and LargeStringBuilder, avoiding invalid builder casts. |
| be/test/core/data_type_serde/data_type_serde_test.cpp | Add UT coverage for VARIANT serialization into arrow::LargeStringBuilder. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var.serialize_one_row_to_string(i, &serialized_value, options); | ||
| const auto serialized_size = | ||
| cast_set<typename BuilderType::offset_type>(serialized_value.size()); | ||
| RETURN_IF_ERROR(checkArrowStatus(builder.Append(serialized_value.data(), serialized_size), | ||
| column.get_name(), builder.type()->name())); |
Issue Number: close DORIS-24846
Related PR: #xxx
Problem Summary: Fix BE crash when serializing VARIANT columns through Arrow large string builders during Parquet OUTFILE export.
Fix BE crash when exporting VARIANT columns to Parquet OUTFILE with large Arrow string batches.
- Test: Unit Test
- `./run-be-ut.sh --run --filter='DataTypeSerDeTest.VariantWriteColumnToArrowSupportsLargeString'`
- `./run-be-ut.sh --run --filter='DataTypeSerDeTest.*'`
- `PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/check-format.sh`
- Behavior changed: Yes. VARIANT Arrow serialization now supports `large_utf8` builders instead of aborting on a bad builder cast.
- Does this need documentation: No
289c900 to
21648cc
Compare
|
run buildall |
TPC-H: Total hot run time: 31565 ms |
|
/review |
TPC-DS: Total hot run time: 172080 ms |
There was a problem hiding this comment.
Code review completed. I did not find additional blocking issues in this PR, and I did not duplicate the already-open inline thread about cast_set overflow in data_type_variant_serde.cpp.
Critical checkpoint conclusions:
- Goal/test: The PR targets the VARIANT-to-Arrow crash when the Arrow converter upgrades utf8 to large_utf8 for large batches. The updated SerDe now dispatches both
arrow::StringBuilderandarrow::LargeStringBuilder, and the added BE UT covers the LargeStringBuilder path. - Scope: The change is small and focused on
DataTypeVariantSerDe::write_column_to_arrow, with a localized helper mirroring the string SerDe pattern. - Concurrency/lifecycle: No new shared state, threads, locks, or special lifecycle/static initialization concerns were introduced.
- Configuration/compatibility: No config items, storage format changes, or FE/BE protocol changes were introduced.
- Parallel paths: The behavior now aligns with the existing string SerDe large_utf8 handling used by the same Arrow block conversion path.
- Error handling: Unsupported Arrow builder types now return
InvalidArgumentinstead of failing a bad cast. Existing review context already covers the cast-overflow concern, so I am not reopening it here. - Testing: A targeted BE unit test was added for the LargeStringBuilder path. I did not run tests in this review runner.
- Observability/performance: No additional observability appears necessary for this narrow serialization fix; per-row serialization behavior is unchanged from the previous implementation.
- User focus: No additional user-provided review focus was specified.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Problem Summary:
DataTypeVariantSerDe::write_column_to_arrowalways cast the Arrow builder toarrow::StringBuilder. During Parquet OUTFILE export, the Arrow block converter can switch utf8 columns tolarge_utf8when a batch is large, which gives variant serialization anarrow::LargeStringBuilderand crashes BE on the bad cast.This patch handles both
arrow::StringBuilderandarrow::LargeStringBuilderfor VARIANT Arrow serialization and adds a BE UT that reproduces the LargeStringBuilder path.Release note
Fix BE crash when exporting VARIANT columns to Parquet OUTFILE with large Arrow string batches.
Check List (For Author)
./run-be-ut.sh --run --filter='DataTypeSerDeTest.VariantWriteColumnToArrowSupportsLargeString'./run-be-ut.sh --run --filter='DataTypeSerDeTest.*'PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/check-format.shlarge_utf8builders instead of aborting on a bad builder cast.Notes
build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --base upstream/masterwas attempted. It is blocked by existing diagnostics in this path, includingcore/types.hunmatchedNOLINTENDand pre-existing modernize/readability findings indata_type_variant_serde.cpp/data_type_serde_test.cpp; the new signed/unsigned warning introduced while developing this patch was fixed before the final tests.