branch-4.1 cherry-pick [feat](format) support native format by eldenmoon · Pull Request #61286 · apache/doris

eldenmoon · 2026-03-12T12:20:34Z

cherry-pick #58711

Thearas · 2026-03-12T12:20:39Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Copilot

Pull request overview

Cherry-picks upstream work to add Doris Native (“native”) file format support across FE/BE for export/outfile and scanning, with specific improvements for VARIANT schema properties.

Changes:

Add FORMAT_NATIVE to file format enums/properties and wire it through FE utilities and Nereids load scanning.
Implement BE native writer/reader (VNativeTransformer, NativeReader) and integrate with file result writer + scanner + internal service schema fetch.
Add regression and BE unit tests covering native round-trip and S3 OUTFILE/export flows.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
regression-test/suites/export_p0/test_export_variant_10k_columns.groovy	New regression test for exporting/loading wide VARIANT data using `native` format
regression-test/suites/export_p0/outfile/native/test_outfile_native.groovy	New OUTFILE-to-S3 + S3 TVF queryback regression test for `native`
regression-test/data/export_p0/outfile/native/test_outfile_native.out	Expected output for the OUTFILE native regression test
gensrc/thrift/PlanNodes.thrift	Adds `FORMAT_NATIVE` to `TFileFormatType`
fe/fe-core/src/main/java/org/apache/doris/tablefunction/ExternalFileTableValuedFunction.java	Preserves VARIANT properties when reconstructing types from external schema
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java	Treats `native` like `arrow` for using real column types during load
fe/fe-core/src/main/java/org/apache/doris/datasource/property/fileformat/NativeFileFormatProperties.java	New file format properties class for `native`
fe/fe-core/src/main/java/org/apache/doris/datasource/property/fileformat/FileFormatProperties.java	Registers `native` in file format factory
fe/fe-core/src/main/java/org/apache/doris/common/util/Util.java	Maps `"native"` format name to `FORMAT_NATIVE`
fe/fe-core/src/main/java/org/apache/doris/common/util/FileFormatConstants.java	Adds `"native"` constant
be/test/vec/exec/format/native/native_reader_writer_test.cpp	New BE UT suite covering native read/write, schema probing, and many types
be/src/vec/sink/writer/vfile_result_writer.cpp	Adds native writer support and format name mapping
be/src/vec/runtime/vnative_transformer.h / .cpp	Implements native file writer with header + block framing
be/src/vec/functions/cast/cast_to_variant.h	Refactors variant casting logic and improves wrapper behavior
be/src/vec/functions/cast/cast_to_jsonb.h	Comment typo fix
be/src/vec/exec/scan/file_scanner.cpp	Adds native reader integration and enables casting with FunctionContext
be/src/vec/exec/format/native/native_reader.h / .cpp	Implements native file reader with header validation + schema probing
be/src/vec/exec/format/native/native_format.h	Defines file header/constants and documents native layout
be/src/service/internal_service.cpp	Adds native format handling for `fetch_table_schema`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-12T12:27:57Z

be/src/vec/exec/format/native/native_reader.cpp

+    buff->assign(len, '\0');
+    Slice data_slice(buff->data(), len);
+    bytes_read = 0;
+    RETURN_IF_ERROR(_file_reader->read_at(_current_offset, data_slice, &bytes_read));
+    if (bytes_read != len) {
+        return Status::InternalError(
+                "Failed to read native block body from file {}, expect {}, "
+                "actual {}",
+                _scan_range.path, len, bytes_read);
+    }


NativeReader allocates len bytes directly from the file-provided block length without validating it against remaining file size (or any upper bound). A malformed/corrupt file can trigger huge allocations (OOM) before read_at fails. Add a validation before buff->assign(...), e.g. ensure len <= (_file_size - _current_offset) and optionally enforce a reasonable maximum native block size to protect BE.

Copilot · 2026-03-12T12:27:58Z

be/src/service/internal_service.cpp

+            reader = vectorized::NativeReader::create_unique(profile.get(), params, range, &io_ctx,
+                                                             nullptr);


NativeReader is constructed with RuntimeState* state = nullptr here, but NativeReader::init_reader() passes _state into FileFactory::get_reader_options(_state, ...). If get_reader_options (or other downstream logic) assumes non-null, this will crash during schema fetch. Pass a valid RuntimeState (if available in this RPC path) or update NativeReader/FileFactory usage to safely handle a null state for schema-only operations.

Suggested change

reader = vectorized::NativeReader::create_unique(profile.get(), params, range, &io_ctx,

nullptr);

RuntimeState dummy_state;

reader = vectorized::NativeReader::create_unique(profile.get(), params, range, &io_ctx,

&dummy_state);

Copilot · 2026-03-12T12:27:58Z

be/src/vec/exec/format/native/native_format.h

+//   - Detects the optional file header by checking the first 8 bytes against DORIS_NATIVE_MAGIC.
+//   - If the header is present, it skips 12 bytes and then starts reading blocks as
+//     [uint64_t block_size][PBlock bytes]...
+//   - If the header is absent (legacy files), it starts reading blocks from offset 0.


The documentation says the header is optional and NativeReader can read legacy files without it, but validate_and_consume_header() currently requires the magic header and errors if it is absent. Either implement the documented backward-compatible behavior (detect header presence and fall back to offset 0) or update this comment to match the actual requirement.

Suggested change

// - Detects the optional file header by checking the first 8 bytes against DORIS_NATIVE_MAGIC.

// - If the header is present, it skips 12 bytes and then starts reading blocks as

// [uint64_t block_size][PBlock bytes]...

// - If the header is absent (legacy files), it starts reading blocks from offset 0.

// - Expects a file header at the beginning of the file and verifies the first 8 bytes

// against DORIS_NATIVE_MAGIC.

// - After validating the header, it skips 12 bytes and then starts reading blocks as

// [uint64_t block_size][PBlock bytes]...

Copilot · 2026-03-12T12:27:58Z

regression-test/suites/export_p0/test_export_variant_10k_columns.groovy

+        def value_type = "VARIANT<PROPERTIES (\"variant_max_subcolumns_count\" = \"2048\")>"
+        if (new Random().nextInt(2) == 0) {
+            value_type = "text"
+        }


This introduces randomness into a P0 regression test, which can make CI runs flaky and non-reproducible (schema differs between runs). Make the choice deterministic (e.g., run both cases explicitly in separate steps / separate suites, or base it on a fixed seed) so failures are reproducible.

Copilot · 2026-03-12T12:27:59Z

regression-test/suites/export_p0/test_export_variant_10k_columns.groovy

+    def waiting_export = { export_label ->
+        while (true) {
+            def res = sql """ show export where label = "${export_label}" """
+            logger.info("export state: " + res[0][2])
+            if (res[0][2] == "FINISHED") {
+                def json = parseJson(res[0][11])
+                assert json instanceof List
+                // assertEquals("1", json.fileNumber[0][0])
+                log.info("outfile_path: ${json.url[0][0]}")
+                return json.url[0][0];
+            } else if (res[0][2] == "CANCELLED") {
+                throw new IllegalStateException("""export failed: ${res[0][10]}""")
+            } else {
+                sleep(5000)
+            }
+        }
+    }


The export wait loop has no timeout, so a stuck export can hang the suite indefinitely. Add a maximum wait time (or max iterations) and fail with a clear message when exceeded; alternatively reuse Awaitility here (similar to the load wait later) for consistent timeouts.

Copilot · 2026-03-12T12:27:59Z

regression-test/suites/export_p0/outfile/native/test_outfile_native.groovy

+        // Insert 10 rows of test data (the last row is all NULL)
+        StringBuilder sb = new StringBuilder()
+        int i = 1
+        for (; i < 10000; i ++) {
+            sb.append("""
+                (${i}, '2024-01-01', '2024-01-01 00:00:00', 's${i}', ${i}, ${i % 128}, true, ${i}.${i}),
+            """)
+        }
+        sb.append("""
+                (${i}, '2024-01-01', '2024-01-01 00:00:00', NULL, NULL, NULL, NULL, NULL)
+            """)
+        sql """ INSERT INTO ${tableName} VALUES ${sb.toString()} """


The comment says 'Insert 10 rows', but the loop inserts 9,999 rows plus a final row (10,000 total). Update the comment to match the actual data volume so future readers understand the test’s intent and runtime characteristics.

Copilot · 2026-03-12T12:27:59Z

regression-test/suites/export_p0/outfile/native/test_outfile_native.groovy

+// specific language governing permissions and limitations
+// under the License.
+
+import org.codehaus.groovy.runtime.IOGroovyMethods


This import is not used anywhere in the file. Removing it reduces noise and avoids misleading readers about expected I/O helpers.

Suggested change

import org.codehaus.groovy.runtime.IOGroovyMethods

Copilot · 2026-03-12T12:28:00Z

fe/fe-core/src/main/java/org/apache/doris/tablefunction/ExternalFileTableValuedFunction.java

+            // Currently no predefined fields are carried in PTypeNode for VARIANT, so use empty list and default
+            // values for other properties.
+            type = new VariantType(new ArrayList<>(), maxSubcolumns,
+                    /*enableTypedPathsToSparse*/ false,
+                    /*variantMaxSparseColumnStatisticsSize*/ 10000,
+                    /*variantSparseHashShardCount*/ 0);


This constructs VariantType with several hard-coded magic values (e.g. variantMaxSparseColumnStatisticsSize = 10000, shard count 0). If defaults/configs change elsewhere, ExternalFileTVF may diverge from normal VARIANT semantics. Prefer using a centralized factory/default-builder (or Config-backed defaults) so only variant_max_subcolumns_count is overridden here, and other properties stay consistent with the rest of FE/BE.

Suggested change

// Currently no predefined fields are carried in PTypeNode for VARIANT, so use empty list and default

// values for other properties.

type = new VariantType(new ArrayList<>(), maxSubcolumns,

/*enableTypedPathsToSparse*/ false,

/*variantMaxSparseColumnStatisticsSize*/ 10000,

/*variantSparseHashShardCount*/ 0);

// Currently no predefined fields are carried in PTypeNode for VARIANT, so use empty list and

// Config-backed defaults for other properties to keep semantics consistent across FE/BE.

type = new VariantType(new ArrayList<>(), maxSubcolumns,

Config.enable_typed_paths_to_sparse,

Config.variant_max_sparse_column_statistics_size,

Config.variant_sparse_hash_shard_count);

Copilot · 2026-03-12T12:28:00Z

be/src/vec/exec/scan/file_scanner.cpp

+        DCHECK(_state != nullptr);
+        auto ctx = FunctionContext::create_context(_state, {}, {});
        RETURN_IF_ERROR(
-                func_cast->execute(nullptr, *_src_block_ptr, {idx}, idx, arg.column->size()));
+                func_cast->execute(ctx.get(), *_src_block_ptr, {idx}, idx, arg.column->size()));


A new FunctionContext is created for every column cast inside the loop. This can add overhead for wide schemas (and native/arrow explicitly targets real typed columns). Consider creating a single FunctionContext once per _cast_to_input_block invocation and reusing it for all casts in the loop, unless specific casts require isolated contexts.

hello-stephen · 2026-03-13T01:57:31Z

run buildall

doris-robot · 2026-03-13T02:42:30Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	79.15% (1788/2259)
Line Coverage	64.47% (31930/49530)
Region Coverage	65.30% (15977/24468)
Branch Coverage	55.86% (8498/15214)

hello-stephen · 2026-03-13T03:22:38Z

FE UT Coverage Report

Increment line coverage 0.00% (0/19) 🎉
Increment coverage report
Complete coverage report

cherry-pick apache#58711

eldenmoon · 2026-03-13T08:55:41Z

run buildall

doris-robot · 2026-03-13T09:35:14Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	79.15% (1788/2259)
Line Coverage	64.50% (31947/49530)
Region Coverage	65.34% (15987/24468)
Branch Coverage	55.88% (8501/15214)

doris-robot · 2026-03-13T12:51:08Z

BE UT Coverage Report

Increment line coverage 65.01% (275/423) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.03% (19202/36209)
Line Coverage	36.24% (178998/493876)
Region Coverage	32.81% (138581/422357)
Branch Coverage	33.84% (60281/178124)

hello-stephen · 2026-03-13T17:13:04Z

BE Regression && UT Coverage Report

Increment line coverage 81.04% (342/422) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.40% (25308/35444)
Line Coverage	54.03% (266344/492991)
Region Coverage	51.60% (220135/426631)
Branch Coverage	53.10% (94924/178767)

cherry-pick #58711

eldenmoon requested a review from yiguolei as a code owner March 12, 2026 12:20

Copilot AI review requested due to automatic review settings March 12, 2026 12:20

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Copilot started reviewing on behalf of eldenmoon March 12, 2026 12:42 View session

eldenmoon changed the title ~~branch-4.1 cherry-pick [feat](variant) support native format~~ branch-4.1 cherry-pick [feat](format) support native format Mar 12, 2026

branch-4.1 cherry-pick [feat](variant) support native format

c99da8c

cherry-pick apache#58711

eldenmoon force-pushed the 4.1-native branch from 3b89bd9 to c99da8c Compare March 13, 2026 08:55

yiguolei merged commit c572db1 into apache:branch-4.1 Mar 13, 2026
21 of 25 checks passed

yiguolei pushed a commit that referenced this pull request Mar 16, 2026

branch-4.1 cherry-pick [feat](format) support native format (#61286)

f18a4b7

cherry-pick #58711

		reader = vectorized::NativeReader::create_unique(profile.get(), params, range, &io_ctx,
		nullptr);

Conversation

eldenmoon commented Mar 12, 2026

Uh oh!

Thearas commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

hello-stephen commented Mar 13, 2026

Uh oh!

doris-robot commented Mar 13, 2026

Cloud UT Coverage Report

Uh oh!

hello-stephen commented Mar 13, 2026

FE UT Coverage Report

Uh oh!

eldenmoon commented Mar 13, 2026

Uh oh!

doris-robot commented Mar 13, 2026

Cloud UT Coverage Report

Uh oh!

doris-robot commented Mar 13, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Mar 13, 2026

BE Regression && UT Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants