Skip to content

branch-4.1 cherry-pick [feat](format) support native format#61286

Merged
yiguolei merged 1 commit intoapache:branch-4.1from
eldenmoon:4.1-native
Mar 13, 2026
Merged

branch-4.1 cherry-pick [feat](format) support native format#61286
yiguolei merged 1 commit intoapache:branch-4.1from
eldenmoon:4.1-native

Conversation

@eldenmoon
Copy link
Member

cherry-pick #58711

@eldenmoon eldenmoon requested a review from yiguolei as a code owner March 12, 2026 12:20
Copilot AI review requested due to automatic review settings March 12, 2026 12:20
@Thearas
Copy link
Contributor

Thearas commented Mar 12, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Cherry-picks upstream work to add Doris Native (“native”) file format support across FE/BE for export/outfile and scanning, with specific improvements for VARIANT schema properties.

Changes:

  • Add FORMAT_NATIVE to file format enums/properties and wire it through FE utilities and Nereids load scanning.
  • Implement BE native writer/reader (VNativeTransformer, NativeReader) and integrate with file result writer + scanner + internal service schema fetch.
  • Add regression and BE unit tests covering native round-trip and S3 OUTFILE/export flows.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
regression-test/suites/export_p0/test_export_variant_10k_columns.groovy New regression test for exporting/loading wide VARIANT data using native format
regression-test/suites/export_p0/outfile/native/test_outfile_native.groovy New OUTFILE-to-S3 + S3 TVF queryback regression test for native
regression-test/data/export_p0/outfile/native/test_outfile_native.out Expected output for the OUTFILE native regression test
gensrc/thrift/PlanNodes.thrift Adds FORMAT_NATIVE to TFileFormatType
fe/fe-core/src/main/java/org/apache/doris/tablefunction/ExternalFileTableValuedFunction.java Preserves VARIANT properties when reconstructing types from external schema
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java Treats native like arrow for using real column types during load
fe/fe-core/src/main/java/org/apache/doris/datasource/property/fileformat/NativeFileFormatProperties.java New file format properties class for native
fe/fe-core/src/main/java/org/apache/doris/datasource/property/fileformat/FileFormatProperties.java Registers native in file format factory
fe/fe-core/src/main/java/org/apache/doris/common/util/Util.java Maps "native" format name to FORMAT_NATIVE
fe/fe-core/src/main/java/org/apache/doris/common/util/FileFormatConstants.java Adds "native" constant
be/test/vec/exec/format/native/native_reader_writer_test.cpp New BE UT suite covering native read/write, schema probing, and many types
be/src/vec/sink/writer/vfile_result_writer.cpp Adds native writer support and format name mapping
be/src/vec/runtime/vnative_transformer.h / .cpp Implements native file writer with header + block framing
be/src/vec/functions/cast/cast_to_variant.h Refactors variant casting logic and improves wrapper behavior
be/src/vec/functions/cast/cast_to_jsonb.h Comment typo fix
be/src/vec/exec/scan/file_scanner.cpp Adds native reader integration and enables casting with FunctionContext
be/src/vec/exec/format/native/native_reader.h / .cpp Implements native file reader with header validation + schema probing
be/src/vec/exec/format/native/native_format.h Defines file header/constants and documents native layout
be/src/service/internal_service.cpp Adds native format handling for fetch_table_schema

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +336 to +345
buff->assign(len, '\0');
Slice data_slice(buff->data(), len);
bytes_read = 0;
RETURN_IF_ERROR(_file_reader->read_at(_current_offset, data_slice, &bytes_read));
if (bytes_read != len) {
return Status::InternalError(
"Failed to read native block body from file {}, expect {}, "
"actual {}",
_scan_range.path, len, bytes_read);
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NativeReader allocates len bytes directly from the file-provided block length without validating it against remaining file size (or any upper bound). A malformed/corrupt file can trigger huge allocations (OOM) before read_at fails. Add a validation before buff->assign(...), e.g. ensure len <= (_file_size - _current_offset) and optionally enforce a reasonable maximum native block size to protect BE.

Copilot uses AI. Check for mistakes.
Comment on lines +861 to +862
reader = vectorized::NativeReader::create_unique(profile.get(), params, range, &io_ctx,
nullptr);
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NativeReader is constructed with RuntimeState* state = nullptr here, but NativeReader::init_reader() passes _state into FileFactory::get_reader_options(_state, ...). If get_reader_options (or other downstream logic) assumes non-null, this will crash during schema fetch. Pass a valid RuntimeState (if available in this RPC path) or update NativeReader/FileFactory usage to safely handle a null state for schema-only operations.

Suggested change
reader = vectorized::NativeReader::create_unique(profile.get(), params, range, &io_ctx,
nullptr);
RuntimeState dummy_state;
reader = vectorized::NativeReader::create_unique(profile.get(), params, range, &io_ctx,
&dummy_state);

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +44
// - Detects the optional file header by checking the first 8 bytes against DORIS_NATIVE_MAGIC.
// - If the header is present, it skips 12 bytes and then starts reading blocks as
// [uint64_t block_size][PBlock bytes]...
// - If the header is absent (legacy files), it starts reading blocks from offset 0.
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says the header is optional and NativeReader can read legacy files without it, but validate_and_consume_header() currently requires the magic header and errors if it is absent. Either implement the documented backward-compatible behavior (detect header presence and fall back to offset 0) or update this comment to match the actual requirement.

Suggested change
// - Detects the optional file header by checking the first 8 bytes against DORIS_NATIVE_MAGIC.
// - If the header is present, it skips 12 bytes and then starts reading blocks as
// [uint64_t block_size][PBlock bytes]...
// - If the header is absent (legacy files), it starts reading blocks from offset 0.
// - Expects a file header at the beginning of the file and verifies the first 8 bytes
// against DORIS_NATIVE_MAGIC.
// - After validating the header, it skips 12 bytes and then starts reading blocks as
// [uint64_t block_size][PBlock bytes]...

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +161
def value_type = "VARIANT<PROPERTIES (\"variant_max_subcolumns_count\" = \"2048\")>"
if (new Random().nextInt(2) == 0) {
value_type = "text"
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces randomness into a P0 regression test, which can make CI runs flaky and non-reproducible (schema differs between runs). Make the choice deterministic (e.g., run both cases explicitly in separate steps / separate suites, or base it on a fixed seed) so failures are reproducible.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +54
def waiting_export = { export_label ->
while (true) {
def res = sql """ show export where label = "${export_label}" """
logger.info("export state: " + res[0][2])
if (res[0][2] == "FINISHED") {
def json = parseJson(res[0][11])
assert json instanceof List
// assertEquals("1", json.fileNumber[0][0])
log.info("outfile_path: ${json.url[0][0]}")
return json.url[0][0];
} else if (res[0][2] == "CANCELLED") {
throw new IllegalStateException("""export failed: ${res[0][10]}""")
} else {
sleep(5000)
}
}
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The export wait loop has no timeout, so a stuck export can hang the suite indefinitely. Add a maximum wait time (or max iterations) and fail with a clear message when exceeded; alternatively reuse Awaitility here (similar to the load wait later) for consistent timeouts.

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +77
// Insert 10 rows of test data (the last row is all NULL)
StringBuilder sb = new StringBuilder()
int i = 1
for (; i < 10000; i ++) {
sb.append("""
(${i}, '2024-01-01', '2024-01-01 00:00:00', 's${i}', ${i}, ${i % 128}, true, ${i}.${i}),
""")
}
sb.append("""
(${i}, '2024-01-01', '2024-01-01 00:00:00', NULL, NULL, NULL, NULL, NULL)
""")
sql """ INSERT INTO ${tableName} VALUES ${sb.toString()} """
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says 'Insert 10 rows', but the loop inserts 9,999 rows plus a final row (10,000 total). Update the comment to match the actual data volume so future readers understand the test’s intent and runtime characteristics.

Copilot uses AI. Check for mistakes.
// specific language governing permissions and limitations
// under the License.

import org.codehaus.groovy.runtime.IOGroovyMethods
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is not used anywhere in the file. Removing it reduces noise and avoids misleading readers about expected I/O helpers.

Suggested change
import org.codehaus.groovy.runtime.IOGroovyMethods

Copilot uses AI. Check for mistakes.
Comment on lines +399 to +404
// Currently no predefined fields are carried in PTypeNode for VARIANT, so use empty list and default
// values for other properties.
type = new VariantType(new ArrayList<>(), maxSubcolumns,
/*enableTypedPathsToSparse*/ false,
/*variantMaxSparseColumnStatisticsSize*/ 10000,
/*variantSparseHashShardCount*/ 0);
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructs VariantType with several hard-coded magic values (e.g. variantMaxSparseColumnStatisticsSize = 10000, shard count 0). If defaults/configs change elsewhere, ExternalFileTVF may diverge from normal VARIANT semantics. Prefer using a centralized factory/default-builder (or Config-backed defaults) so only variant_max_subcolumns_count is overridden here, and other properties stay consistent with the rest of FE/BE.

Suggested change
// Currently no predefined fields are carried in PTypeNode for VARIANT, so use empty list and default
// values for other properties.
type = new VariantType(new ArrayList<>(), maxSubcolumns,
/*enableTypedPathsToSparse*/ false,
/*variantMaxSparseColumnStatisticsSize*/ 10000,
/*variantSparseHashShardCount*/ 0);
// Currently no predefined fields are carried in PTypeNode for VARIANT, so use empty list and
// Config-backed defaults for other properties to keep semantics consistent across FE/BE.
type = new VariantType(new ArrayList<>(), maxSubcolumns,
Config.enable_typed_paths_to_sparse,
Config.variant_max_sparse_column_statistics_size,
Config.variant_sparse_hash_shard_count);

Copilot uses AI. Check for mistakes.
Comment on lines +617 to +620
DCHECK(_state != nullptr);
auto ctx = FunctionContext::create_context(_state, {}, {});
RETURN_IF_ERROR(
func_cast->execute(nullptr, *_src_block_ptr, {idx}, idx, arg.column->size()));
func_cast->execute(ctx.get(), *_src_block_ptr, {idx}, idx, arg.column->size()));
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new FunctionContext is created for every column cast inside the loop. This can add overhead for wide schemas (and native/arrow explicitly targets real typed columns). Consider creating a single FunctionContext once per _cast_to_input_block invocation and reusing it for all casts in the loop, unless specific casts require isolated contexts.

Copilot uses AI. Check for mistakes.
@eldenmoon eldenmoon changed the title branch-4.1 cherry-pick [feat](variant) support native format branch-4.1 cherry-pick [feat](format) support native format Mar 12, 2026
@hello-stephen
Copy link
Contributor

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.15% (1788/2259)
Line Coverage 64.47% (31930/49530)
Region Coverage 65.30% (15977/24468)
Branch Coverage 55.86% (8498/15214)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/19) 🎉
Increment coverage report
Complete coverage report

@eldenmoon
Copy link
Member Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.15% (1788/2259)
Line Coverage 64.50% (31947/49530)
Region Coverage 65.34% (15987/24468)
Branch Coverage 55.88% (8501/15214)

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 65.01% (275/423) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.03% (19202/36209)
Line Coverage 36.24% (178998/493876)
Region Coverage 32.81% (138581/422357)
Branch Coverage 33.84% (60281/178124)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 81.04% (342/422) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.40% (25308/35444)
Line Coverage 54.03% (266344/492991)
Region Coverage 51.60% (220135/426631)
Branch Coverage 53.10% (94924/178767)

@yiguolei yiguolei merged commit c572db1 into apache:branch-4.1 Mar 13, 2026
21 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants