[fix](variant) materialize NG compaction regular paths#63104
[fix](variant) materialize NG compaction regular paths#63104eldenmoon wants to merge 3 commits intoapache:masterfrom
Conversation
NestedGroup compaction targets normally do not carry extracted subcolumns, so preserving existing target subcolumns can miss regular paths that sit beside NG arrays. Collect Variant extended info for NG roots, keep NG physical children owned by the NestedGroup writer, and materialize only regular non-NG paths through the existing data-types helper. Typed paths continue to use get_compaction_typed_columns(), while doc-mode and normal non-NG Variant flows keep their existing ordering. Also record materialized regular paths in sub_path_set and update schema util coverage for the new target-schema contract.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
This PR adjusts how VARIANT NestedGroup (NG) compaction schemas are materialized so that regular (non-NG) sibling paths are derived from rowset metadata rather than relying on any pre-existing extracted subcolumns in the compaction target schema. It also tightens the “materialized regular path” bookkeeping to avoid downstream duplication during streaming compaction.
Changes:
- Rework NG compaction schema generation to materialize only regular non-NG paths (typed paths still handled via
get_compaction_typed_columns()). - Record materialized regular extracted paths into
PathsSetInfo::sub_path_setwhen materializing from data types. - Update/extend unit tests to reflect the new NG compaction target-schema contract and
sub_path_setbehavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| be/src/exec/common/variant_util.cpp | Refactors NG compaction schema building; filters regular paths outside NG and records materialized extracted paths in sub_path_set. |
| be/test/exec/common/schema_util_test.cpp | Updates NG compaction schema test expectations; adds assertions for sub_path_set population in data-types materialization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // get_extended_compaction_schema rebuilds from the base columns. Real compaction targets do | ||
| // not carry pre-existing extracted columns, so NG compaction must rely on rowset metadata for | ||
| // regular paths and on get_compaction_typed_columns() for typed paths. | ||
| EXPECT_EQ(target_schema->num_columns(), 1); | ||
| const PathInData typed_path("v1.owner", true); | ||
| const auto typed_column_index = target_schema->field_index(typed_path); | ||
| ASSERT_NE(typed_column_index, -1); | ||
|
|
||
| const auto& preserved_typed_column = target_schema->column(typed_column_index); | ||
| EXPECT_TRUE(preserved_typed_column.path_info_ptr()->get_is_typed()); | ||
| EXPECT_EQ(preserved_typed_column.type(), FieldType::OLAP_FIELD_TYPE_STRING); | ||
| EXPECT_EQ(target_schema->field_index(typed_path), -1); | ||
|
|
||
| const auto* path_set_info = target_schema->try_path_set_info(1); | ||
| ASSERT_NE(path_set_info, nullptr); | ||
| ASSERT_TRUE(path_set_info->typed_path_set.contains("owner")); | ||
| EXPECT_EQ(path_set_info->typed_path_set.at("owner").indexes.size(), 1); | ||
| EXPECT_EQ(path_set_info->typed_path_set.at("owner").indexes[0]->index_name(), "v1_owner_idx"); | ||
| EXPECT_FALSE(path_set_info->typed_path_set.contains("owner")); |
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one blocking regression in the compaction schema helper. The change removes typed-path registration from get_compaction_subcolumns_from_data_types(), so typed PathInData entries are appended as typed extracted columns but their metadata is stored in subcolumn_indexes instead of typed_path_set. This contradicts the existing unit-test expectations in get_compaction_subcolumns_from_data_types and can make later typed-column index lookup miss those indexes via TabletSchema::inverted_indexs(const TabletColumn&).
Critical checkpoint conclusions:
- Goal/test: the PR aims to rebuild NG compaction schemas from rowset metadata and materialize regular non-NG paths; the new behavior is partly covered, but the helper regression breaks an existing typed-path contract. The already-known missing positive NG test thread was not duplicated.
- Scope: the change is focused, but the helper behavior change is broader than the NG path filtering and is not safe as written.
- Concurrency/lifecycle: no new concurrent state, locks, threads, or special lifecycle management were introduced in the reviewed diff.
- Config/compatibility/protocol: no new config, FE-BE protocol, or storage-format fields are introduced.
- Parallel paths: both shared-nothing and cloud compaction call this schema builder; the helper regression can affect any path that passes typed
PathInDataentries toget_compaction_subcolumns_from_data_types(). - Tests/results: existing
SchemaUtilTest.get_compaction_subcolumns_from_data_typesshould fail or expose the metadata mismatch; NG positive coverage remains already raised in an existing thread. - Observability/transactions/persistence/data writes: no new observability, transaction, or persistence mechanism is involved; compaction schema correctness affects data/index materialization.
- Performance: no additional performance blocker found beyond the correctness issue.
Focus points: no additional user-provided review focus was specified.
ec35057 to
e50415a
Compare
|
/review |
There was a problem hiding this comment.
Blocking issue remains in the variant compaction schema changes: the existing inline thread on be/src/exec/common/variant_util.cpp:1244 identifies that typed paths are no longer preserved in typed_path_set by get_compaction_subcolumns_from_data_types(), which breaks the established typed-column/index lookup contract. I did not add a duplicate inline comment for that already-known thread.
Critical checkpoint conclusions:
- Goal/test: The goal appears to be materializing regular non-NG paths for NestedGroup compaction while leaving NG-owned paths to the streaming writer. The current code does not fully satisfy the existing typed-path contract; tests were adjusted around this behavior, but the existing review thread points out a regression against existing expectations.
- Scope: The change is mostly focused, but it changes a public helper contract beyond the NG-specific caller.
- Concurrency/lifecycle: No new concurrency, locking, or special lifecycle/static initialization concerns found in the reviewed diff.
- Config/compatibility: No new config or storage-format/protocol compatibility change found.
- Parallel paths: The helper is shared by NG and non-NG all-materialized compaction paths, which is the core risk noted in the existing comment.
- Conditional logic: The existing thread at variant_util.cpp:1303 covers unreachable/confusing NG branching; I did not duplicate it.
- Tests: Existing thread at schema_util_test.cpp:1237 covers missing positive NG regular-path materialization coverage; no additional user focus was provided.
- Observability: No additional observability requirement identified.
- Transaction/persistence/data writes: The change affects compaction output schema/data layout; no transaction or edit-log path was changed.
- FE/BE variables: No new transmitted variables.
- Performance: No clear new performance issue beyond correctness concerns.
User focus: review_focus.txt states that there is no additional user-provided focus, so the review covered the whole PR with no extra focus-specific findings.
|
run buildall |
TPC-H: Total hot run time: 29686 ms |
TPC-DS: Total hot run time: 172162 ms |
|
run buildall |
TPC-H: Total hot run time: 29793 ms |
TPC-H: Total hot run time: 29596 ms |
TPC-DS: Total hot run time: 171643 ms |
TPC-DS: Total hot run time: 171600 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
NestedGroup compaction targets normally do not carry extracted subcolumns, so preserving existing target subcolumns can miss regular paths that sit beside NG arrays.
Collect Variant extended info for NG roots, keep NG physical children owned by the NestedGroup writer, and materialize only regular non-NG paths through the existing data-types helper. Typed paths continue to use get_compaction_typed_columns(), while doc-mode and normal non-NG Variant flows keep their existing ordering.
Also record materialized regular paths in sub_path_set and update schema util coverage for the new target-schema contract.