cherry-pick branch-4.1 : [Improve](Variant) Keep first duplicate Variant JSON path#63697
Open
eldenmoon wants to merge 1 commit into
Open
cherry-pick branch-4.1 : [Improve](Variant) Keep first duplicate Variant JSON path#63697eldenmoon wants to merge 1 commit into
eldenmoon wants to merge 1 commit into
Conversation
### What problem does this PR solve? Issue Number: None Related PR: apache#63082 Problem Summary: Pick apache#63082 to branch-4.0. Add disabled-by-default duplicate Variant JSON path checking so Variant JSON parsing can keep the first value for duplicate normalized leaf paths and avoid inconsistent subcolumns during load/parsing. Adapt the implementation to branch-4.0's vec/json parser and parse2column flow. ### Release note Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate Variant JSON leaf path when enabled. The default value is false. ### Check List (For Author) - Test: Unit Test - Unit Test: ./run-be-ut.sh --run --filter='JsonParserTest.*DuplicateJsonPath*:SchemaUtilTest.TestParseVariantColumnsDuplicateJsonPathCheck' - Format: PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH ./build-support/clang-format.sh - Static check: git diff --cached --check - Regression test: Not run (no local ASAN output cluster available in the fresh branch-4.0 checkout) - Behavior changed: Yes. When variant_enable_duplicate_json_path_check is enabled, duplicate normalized Variant JSON leaf paths keep the first parsed value. The default value is false. - Does this need documentation: No (cherry picked from commit f563b2e)
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Member
Author
|
run buildall |
Contributor
There was a problem hiding this comment.
Pull request overview
This cherry-pick introduces an optional BE-side behavior change for Variant JSON parsing: when enabled, duplicate JSON paths (including normalization between dotted keys and nested objects) keep the first encountered value instead of erroring/overwriting.
Changes:
- Add BE config
variant_enable_duplicate_json_path_checkand plumb it into Variant parsing/segment writers. - Enhance the JSON parser and Variant subcolumn materialization to de-duplicate normalized paths (keep-first semantics).
- Add unit + regression coverage for duplicate-path scenarios (SQL insert + stream load + compaction).
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/variant_p0/duplicate_json_path.groovy | New regression suite validating keep-first behavior across inserts/stream load and after compaction. |
| regression-test/data/variant_p0/duplicate_json_path.json | Stream load data covering duplicate key/path edge cases. |
| be/test/vec/jsonb/json_parser_test.cpp | Unit tests for keep-first behavior and dotted-vs-nested normalization at the JSON parser level. |
| be/test/vec/common/schema_util_test.cpp | Unit test ensuring Variant subcolumn parsing keeps the first value for duplicate paths. |
| be/src/vec/json/parse2column.cpp | Normalize “plain” paths when duplicate-path check is enabled so dotted/nested map to the same Variant subcolumns. |
| be/src/vec/json/json_parser.h | Extend parse config/context to support duplicate-path checking. |
| be/src/vec/json/json_parser.cpp | Implement keep-first de-duplication during traversal via appendValueIfNotDuplicate. |
| be/src/vec/data_types/serde/data_type_variant_serde.cpp | Apply the new BE config when deserializing Variant from JSON. |
| be/src/olap/rowset/segment_v2/vertical_segment_writer.cpp | Pass duplicate-path config into Variant parsing during segment writing. |
| be/src/olap/rowset/segment_v2/segment_writer.cpp | Pass duplicate-path config into Variant parsing during segment writing. |
| be/src/olap/rowset/segment_creator.cpp | Pass duplicate-path config into Variant parsing during segment flushing. |
| be/src/common/config.h | Declare variant_enable_duplicate_json_path_check. |
| be/src/common/config.cpp | Define default value for variant_enable_duplicate_json_path_check. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| if (subcolumn->size() != old_num_rows) { | ||
| throw doris::Exception(ErrorCode::INVALID_ARGUMENT, | ||
| "subcolumn {} size missmatched, may contains duplicated entry", |
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pick #63082 to branch-4.0.