Skip to content

cherry-pick branch-4.1 : [Improve](Variant) Keep first duplicate Variant JSON path#63697

Open
eldenmoon wants to merge 1 commit into
apache:branch-4.0from
eldenmoon:codex/pick-63082-branch-4.0
Open

cherry-pick branch-4.1 : [Improve](Variant) Keep first duplicate Variant JSON path#63697
eldenmoon wants to merge 1 commit into
apache:branch-4.0from
eldenmoon:codex/pick-63082-branch-4.0

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

@eldenmoon eldenmoon commented May 26, 2026

Pick #63082 to branch-4.0.

### What problem does this PR solve?

Issue Number: None

Related PR: apache#63082

Problem Summary: Pick apache#63082 to branch-4.0. Add disabled-by-default duplicate Variant JSON path checking so Variant JSON parsing can keep the first value for duplicate normalized leaf paths and avoid inconsistent subcolumns during load/parsing. Adapt the implementation to branch-4.0's vec/json parser and parse2column flow.

### Release note

Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate Variant JSON leaf path when enabled. The default value is false.

### Check List (For Author)

- Test: Unit Test
    - Unit Test: ./run-be-ut.sh --run --filter='JsonParserTest.*DuplicateJsonPath*:SchemaUtilTest.TestParseVariantColumnsDuplicateJsonPathCheck'
    - Format: PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH ./build-support/clang-format.sh
    - Static check: git diff --cached --check
    - Regression test: Not run (no local ASAN output cluster available in the fresh branch-4.0 checkout)
- Behavior changed: Yes. When variant_enable_duplicate_json_path_check is enabled, duplicate normalized Variant JSON leaf paths keep the first parsed value. The default value is false.
- Does this need documentation: No

(cherry picked from commit f563b2e)
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon eldenmoon changed the title [fix](be) Keep first duplicate Variant JSON path cherry-pick branch-4.1 : [Improve](Variant) Keep first duplicate Variant JSON path May 26, 2026
@eldenmoon eldenmoon marked this pull request as ready for review May 27, 2026 01:27
Copilot AI review requested due to automatic review settings May 27, 2026 01:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This cherry-pick introduces an optional BE-side behavior change for Variant JSON parsing: when enabled, duplicate JSON paths (including normalization between dotted keys and nested objects) keep the first encountered value instead of erroring/overwriting.

Changes:

  • Add BE config variant_enable_duplicate_json_path_check and plumb it into Variant parsing/segment writers.
  • Enhance the JSON parser and Variant subcolumn materialization to de-duplicate normalized paths (keep-first semantics).
  • Add unit + regression coverage for duplicate-path scenarios (SQL insert + stream load + compaction).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
regression-test/suites/variant_p0/duplicate_json_path.groovy New regression suite validating keep-first behavior across inserts/stream load and after compaction.
regression-test/data/variant_p0/duplicate_json_path.json Stream load data covering duplicate key/path edge cases.
be/test/vec/jsonb/json_parser_test.cpp Unit tests for keep-first behavior and dotted-vs-nested normalization at the JSON parser level.
be/test/vec/common/schema_util_test.cpp Unit test ensuring Variant subcolumn parsing keeps the first value for duplicate paths.
be/src/vec/json/parse2column.cpp Normalize “plain” paths when duplicate-path check is enabled so dotted/nested map to the same Variant subcolumns.
be/src/vec/json/json_parser.h Extend parse config/context to support duplicate-path checking.
be/src/vec/json/json_parser.cpp Implement keep-first de-duplication during traversal via appendValueIfNotDuplicate.
be/src/vec/data_types/serde/data_type_variant_serde.cpp Apply the new BE config when deserializing Variant from JSON.
be/src/olap/rowset/segment_v2/vertical_segment_writer.cpp Pass duplicate-path config into Variant parsing during segment writing.
be/src/olap/rowset/segment_v2/segment_writer.cpp Pass duplicate-path config into Variant parsing during segment writing.
be/src/olap/rowset/segment_creator.cpp Pass duplicate-path config into Variant parsing during segment flushing.
be/src/common/config.h Declare variant_enable_duplicate_json_path_check.
be/src/common/config.cpp Define default value for variant_enable_duplicate_json_path_check.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
if (subcolumn->size() != old_num_rows) {
throw doris::Exception(ErrorCode::INVALID_ARGUMENT,
"subcolumn {} size missmatched, may contains duplicated entry",
@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 95.45% (63/66) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.36% (25455/35673)
Line Coverage 54.23% (269759/497449)
Region Coverage 51.71% (223025/431324)
Branch Coverage 53.15% (95992/180589)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants