[fix](function) deduplicate map keys after string-to-map cast#63713
[fix](function) deduplicate map keys after string-to-map cast#63713Mryange wants to merge 2 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found one correctness issue in the string-to-map cast normalization. The PR fixes top-level duplicate keys, but nested MAP values parsed from the same string still keep duplicate keys, so the behavior remains inconsistent for nested MAP types.
Critical checkpoint conclusions:
- Goal/test: The PR partially accomplishes the goal and tests the top-level case, but misses nested MAP values produced by the same string deserialization path.
- Scope: The change is focused and small.
- Concurrency/lifecycle: No new concurrency or special lifecycle concerns found.
- Config/compatibility/protocol: No new config, persisted format, or FE-BE protocol compatibility concerns found.
- Parallel paths: Table ingestion already uses recursive map deduplication; this cast path should match it for nested maps.
- Tests: Added regression coverage is useful but incomplete for nested MAP duplicate-key behavior.
- Observability/transactions/data writes: Not applicable to this expression-only change.
- Performance: No material performance issue found beyond the existing dedup cost.
Focus points: No additional user-provided review focus was present.
| result_column_name); | ||
| } | ||
|
|
||
| RETURN_IF_ERROR(map_column->deduplicate_keys()); |
There was a problem hiding this comment.
This only deduplicates the outer ColumnMap. DataTypeMapSerDe::_from_string() recursively deserializes MAP values but does not deduplicate them, so a cast like CAST('{"outer":{"a":1,"a":2}}' AS MAP<STRING, MAP<STRING, INT>>) will still leave duplicate keys in the nested map and element_at(..., 'a') can keep returning the first value there. That means the PR fixes the top-level repro but not the same inconsistency for nested MAP types. Please call deduplicate_keys(true) here (and in the non-nullable branch) and add a nested MAP regression case.
| RETURN_IF_ERROR(map_column->deduplicate_keys()); | |
| RETURN_IF_ERROR(map_column->deduplicate_keys(true)); |
|
/review |
There was a problem hiding this comment.
Review result: no additional blocking issues found in the latest PR head. The previously raised nested MAP concern is addressed by using deduplicate_keys(true) and adding the nested MAP regression case.
Critical checkpoint conclusions:
- Goal/test: The change targets duplicated keys when casting JSON strings to MAP. The implementation now deduplicates string-to-map cast results, including nested MAP values, and adds top-level and nested regression coverage.
- Scope: The code change is small and limited to the string-to-map wrapper plus regression expectations.
- Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, configuration, storage-format, or FE-BE protocol concerns identified.
- Parallel paths: The changed path is specifically string-to-map casts; map-to-map behavior was already deduplicating keys separately. Array/struct casts are unaffected.
- Error handling: The wrapped cast Status is propagated, and deduplication Status is checked.
- Tests: Added deterministic single-row qt_sql cases for cast output, size, element_at, and nested MAP behavior. I did not run the regression suite in this review environment.
- Observability/performance: No new observability needed. Deduplication adds expected work only on string-to-map cast output.
User focus: No additional user-provided review focus was specified.
|
run buildall |
TPC-H: Total hot run time: 31406 ms |
TPC-DS: Total hot run time: 171888 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run p0 |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Problem Summary:
Casting a JSON string with duplicated object keys to MAP kept all duplicated entries because the string-to-complex cast path returned the generic wrapper directly and skipped ColumnMap::deduplicate_keys(). This made string-to-map casts inconsistent with MAP constructor semantics where the last value wins.
Reproduction SQL:
Before this fix:
After this fix:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)