[feature](jsonb) add json_object_flatten scalar function#62825
[feature](jsonb) add json_object_flatten scalar function#62825eldenmoon merged 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
c319896 to
0230239
Compare
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
Findings
variant_flattencan dereference a nulltimezonewhen the variant contains typedTIMESTAMPTZvalues.FunctionVariantFlatten::execute_impl()creates a defaultFormatOptions, butDataTypeTimeStampTzSerDe::serialize_one_cell_to_json()requiresoptions.timezone. A typed variant path such asvariant<'ts':timestamptz>will hit this path during flattening.- Top-level empty or all-null arrays violate the advertised keep-arrays semantics.
serialize_one_row_flattened_to_string()gates onis_visible_root_value(), and that helper marks array roots like[]or[null]as invisible, so the function falls back to emitting{}instead of the original array.
Critical Checkpoints
- Goal: Add
variant_flattento emit a flat JSON string from aVARIANT. The current code does not fully accomplish that goal because the two cases above can either return the wrong result or crash. The added BE/FE tests cover nested objects and storage layouts, but they do not cover top-level empty/all-null arrays or temporal typed leaves. - Scope: The PR is otherwise focused, but the new helper in
ColumnVariantdiverges from existing variant/string serialization behavior in important edge cases. - Concurrency: No new shared-state, lock-order, or thread-safety concerns were introduced.
- Lifecycle / static init: No special lifecycle or static-initialization issues beyond normal function registration.
- Config: No new configuration items were added.
- Compatibility: No FE-BE protocol or storage-format compatibility issue is introduced by the patch itself.
- Parallel code paths: Existing string/JSON serialization paths already thread timezone context and preserve scalar root arrays;
variant_flattenshould match those behaviors. - Conditional checks: The
is_visible_root_value()gate is not valid for the documented keep-arrays behavior on root arrays. - Test coverage: Missing for top-level
[]/[null]and typed temporal leaves (especiallyTIMESTAMPTZ). - Test result updates: The new regression
.outfile looks consistent for the covered cases. - Observability: No additional observability changes appear necessary for this function.
- Transaction / persistence / data writes: Not applicable for this scalar-function change.
- FE/BE variable propagation: FE and BE registration/nullability propagation look aligned for the covered path.
- Performance: The implementation is reasonably focused; correctness is the blocking concern.
- Additional focus points: No additional user-provided review focus.
Please address the blocking correctness issues above before merge.
0230239 to
d698867
Compare
|
/review |
d698867 to
963e180
Compare
Add json_object_flatten(JSONB) -> JSONB. Turn a nested JSONB object into a
single-level JSONB whose keys are the dot-joined paths to each leaf, NiFi
FlattenJson "keep-arrays" semantics:
{"a":{"b":2}} -> {"a.b":2}
{"a":{"b":{"c":3}}} -> {"a.b.c":3}
{"a":[{"b":1}]} -> {"a":[{"b":1}]}
Only objects are walked; arrays / scalars / nulls / empty objects are kept
as opaque leaf values. Top-level non-object values pass through unchanged.
Nullability propagates from the input.
Implementation lives next to the existing sort_json_object_keys /
normalize_json_numbers_to_double under the FunctionJsonbTransform template
in function_jsonb_transform.cpp — a single recursive walk over JsonbValue
emitting flat keys via JsonbWriter; no parse / DOM / reparse round-trip.
For VARIANT data, callers cast to JSONB first (`json_object_flatten(cast(v
as jsonb))`); the regression suite covers both JSONB-column and VARIANT-via-
cast call sites. Inputs with array leaves are intentionally excluded from
the regression baseline because Doris's JSONB-to-JSON rendering may insert
or omit a space after array commas in different contexts; that surface is
covered exhaustively in the BE gtest using exact-equality assertions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
963e180 to
e29ff9d
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Found 2 issues:
be/src/exprs/function/function_jsonb_transform.cpp:json_object_flattenwrites synthesized dotted keys withstatic_cast<uint8_t>(prefix.size()). Even with the default 255-byte per-segment input limit, valid inputs like two 128-byte nested keys produce a 257-byte flattened path. This silently truncates the emitted JSONB key instead of preserving or rejecting it, so the function can return a corrupted object for valid input.fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/JsonObjectFlatten.java: the docstring tells users to calljson_object_flatten(to_json(v)), butto_jsondoes not acceptVARIANTin Nereids. The supported pattern in this snapshot isjson_object_flatten(cast(v as jsonb)).
Critical checkpoint conclusions:
- Goal / correctness: The PR’s goal is to add
json_object_flatten(JSONB) -> JSONB. FE/BE registration and the common flatten semantics are wired up, but the implementation is not correct for all valid inputs because overlong flattened paths are mishandled. - Scope / focus: The change is reasonably focused to one BE transform, FE registration, and tests.
- Concurrency: No new concurrent state or lock interaction is involved.
- Lifecycle / static initialization: No special lifecycle or static-init concerns found.
- Config: No new config is added. The existing JSON key-length limit still applies, but this new writer path bypasses the checked conversion that other JSONB-building code uses.
- Compatibility: No protocol, persisted metadata, or storage-format compatibility issue was introduced beyond the new scalar function being additive.
- Parallel code paths: The JSONB transform path is covered; no other sibling execution path looked missing in this snapshot.
- Special conditions: Top-level non-object pass-through and keep-arrays semantics are covered and looked consistent.
- Test coverage: Added regression and BE unit tests cover many ordinary cases, but they miss the >255-byte flattened-key case and therefore do not catch the blocker above.
- Test result files: The added
.outfile matches the covered regression queries. - Observability: No extra observability appears necessary for this scalar function.
- Transaction / persistence: Not involved.
- Data writes / atomicity: Not involved.
- FE-BE variable propagation: No new transmitted fields are involved.
- Performance: The walk is linear in the JSON tree and reuses one prefix buffer, so no obvious hot-path performance problem stood out.
- Other issues: No additional user-provided focus was supplied; no extra focus-specific issue was found.
|
|
||
| if (val->isObject() && val->unpack<ObjectVal>()->numElem() > 0) { | ||
| flatten_json_object_into(jsonb_writer, val->unpack<ObjectVal>(), prefix); | ||
| } else { |
There was a problem hiding this comment.
This write path is missing the checked key-length handling Doris already uses when building JSONB from parsed JSON. prefix is a synthesized dotted path, so it can exceed 255 bytes even when every input segment is individually valid under the default limit: for example, two nested 128-byte keys produce a 257-byte flat key. With static_cast<uint8_t> that silently wraps and only the truncated prefix bytes are written, so json_object_flatten returns a corrupted object instead of preserving or rejecting the key. Please validate the flattened key length here and use the checked cast path rather than a raw static_cast, and add a BE test for a >255-byte dotted key.
| * (NiFi FlattenJson "keep-arrays" semantics — arrays stay as opaque values): | ||
| * {"a":{"b":2}} -> {"a.b":2} | ||
| * {"a":[{"b":1}]} -> {"a":[{"b":1}]} | ||
| * To flatten a VARIANT, wrap it with `to_json`: json_object_flatten(to_json(v)). |
There was a problem hiding this comment.
This usage example does not match the actual FE signature rules in this snapshot. json_object_flatten can accept a VARIANT only via the JSONB cast path used in the regression test (cast(v as jsonb)), while to_json itself does not accept VARIANT in Nereids. Please update the doc comment so it does not advertise an unsupported call pattern.
There was a problem hiding this comment.
I found blocking correctness issues in the current implementation.
- : flattened dot-path keys can overflow JSONB's 1-byte key-length field and are silently truncated/corrupted instead of failing.
- : the FE contract around is inconsistent. The doc recommends even though Nereids does not accept , while the generic castable signature also lets through via implicit cast. That cast is lossy for empty-object/null-only paths, so the same logical JSON returns different answers in JSONB vs VARIANT storage.
Critical checkpoint conclusions:
- Goal / completeness: the PR wires the new JSONB function through FE and BE and adds tests, but it does not fully accomplish the goal because oversized flattened keys are not handled correctly and the supported contract is not coherent.
- Scope / focus: the change is reasonably small and focused.
- Concurrency: no new concurrent path or lock interaction is involved.
- Lifecycle / static init: no special lifecycle or static initialization issue found.
- Config / compatibility: no new config, persistence, or protocol compatibility change is introduced.
- Parallel paths: the direct JSONB path and the -via-cast path behave differently; that divergence needs an explicit fix or restriction.
- Special conditions: generated dot-path keys are not checked against JSONB key-length limits.
- Tests: the BE unit tests cover many normal JSONB cases, and there is a regression suite, but there is no coverage for oversized flattened keys. The new variant regression cases also expose lossy behavior rather than validating parity.
- Observability / persistence / transactions / data writes: not applicable for this patch.
- Performance: prefix reuse is reasonable; I did not find a separate performance blocker.
User focus: no additional user-provided focus.
There was a problem hiding this comment.
Corrected summary for my earlier request-changes review:
I still see two blocking issues.
be/src/exprs/function/function_jsonb_transform.cpp:json_object_flattenbuilds synthesized dotted keys and writes them with a rawstatic_cast<uint8_t>(prefix.size()). That can overflow the JSONB one-byte key-length field for valid nested inputs and silently truncate or corrupt the output instead of failing.fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/JsonObjectFlatten.java: the documentedVARIANTusage is inconsistent. The Javadoc advertisesjson_object_flatten(to_json(v)), butto_jsondoes not acceptVariantTypein Nereids. Meanwhilejson_object_flatten(v)is accepted through implicitVARIANT -> JSONBcast, and the new regression output already shows that path loses empty-object and null-only leaves compared with direct JSONB input.
Critical checkpoint conclusions:
- Goal and completeness: partially met; function is wired through FE and BE, but these two user-visible correctness issues remain.
- Scope and focus: the patch is small and focused.
- Concurrency: no new concurrency or locking risk found.
- Lifecycle and static initialization: no issue found.
- Config and compatibility: no new config, persistence, or protocol compatibility issue found.
- Parallel code paths: JSONB and VARIANT-via-cast do not currently preserve the same semantics.
- Tests: useful BE unit coverage and regression coverage were added, but there is no oversized-key coverage, and the variant regression currently records lossy behavior rather than validating parity.
- Observability, transactions, persistence: not applicable here.
- Performance: no separate blocker found.
User focus: no additional user-provided focus.
There was a problem hiding this comment.
I still see two blocking issues.
be/src/exprs/function/function_jsonb_transform.cpp:json_object_flattenbuilds synthesized dotted keys and writes them with a rawstatic_cast<uint8_t>(prefix.size()). That can overflow the JSONB one-byte key-length field for valid nested inputs and silently truncate or corrupt the output instead of failing.fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/JsonObjectFlatten.java: the documentedVARIANTusage is inconsistent. The Javadoc advertisesjson_object_flatten(to_json(v)), butto_jsondoes not acceptVariantTypein Nereids. Meanwhilejson_object_flatten(v)is accepted through implicitVARIANT -> JSONBcast, and the new regression output already shows that path loses empty-object and null-only leaves compared with direct JSONB input.
Critical checkpoint conclusions:
- Goal and completeness: partially met; function is wired through FE and BE, but these two user-visible correctness issues remain.
- Scope and focus: the patch is small and focused.
- Concurrency: no new concurrency or locking risk found.
- Lifecycle and static initialization: no issue found.
- Config and compatibility: no new config, persistence, or protocol compatibility issue found.
- Parallel code paths: JSONB and VARIANT-via-cast do not currently preserve the same semantics.
- Tests: useful BE unit coverage and regression coverage were added, but there is no oversized-key coverage, and the variant regression currently records lossy behavior rather than validating parity.
- Observability, transactions, persistence: not applicable here.
- Performance: no separate blocker found.
User focus: no additional user-provided focus.
FE UT Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
Introduce `json_object_flatten(JSONB) -> JSONB`. Turn a nested JSONB
object into a single-level JSONB whose keys are the dot-joined paths to
each leaf, mirroring NiFi `FlattenJson` *keep-arrays* semantics:
```
{"a":{"b":2}} -> {"a.b":2}
{"a":{"b":{"c":3}}} -> {"a.b.c":3}
{"a":[{"b":1}]} -> {"a":[{"b":1}]}
```
apache/doris-website#3589
Introduce
json_object_flatten(JSONB) -> JSONB. Turn a nested JSONB object into a single-level JSONB whose keys are the dot-joined paths to each leaf, mirroring NiFiFlattenJsonkeep-arrays semantics:apache/doris-website#3589
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)