Skip to content

[feature](jsonb) add json_object_flatten scalar function#62825

Merged
eldenmoon merged 1 commit intoapache:masterfrom
csun5285:feat-variant-flatten
Apr 27, 2026
Merged

[feature](jsonb) add json_object_flatten scalar function#62825
eldenmoon merged 1 commit intoapache:masterfrom
csun5285:feat-variant-flatten

Conversation

@csun5285
Copy link
Copy Markdown
Contributor

@csun5285 csun5285 commented Apr 24, 2026

Introduce json_object_flatten(JSONB) -> JSONB. Turn a nested JSONB object into a single-level JSONB whose keys are the dot-joined paths to each leaf, mirroring NiFi FlattenJson keep-arrays semantics:

{"a":{"b":2}}        -> {"a.b":2}
{"a":{"b":{"c":3}}}  -> {"a.b.c":3}
{"a":[{"b":1}]}      -> {"a":[{"b":1}]}

apache/doris-website#3589

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 24, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@csun5285
Copy link
Copy Markdown
Contributor Author

run buildall

@csun5285
Copy link
Copy Markdown
Contributor Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24890200960

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.00% (94/100) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.49% (26810/37501)
Line Coverage 53.87% (280122/519959)
Region Coverage 47.26% (215525/456005)
Branch Coverage 50.55% (97518/192903)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 28.89% (13/45) 🎉
Increment coverage report
Complete coverage report

@csun5285 csun5285 force-pushed the feat-variant-flatten branch from c319896 to 0230239 Compare April 25, 2026 02:49
@csun5285
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.00% (94/100) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.91% (27709/37490)
Line Coverage 57.64% (299763/520065)
Region Coverage 54.82% (249318/454824)
Branch Coverage 56.39% (107933/191404)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 81.25% (13/16) 🎉
Increment coverage report
Complete coverage report

@csun5285
Copy link
Copy Markdown
Contributor Author

/review

Comment thread be/src/core/column/column_variant.cpp Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  1. variant_flatten can dereference a null timezone when the variant contains typed TIMESTAMPTZ values. FunctionVariantFlatten::execute_impl() creates a default FormatOptions, but DataTypeTimeStampTzSerDe::serialize_one_cell_to_json() requires options.timezone. A typed variant path such as variant<'ts':timestamptz> will hit this path during flattening.
  2. Top-level empty or all-null arrays violate the advertised keep-arrays semantics. serialize_one_row_flattened_to_string() gates on is_visible_root_value(), and that helper marks array roots like [] or [null] as invisible, so the function falls back to emitting {} instead of the original array.

Critical Checkpoints

  • Goal: Add variant_flatten to emit a flat JSON string from a VARIANT. The current code does not fully accomplish that goal because the two cases above can either return the wrong result or crash. The added BE/FE tests cover nested objects and storage layouts, but they do not cover top-level empty/all-null arrays or temporal typed leaves.
  • Scope: The PR is otherwise focused, but the new helper in ColumnVariant diverges from existing variant/string serialization behavior in important edge cases.
  • Concurrency: No new shared-state, lock-order, or thread-safety concerns were introduced.
  • Lifecycle / static init: No special lifecycle or static-initialization issues beyond normal function registration.
  • Config: No new configuration items were added.
  • Compatibility: No FE-BE protocol or storage-format compatibility issue is introduced by the patch itself.
  • Parallel code paths: Existing string/JSON serialization paths already thread timezone context and preserve scalar root arrays; variant_flatten should match those behaviors.
  • Conditional checks: The is_visible_root_value() gate is not valid for the documented keep-arrays behavior on root arrays.
  • Test coverage: Missing for top-level [] / [null] and typed temporal leaves (especially TIMESTAMPTZ).
  • Test result updates: The new regression .out file looks consistent for the covered cases.
  • Observability: No additional observability changes appear necessary for this function.
  • Transaction / persistence / data writes: Not applicable for this scalar-function change.
  • FE/BE variable propagation: FE and BE registration/nullability propagation look aligned for the covered path.
  • Performance: The implementation is reasonably focused; correctness is the blocking concern.
  • Additional focus points: No additional user-provided review focus.

Please address the blocking correctness issues above before merge.

Comment thread be/src/core/column/column_variant.cpp Outdated
Comment thread be/src/exprs/function/function_variant_flatten.cpp Outdated
@csun5285 csun5285 force-pushed the feat-variant-flatten branch from 0230239 to d698867 Compare April 27, 2026 03:32
@eldenmoon
Copy link
Copy Markdown
Member

/review

@csun5285 csun5285 force-pushed the feat-variant-flatten branch from d698867 to 963e180 Compare April 27, 2026 03:51
@csun5285 csun5285 changed the title [feature](variant) add variant_flatten scalar function [feature](jsonb) add json_object_flatten scalar function Apr 27, 2026
Add json_object_flatten(JSONB) -> JSONB. Turn a nested JSONB object into a
single-level JSONB whose keys are the dot-joined paths to each leaf, NiFi
FlattenJson "keep-arrays" semantics:

  {"a":{"b":2}}        -> {"a.b":2}
  {"a":{"b":{"c":3}}}  -> {"a.b.c":3}
  {"a":[{"b":1}]}      -> {"a":[{"b":1}]}

Only objects are walked; arrays / scalars / nulls / empty objects are kept
as opaque leaf values. Top-level non-object values pass through unchanged.
Nullability propagates from the input.

Implementation lives next to the existing sort_json_object_keys /
normalize_json_numbers_to_double under the FunctionJsonbTransform template
in function_jsonb_transform.cpp — a single recursive walk over JsonbValue
emitting flat keys via JsonbWriter; no parse / DOM / reparse round-trip.

For VARIANT data, callers cast to JSONB first (`json_object_flatten(cast(v
as jsonb))`); the regression suite covers both JSONB-column and VARIANT-via-
cast call sites. Inputs with array leaves are intentionally excluded from
the regression baseline because Doris's JSONB-to-JSON rendering may insert
or omit a space after array commas in different contexts; that surface is
covered exhaustively in the BE gtest using exact-equality assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@csun5285 csun5285 force-pushed the feat-variant-flatten branch from 963e180 to e29ff9d Compare April 27, 2026 03:55
@csun5285
Copy link
Copy Markdown
Contributor Author

run buildall

@csun5285
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 issues:

  1. be/src/exprs/function/function_jsonb_transform.cpp: json_object_flatten writes synthesized dotted keys with static_cast<uint8_t>(prefix.size()). Even with the default 255-byte per-segment input limit, valid inputs like two 128-byte nested keys produce a 257-byte flattened path. This silently truncates the emitted JSONB key instead of preserving or rejecting it, so the function can return a corrupted object for valid input.
  2. fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/JsonObjectFlatten.java: the docstring tells users to call json_object_flatten(to_json(v)), but to_json does not accept VARIANT in Nereids. The supported pattern in this snapshot is json_object_flatten(cast(v as jsonb)).

Critical checkpoint conclusions:

  • Goal / correctness: The PR’s goal is to add json_object_flatten(JSONB) -> JSONB. FE/BE registration and the common flatten semantics are wired up, but the implementation is not correct for all valid inputs because overlong flattened paths are mishandled.
  • Scope / focus: The change is reasonably focused to one BE transform, FE registration, and tests.
  • Concurrency: No new concurrent state or lock interaction is involved.
  • Lifecycle / static initialization: No special lifecycle or static-init concerns found.
  • Config: No new config is added. The existing JSON key-length limit still applies, but this new writer path bypasses the checked conversion that other JSONB-building code uses.
  • Compatibility: No protocol, persisted metadata, or storage-format compatibility issue was introduced beyond the new scalar function being additive.
  • Parallel code paths: The JSONB transform path is covered; no other sibling execution path looked missing in this snapshot.
  • Special conditions: Top-level non-object pass-through and keep-arrays semantics are covered and looked consistent.
  • Test coverage: Added regression and BE unit tests cover many ordinary cases, but they miss the >255-byte flattened-key case and therefore do not catch the blocker above.
  • Test result files: The added .out file matches the covered regression queries.
  • Observability: No extra observability appears necessary for this scalar function.
  • Transaction / persistence: Not involved.
  • Data writes / atomicity: Not involved.
  • FE-BE variable propagation: No new transmitted fields are involved.
  • Performance: The walk is linear in the JSON tree and reuses one prefix buffer, so no obvious hot-path performance problem stood out.
  • Other issues: No additional user-provided focus was supplied; no extra focus-specific issue was found.


if (val->isObject() && val->unpack<ObjectVal>()->numElem() > 0) {
flatten_json_object_into(jsonb_writer, val->unpack<ObjectVal>(), prefix);
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This write path is missing the checked key-length handling Doris already uses when building JSONB from parsed JSON. prefix is a synthesized dotted path, so it can exceed 255 bytes even when every input segment is individually valid under the default limit: for example, two nested 128-byte keys produce a 257-byte flat key. With static_cast<uint8_t> that silently wraps and only the truncated prefix bytes are written, so json_object_flatten returns a corrupted object instead of preserving or rejecting the key. Please validate the flattened key length here and use the checked cast path rather than a raw static_cast, and add a BE test for a >255-byte dotted key.

* (NiFi FlattenJson "keep-arrays" semantics — arrays stay as opaque values):
* {"a":{"b":2}} -> {"a.b":2}
* {"a":[{"b":1}]} -> {"a":[{"b":1}]}
* To flatten a VARIANT, wrap it with `to_json`: json_object_flatten(to_json(v)).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This usage example does not match the actual FE signature rules in this snapshot. json_object_flatten can accept a VARIANT only via the JSONB cast path used in the regression test (cast(v as jsonb)), while to_json itself does not accept VARIANT in Nereids. Please update the doc comment so it does not advertise an unsupported call pattern.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found blocking correctness issues in the current implementation.

  1. : flattened dot-path keys can overflow JSONB's 1-byte key-length field and are silently truncated/corrupted instead of failing.
  2. : the FE contract around is inconsistent. The doc recommends even though Nereids does not accept , while the generic castable signature also lets through via implicit cast. That cast is lossy for empty-object/null-only paths, so the same logical JSON returns different answers in JSONB vs VARIANT storage.

Critical checkpoint conclusions:

  • Goal / completeness: the PR wires the new JSONB function through FE and BE and adds tests, but it does not fully accomplish the goal because oversized flattened keys are not handled correctly and the supported contract is not coherent.
  • Scope / focus: the change is reasonably small and focused.
  • Concurrency: no new concurrent path or lock interaction is involved.
  • Lifecycle / static init: no special lifecycle or static initialization issue found.
  • Config / compatibility: no new config, persistence, or protocol compatibility change is introduced.
  • Parallel paths: the direct JSONB path and the -via-cast path behave differently; that divergence needs an explicit fix or restriction.
  • Special conditions: generated dot-path keys are not checked against JSONB key-length limits.
  • Tests: the BE unit tests cover many normal JSONB cases, and there is a regression suite, but there is no coverage for oversized flattened keys. The new variant regression cases also expose lossy behavior rather than validating parity.
  • Observability / persistence / transactions / data writes: not applicable for this patch.
  • Performance: prefix reuse is reasonable; I did not find a separate performance blocker.

User focus: no additional user-provided focus.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected summary for my earlier request-changes review:

I still see two blocking issues.

  1. be/src/exprs/function/function_jsonb_transform.cpp: json_object_flatten builds synthesized dotted keys and writes them with a raw static_cast<uint8_t>(prefix.size()). That can overflow the JSONB one-byte key-length field for valid nested inputs and silently truncate or corrupt the output instead of failing.
  2. fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/JsonObjectFlatten.java: the documented VARIANT usage is inconsistent. The Javadoc advertises json_object_flatten(to_json(v)), but to_json does not accept VariantType in Nereids. Meanwhile json_object_flatten(v) is accepted through implicit VARIANT -> JSONB cast, and the new regression output already shows that path loses empty-object and null-only leaves compared with direct JSONB input.

Critical checkpoint conclusions:

  • Goal and completeness: partially met; function is wired through FE and BE, but these two user-visible correctness issues remain.
  • Scope and focus: the patch is small and focused.
  • Concurrency: no new concurrency or locking risk found.
  • Lifecycle and static initialization: no issue found.
  • Config and compatibility: no new config, persistence, or protocol compatibility issue found.
  • Parallel code paths: JSONB and VARIANT-via-cast do not currently preserve the same semantics.
  • Tests: useful BE unit coverage and regression coverage were added, but there is no oversized-key coverage, and the variant regression currently records lossy behavior rather than validating parity.
  • Observability, transactions, persistence: not applicable here.
  • Performance: no separate blocker found.

User focus: no additional user-provided focus.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see two blocking issues.

  1. be/src/exprs/function/function_jsonb_transform.cpp: json_object_flatten builds synthesized dotted keys and writes them with a raw static_cast<uint8_t>(prefix.size()). That can overflow the JSONB one-byte key-length field for valid nested inputs and silently truncate or corrupt the output instead of failing.
  2. fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/JsonObjectFlatten.java: the documented VARIANT usage is inconsistent. The Javadoc advertises json_object_flatten(to_json(v)), but to_json does not accept VariantType in Nereids. Meanwhile json_object_flatten(v) is accepted through implicit VARIANT -> JSONB cast, and the new regression output already shows that path loses empty-object and null-only leaves compared with direct JSONB input.

Critical checkpoint conclusions:

  • Goal and completeness: partially met; function is wired through FE and BE, but these two user-visible correctness issues remain.
  • Scope and focus: the patch is small and focused.
  • Concurrency: no new concurrency or locking risk found.
  • Lifecycle and static initialization: no issue found.
  • Config and compatibility: no new config, persistence, or protocol compatibility issue found.
  • Parallel code paths: JSONB and VARIANT-via-cast do not currently preserve the same semantics.
  • Tests: useful BE unit coverage and regression coverage were added, but there is no oversized-key coverage, and the variant regression currently records lossy behavior rather than validating parity.
  • Observability, transactions, persistence: not applicable here.
  • Performance: no separate blocker found.

User focus: no additional user-provided focus.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 8.33% (1/12) 🎉
Increment coverage report
Complete coverage report

Copy link
Copy Markdown
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 90.32% (28/31) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.34% (20417/38274)
Line Coverage 36.90% (192460/521559)
Region Coverage 33.20% (149662/450768)
Branch Coverage 34.34% (65518/190776)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (31/31) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.86% (27697/37499)
Line Coverage 57.61% (299712/520264)
Region Coverage 54.85% (249619/455094)
Branch Coverage 56.37% (107949/191506)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 91.67% (11/12) 🎉
Increment coverage report
Complete coverage report

@eldenmoon eldenmoon merged commit 18fceee into apache:master Apr 27, 2026
32 of 33 checks passed
@eldenmoon eldenmoon added dev/3.1.x usercase Important user case type label and removed dev/3.1.x labels Apr 27, 2026
github-actions Bot pushed a commit that referenced this pull request Apr 27, 2026
Introduce `json_object_flatten(JSONB) -> JSONB`. Turn a nested JSONB
object into a single-level JSONB whose keys are the dot-joined paths to
each leaf, mirroring NiFi `FlattenJson` *keep-arrays* semantics:

```
{"a":{"b":2}}        -> {"a.b":2}
{"a":{"b":{"c":3}}}  -> {"a.b.c":3}
{"a":[{"b":1}]}      -> {"a":[{"b":1}]}
```
apache/doris-website#3589
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.x dev/4.0.x-conflict dev/4.1.x usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants