Skip to content

Improve JsonExtractScalarTransformFunction: type coercion, BIG_DECIMAL_ARRAY, guards#18429

Merged
Jackie-Jiang merged 1 commit into
apache:masterfrom
Jackie-Jiang:json_extract_scalar_coerce
May 6, 2026
Merged

Improve JsonExtractScalarTransformFunction: type coercion, BIG_DECIMAL_ARRAY, guards#18429
Jackie-Jiang merged 1 commit into
apache:masterfrom
Jackie-Jiang:json_extract_scalar_coerce

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang commented May 5, 2026

Summary

Restructures JsonExtractScalarTransformFunction for correctness, consistency, and performance. The headline issue was a ClassCastException when the JSON path resolved to elements of a different runtime type than resultsType (e.g. extracting a JSON array of numbers as STRING_ARRAY); the broader rewrite then folded out a lot of repeated per-row dispatch and added the missing BIG_DECIMAL_ARRAY MV path.

Bug fix: per-element coercion in MV paths

The MV transform methods declared their result list as List<Integer> / List<Long> / etc. and cast result.get(j) directly. When the JsonPath resolved to elements of a different runtime type, the cast threw. Switched each MV result list to List<Object> and route every element through a type-specific helper:

  • toInt(value, isBoolean)NumberintValue(); when isBoolean, follow Pinot's numeric BOOLEAN convention (any non-zero Number → 1), Boolean → 1/0, and String via BooleanUtils.toInt(String).
  • toLong(value, isTimestamp)NumberlongValue(); when isTimestamp, parse strings via TimestampUtils.toMillisSinceEpoch (accepts ISO-8601 + numeric millis); otherwise NumberUtils.parseJsonLong.
  • toFloat / toDouble / toBigDecimalNumber cast / type-specific cast with parse*(toString()) fallback.
  • toStringString pass-through; otherwise JsonUtils.objectToString (single-pass JSON serialization, faster than the old objectToJsonNode(...).toString()).

These helpers are shared between the SV and MV transform methods, eliminating ~10 nearly-identical conversion blocks.

New transform method: BIG_DECIMAL_ARRAY

transformToBigDecimalValuesMV was missing — BIG_DECIMAL_ARRAY result type previously fell through to the base class which can't extract from JSON. Implemented with the same coercion pattern.

Parser-context selection

  • BIG_DECIMAL and STRING (both SV and MV) now use JSON_PARSER_CONTEXT_WITH_BIG_DECIMAL, matching the SV-BigDecimal and SV-String pre-existing behavior. This preserves full numeric precision for BIG_DECIMAL and produces canonical-form serialization for STRING (e.g. 1.0E20 stringifies as 100000000000000000000).
  • Numeric SV/MV stay on the default parser since narrowing to int / long / float / double yields equivalent results within double precision and BigDecimal parsing is several times slower.
  • New helper getResultExtractorWithBigDecimal(valueBlock) mirroring the default getResultExtractor.

Stored-type guards

All 12 SV/MV transform methods now guard with _storedType != DataType.<X> ? super.transformTo*Values*V() : .... Closes the cross-type correctness hole where a caller asks for an int from a STRING-typed function — the base class now handles the conversion through the function's actual result type. Previously only INT/LONG SV had this guard.

Default-value handling

_defaultValue is pre-converted to the canonical stored-type form once in init():

  • INT / LONG / FLOAT / DOUBLE / BIG_DECIMAL / TIMESTAMP / STRING — already returned the right boxed type from the literal accessor.
  • BOOLEAN now stored as Integer 0 / 1 to match its INT storedType, so per-row consumers can unbox directly without a Boolean → Integer conversion.

Per-row default extraction is now a single direct cast at the top of each transform method, eliminating the instanceof Number / parse*(toString()) chain that was previously repeated in every method.

Cached members

_dataType and _storedType are cached as fields in init(), so the transform methods avoid repeated getDataType() / getStoredType() calls.

Tests

Added comprehensive coverage for the new behavior using FluentQueryTest with synthetic JSON:

  • BOOLEAN coercion across Number (non-zero convention) / Boolean / String forms.
  • TIMESTAMP coercion across numeric epoch millis (number and string), ISO-8601 strings, and JDBC-format strings.
  • STRING serialization of non-String JSON values (numbers, booleans, arrays, objects).
  • INT_ARRAY heterogeneous-element coercion (mixed Number and string-form numbers).
  • STRING_ARRAY heterogeneous-element JSON serialization.
  • BIG_DECIMAL_ARRAY precision preservation (29-digit decimal that exceeds Double precision).
  • Cross-type guard via the base class (CAST(jsonExtractScalar(..., 'STRING') AS LONG)).

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 5, 2026

Codecov Report

❌ Patch coverage is 79.04192% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.57%. Comparing base (b870804) to head (23c84e7).

Files with missing lines Patch % Lines
...m/function/JsonExtractScalarTransformFunction.java 79.04% 21 Missing and 14 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18429      +/-   ##
============================================
- Coverage     63.61%   63.57%   -0.04%     
  Complexity     1717     1717              
============================================
  Files          3252     3252              
  Lines        199051   199114      +63     
  Branches      30838    30855      +17     
============================================
- Hits         126618   126592      -26     
- Misses        62352    62443      +91     
+ Partials      10081    10079       -2     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.57% <79.04%> (-0.04%) ⬇️
temurin 63.57% <79.04%> (-0.04%) ⬇️
unittests 63.57% <79.04%> (-0.04%) ⬇️
unittests1 55.67% <79.04%> (+0.02%) ⬆️
unittests2 34.91% <0.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Jackie-Jiang Jackie-Jiang added bug Something is not working as expected query Related to query processing labels May 5, 2026
@Jackie-Jiang Jackie-Jiang force-pushed the json_extract_scalar_coerce branch from c6266a3 to 18c179e Compare May 5, 2026 21:27
@Jackie-Jiang Jackie-Jiang changed the title JsonExtractScalar: coerce per element in MV paths and add BIG_DECIMAL_ARRAY Improve JsonExtractScalarTransformFunction: type coercion, BIG_DECIMAL_ARRAY, guards May 5, 2026
@Jackie-Jiang Jackie-Jiang force-pushed the json_extract_scalar_coerce branch from 18c179e to aa85969 Compare May 5, 2026 21:31
@Jackie-Jiang Jackie-Jiang requested a review from Copilot May 5, 2026 21:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improves JsonExtractScalarTransformFunction by adding per-element coercion for MV outputs, supporting BIG_DECIMAL_ARRAY, and introducing stored-type guards to route cross-type requests through the base conversion path.

Changes:

  • Refactors SV/MV transforms to use shared coercion helpers and avoid ClassCastException on heterogeneous JsonPath results.
  • Adds BIG_DECIMAL_ARRAY MV extraction and BigDecimal-preserving parser selection for STRING/BIG_DECIMAL.
  • Expands tests to cover coercion behavior, precision, and cross-type guard behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunction.java Refactors JSON extraction/coercion, adds BIG_DECIMAL_ARRAY, stored-type guards, and BigDecimal parser pathway.
pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunctionTest.java Adds new FluentQueryTest coverage for coercion, precision, serialization, and cross-type conversions.

…L_ARRAY, guards

Restructures `JsonExtractScalarTransformFunction` for correctness, consistency, and
performance. Stacked on the PinotDataType / FunctionUtils refactor in the parent
commit; review only this commit on top of apache#18428.

Per-element coercion (the bug):
- The MV transform methods declared their result list as `List<Integer>` /
  `List<Long>` / etc. and cast `result.get(j)` directly. When the JsonPath resolved
  to elements of a different runtime type (e.g. `STRING_ARRAY` over a JSON array of
  numbers), the cast threw `ClassCastException`. Switched each MV result list to
  `List<Object>` and route per-element conversion through type-specific helpers.

Type-specific coercion helpers (shared by SV and MV):
- `toInt(value, isBoolean)` — `Number` → `intValue()`; for BOOLEAN result, follows
  Pinot's numeric convention (any non-zero `Number` → 1), `Boolean` → 1/0, and
  String forms via `BooleanUtils.toInt(String)`.
- `toLong(value, isTimestamp)` — `Number` → `longValue()`; for TIMESTAMP result,
  String forms parsed via `TimestampUtils.toMillisSinceEpoch` (ISO-8601 + numeric);
  otherwise `NumberUtils.parseJsonLong`.
- `toFloat`, `toDouble`, `toBigDecimal`, `toString` — straight `Number` /
  type-specific cast with `parse*(toString())` / `JsonUtils.objectToString`
  fallback.

New transform method:
- Added `transformToBigDecimalValuesMV`. `BIG_DECIMAL_ARRAY` previously fell through
  to the base class which can't extract from JSON.

Parser-context selection:
- `BIG_DECIMAL` and `STRING` SV/MV use `JSON_PARSER_CONTEXT_WITH_BIG_DECIMAL` to
  preserve full numeric precision and produce canonical-form string serialization.
  Numeric SV/MV stay on the default parser since narrowing to int / long / float /
  double yields equivalent results within double precision.
- New helper `getResultExtractorWithBigDecimal(valueBlock)` for the
  BigDecimal-parser path, mirroring the default `getResultExtractor`.

Stored-type guards on every transform method:
- All 12 SV/MV transform methods now guard with
  `_storedType != DataType.<X> ? super.transformTo*Values*V() : ...`. Closes the
  cross-type correctness hole where a caller asks for an int from a STRING-typed
  function — the base class now handles the conversion.

Default-value handling:
- `_defaultValue` is pre-converted to the canonical stored-type form once in
  `init()` (most types via the literal accessors; `BOOLEAN` literal stored as
  `Integer` 0 / 1 to match the `INT` storedType). Per-row default extraction is now
  a single direct cast at the top of each transform method, eliminating the
  `instanceof Number` / `parse*(toString())` chain that was repeated in every
  method.

Cached members:
- `_dataType` and `_storedType` cached as fields in `init()` so the transform
  methods avoid repeated `getDataType()` / `getStoredType()` invocations.

Tests:
- Added comprehensive coverage for the new behavior using `FluentQueryTest` with
  synthetic JSON: BOOLEAN coercion (Number / Boolean / String forms), TIMESTAMP
  coercion (numeric millis, ISO-8601, JDBC-format strings), STRING serialization
  for non-String JSON values, INT_ARRAY / STRING_ARRAY heterogeneous-element
  coercion, BIG_DECIMAL_ARRAY precision preservation, and the cross-type guard via
  the base class.
@Jackie-Jiang Jackie-Jiang force-pushed the json_extract_scalar_coerce branch from aa85969 to 23c84e7 Compare May 5, 2026 22:18
@Jackie-Jiang Jackie-Jiang merged commit 5cb954d into apache:master May 6, 2026
11 checks passed
@Jackie-Jiang Jackie-Jiang deleted the json_extract_scalar_coerce branch May 6, 2026 00:38
@xiangfu0
Copy link
Copy Markdown
Contributor

xiangfu0 commented May 6, 2026

Docs follow-up opened: pinot-contrib/pinot-docs#801

xiangfu0 added a commit to pinot-contrib/pinot-docs that referenced this pull request May 6, 2026
)

## Summary
- clarify `JSONEXTRACTSCALAR` result typing for query-facing docs
- document the new `BIG_DECIMAL_ARRAY` support and the
precision-preserving contract for decimal extraction
- spell out the coercion and JSON serialization behavior users can rely
on for `STRING`, numeric, boolean, and timestamp result types

## Structural changes
- updated the existing `functions/json/jsonextractscalar.md` reference
page only

## Source cross-check
- verified behavior against
`pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunction.java`
- verified the user-facing contract against
`pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunctionTest.java`

## Validation
- `git diff --check`
- lightweight markdown link validation for
`functions/json/jsonextractscalar.md`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected query Related to query processing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants