Improve JsonExtractScalarTransformFunction: type coercion, BIG_DECIMAL_ARRAY, guards#18429
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18429 +/- ##
============================================
- Coverage 63.61% 63.57% -0.04%
Complexity 1717 1717
============================================
Files 3252 3252
Lines 199051 199114 +63
Branches 30838 30855 +17
============================================
- Hits 126618 126592 -26
- Misses 62352 62443 +91
+ Partials 10081 10079 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c6266a3 to
18c179e
Compare
18c179e to
aa85969
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Improves JsonExtractScalarTransformFunction by adding per-element coercion for MV outputs, supporting BIG_DECIMAL_ARRAY, and introducing stored-type guards to route cross-type requests through the base conversion path.
Changes:
- Refactors SV/MV transforms to use shared coercion helpers and avoid
ClassCastExceptionon heterogeneous JsonPath results. - Adds
BIG_DECIMAL_ARRAYMV extraction and BigDecimal-preserving parser selection forSTRING/BIG_DECIMAL. - Expands tests to cover coercion behavior, precision, and cross-type guard behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunction.java | Refactors JSON extraction/coercion, adds BIG_DECIMAL_ARRAY, stored-type guards, and BigDecimal parser pathway. |
| pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunctionTest.java | Adds new FluentQueryTest coverage for coercion, precision, serialization, and cross-type conversions. |
…L_ARRAY, guards Restructures `JsonExtractScalarTransformFunction` for correctness, consistency, and performance. Stacked on the PinotDataType / FunctionUtils refactor in the parent commit; review only this commit on top of apache#18428. Per-element coercion (the bug): - The MV transform methods declared their result list as `List<Integer>` / `List<Long>` / etc. and cast `result.get(j)` directly. When the JsonPath resolved to elements of a different runtime type (e.g. `STRING_ARRAY` over a JSON array of numbers), the cast threw `ClassCastException`. Switched each MV result list to `List<Object>` and route per-element conversion through type-specific helpers. Type-specific coercion helpers (shared by SV and MV): - `toInt(value, isBoolean)` — `Number` → `intValue()`; for BOOLEAN result, follows Pinot's numeric convention (any non-zero `Number` → 1), `Boolean` → 1/0, and String forms via `BooleanUtils.toInt(String)`. - `toLong(value, isTimestamp)` — `Number` → `longValue()`; for TIMESTAMP result, String forms parsed via `TimestampUtils.toMillisSinceEpoch` (ISO-8601 + numeric); otherwise `NumberUtils.parseJsonLong`. - `toFloat`, `toDouble`, `toBigDecimal`, `toString` — straight `Number` / type-specific cast with `parse*(toString())` / `JsonUtils.objectToString` fallback. New transform method: - Added `transformToBigDecimalValuesMV`. `BIG_DECIMAL_ARRAY` previously fell through to the base class which can't extract from JSON. Parser-context selection: - `BIG_DECIMAL` and `STRING` SV/MV use `JSON_PARSER_CONTEXT_WITH_BIG_DECIMAL` to preserve full numeric precision and produce canonical-form string serialization. Numeric SV/MV stay on the default parser since narrowing to int / long / float / double yields equivalent results within double precision. - New helper `getResultExtractorWithBigDecimal(valueBlock)` for the BigDecimal-parser path, mirroring the default `getResultExtractor`. Stored-type guards on every transform method: - All 12 SV/MV transform methods now guard with `_storedType != DataType.<X> ? super.transformTo*Values*V() : ...`. Closes the cross-type correctness hole where a caller asks for an int from a STRING-typed function — the base class now handles the conversion. Default-value handling: - `_defaultValue` is pre-converted to the canonical stored-type form once in `init()` (most types via the literal accessors; `BOOLEAN` literal stored as `Integer` 0 / 1 to match the `INT` storedType). Per-row default extraction is now a single direct cast at the top of each transform method, eliminating the `instanceof Number` / `parse*(toString())` chain that was repeated in every method. Cached members: - `_dataType` and `_storedType` cached as fields in `init()` so the transform methods avoid repeated `getDataType()` / `getStoredType()` invocations. Tests: - Added comprehensive coverage for the new behavior using `FluentQueryTest` with synthetic JSON: BOOLEAN coercion (Number / Boolean / String forms), TIMESTAMP coercion (numeric millis, ISO-8601, JDBC-format strings), STRING serialization for non-String JSON values, INT_ARRAY / STRING_ARRAY heterogeneous-element coercion, BIG_DECIMAL_ARRAY precision preservation, and the cross-type guard via the base class.
aa85969 to
23c84e7
Compare
|
Docs follow-up opened: pinot-contrib/pinot-docs#801 |
) ## Summary - clarify `JSONEXTRACTSCALAR` result typing for query-facing docs - document the new `BIG_DECIMAL_ARRAY` support and the precision-preserving contract for decimal extraction - spell out the coercion and JSON serialization behavior users can rely on for `STRING`, numeric, boolean, and timestamp result types ## Structural changes - updated the existing `functions/json/jsonextractscalar.md` reference page only ## Source cross-check - verified behavior against `pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunction.java` - verified the user-facing contract against `pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunctionTest.java` ## Validation - `git diff --check` - lightweight markdown link validation for `functions/json/jsonextractscalar.md`
Summary
Restructures
JsonExtractScalarTransformFunctionfor correctness, consistency, and performance. The headline issue was aClassCastExceptionwhen the JSON path resolved to elements of a different runtime type thanresultsType(e.g. extracting a JSON array of numbers asSTRING_ARRAY); the broader rewrite then folded out a lot of repeated per-row dispatch and added the missingBIG_DECIMAL_ARRAYMV path.Bug fix: per-element coercion in MV paths
The MV transform methods declared their result list as
List<Integer>/List<Long>/ etc. and castresult.get(j)directly. When the JsonPath resolved to elements of a different runtime type, the cast threw. Switched each MV result list toList<Object>and route every element through a type-specific helper:toInt(value, isBoolean)—Number→intValue(); whenisBoolean, follow Pinot's numeric BOOLEAN convention (any non-zeroNumber→ 1),Boolean→ 1/0, and String viaBooleanUtils.toInt(String).toLong(value, isTimestamp)—Number→longValue(); whenisTimestamp, parse strings viaTimestampUtils.toMillisSinceEpoch(accepts ISO-8601 + numeric millis); otherwiseNumberUtils.parseJsonLong.toFloat/toDouble/toBigDecimal—Numbercast / type-specific cast withparse*(toString())fallback.toString—Stringpass-through; otherwiseJsonUtils.objectToString(single-pass JSON serialization, faster than the oldobjectToJsonNode(...).toString()).These helpers are shared between the SV and MV transform methods, eliminating ~10 nearly-identical conversion blocks.
New transform method:
BIG_DECIMAL_ARRAYtransformToBigDecimalValuesMVwas missing —BIG_DECIMAL_ARRAYresult type previously fell through to the base class which can't extract from JSON. Implemented with the same coercion pattern.Parser-context selection
BIG_DECIMALandSTRING(both SV and MV) now useJSON_PARSER_CONTEXT_WITH_BIG_DECIMAL, matching the SV-BigDecimal and SV-String pre-existing behavior. This preserves full numeric precision forBIG_DECIMALand produces canonical-form serialization forSTRING(e.g.1.0E20stringifies as100000000000000000000).int/long/float/doubleyields equivalent results within double precision and BigDecimal parsing is several times slower.getResultExtractorWithBigDecimal(valueBlock)mirroring the defaultgetResultExtractor.Stored-type guards
All 12 SV/MV transform methods now guard with
_storedType != DataType.<X> ? super.transformTo*Values*V() : .... Closes the cross-type correctness hole where a caller asks for anintfrom aSTRING-typed function — the base class now handles the conversion through the function's actual result type. Previously only INT/LONG SV had this guard.Default-value handling
_defaultValueis pre-converted to the canonical stored-type form once ininit():INT/LONG/FLOAT/DOUBLE/BIG_DECIMAL/TIMESTAMP/STRING— already returned the right boxed type from the literal accessor.BOOLEANnow stored asInteger0 / 1 to match itsINTstoredType, so per-row consumers can unbox directly without aBoolean → Integerconversion.Per-row default extraction is now a single direct cast at the top of each transform method, eliminating the
instanceof Number/parse*(toString())chain that was previously repeated in every method.Cached members
_dataTypeand_storedTypeare cached as fields ininit(), so the transform methods avoid repeatedgetDataType()/getStoredType()calls.Tests
Added comprehensive coverage for the new behavior using
FluentQueryTestwith synthetic JSON:Number(non-zero convention) /Boolean/Stringforms.Numberand string-form numbers).BIG_DECIMAL_ARRAYprecision preservation (29-digit decimal that exceedsDoubleprecision).CAST(jsonExtractScalar(..., 'STRING') AS LONG)).