Refactor PinotDataType / FunctionUtils dispatch to value+instanceof#18428
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Refactors Pinot type inference from exact Class<?> matching to value-based instanceof dispatch so runtime subclasses, especially JDBC Timestamp subclasses, resolve to the expected Pinot types. It also separates primitive vs boxed boolean-array handling and updates affected call sites/tests.
Changes:
- Replaced
Class<?>-based dispatch inPinotDataTypeandFunctionUtilswith value-based inference and updated callers to pass values instead of.getClass(). - Split boolean-array handling into
PRIMITIVE_BOOLEAN_ARRAY(boolean[]) andBOOLEAN_ARRAY(Boolean[]), and added related conversion/map updates. - Added and updated tests for new dispatch behavior, including vendor
Timestampsubclasses and new array mappings.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/DataTypeTransformerUtils.java | Switches transform-time source type inference to value-based dispatch. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/defaultcolumn/BaseDefaultColumnHandler.java | Infers derived-column output type from runtime values instead of classes. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java | Uses runtime values for map-entry stats type detection. |
| pinot-common/src/test/java/org/apache/pinot/common/utils/PinotDataTypeTest.java | Reworks PinotDataType tests for value-based dispatch and boolean-array split. |
| pinot-common/src/test/java/org/apache/pinot/common/function/FunctionUtilsTest.java | Adds coverage for getArgumentType, parameter types, and column data types. |
| pinot-common/src/test/java/org/apache/pinot/common/evaluator/InbuiltFunctionEvaluatorTest.java | Updates evaluator tests to use value-based argument type inference. |
| pinot-common/src/main/java/org/apache/pinot/common/utils/PinotDataType.java | Implements runtime-value dispatch and new boxed/primitive boolean-array behavior. |
| pinot-common/src/main/java/org/apache/pinot/common/function/scalar/DataTypeConversionFunctions.java | Updates cast source-type resolution and fixes the array guard condition. |
| pinot-common/src/main/java/org/apache/pinot/common/function/FunctionUtils.java | Replaces argument type lookup logic and extends type maps for new array forms. |
| pinot-common/src/main/java/org/apache/pinot/common/function/FunctionInvoker.java | Converts function arguments using value-derived Pinot types. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18428 +/- ##
============================================
- Coverage 63.59% 63.59% -0.01%
Complexity 1717 1717
============================================
Files 3252 3252
Lines 199119 199132 +13
Branches 30857 30875 +18
============================================
+ Hits 126627 126631 +4
- Misses 62425 62427 +2
- Partials 10067 10074 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…L_ARRAY, guards Restructures `JsonExtractScalarTransformFunction` for correctness, consistency, and performance. Stacked on the PinotDataType / FunctionUtils refactor in the parent commit; review only this commit on top of apache#18428. Per-element coercion (the bug): - The MV transform methods declared their result list as `List<Integer>` / `List<Long>` / etc. and cast `result.get(j)` directly. When the JsonPath resolved to elements of a different runtime type (e.g. `STRING_ARRAY` over a JSON array of numbers), the cast threw `ClassCastException`. Switched each MV result list to `List<Object>` and route per-element conversion through type-specific helpers. Type-specific coercion helpers (shared by SV and MV): - `toInt(value, isBoolean)` — `Number` → `intValue()`; for BOOLEAN result, follows Pinot's numeric convention (any non-zero `Number` → 1), `Boolean` → 1/0, and String forms via `BooleanUtils.toInt(String)`. - `toLong(value, isTimestamp)` — `Number` → `longValue()`; for TIMESTAMP result, String forms parsed via `TimestampUtils.toMillisSinceEpoch` (ISO-8601 + numeric); otherwise `NumberUtils.parseJsonLong`. - `toFloat`, `toDouble`, `toBigDecimal`, `toString` — straight `Number` / type-specific cast with `parse*(toString())` / `JsonUtils.objectToString` fallback. New transform method: - Added `transformToBigDecimalValuesMV`. `BIG_DECIMAL_ARRAY` previously fell through to the base class which can't extract from JSON. Parser-context selection: - `BIG_DECIMAL` and `STRING` SV/MV use `JSON_PARSER_CONTEXT_WITH_BIG_DECIMAL` to preserve full numeric precision and produce canonical-form string serialization. Numeric SV/MV stay on the default parser since narrowing to int / long / float / double yields equivalent results within double precision. - New helper `getResultExtractorWithBigDecimal(valueBlock)` for the BigDecimal-parser path, mirroring the default `getResultExtractor`. Stored-type guards on every transform method: - All 12 SV/MV transform methods now guard with `_storedType != DataType.<X> ? super.transformTo*Values*V() : ...`. Closes the cross-type correctness hole where a caller asks for an int from a STRING-typed function — the base class now handles the conversion. Default-value handling: - `_defaultValue` is pre-converted to the canonical stored-type form once in `init()` (most types via the literal accessors; `BOOLEAN` literal stored as `Integer` 0 / 1 to match the `INT` storedType). Per-row default extraction is now a single direct cast at the top of each transform method, eliminating the `instanceof Number` / `parse*(toString())` chain that was repeated in every method. Cached members: - `_dataType` and `_storedType` cached as fields in `init()` so the transform methods avoid repeated `getDataType()` / `getStoredType()` invocations. Tests: - Added comprehensive coverage for the new behavior using `FluentQueryTest` with synthetic JSON: BOOLEAN coercion (Number / Boolean / String forms), TIMESTAMP coercion (numeric millis, ISO-8601, JDBC-format strings), STRING serialization for non-String JSON values, INT_ARRAY / STRING_ARRAY heterogeneous-element coercion, BIG_DECIMAL_ARRAY precision preservation, and the cross-type guard via the base class.
…L_ARRAY, guards Restructures `JsonExtractScalarTransformFunction` for correctness, consistency, and performance. Stacked on the PinotDataType / FunctionUtils refactor in the parent commit; review only this commit on top of apache#18428. Per-element coercion (the bug): - The MV transform methods declared their result list as `List<Integer>` / `List<Long>` / etc. and cast `result.get(j)` directly. When the JsonPath resolved to elements of a different runtime type (e.g. `STRING_ARRAY` over a JSON array of numbers), the cast threw `ClassCastException`. Switched each MV result list to `List<Object>` and route per-element conversion through type-specific helpers. Type-specific coercion helpers (shared by SV and MV): - `toInt(value, isBoolean)` — `Number` → `intValue()`; for BOOLEAN result, follows Pinot's numeric convention (any non-zero `Number` → 1), `Boolean` → 1/0, and String forms via `BooleanUtils.toInt(String)`. - `toLong(value, isTimestamp)` — `Number` → `longValue()`; for TIMESTAMP result, String forms parsed via `TimestampUtils.toMillisSinceEpoch` (ISO-8601 + numeric); otherwise `NumberUtils.parseJsonLong`. - `toFloat`, `toDouble`, `toBigDecimal`, `toString` — straight `Number` / type-specific cast with `parse*(toString())` / `JsonUtils.objectToString` fallback. New transform method: - Added `transformToBigDecimalValuesMV`. `BIG_DECIMAL_ARRAY` previously fell through to the base class which can't extract from JSON. Parser-context selection: - `BIG_DECIMAL` and `STRING` SV/MV use `JSON_PARSER_CONTEXT_WITH_BIG_DECIMAL` to preserve full numeric precision and produce canonical-form string serialization. Numeric SV/MV stay on the default parser since narrowing to int / long / float / double yields equivalent results within double precision. - New helper `getResultExtractorWithBigDecimal(valueBlock)` for the BigDecimal-parser path, mirroring the default `getResultExtractor`. Stored-type guards on every transform method: - All 12 SV/MV transform methods now guard with `_storedType != DataType.<X> ? super.transformTo*Values*V() : ...`. Closes the cross-type correctness hole where a caller asks for an int from a STRING-typed function — the base class now handles the conversion. Default-value handling: - `_defaultValue` is pre-converted to the canonical stored-type form once in `init()` (most types via the literal accessors; `BOOLEAN` literal stored as `Integer` 0 / 1 to match the `INT` storedType). Per-row default extraction is now a single direct cast at the top of each transform method, eliminating the `instanceof Number` / `parse*(toString())` chain that was repeated in every method. Cached members: - `_dataType` and `_storedType` cached as fields in `init()` so the transform methods avoid repeated `getDataType()` / `getStoredType()` invocations. Tests: - Added comprehensive coverage for the new behavior using `FluentQueryTest` with synthetic JSON: BOOLEAN coercion (Number / Boolean / String forms), TIMESTAMP coercion (numeric millis, ISO-8601, JDBC-format strings), STRING serialization for non-String JSON values, INT_ARRAY / STRING_ARRAY heterogeneous-element coercion, BIG_DECIMAL_ARRAY precision preservation, and the cross-type guard via the base class.
|
Summarizing replies to Copilot's review: Re:
|
…L_ARRAY, guards Restructures `JsonExtractScalarTransformFunction` for correctness, consistency, and performance. Stacked on the PinotDataType / FunctionUtils refactor in the parent commit; review only this commit on top of apache#18428. Per-element coercion (the bug): - The MV transform methods declared their result list as `List<Integer>` / `List<Long>` / etc. and cast `result.get(j)` directly. When the JsonPath resolved to elements of a different runtime type (e.g. `STRING_ARRAY` over a JSON array of numbers), the cast threw `ClassCastException`. Switched each MV result list to `List<Object>` and route per-element conversion through type-specific helpers. Type-specific coercion helpers (shared by SV and MV): - `toInt(value, isBoolean)` — `Number` → `intValue()`; for BOOLEAN result, follows Pinot's numeric convention (any non-zero `Number` → 1), `Boolean` → 1/0, and String forms via `BooleanUtils.toInt(String)`. - `toLong(value, isTimestamp)` — `Number` → `longValue()`; for TIMESTAMP result, String forms parsed via `TimestampUtils.toMillisSinceEpoch` (ISO-8601 + numeric); otherwise `NumberUtils.parseJsonLong`. - `toFloat`, `toDouble`, `toBigDecimal`, `toString` — straight `Number` / type-specific cast with `parse*(toString())` / `JsonUtils.objectToString` fallback. New transform method: - Added `transformToBigDecimalValuesMV`. `BIG_DECIMAL_ARRAY` previously fell through to the base class which can't extract from JSON. Parser-context selection: - `BIG_DECIMAL` and `STRING` SV/MV use `JSON_PARSER_CONTEXT_WITH_BIG_DECIMAL` to preserve full numeric precision and produce canonical-form string serialization. Numeric SV/MV stay on the default parser since narrowing to int / long / float / double yields equivalent results within double precision. - New helper `getResultExtractorWithBigDecimal(valueBlock)` for the BigDecimal-parser path, mirroring the default `getResultExtractor`. Stored-type guards on every transform method: - All 12 SV/MV transform methods now guard with `_storedType != DataType.<X> ? super.transformTo*Values*V() : ...`. Closes the cross-type correctness hole where a caller asks for an int from a STRING-typed function — the base class now handles the conversion. Default-value handling: - `_defaultValue` is pre-converted to the canonical stored-type form once in `init()` (most types via the literal accessors; `BOOLEAN` literal stored as `Integer` 0 / 1 to match the `INT` storedType). Per-row default extraction is now a single direct cast at the top of each transform method, eliminating the `instanceof Number` / `parse*(toString())` chain that was repeated in every method. Cached members: - `_dataType` and `_storedType` cached as fields in `init()` so the transform methods avoid repeated `getDataType()` / `getStoredType()` invocations. Tests: - Added comprehensive coverage for the new behavior using `FluentQueryTest` with synthetic JSON: BOOLEAN coercion (Number / Boolean / String forms), TIMESTAMP coercion (numeric millis, ISO-8601, JDBC-format strings), STRING serialization for non-String JSON values, INT_ARRAY / STRING_ARRAY heterogeneous-element coercion, BIG_DECIMAL_ARRAY precision preservation, and the cross-type guard via the base class.
|
do we need to handle PRIMITIVE_BOOLEAN_ARRAY or TIMESTAMP_ARRAY in ScalarTransformFunctionWrapper .getNonLiteralValues |
Replaces the Class-based map / chain dispatch in PinotDataType.getSingleValueType / getMultiValueType and FunctionUtils.getArgumentType with an instanceof chain that takes the value directly. Fixes the long-standing exact-class match bug where vendor JDBC Timestamp subclasses (e.g. BigQuery Simba's TimestampTz) fell through to OBJECT and broke downstream conversion. Highlights: - PinotDataType.getSingleValueType / getMultiValueType: take Object instead of Class, dispatch via instanceof chain in canonical Pinot type order. Always non-null (OBJECT / OBJECT_ARRAY for unrecognized). - Split BOOLEAN_ARRAY into PRIMITIVE_BOOLEAN_ARRAY (boolean[]) and BOOLEAN_ARRAY (Boolean[]), parallel to PRIMITIVE_INT_ARRAY / INTEGER_ARRAY etc. Fixes the silent asymmetry where BOOLEAN_ARRAY stored as primitive while every other *_ARRAY was boxed. - Rename toBooleanArray (returns boolean[]) to toPrimitiveBooleanArray; new toBooleanArray returns Boolean[]. Matches int / long / float / double naming. - toObjectArray now handles boolean[] alongside int[] / long[] / float[] / double[]. - Reorder default to*Array methods to canonical Pinot type order. FunctionUtils: - getArgumentType: take Object, always non-null. Delegates SV to PinotDataType.getSingleValueType, MV reference arrays to PinotDataType.getMultiValueType (via element sampling), primitive arrays handled locally. - Add boolean[] / Timestamp[] entries to PARAMETER_TYPE_MAP and COLUMN_DATA_TYPE_MAP. - Remove unused DATA_TYPE_MAP and getDataType (zero callers; the map mapped Java array classes to the element-type DataType which lost the SV/MV distinction). Callers updated to pass values instead of classes (drops .getClass() at every call site): FunctionInvoker, BaseDefaultColumnHandler, DataTypeConversionFunctions, MapColumnPreIndexStatsCollector, DataTypeTransformerUtils. Tests: - New FunctionUtilsTest covering getArgumentType / getParameterType / getColumnDataType, including vendor Timestamp subclass case and boolean[] / Timestamp[] additions. - PinotDataTypeTest converted to value-based assertions in canonical order, added PRIMITIVE_BOOLEAN_ARRAY ↔ BOOLEAN_ARRAY cross-form conversions and Timestamp subclass cases for both getSingleValueType and getMultiValueType.
Good point. Added |
d5ab8da to
e55d282
Compare
Summary
Replaces the
Class-based map / chain dispatch inPinotDataType.getSingleValueType/getMultiValueTypeandFunctionUtils.getArgumentTypewith aninstanceofchain that takes the value directly. Fixes a long-standing exact-class match bug where vendor JDBCTimestampsubclasses (e.g. BigQuery Simba'sTimestampTz) fell through toOBJECTand broke downstream conversion.PinotDataType
getSingleValueType(Class<?>)→getSingleValueType(Object)andgetMultiValueType(Class<?>)→getMultiValueType(Object)—instanceofdispatch in canonical Pinot type order. Always non-null (OBJECT/OBJECT_ARRAYfor unrecognized types). Subclasses of non-final types (Timestamp,Map, etc.) match their parent type naturally.BOOLEAN_ARRAYintoPRIMITIVE_BOOLEAN_ARRAY(boolean[]) andBOOLEAN_ARRAY(Boolean[]) — parallel toPRIMITIVE_INT_ARRAY/INTEGER_ARRAY. Fixes the silent asymmetry whereBOOLEAN_ARRAYstored as primitive while every other*_ARRAYwas boxed.toBooleanArray(returnsboolean[]) totoPrimitiveBooleanArray; newtoBooleanArrayreturnsBoolean[]. Matchesint/long/float/doublenaming.toObjectArraynow handlesboolean[]alongsideint[]/long[]/float[]/double[].to*Arraymethods to canonical Pinot type order (INT → LONG → FLOAT → DOUBLE → BIG_DECIMAL → BOOLEAN → TIMESTAMP → STRING → BYTES → DATE → TIME → UUID).FunctionUtils
getArgumentType(Class<?>)→getArgumentType(Object), always non-null. Delegates SV dispatch toPinotDataType.getSingleValueTypeand MV reference-array dispatch toPinotDataType.getMultiValueTypevia element sampling; primitive arrays handled locally (since they can't be element-sampled into a boxed type).boolean[]/Timestamp[]entries toPARAMETER_TYPE_MAPandCOLUMN_DATA_TYPE_MAPso scalar functions can declare these as parameter / return types.DATA_TYPE_MAPandgetDataType— zero callers; the map mapped Java array classes to the element-typeDataTypewhich lost the SV/MV distinction.ScalarTransformFunctionWrapper
PRIMITIVE_BOOLEAN_ARRAYandTIMESTAMP_ARRAYcases togetNonLiteralValues. Both were reachable via the newboolean[]/Timestamp[]entries inPARAMETER_TYPE_MAPbut previously hit thedefaultbranch and threwIllegalStateException("Unsupported parameter type: ..."). Per-row conversion mirrors the SVBOOLEAN/TIMESTAMPcases (int[][]→boolean[][]andlong[][]→Timestamp[][]).Caller updates
Drops
.getClass()at every call site:FunctionInvoker.convertTypesBaseDefaultColumnHandler.createDerivedColumnV1IndicesDataTypeConversionFunctions.castMapColumnPreIndexStatsCollector.createKeyStatsCollectorDataTypeTransformerUtils.transformValueTests
FunctionUtilsTestcoveringgetArgumentType/getParameterType/getColumnDataType, including the vendorTimestampsubclass case and the newboolean[]/Timestamp[]map entries.PinotDataTypeTestconverted to value-based assertions in canonical order, addedPRIMITIVE_BOOLEAN_ARRAY↔BOOLEAN_ARRAYcross-form conversions andTimestampsubclass cases for bothgetSingleValueTypeandgetMultiValueType.ScalarTransformFunctionWrapperTestcases —testCountTrueBooleansTransformFunctionandtestSumTimestampMillisTransformFunction— exercise the newPRIMITIVE_BOOLEAN_ARRAY/TIMESTAMP_ARRAYdispatch end-to-end via test-scope@ScalarFunctionhelpers (countTrueBooleans(boolean[]),sumTimestampMillis(Timestamp[])) registered throughFunctionRegistry's classpath scan.