Expose DataSource from TransformFunction#18435
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a TransformFunction#getDataSource() API and threads segment DataSource through ColumnContext so index-backed transform functions can reliably access indexes even across transform-to-transform chains.
Changes:
- Introduces
TransformFunction#getDataSource()and updatesColumnContext.fromTransformFunction(...)to preserve data sources. - Updates multiple transform functions to implement/propagate the new data source exposure.
- Refactors
TEXT_MATCHandjsonExtractIndexto consume the input’s exposedDataSourceinstead of re-looking up identifiers viacolumnContextMap.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/TransformFunction.java | Adds new getDataSource() API to the transform function contract. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/ColumnContext.java | Preserves DataSource when deriving ColumnContext from a transform. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/IdentifierTransformFunction.java | Exposes backing DataSource for raw identifier transforms. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/TextMatchTransformFunction.java | Switches to using input DataSource for text index access. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java | Switches to using input DataSource for JSON index access. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/ItemTransformFunction.java | Uses _mapValue.getDataSource() and exposes the key DataSource. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/BaseTransformFunction.java | Supplies a null getDataSource() implementation for base transforms. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/LiteralTransformFunction.java | Implements getDataSource() as null for literals. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/ArrayLiteralTransformFunction.java | Implements getDataSource() as null for array literals. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/GenerateArrayTransformFunction.java | Implements getDataSource() as null for generated arrays. |
| pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/IdentifierTransformFunctionTest.java | Adds coverage to assert DataSource propagation via transform + ColumnContext. |
| pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/DateTimeConversionTransformFunctionTest.java | Updates a test stub to compile with the new interface method. |
305f0da to
1a336cb
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18435 +/- ##
============================================
+ Coverage 63.40% 63.58% +0.17%
- Complexity 1668 1717 +49
============================================
Files 3252 3252
Lines 198661 199138 +477
Branches 30770 30875 +105
============================================
+ Hits 125965 126620 +655
+ Misses 62632 62441 -191
- Partials 10064 10077 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| * {@code null} otherwise. | ||
| */ | ||
| @Nullable | ||
| DataSource getDataSource(); |
Summary
TransformFunction#getDataSource()so transform functions can explicitly expose a backing segmentDataSourcewhen one exists.ColumnContext.fromTransformFunction(...)to preserve exposed data sources across transform operator chains.jsonExtractIndexandTEXT_MATCHto consume the source from their input transform instead of re-looking up raw identifiers incolumnContextMap.Why
Some index-backed transform functions need access to the segment
DataSourcefor the expression they consume. Before this change, that access was only available throughColumnContextfor raw columns, which made transform-to-transform chains lose the data source even when a transform result still mapped directly to a segment source.DataSource vs Dictionary Contract
DataSourceandDictionaryare related, but neither strictly derives from the other in the generalTransformFunctionAPI.getDataSource()answers whether the transform result can expose a backing segment source and its indexes.getDictionary()answers whether the transform result can be read through dictionary ids.getDataSource() != nulldoes not implygetDictionary() != null; raw columns can have a data source without dictionary encoding.getDictionary() != nulldoes not implygetDataSource() != null; computed transforms can expose dictionary-backed output without being a direct physical data source.getDictionary()should matchgetDataSource().getDictionary().User Manual
For transform function authors:
getDataSource()explicitly on everyTransformFunctionimplementation.DataSourceonly when the transform output maps directly to a segment data source, such as an identifier or a map item key source.nullfor computed expressions and literals.argument.getDataSource()when a function needs indexes from its input expression.Sample Queries
These existing index-backed forms continue to work, with the data source now flowing through the transform-function API:
Validation
./mvnw -pl pinot-core -am -DskipTests compile./mvnw -pl pinot-core -am -Dtest=IdentifierTransformFunctionTest,DateTimeConversionTransformFunctionTest,JsonExtractIndexTransformFunctionTest -Dsurefire.failIfNoSpecifiedTests=false -DfailIfNoTests=false test./mvnw -pl pinot-core -am -Dtest=TextMatchTransformFunctionTest -Dsurefire.failIfNoSpecifiedTests=false -DfailIfNoTests=false test