Skip to content

Expose DataSource from TransformFunction#18435

Closed
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:codex/expose-transform-datasource
Closed

Expose DataSource from TransformFunction#18435
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:codex/expose-transform-datasource

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented May 7, 2026

Summary

  • Add TransformFunction#getDataSource() so transform functions can explicitly expose a backing segment DataSource when one exists.
  • Wire ColumnContext.fromTransformFunction(...) to preserve exposed data sources across transform operator chains.
  • Update identifier, literal, array literal, generated array, base, and map item transform implementations to implement the new method explicitly.
  • Switch jsonExtractIndex and TEXT_MATCH to consume the source from their input transform instead of re-looking up raw identifiers in columnContextMap.

Why

Some index-backed transform functions need access to the segment DataSource for the expression they consume. Before this change, that access was only available through ColumnContext for raw columns, which made transform-to-transform chains lose the data source even when a transform result still mapped directly to a segment source.

DataSource vs Dictionary Contract

DataSource and Dictionary are related, but neither strictly derives from the other in the general TransformFunction API.

  • getDataSource() answers whether the transform result can expose a backing segment source and its indexes.
  • getDictionary() answers whether the transform result can be read through dictionary ids.
  • getDataSource() != null does not imply getDictionary() != null; raw columns can have a data source without dictionary encoding.
  • getDictionary() != null does not imply getDataSource() != null; computed transforms can expose dictionary-backed output without being a direct physical data source.
  • When a transform result directly maps to a physical source, getDictionary() should match getDataSource().getDictionary().

User Manual

For transform function authors:

  • Implement getDataSource() explicitly on every TransformFunction implementation.
  • Return the backing DataSource only when the transform output maps directly to a segment data source, such as an identifier or a map item key source.
  • Return null for computed expressions and literals.
  • Prefer argument.getDataSource() when a function needs indexes from its input expression.

Sample Queries

These existing index-backed forms continue to work, with the data source now flowing through the transform-function API:

SELECT TEXT_MATCH(skills, 'sewing') AS match
FROM testTable;
SELECT jsonExtractIndex(jsonSV, '$.intVal', 'INT')
FROM testTable;

Validation

  • ./mvnw -pl pinot-core -am -DskipTests compile
  • ./mvnw -pl pinot-core -am -Dtest=IdentifierTransformFunctionTest,DateTimeConversionTransformFunctionTest,JsonExtractIndexTransformFunctionTest -Dsurefire.failIfNoSpecifiedTests=false -DfailIfNoTests=false test
  • ./mvnw -pl pinot-core -am -Dtest=TextMatchTransformFunctionTest -Dsurefire.failIfNoSpecifiedTests=false -DfailIfNoTests=false test

@xiangfu0 xiangfu0 marked this pull request as ready for review May 7, 2026 03:42
@xiangfu0 xiangfu0 requested a review from Copilot May 7, 2026 03:42
@xiangfu0 xiangfu0 changed the title [codex] Expose DataSource from TransformFunction Expose DataSource from TransformFunction May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a TransformFunction#getDataSource() API and threads segment DataSource through ColumnContext so index-backed transform functions can reliably access indexes even across transform-to-transform chains.

Changes:

  • Introduces TransformFunction#getDataSource() and updates ColumnContext.fromTransformFunction(...) to preserve data sources.
  • Updates multiple transform functions to implement/propagate the new data source exposure.
  • Refactors TEXT_MATCH and jsonExtractIndex to consume the input’s exposed DataSource instead of re-looking up identifiers via columnContextMap.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/TransformFunction.java Adds new getDataSource() API to the transform function contract.
pinot-core/src/main/java/org/apache/pinot/core/operator/ColumnContext.java Preserves DataSource when deriving ColumnContext from a transform.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/IdentifierTransformFunction.java Exposes backing DataSource for raw identifier transforms.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/TextMatchTransformFunction.java Switches to using input DataSource for text index access.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java Switches to using input DataSource for JSON index access.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/ItemTransformFunction.java Uses _mapValue.getDataSource() and exposes the key DataSource.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/BaseTransformFunction.java Supplies a null getDataSource() implementation for base transforms.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/LiteralTransformFunction.java Implements getDataSource() as null for literals.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/ArrayLiteralTransformFunction.java Implements getDataSource() as null for array literals.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/GenerateArrayTransformFunction.java Implements getDataSource() as null for generated arrays.
pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/IdentifierTransformFunctionTest.java Adds coverage to assert DataSource propagation via transform + ColumnContext.
pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/DateTimeConversionTransformFunctionTest.java Updates a test stub to compile with the new interface method.

@xiangfu0 xiangfu0 force-pushed the codex/expose-transform-datasource branch from 305f0da to 1a336cb Compare May 7, 2026 03:57
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

Codecov Report

❌ Patch coverage is 61.90476% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.58%. Comparing base (22df831) to head (1a336cb).
⚠️ Report is 67 commits behind head on master.

Files with missing lines Patch % Lines
...ator/transform/function/ItemTransformFunction.java 0.00% 3 Missing ⚠️
...rm/function/JsonExtractIndexTransformFunction.java 57.14% 2 Missing and 1 partial ⚠️
...nsform/function/ArrayLiteralTransformFunction.java 0.00% 1 Missing ⚠️
...sform/function/GenerateArrayTransformFunction.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18435      +/-   ##
============================================
+ Coverage     63.40%   63.58%   +0.17%     
- Complexity     1668     1717      +49     
============================================
  Files          3252     3252              
  Lines        198661   199138     +477     
  Branches      30770    30875     +105     
============================================
+ Hits         125965   126620     +655     
+ Misses        62632    62441     -191     
- Partials      10064    10077      +13     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.58% <61.90%> (+0.17%) ⬆️
temurin 63.58% <61.90%> (+0.17%) ⬆️
unittests 63.58% <61.90%> (+0.17%) ⬆️
unittests1 55.67% <61.90%> (+0.29%) ⬆️
unittests2 34.90% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

* {@code null} otherwise.
*/
@Nullable
DataSource getDataSource();
@Jackie-Jiang Jackie-Jiang added query Related to query processing enhancement Improvement to existing functionality labels May 7, 2026
@xiangfu0 xiangfu0 closed this May 7, 2026
@xiangfu0 xiangfu0 deleted the codex/expose-transform-datasource branch May 7, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to existing functionality query Related to query processing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants