Skip to content

Commit

Permalink
GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::F…
Browse files Browse the repository at this point in the history
…ilter as a Substrait proto extended expression (#35570)

### Rationale for this change

To close #34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by #35798

This PR needs/use this PRs/Issues:
- #34834
- #34227
- #35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: #34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
  • Loading branch information
5 people committed Sep 20, 2023
1 parent 6b1bcae commit 00481a2
Show file tree
Hide file tree
Showing 7 changed files with 770 additions and 24 deletions.
29 changes: 24 additions & 5 deletions docs/source/java/dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,12 +132,10 @@ within method ``Scanner::schema()``:
.. _java-dataset-projection:

Projection
==========
Projection (Subset of Columns)
==============================

User can specify projections in ScanOptions. For ``FileSystemDataset``, only
column projection is allowed for now, which means, only column names
in the projection list will be accepted. For example:
User can specify projections in ScanOptions. For example:

.. code-block:: Java
Expand All @@ -159,6 +157,27 @@ Or use shortcut construtor:
Then all columns will be emitted during scanning.

Projection (Produce New Columns) and Filters
============================================

User can specify projections (new columns) or filters in ScanOptions using Substrait. For example:

.. code-block:: Java
ByteBuffer substraitExpressionFilter = getSubstraitExpressionFilter();
ByteBuffer substraitExpressionProject = getSubstraitExpressionProjection();
// Use Substrait APIs to create an Expression and serialize to a ByteBuffer
ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
.columns(Optional.empty())
.substraitExpressionFilter(substraitExpressionFilter)
.substraitExpressionProjection(getSubstraitExpressionProjection())
.build();
.. seealso::

:doc:`Executing Projections and Filters Using Extended Expressions <substrait>`
Projections and Filters using Substrait.

Read Data from HDFS
===================

Expand Down

0 comments on commit 00481a2

Please sign in to comment.