[C++] Support specifying filters and projections with Substrait expressions #33985

ianmcook · 2023-02-01T15:59:30Z

Describe the enhancement requested

In addition to representing full plans, Substrait can also be used to represent expressions (see substrait-io/substrait#405). It would be nice if Dataset and Acero could consume Substrait expressions and use them to specify filters and projections.

I would love to see us expose functions that:

Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
Receive a list of Substrait scalar expressions and use them to Project a Dataset
Receive a Boolean-valued Substrait scalar expression and use it to add a Filter node to the ExecPlan
Receive a list of Substrait scalar expressions and use it to add a Project node to the ExecPlan

Component(s)

C++

ianmcook · 2023-02-01T16:47:20Z

This plus substrait-io/substrait-java#128 would allow users to specify filters and projections as SQL expressions and execute them with Dataset and Acero.

westonpace · 2023-02-01T17:07:36Z

The Arrow equivalent of Substrait's expression is arrow::compute::Expression. So I think an extended expression proto file would roughly translate to std::vector<std::pair<std::string, arrow::compute::Expression>> (not actually suggesting we use this API, just describing).

If we added an API for that then those compute expressions could be used when building Acero filters & projects.

Note: this sort of implies the user is not using Substrait to express their actual queries. This is fine, we have non-Substrait APIs for filter & project in pyarrow (e.g. Array.filter) and R (dplyr) so there is certainly room for it.

ianmcook · 2023-02-17T21:56:06Z

@westonpace to clarify: with the implementation you envision, could this also give us the ability to pass Substrait expressions to arrow::dataset::ScannerBuilder::Filter() and arrow::dataset::ScannerBuilder::Project(), correct? That would be very, very cool indeed.

westonpace · 2023-02-17T22:37:56Z

You would need to use this variant of Project() but otherwise yes.

…ssions (#34834) ### Rationale for this change Substrait provides a library-independent way to represent compute expressions. By serializing and deserializing pyarrow compute expression to substrait we can allow interoperability with other libraries. Originally it was thought this would not be needed because users would be sending entire query plans (which contain expressions) back and forth and so there was no need to work with expressions by themselves. However, as more and more APIs and integration points emerge it turns out there are situations where serializing expressions by themselves is useful. For example, the proposed datasets protocol, or for the Java JNI datasets implementation (which uses Arrow-C++'s datasets) ### What changes are included in this PR? In Arrow-C++ we add two new methods to serialize and deserialize a collection of named, bound expressions to Substrait's ExtendedExpression message. In pyarrow we expose these two methods and also add utility methods to pyarrow.compute.Expression to convert a single expression to/from substrait (these will be encoded as an ExtendedExpression message with one expression named "expression") In addition, this PR exposed that we do not have very many bindings for arrow-functions to substrait-functions (previous work has mostly focused on the reverse direction). This PR adds many (though not all) new bindings. In addition, this PR adds ToProto for cast and both FromProto and ToProto support for the SingularOrList expression type (we convert is_in to SingularOrList and convert SingularOrList to an or list). This should provide support for all the sargable operators except between (there is no Arrow-C++ between function) and like (we still don't have arrow->substrait bindings for the string functions) which should be a sufficient set of expressions for a first release. ### Are these changes tested? Yes. ### Are there any user-facing changes? There are new features, as described above, but no backwards incompatible changes. ### Caveats There are a fair number of minor inconsistencies or surprises, many of which can be smoothed over by follow-up work. #### Bound Expressions Arrow-C++ has long had a distinction between "unbound expressions" (e.g. `a + b`) and "bound expressions" (e.g. `a:i32 + b:i32`). A bound expression is an expression that has been bound to a schema of some kind. Field references are resolved and the output type is known for every node of the AST. Pyarrow has hidden this complexity and most pyarrow compute expressions that the user encounters will be unbound expressions. Substrait is only capable (currently) of representing bound expressions. As a result, in order to serialize expressions, the user will need to provide an input schema. This can be an inconvenience for some workflows. To resolve this, I would like to eventually add support for unbound expressions to Substrait (substrait-io/substrait#515) Another minor annoyance of bound expressions is that an unbound pyarrow.compute.Expression object will not be equal to a bound pyarrow.compute.Expression object. It would make testing easier if we had a `pyarrow.compute.Expression.equals` variant that did not examine bound fields. #### Named field references Pyarrow datasets users are used to working with named field references. For example, one can set a filter `pc.equal(ds.field("x"), 7)`. Substrait, since it requires everything to be bound, considers named references to be superfluous and does everything in terms of numeric indices into the base schema. So the above expression, after round tripping, would become something like `pc.equal(ds.field(3), 7)` (assuming `"x"` is at index `3` in the schema used for serialization). This is something that can be overcome in the future if Substrait adds support for unbound expressions. Or, if that doesn't happen, it could still be implemented as a Substrait expression hint (this would allow named references to be used even if the user wants to work with bound expressions). #### UDFs UDFs ARE supported by this PR. This covers both "builtin arrow functions that do not exist in substrait (e.g. shift_left)" and "custom UDFs added with `register_scalar_function`". By default, UDFs will not be allowed when converting to Substrait because the resulting message would not be portable (e.g. you can't expect an external system to know about your custom UDFs). However, you can set the `allow_udfs` flag to True and these will be allowed. The Substrait representation will have the URI `urn:arrow:substrait_simple_extension_function`. **Options**: Although UDFs are allowed we do not yet support UDFs that take function options. These are trickier to convert to Substrait (though it should be possible in the future if someone is motivated enough). #### Rough Edges There are a few corner cases: * The function `is_in` converts to Substrait's `SingularOrList`. On conversion back to Arrow this becomes an or list. In other words, the function `is_in(5, [1, 2, 5])` converts to `5 == 1 || 5 == 2 || 5 == 5`. This is because Substrait's or list is more expression and allows things like `5 == field_ref(0) || 5 == 7` which cannot be expressed as an `is_in` function. * Arrow functions can either be converted to Substrait or are considered UDFs. However, there are a small number of functions which can "sometimes" be converted to Substrait depending on the function options. At the moment I think this is only the `is_null` function. The `is_null` function has an option `nan_is_null` which will allow you to consider `NaN` as a null value. Substrait has no single function that evaluates both `NULL` and `NaN` as true. In the meantime you can use `is_null || is_nan`. In the future, should someone want to, they could add special logic to convert this case. * Closes: #33985 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

… expressions (apache#34834) ### Rationale for this change Substrait provides a library-independent way to represent compute expressions. By serializing and deserializing pyarrow compute expression to substrait we can allow interoperability with other libraries. Originally it was thought this would not be needed because users would be sending entire query plans (which contain expressions) back and forth and so there was no need to work with expressions by themselves. However, as more and more APIs and integration points emerge it turns out there are situations where serializing expressions by themselves is useful. For example, the proposed datasets protocol, or for the Java JNI datasets implementation (which uses Arrow-C++'s datasets) ### What changes are included in this PR? In Arrow-C++ we add two new methods to serialize and deserialize a collection of named, bound expressions to Substrait's ExtendedExpression message. In pyarrow we expose these two methods and also add utility methods to pyarrow.compute.Expression to convert a single expression to/from substrait (these will be encoded as an ExtendedExpression message with one expression named "expression") In addition, this PR exposed that we do not have very many bindings for arrow-functions to substrait-functions (previous work has mostly focused on the reverse direction). This PR adds many (though not all) new bindings. In addition, this PR adds ToProto for cast and both FromProto and ToProto support for the SingularOrList expression type (we convert is_in to SingularOrList and convert SingularOrList to an or list). This should provide support for all the sargable operators except between (there is no Arrow-C++ between function) and like (we still don't have arrow->substrait bindings for the string functions) which should be a sufficient set of expressions for a first release. ### Are these changes tested? Yes. ### Are there any user-facing changes? There are new features, as described above, but no backwards incompatible changes. ### Caveats There are a fair number of minor inconsistencies or surprises, many of which can be smoothed over by follow-up work. #### Bound Expressions Arrow-C++ has long had a distinction between "unbound expressions" (e.g. `a + b`) and "bound expressions" (e.g. `a:i32 + b:i32`). A bound expression is an expression that has been bound to a schema of some kind. Field references are resolved and the output type is known for every node of the AST. Pyarrow has hidden this complexity and most pyarrow compute expressions that the user encounters will be unbound expressions. Substrait is only capable (currently) of representing bound expressions. As a result, in order to serialize expressions, the user will need to provide an input schema. This can be an inconvenience for some workflows. To resolve this, I would like to eventually add support for unbound expressions to Substrait (substrait-io/substrait#515) Another minor annoyance of bound expressions is that an unbound pyarrow.compute.Expression object will not be equal to a bound pyarrow.compute.Expression object. It would make testing easier if we had a `pyarrow.compute.Expression.equals` variant that did not examine bound fields. #### Named field references Pyarrow datasets users are used to working with named field references. For example, one can set a filter `pc.equal(ds.field("x"), 7)`. Substrait, since it requires everything to be bound, considers named references to be superfluous and does everything in terms of numeric indices into the base schema. So the above expression, after round tripping, would become something like `pc.equal(ds.field(3), 7)` (assuming `"x"` is at index `3` in the schema used for serialization). This is something that can be overcome in the future if Substrait adds support for unbound expressions. Or, if that doesn't happen, it could still be implemented as a Substrait expression hint (this would allow named references to be used even if the user wants to work with bound expressions). #### UDFs UDFs ARE supported by this PR. This covers both "builtin arrow functions that do not exist in substrait (e.g. shift_left)" and "custom UDFs added with `register_scalar_function`". By default, UDFs will not be allowed when converting to Substrait because the resulting message would not be portable (e.g. you can't expect an external system to know about your custom UDFs). However, you can set the `allow_udfs` flag to True and these will be allowed. The Substrait representation will have the URI `urn:arrow:substrait_simple_extension_function`. **Options**: Although UDFs are allowed we do not yet support UDFs that take function options. These are trickier to convert to Substrait (though it should be possible in the future if someone is motivated enough). #### Rough Edges There are a few corner cases: * The function `is_in` converts to Substrait's `SingularOrList`. On conversion back to Arrow this becomes an or list. In other words, the function `is_in(5, [1, 2, 5])` converts to `5 == 1 || 5 == 2 || 5 == 5`. This is because Substrait's or list is more expression and allows things like `5 == field_ref(0) || 5 == 7` which cannot be expressed as an `is_in` function. * Arrow functions can either be converted to Substrait or are considered UDFs. However, there are a small number of functions which can "sometimes" be converted to Substrait depending on the function options. At the moment I think this is only the `is_null` function. The `is_null` function has an option `nan_is_null` which will allow you to consider `NaN` as a null value. Substrait has no single function that evaluates both `NULL` and `NaN` as true. In the meantime you can use `is_null || is_nan`. In the future, should someone want to, they could add special logic to convert this case. * Closes: apache#33985 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

ianmcook added the Type: enhancement label Feb 1, 2023

github-actions bot added the Component: C++ label Feb 1, 2023

ianmcook mentioned this issue Feb 17, 2023

[Java] Push-down filtering in Java #14782

Open

ianmcook mentioned this issue Feb 18, 2023

[Java] Filter and project Datasets with Substrait expressions #34252

Closed

github-actions bot mentioned this issue Apr 1, 2023

GH-33985: [C++] Add substrait serialization/deserialization for expressions #34834

Merged

github-actions bot assigned westonpace Apr 1, 2023

westonpace closed this as completed in #34834 Aug 22, 2023

westonpace added this to the 14.0.0 milestone Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Support specifying filters and projections with Substrait expressions #33985

[C++] Support specifying filters and projections with Substrait expressions #33985

ianmcook commented Feb 1, 2023 •

edited

Loading

ianmcook commented Feb 1, 2023 •

edited

Loading

westonpace commented Feb 1, 2023

ianmcook commented Feb 17, 2023 •

edited

Loading

westonpace commented Feb 17, 2023 •

edited

Loading

[C++] Support specifying filters and projections with Substrait expressions #33985

[C++] Support specifying filters and projections with Substrait expressions #33985

Comments

ianmcook commented Feb 1, 2023 • edited Loading

Describe the enhancement requested

Component(s)

ianmcook commented Feb 1, 2023 • edited Loading

westonpace commented Feb 1, 2023

ianmcook commented Feb 17, 2023 • edited Loading

westonpace commented Feb 17, 2023 • edited Loading

ianmcook commented Feb 1, 2023 •

edited

Loading

ianmcook commented Feb 1, 2023 •

edited

Loading

ianmcook commented Feb 17, 2023 •

edited

Loading

westonpace commented Feb 17, 2023 •

edited

Loading