[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

westonpace · 2024-03-20T23:37:54Z

Describe the enhancement requested

The Arrow<->Substrait conversion currently only works with the types that are supported by both Arrow and Substrait. I would like to use Substrait expression conversion for filter pushdown (polars can convert to a pyarrow.compute expression, and datafusion can consume a substrait expression, and I would like to bridge the two).

This is currently blocked by the fact that polars uses large_string by default and pyarrow.compute expressions fail to serialize if they contain a large_string type.

Since I know that both the source and destination are arrow I should be able to use the arrow-specific types (substrait will consider them user defined types).

To simplify things, this request only asks for support for non-parameterized types. Arrow-specific parameterized types (e.g. decimal256, large_string, etc.) can come in a future request.

Component(s)

C++

### Rationale for this change See #40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: #40695 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

westonpace · 2024-04-11T02:48:12Z

Issue resolved by pull request 40696
#40696

### Rationale for this change See #40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: #40695 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

### Rationale for this change See apache#40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: apache#40695 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

westonpace added the Type: enhancement label Mar 20, 2024

github-actions bot added the Component: C++ label Mar 20, 2024

westonpace mentioned this issue Mar 20, 2024

GH-40695 [C++] Expand Substrait type support #40696

Merged

github-actions bot assigned westonpace Mar 20, 2024

This was referenced Mar 22, 2024

[C++] Add Substrait support for arrow-specific types (paramaeterized) #40740

Open

[C++] Add support for precision timestamp literals #40741

Open

westonpace added this to the 16.0.0 milestone Apr 11, 2024

westonpace closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

westonpace commented Mar 20, 2024

westonpace commented Apr 11, 2024

[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

Comments

westonpace commented Mar 20, 2024

Describe the enhancement requested

Component(s)

westonpace commented Apr 11, 2024