Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

Closed
westonpace opened this issue Mar 20, 2024 · 1 comment
Closed

Comments

@westonpace
Copy link
Member

Describe the enhancement requested

The Arrow<->Substrait conversion currently only works with the types that are supported by both Arrow and Substrait. I would like to use Substrait expression conversion for filter pushdown (polars can convert to a pyarrow.compute expression, and datafusion can consume a substrait expression, and I would like to bridge the two).

This is currently blocked by the fact that polars uses large_string by default and pyarrow.compute expressions fail to serialize if they contain a large_string type.

Since I know that both the source and destination are arrow I should be able to use the arrow-specific types (substrait will consider them user defined types).

To simplify things, this request only asks for support for non-parameterized types. Arrow-specific parameterized types (e.g. decimal256, large_string, etc.) can come in a future request.

Component(s)

C++

westonpace added a commit that referenced this issue Apr 11, 2024
### Rationale for this change

See #40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: #40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 16.0.0 milestone Apr 11, 2024
@westonpace
Copy link
Member Author

Issue resolved by pull request 40696
#40696

raulcd pushed a commit that referenced this issue Apr 11, 2024
### Rationale for this change

See #40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: #40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue Apr 15, 2024
### Rationale for this change

See apache#40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: apache#40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
### Rationale for this change

See apache#40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: apache#40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
### Rationale for this change

See apache#40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: apache#40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant