-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40695 [C++] Expand Substrait type support #40696
GH-40695 [C++] Expand Substrait type support #40696
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
# of values. For example, unsigned integer types are very similar to their integer | ||
# counterparts, but have a different range of values. These types are defined here | ||
# as extension types. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why large_binary should be an extension type but binary_view should only be an encoding? I think it'd provide a useful guide for future authors who need to pick where to put a type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I explain this above (I have updated the wording slightly)?
# Certain Arrow data types are, from Substrait's point of view, encodings.
# These include dictionary, the view types (e.g. binary view, list view),
# and REE.
#
# These types are not logically distinct from the type they are encoding.
# Specifically, the types meet the following criteria:
# * There is no value in the decoded type that cannot be represented
# as a value in the encoded type and vice versa.
# * Functions have the same meaning when applied to the encoded type
#
# Note: if two types have a different range (e.g. string and large_string) then
# they do not satisfy the above criteria and are not encodings.
#
# These types will never have a Substrait equivalent. In the Substrait point
# of view these are execution details.
So large_string
and string
are different types because concat(<string-with-2B-characters>, 'x')
will have a different output for string
and large_string
(it will output an error
given string
and a valid value given large_string
). However, there are no possible inputs that could lead to a different function output between string
and string_view
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks. I had forgotten that substrait specifies that strings may not be longer than 2GB.
67839a1
to
962b2a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only nitpicks.
google::protobuf::UInt64Value value; | ||
if (!user_defined_->value().UnpackTo(&value)) { | ||
return Status::Invalid( | ||
"Failed to unpack user defined integer literal to UInt64Value"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a function to create this Status taking arguments like "integer literal"
and "UInt64Value"
would reduce the string literal bloat in the binary because "Failed to unpack user defined " wouldn't have to be inlined so many times in the literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding ARROW_PREDICT_FALSE
to the condition would also reduce the inlining of code in the block because it becomes cold code code for the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return EncodeUserDefined(*s.type, value); | ||
} | ||
Status Visit(const UInt32Scalar& s) { | ||
google::protobuf::UInt64Value value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be UInt32Value
? And what about the 16-bit version above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In protobuf all unsigned integers are encoded the same way and there is no uint16 or uint8 so I figured it would be simplest and most consistent to just use uint64 for everything. It's something of an arbitrary decision and, even if the receiver decodes it as a uint32, it will still work.
auto user_defined = std::make_unique<Lit::UserDefined>(); | ||
user_defined->set_type_reference(anchor); | ||
auto value_any = std::make_unique<google::protobuf::Any>(); | ||
value_any->PackFrom(value); | ||
user_defined->set_allocated_value(value_any.release()); | ||
lit_->set_allocated_user_defined(user_defined.release()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use destination-passing style to avoid extra allocations with protobuf types:
auto *user_defined = lit_->mutable_user_defined();
user_defined->set_type_reference(anchor);
value.PackTo(user_defined->mutable_value());
A change in style for the whole file really, so not a blocker for this PR specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…. Add support for more arrow-specific types that were not previously supported
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
962b2a8
to
ab8237a
Compare
Thanks @felipecrv The whole-file optimizations I've added follow-up PRs for. I do plan on adding support for parameterized types (e.g. decimal256, etc.) in the next release and I'll take a stab at those when I do that. Otherwise, I believe I have addressed your comments and will merge on green. |
Sounds great! |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 7f64fff. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 11 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change See #40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: #40695 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
### Rationale for this change See apache#40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: apache#40695 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
### Rationale for this change See apache#40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: apache#40695 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
### Rationale for this change See apache#40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: apache#40695 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Rationale for this change
See #40695
What changes are included in this PR?
This PR does a few things:
When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like
to_char(my_timestamp, "America/New York")
instead ofto_char(cast(my_timestamp, timestamp("nanos", "America/New York")
.However, the current implementation makes it impossible to produce or consume a plan with
to_char(my_timestamp, "America/New York")
because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type.In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of
DICTIONARY<INT32>
will convert toINT32
. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be.Are these changes tested?
Yes, I added new unit tests
Are there any user-facing changes?
Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".