-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: extend substrait type support, including type variations #5775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
waynexia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andygrove @nseekhao, please take a look when you have time, thanks!
There are some issues with testing and would greatly appreciate any advice you may have ❤️
| // Literal in this cases have incorrect type. This is not a good case | ||
| #[tokio::test] | ||
| async fn other_type_literal() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Literals in this case have different types with the columns 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use the arrow_cast function to get specific types -- like arrow_cast('foo', LargeUtf8) for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, thanks for your help!
| Ok(()) | ||
| } | ||
|
|
||
| async fn roundtrip_all_types(sql: &str) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote cases like "select cast(a as boolean) from data" I got an error saying that there is no column a in the table (however "select a from data" works). I'm not sure if it's expected. But only cast uses type for now. Do you have any suggestions about testing new types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the following syntax work?
select a::boolean from data work?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works! I rewrite this case. But only a few SQL types are supported for now (I guess they are defined in this function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ::boolean syntax is in terms of SQL types (not Arrow types). The SQL --> Arrow type mapping are in
https://arrow.apache.org/datafusion/user-guide/sql/data_types.html
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove / @waynexia is there someone who knows the datafusion-substrait crate that can help review this PR?
cc @mbrobbel for who has contributed to this crate. I can also split this PR into two parts for types and literals to reduce the review burden. |
|
If no one else reviews this I will find time to do so later today |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an improvement to me -- thank you very much @waynexia
I am concerned about the use of unsafe / transmute but that was not introduced by this PR and thus I don't think it is needed to merge
datafusion/substrait/src/lib.rs
Outdated
| pub mod logical_plan; | ||
| pub mod physical_plan; | ||
| pub mod serializer; | ||
| mod variation_const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of these modules are pub but the newly added variation_const is not -- is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
|
|
||
| //! Type variation constants | ||
| pub const DEFAULT_TYPE_REF: u32 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with substrait's type variations --perhaps it would help to add a reference here to the subtrait documents that describe this and what they mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I add some documents and links to mod-level documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very helpful
| }, | ||
| Some(LiteralType::I32(n)) => match lit.type_variation_reference { | ||
| DEFAULT_TYPE_REF => ScalarValue::Int32(Some(*n)), | ||
| UNSIGNED_INTEGER_TYPE_REF => ScalarValue::UInt32(Some(unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of unsafe here seems strange to me for two reasons:
- It is inconsistent with Int16/UInt16 use
n as i16/n as u16 - it doesn't seem necessary --it is perfectly safe to cast i32 to u32 -- for example https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=cfd7670d0675e2353f4f8c58616b3cec
I realize you just moved this code around, but I think it would be good to get rid of transmute
This comment applies to Int64 as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is perfectly safe to cast i32 to u32
This is because the "producer" (encoder) transmutes u32 to i32, which changes the actual number. E.g., when encoding an u32::MAX. Thus here we should also transmute back to get the original value.
These transmutations are dangerous and not straightforward. What do you think about replacing them with this cast method? Though it's also a transmute underlying, it would make the code looks more concrete like the following, and I believe those type annotations (::<u32,i32>) can be omitted.
// producer
ScalarValue::UInt32(Some(n)) => LiteralType::I32(cast::<u32,i32>(n)}),
// consumer
UNSIGNED_INTEGER_TYPE_REF => ScalarValue::UInt32(Some(cast::<i32,u32>(n))),There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found a way to remove the unsafe code, see #5946
| let ctx = SessionContext::new(); | ||
| let mut explicit_options = CsvReadOptions::new(); | ||
| let schema = Schema::new(vec![ | ||
| Field::new("a", DataType::Boolean, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally find this test setup easier to use and understand if the fields were named for the type (for exmaple "bool_col" rather than "a"
The reason is that using just letters means that it is hard to understand what a particular test is testing without looking at this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, it looks cleaner now 👍
| f = 0 AND | ||
| g = 0 AND | ||
| h = 0 AND | ||
| i = 0;", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is any reason for not testing j and k (f32 and f64) as well?
| // Literal in this cases have incorrect type. This is not a good case | ||
| #[tokio::test] | ||
| async fn other_type_literal() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use the arrow_cast function to get specific types -- like arrow_cast('foo', LargeUtf8) for example
| Ok(()) | ||
| } | ||
|
|
||
| async fn roundtrip_all_types(sql: &str) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the following syntax work?
select a::boolean from data work?Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
waynexia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing this @alamb ❤️ I've pushed a new commit to solve comments, please take a look
datafusion/substrait/src/lib.rs
Outdated
| pub mod logical_plan; | ||
| pub mod physical_plan; | ||
| pub mod serializer; | ||
| mod variation_const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
| }, | ||
| Some(LiteralType::I32(n)) => match lit.type_variation_reference { | ||
| DEFAULT_TYPE_REF => ScalarValue::Int32(Some(*n)), | ||
| UNSIGNED_INTEGER_TYPE_REF => ScalarValue::UInt32(Some(unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is perfectly safe to cast i32 to u32
This is because the "producer" (encoder) transmutes u32 to i32, which changes the actual number. E.g., when encoding an u32::MAX. Thus here we should also transmute back to get the original value.
These transmutations are dangerous and not straightforward. What do you think about replacing them with this cast method? Though it's also a transmute underlying, it would make the code looks more concrete like the following, and I believe those type annotations (::<u32,i32>) can be omitted.
// producer
ScalarValue::UInt32(Some(n)) => LiteralType::I32(cast::<u32,i32>(n)}),
// consumer
UNSIGNED_INTEGER_TYPE_REF => ScalarValue::UInt32(Some(cast::<i32,u32>(n))),|
|
||
| //! Type variation constants | ||
| pub const DEFAULT_TYPE_REF: u32 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I add some documents and links to mod-level documentation.
| let ctx = SessionContext::new(); | ||
| let mut explicit_options = CsvReadOptions::new(); | ||
| let schema = Schema::new(vec![ | ||
| Field::new("a", DataType::Boolean, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, it looks cleaner now 👍
| // Literal in this cases have incorrect type. This is not a good case | ||
| #[tokio::test] | ||
| async fn other_type_literal() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, thanks for your help!
| Ok(()) | ||
| } | ||
|
|
||
| async fn roundtrip_all_types(sql: &str) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works! I rewrite this case. But only a few SQL types are supported for now (I guess they are defined in this function).
|
Thanks again @waynexia |
…e#5775) * feat: extend types Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * feat: extend literal Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix clippy warnings Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * add some cases Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * clean up Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * add documentations and accomplish test Signed-off-by: Ruihang Xia <waynestxia@gmail.com> --------- Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Which issue does this PR close?
Closes #5717.
Rationale for this change
Extend the type and literal support in substrait. Most of simple type/literal are supported (exception:
Date64literal), and compound types exceptFixedSizeUtf8,FixedSizeList,Dictionaryetc.What changes are included in this PR?
Are these changes tested?
Some of, see comment below
Are there any user-facing changes?
Not breaking