Skip to content

Conversation

@waynexia
Copy link
Member

@waynexia waynexia commented Mar 29, 2023

Which issue does this PR close?

Closes #5717.

Rationale for this change

Extend the type and literal support in substrait. Most of simple type/literal are supported (exception: Date64 literal), and compound types except FixedSizeUtf8, FixedSizeList, Dictionary etc.

What changes are included in this PR?

Are these changes tested?

Some of, see comment below

Are there any user-facing changes?

Not breaking

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@github-actions github-actions bot added the substrait Changes to the substrait crate label Mar 29, 2023
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia waynexia marked this pull request as ready for review March 31, 2023 13:28
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Copy link
Member Author

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andygrove @nseekhao, please take a look when you have time, thanks!

There are some issues with testing and would greatly appreciate any advice you may have ❤️

Comment on lines 317 to 319
// Literal in this cases have incorrect type. This is not a good case
#[tokio::test]
async fn other_type_literal() -> Result<()> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literals in this case have different types with the columns 😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use the arrow_cast function to get specific types -- like arrow_cast('foo', LargeUtf8) for example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, thanks for your help!

Ok(())
}

async fn roundtrip_all_types(sql: &str) -> Result<()> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote cases like "select cast(a as boolean) from data" I got an error saying that there is no column a in the table (however "select a from data" works). I'm not sure if it's expected. But only cast uses type for now. Do you have any suggestions about testing new types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the following syntax work?

select a::boolean from data work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works! I rewrite this case. But only a few SQL types are supported for now (I guess they are defined in this function).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ::boolean syntax is in terms of SQL types (not Arrow types). The SQL --> Arrow type mapping are in

https://arrow.apache.org/datafusion/user-guide/sql/data_types.html

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove / @waynexia is there someone who knows the datafusion-substrait crate that can help review this PR?

@waynexia
Copy link
Member Author

waynexia commented Apr 5, 2023

@andygrove / @waynexia is there someone who knows the datafusion-substrait crate that can help review this PR?

cc @mbrobbel for who has contributed to this crate. I can also split this PR into two parts for types and literals to reduce the review burden.

@alamb
Copy link
Contributor

alamb commented Apr 5, 2023

If no one else reviews this I will find time to do so later today

@alamb alamb changed the title feat: extend substrait types feat: extend substrait type support Apr 6, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an improvement to me -- thank you very much @waynexia

I am concerned about the use of unsafe / transmute but that was not introduced by this PR and thus I don't think it is needed to merge

pub mod logical_plan;
pub mod physical_plan;
pub mod serializer;
mod variation_const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of these modules are pub but the newly added variation_const is not -- is that intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!


//! Type variation constants
pub const DEFAULT_TYPE_REF: u32 = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with substrait's type variations --perhaps it would help to add a reference here to the subtrait documents that describe this and what they mean

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I add some documents and links to mod-level documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very helpful

},
Some(LiteralType::I32(n)) => match lit.type_variation_reference {
DEFAULT_TYPE_REF => ScalarValue::Int32(Some(*n)),
UNSIGNED_INTEGER_TYPE_REF => ScalarValue::UInt32(Some(unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of unsafe here seems strange to me for two reasons:

  1. It is inconsistent with Int16/UInt16 use n as i16 / n as u16
  2. it doesn't seem necessary --it is perfectly safe to cast i32 to u32 -- for example https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=cfd7670d0675e2353f4f8c58616b3cec

I realize you just moved this code around, but I think it would be good to get rid of transmute

This comment applies to Int64 as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is perfectly safe to cast i32 to u32

This is because the "producer" (encoder) transmutes u32 to i32, which changes the actual number. E.g., when encoding an u32::MAX. Thus here we should also transmute back to get the original value.

These transmutations are dangerous and not straightforward. What do you think about replacing them with this cast method? Though it's also a transmute underlying, it would make the code looks more concrete like the following, and I believe those type annotations (::<u32,i32>) can be omitted.

// producer
ScalarValue::UInt32(Some(n)) => LiteralType::I32(cast::<u32,i32>(n)}),

// consumer
UNSIGNED_INTEGER_TYPE_REF => ScalarValue::UInt32(Some(cast::<i32,u32>(n))),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a way to remove the unsafe code, see #5946

let ctx = SessionContext::new();
let mut explicit_options = CsvReadOptions::new();
let schema = Schema::new(vec![
Field::new("a", DataType::Boolean, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally find this test setup easier to use and understand if the fields were named for the type (for exmaple "bool_col" rather than "a"

The reason is that using just letters means that it is hard to understand what a particular test is testing without looking at this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, it looks cleaner now 👍

f = 0 AND
g = 0 AND
h = 0 AND
i = 0;",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any reason for not testing j and k (f32 and f64) as well?

Comment on lines 317 to 319
// Literal in this cases have incorrect type. This is not a good case
#[tokio::test]
async fn other_type_literal() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use the arrow_cast function to get specific types -- like arrow_cast('foo', LargeUtf8) for example

Ok(())
}

async fn roundtrip_all_types(sql: &str) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the following syntax work?

select a::boolean from data work?

@alamb alamb changed the title feat: extend substrait type support feat: extend substrait type support, including type variations Apr 6, 2023
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Copy link
Member Author

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing this @alamb ❤️ I've pushed a new commit to solve comments, please take a look

pub mod logical_plan;
pub mod physical_plan;
pub mod serializer;
mod variation_const;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

},
Some(LiteralType::I32(n)) => match lit.type_variation_reference {
DEFAULT_TYPE_REF => ScalarValue::Int32(Some(*n)),
UNSIGNED_INTEGER_TYPE_REF => ScalarValue::UInt32(Some(unsafe {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is perfectly safe to cast i32 to u32

This is because the "producer" (encoder) transmutes u32 to i32, which changes the actual number. E.g., when encoding an u32::MAX. Thus here we should also transmute back to get the original value.

These transmutations are dangerous and not straightforward. What do you think about replacing them with this cast method? Though it's also a transmute underlying, it would make the code looks more concrete like the following, and I believe those type annotations (::<u32,i32>) can be omitted.

// producer
ScalarValue::UInt32(Some(n)) => LiteralType::I32(cast::<u32,i32>(n)}),

// consumer
UNSIGNED_INTEGER_TYPE_REF => ScalarValue::UInt32(Some(cast::<i32,u32>(n))),


//! Type variation constants
pub const DEFAULT_TYPE_REF: u32 = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I add some documents and links to mod-level documentation.

let ctx = SessionContext::new();
let mut explicit_options = CsvReadOptions::new();
let schema = Schema::new(vec![
Field::new("a", DataType::Boolean, true),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, it looks cleaner now 👍

Comment on lines 317 to 319
// Literal in this cases have incorrect type. This is not a good case
#[tokio::test]
async fn other_type_literal() -> Result<()> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, thanks for your help!

Ok(())
}

async fn roundtrip_all_types(sql: &str) -> Result<()> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works! I rewrite this case. But only a few SQL types are supported for now (I guess they are defined in this function).

@alamb alamb merged commit 98197d5 into apache:main Apr 10, 2023
@alamb
Copy link
Contributor

alamb commented Apr 10, 2023

Thanks again @waynexia

@waynexia waynexia deleted the substrait-type branch April 11, 2023 02:58
korowa pushed a commit to korowa/arrow-datafusion that referenced this pull request Apr 13, 2023
…e#5775)

* feat: extend types

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* feat: extend literal

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* fix clippy warnings

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* add some cases

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* clean up

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* add documentations and accomplish test

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

---------

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Timestamp types for substrait

2 participants