Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Substrait's VirtualTables #10531

Merged
merged 10 commits into from
May 28, 2024

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented May 15, 2024

Which issue does this PR close?

Closes #10530

This (well, now together with the earlier PRs for chunks of this) is my first (but hopefully not last!) contribution to DataFusion, and I'm pretty new with Rust as well - so all feedback on both style and details is very much welcome!

Rationale for this change

What changes are included in this PR?

Adds support for Substrait's VirtualTables, ie. tables with data baked-in into the Substrait plan instead of being read from a source.

Adds conversion in both ways (Substrait -> DataFusion and DataFusion -> Substrait)

Adds support for both empty VirtualTables (mapped into LogicalPlan::EmptyTable) and those containing values (mapped into LogicalPlan::Values).

Are these changes tested?

Tested through round-trip tests, similar to other Substrait support.

Are there any user-facing changes?

  • More support through Substrait

@@ -1277,6 +1407,56 @@ pub(crate) fn from_substrait_literal(lit: &Literal) -> Result<ScalarValue> {
s,
)
}
Some(LiteralType::List(l)) => {
Copy link
Contributor Author

@Blizzara Blizzara May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily need these for this PR, so if it's preferrable I'm happy to split them into separate PRs - I just added them while coding to confirm that lists and structs inside lists are handled correctly.

Same applies for the ones in producer.rs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is convenient, using separate PRs would make the review process easier. Perhaps one PR could focus on the virtual table, while the other two could address the list and struct separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split those into #10615, #10622, #10640

let ctx = create_context().await?;
let row1 = vec![lit(1), lit("a")];
let row2 = vec![lit(2), lit("b")];
let plan = LogicalPlanBuilder::values(vec![row1, row2])?.build()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find VALUES in the DataFusion SQL - is it not implemented yet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is implemented, and you can use it like this.

roundtrip("VALUES (1, 'a', [[1,2,3], []], STRUCT(true, 1))").await

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'm blind, thanks!

I tried switching to that, but it produces somewhat ugly plans as it uses make_array and struct as UDFs, making the assert fail:

  left: "Values: (Int64(1), Utf8(\"a\"), List([[-213.1, , 5.5, 2.0, 1.0], []]) AS make_array(make_array(Float64(-213.1),NULL,Float64(5.5),Float64(2),Float64(1)),make_array()), Struct({c0:true,c1:1}) AS struct(Boolean(true),Int64(1))), (Int64(NULL), Utf8(NULL), List(), Struct({c0:,c1:}))"
 right: "Values: (Int64(1), Utf8(\"a\"), List([[-213.1, , 5.5, 2.0, 1.0], []]), Struct({c0:true,c1:1})), (Int64(NULL), Utf8(NULL), List(), Struct({c0:,c1:}))"

The AS aliases it produces seem to get ignored (the column names are still column1, column2, column3, ..), so it doesn't seem to make sense to map them into Substrait (and I'm not even sure if that'd be possible). Any ideas how to go around that?

Copy link
Member

@jonahgao jonahgao May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. For Values plan, these aliases should be able to be ignored.
I think we could use assert_expected_plan during testing to ignore aliases, for example:

#[tokio::test]
async fn roundtrip_values() -> Result<()> {
    assert_expected_plan(
        "VALUES (1, 'a', [[1,2,3], []], STRUCT(true, 1))",
        "Values: (Int64(1), Utf8(\"a\"), List([[1, 2, 3], []]), Struct({c0:true,c1:1}))",
    )
    .await
}

Additionally, aliases should be ignored when handling LogicalPlan::Values in to_substrait_rel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, that works - 18b7f08.

@Blizzara Blizzara marked this pull request as ready for review May 16, 2024 16:24
@alamb
Copy link
Contributor

alamb commented May 17, 2024

Thanks @Blizzara -- I started the CI on this PR

@Blizzara
Copy link
Contributor Author

@jonahgao @alamb I think this is ready by me :)

Some(LiteralType::Null(ntype)) => from_substrait_null(ntype)?,
Some(LiteralType::List(l)) => {
let element_dt: Option<&DataType> = match dt {
Some(DataType::List(l)) => Ok(Some(l.data_type())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if dt is redundant here and can be removed. We still need to check whether it is a LargeList, just as we check type_variation_reference below.

Is it possible to only extract element_dt from elements[0]? I guess that elements are not empty because substrait has a special type called EmptyList.

 pub struct List {
            /// A homogeneously-typed list of one or more expressions that form the
            /// list entries. To specify an empty list, use Literal.empty_list
            /// (otherwise type information would be missing).
            #[prost(message, repeated, tag="1")]
            pub values: ::prost::alloc::vec::Vec<super::super::Expression>,
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we already do that (look at the first element) if dt is not given. I think the reason I included dt was because Substrait types don't include the field names for structs, so not using dt would mean using something like names and name_idx instead (like in from_substrait_type_with_names).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanation @Blizzara . My concern is that using two types for List simultaneously is a bit verbose unless it is necessary.

I tested the following two cases: the first one panicked, and the second one also failed.

#[tokio::test]
async fn roundtrip_list_iteral() -> Result<()> {
    roundtrip("SELECT [] from data").await
}

#[tokio::test]
async fn roundtrip_struct_iteral() -> Result<()> {
    roundtrip("select struct(1 as name0, 3.14 as name1, 'e', true as name3) from data")
        .await
}
---- cases::roundtrip_logical_plan::roundtrip_list_iteral stdout ----
thread 'cases::roundtrip_logical_plan::roundtrip_list_iteral' panicked at consumer.rs:1425:41:
index out of bounds: the len is 0 but the index is 0

I think lists and structs are more complex than the virtual table, so it's better for them to have separate PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was because I wasn't handling empty lists properly (producer created Lists rather than EmptyList) - that was fixed in the Lists PR. Thanks for catching!

.expect("expected schema to exist for virtual table");

let fields = from_substrait_named_struct(&base_schema)?;
let schema = DFSchemaRef::new(DFSchema::try_from(Schema::new(fields))?);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the schema here may not be necessary, perhaps it can be restored through LogicalPlanBuilder::values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably could, except for the column names - LogicalPlanBuilder::values names everything to just column1 etc. Do you see some benefit to using that (and then handling the column names somehow) instead of passing in the schema explicitly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I once wanted to remove the schema, but we need to use it to pass the struct's names, so the current approach should be fine.

@Blizzara
Copy link
Contributor Author

@jonahgao Thanks for the review - I pushed a new version but it builds on top of #10622 and #10640 so will need those merged and this rebased first before this PRs looks clean.

This is mostly what it used to be with one more major change: I removed passing the datatype into from_substrait_literal and instead do the same thing with names as I did for from_substrait_type. That maybe seems a bit cleaner (even though I dislike the name handling, but I don't see a better way to deal with it based on how it works in Substrait :/)

@Blizzara Blizzara marked this pull request as draft May 23, 2024 16:28
@Blizzara Blizzara force-pushed the avo/substrait-virtual-tables branch 4 times, most recently from 01ca26f to 908e533 Compare May 24, 2024 14:37
Adds support for Substrait's VirtualTables, ie. tables with data baked-in into the Substrait plan instead of being read from a source.

Adds conversion in both ways (Substrait -> DataFusion and DataFusion -> Substrait)
and a roundtrip test.
@Blizzara Blizzara force-pushed the avo/substrait-virtual-tables branch from 908e533 to 02e7bc0 Compare May 24, 2024 14:56
@Blizzara Blizzara marked this pull request as ready for review May 24, 2024 14:56
@Blizzara
Copy link
Contributor Author

@jonahgao @alamb this last one is ready now too :)

}
Ok(names)
}
DataType::List(l) => names_dfs(l.data_type()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to consider DataType::LargeList here because to_substrait_type supports it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed 9c83ba4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to create a LargeList using VALUES for testing, without writing a billion values in there? 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow_cast is a solution .

VALUES(arrow_cast([1,2], 'LargeList(Int64)'));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I added that in 4b8babe, though it seems that arrow_cast doesn't yet support structs, so the test still wouldn't have caught this. Unless you can think of some workaround? I'm not familiar at all with arrow_cast and the like 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too, let's keep it as a TODO.


#[tokio::test]
async fn roundtrip_empty_relation() -> Result<()> {
roundtrip("SELECT * FROM (VALUES (1)) LIMIT 0").await
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've slightly modified this test and printed out the schema on the consumer side, but it seems to be incorrect.

#[tokio::test]
async fn roundtrip_empty_relation() -> Result<()> {
    roundtrip("SELECT * FROM (VALUES ([STRUCT(1 as a)], 2)) LIMIT 0").await
}
plan2 schema: DFSchema { inner: Schema { fields: [
Field { name: "column1", data_type: List(Field { name: "item", data_type: Struct([Field { name: "c0", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 
# Its name should be "column2"
Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, 
field_qualifiers: [None, None], functional_dependencies: FunctionalDependencies { deps: [] } }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need a test to verify the schemas. For this test, the plan str on both sides would just be EmptyRelation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed in 24d07c9 and added schema verification to most tests

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me👍. Thank you for your contribution @Blizzara .

@alamb alamb merged commit 2762754 into apache:main May 28, 2024
23 checks passed
@Blizzara Blizzara deleted the avo/substrait-virtual-tables branch May 28, 2024 15:00
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Add support for Substrait VirtualTables

Adds support for Substrait's VirtualTables, ie. tables with data baked-in into the Substrait plan instead of being read from a source.

Adds conversion in both ways (Substrait -> DataFusion and DataFusion -> Substrait)
and a roundtrip test.

* fix clippy

* Add support for empty relations

* Fix consuming Structs inside Lists and Structs

Also adds roundtrip schema assertions for cases where possible

* Rename from_substrait_struct -> from_substrait_struct_type for clarity

* Add DataType::LargeList to to_substrait_named_struct

* cargo fmt --all

* Add validation that names list matches schema exactly

* Add a LargeList into VALUES test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Substrait VirtualTables
3 participants