Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More properly handle nullability of types/literals in Substrait #10640

Merged
merged 3 commits into from
May 24, 2024

Conversation

Blizzara
Copy link
Contributor

Which issue does this PR close?

Extracted from #10531

Builds on top of #10622 so I'll rebase this once it's merged.

Rationale for this change

More closely propagate null-ability through Substrait conversions

What changes are included in this PR?

Handles the nullability of struct fields during Substrait conversions. This isn't perfect; some things are still assumed to just always be nullable (e.g. Literal list elements).

Are these changes tested?

Tested through round-trip unit tests

Are there any user-facing changes?

Nulls are handled more accurately

This isn't perfect; some things are still assumed to just always be nullable (e.g. Literal list elements).
But it's still giving a closer match than just assuming everything is nullable.
@Blizzara Blizzara marked this pull request as ready for review May 23, 2024 17:48
@Blizzara
Copy link
Contributor Author

@alamb @jonahgao - this next one is up for review/merge now :) And thanks for merging the struct one!

@@ -1686,7 +1691,7 @@ fn to_substrait_bounds(window_frame: &WindowFrame) -> Result<(Bound, Bound)> {
))
}

fn to_substrait_literal(value: &ScalarValue) -> Result<Literal> {
fn to_substrait_literal(value: &ScalarValue, nullable: bool) -> Result<Literal> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to ExprSchemable::nullable(), the nullability of a literal depends on its ScalarValue, so I think the nullable argument for this function could be omitted.

fn to_substrait_literal(value: &ScalarValue, nullable: bool) -> Result<Literal> {
    if value.is_null() {
        return Ok(Literal {
            nullable: true,
            type_variation_reference: DEFAULT_TYPE_REF,
            literal_type: Some(LiteralType::Null(to_substrait_type(
                &value.data_type(),
                true,
            )?)),
        });
    }

At the same time, we can also remove try_to_substrait_null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, thanks! That does indeed simplify things. Fixed in cbd2264

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

@jonahgao
Copy link
Member

jonahgao commented May 24, 2024

I plan to merge this PR now as it might conflict with #10646. cc @waynexia

@jonahgao jonahgao merged commit 8bedecc into apache:main May 24, 2024
23 checks passed
@Blizzara Blizzara deleted the avo/substrait-nullability branch May 24, 2024 13:22
@alamb
Copy link
Contributor

alamb commented May 25, 2024

Thanks @Blizzara and @jonahgao

jayzhan211 pushed a commit to jayzhan211/arrow-datafusion that referenced this pull request May 26, 2024
…he#10640)

* More properly handle nullability of types/literals in Substrait

This isn't perfect; some things are still assumed to just always be nullable (e.g. Literal list elements).
But it's still giving a closer match than just assuming everything is nullable.

* Avoid cloning and creating DataFusionError

Co-authored-by: Jonah Gao <jonahgao@msn.com>

* simplify Literal/ScalarValue null handling

---------

Co-authored-by: Jonah Gao <jonahgao@msn.com>
Blizzara added a commit to Blizzara/datafusion that referenced this pull request Jun 11, 2024
DataFusion (= Arrow) is quite strict about nullability, specifically,
when using e.g. LogicalPlan::Values, the given schema must match the
given literals exactly - including nullability.
This is non-trivial to do when converting schema and literals separately.

The existing implementation for from_substrait_literal already creates
lists that are always nullable
(see ScalarValue::new_list => array_into_list_array).
This reverts part of apache#10640 to
align from_substrait_type with that behavior.

This is the error I was hitting:
```
ArrowError(InvalidArgumentError("column types must match schema types, expected
List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found
List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None)
```
alamb pushed a commit that referenced this pull request Jun 12, 2024
…0874)

* Ignore nullability of list elements when consuming Substrait

DataFusion (= Arrow) is quite strict about nullability, specifically,
when using e.g. LogicalPlan::Values, the given schema must match the
given literals exactly - including nullability.
This is non-trivial to do when converting schema and literals separately.

The existing implementation for from_substrait_literal already creates
lists that are always nullable
(see ScalarValue::new_list => array_into_list_array).
This reverts part of #10640 to
align from_substrait_type with that behavior.

This is the error I was hitting:
```
ArrowError(InvalidArgumentError("column types must match schema types, expected
List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found
List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None)
```

* use `Field::new_list_field` in `array_into_(large_)list_array`

just for consistency, to reduce the places where "item" is written out

* add a test for non-nullable lists
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…he#10640)

* More properly handle nullability of types/literals in Substrait

This isn't perfect; some things are still assumed to just always be nullable (e.g. Literal list elements).
But it's still giving a closer match than just assuming everything is nullable.

* Avoid cloning and creating DataFusionError

Co-authored-by: Jonah Gao <jonahgao@msn.com>

* simplify Literal/ScalarValue null handling

---------

Co-authored-by: Jonah Gao <jonahgao@msn.com>
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…ache#10874)

* Ignore nullability of list elements when consuming Substrait

DataFusion (= Arrow) is quite strict about nullability, specifically,
when using e.g. LogicalPlan::Values, the given schema must match the
given literals exactly - including nullability.
This is non-trivial to do when converting schema and literals separately.

The existing implementation for from_substrait_literal already creates
lists that are always nullable
(see ScalarValue::new_list => array_into_list_array).
This reverts part of apache#10640 to
align from_substrait_type with that behavior.

This is the error I was hitting:
```
ArrowError(InvalidArgumentError("column types must match schema types, expected
List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found
List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None)
```

* use `Field::new_list_field` in `array_into_(large_)list_array`

just for consistency, to reduce the places where "item" is written out

* add a test for non-nullable lists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants