Skip to content

Substrait consumer's schema check can fail for list (probs also map) columns #13437

@Blizzara

Description

@Blizzara

Describe the bug

When constructing a DF plan from Substrait, we confirm that the data types of the columns DF sees matches what the Substrait plan expects. However, the check here can fail if the inner name of an List field differs:

"Field 'categories' in Substrait schema has a different type 
(List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })) than the corresponding field in the table schema 
(List(Field { name: \"element\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })).

backtrace:    0: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
   1: datafusion_substrait::logical_plan::consumer::ensure_schema_compatability
   2: datafusion_substrait::logical_plan::consumer::from_substrait_rel::{{closure}}

This happens because we call the inner field "item" (specifically we use Field::new_list_field which calls it "item"), but Arrow doesn't mandate that. And Spark's toArrowSchema uses "element" instead: https://github.com/apache/spark/blob/11e47064d0c73ab4fc6c960153845b45356db20f/sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala#L114

For Map columns, Spark's toArrowSchema seems to use the same name as we use here, so it doesn't collide - but Arrow doesn't mandate that naming either so it's liable to fail for someone somewhere.

I can think of at least two ways to fix this:
a) normalize the names in consumer.rs before comparing
b) normalize the names in dfschema.rs::datatype_is_logically_equal before comparing

thoughts? cc @vbarua

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions