-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
When constructing a DF plan from Substrait, we confirm that the data types of the columns DF sees matches what the Substrait plan expects. However, the check here can fail if the inner name of an List field differs:
"Field 'categories' in Substrait schema has a different type
(List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })) than the corresponding field in the table schema
(List(Field { name: \"element\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })).
backtrace: 0: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
1: datafusion_substrait::logical_plan::consumer::ensure_schema_compatability
2: datafusion_substrait::logical_plan::consumer::from_substrait_rel::{{closure}}
This happens because we call the inner field "item" (specifically we use Field::new_list_field which calls it "item"), but Arrow doesn't mandate that. And Spark's toArrowSchema uses "element" instead: https://github.com/apache/spark/blob/11e47064d0c73ab4fc6c960153845b45356db20f/sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala#L114
For Map columns, Spark's toArrowSchema seems to use the same name as we use here, so it doesn't collide - but Arrow doesn't mandate that naming either so it's liable to fail for someone somewhere.
I can think of at least two ways to fix this:
a) normalize the names in consumer.rs before comparing
b) normalize the names in dfschema.rs::datatype_is_logically_equal before comparing
thoughts? cc @vbarua
To Reproduce
No response
Expected behavior
No response
Additional context
No response