Consolidate datafusion, arrow-cpp, and substrait's handling of non-substrait arrow types

### Is your feature request related to a problem or challenge?

I am trying to take a filter expression created by pyarrow and convert it into a filter expression for Datafusion to satisfy.  I am using Substrait to do this.  Everything works fine when I use the standard Substrait types.  However, when I use normal Arrow types that are not Substrait types (e.g. unsigned integers, large containers) I run into problems.

It seems that arrow-cpp (admittedly, me, in this case) and datafusion have taken different approaches to handling these limitations.

In arrow-cpp the types that expand or change the valid range of values (e.g. unsigned integers, large containers) are converted to extension types.  This process is documented in https://github.com/apache/arrow/blob/main/format/substrait/extension_types.yaml

In datafusion it appears these types are expected to use the nearest substrait match (e.g. signed integer, small container) with a type variation.

### Describe the solution you'd like

I am admittedly biased (given I implemented one of the two disagreeing components) but I favor the extension types approach.  Type variations are defined in Substrait as this:

> Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions.

Given that definition, I do not think it is valid to say that an unsigned integer is a variation of a signed integer (they do not have the same outputs for all functions).  I do believe things like the view types and dictionary encoding are valid type variations.

### Describe alternatives you've considered

The alternative would be to change arrow-cpp to also use type variations.  Though I'd like some consensus from the Substrait community that this is a valid use of type variations before taking that approach.

At the moment I am working around this issue by simply removing any non-standard types from the input schema (this works as long as the filter isn't referencing those types).

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consolidate datafusion, arrow-cpp, and substrait's handling of non-substrait arrow types #12181

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consolidate datafusion, arrow-cpp, and substrait's handling of non-substrait arrow types #12181

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions