Skip to content

Consolidate datafusion, arrow-cpp, and substrait's handling of non-substrait arrow types #12181

@westonpace

Description

@westonpace

Is your feature request related to a problem or challenge?

I am trying to take a filter expression created by pyarrow and convert it into a filter expression for Datafusion to satisfy. I am using Substrait to do this. Everything works fine when I use the standard Substrait types. However, when I use normal Arrow types that are not Substrait types (e.g. unsigned integers, large containers) I run into problems.

It seems that arrow-cpp (admittedly, me, in this case) and datafusion have taken different approaches to handling these limitations.

In arrow-cpp the types that expand or change the valid range of values (e.g. unsigned integers, large containers) are converted to extension types. This process is documented in https://github.com/apache/arrow/blob/main/format/substrait/extension_types.yaml

In datafusion it appears these types are expected to use the nearest substrait match (e.g. signed integer, small container) with a type variation.

Describe the solution you'd like

I am admittedly biased (given I implemented one of the two disagreeing components) but I favor the extension types approach. Type variations are defined in Substrait as this:

Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions.

Given that definition, I do not think it is valid to say that an unsigned integer is a variation of a signed integer (they do not have the same outputs for all functions). I do believe things like the view types and dictionary encoding are valid type variations.

Describe alternatives you've considered

The alternative would be to change arrow-cpp to also use type variations. Though I'd like some consensus from the Substrait community that this is a valid use of type variations before taking that approach.

At the moment I am working around this issue by simply removing any non-standard types from the input schema (this works as long as the filter isn't referencing those types).

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions