-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
In the presences of multiple grouping sets, the output schema for a AggregateRel consists of:
- All expressions in the groupingExpressions field
- All measures
- An extra i32 column to indicate which grouping a row belongs to.
In Datafusion, the default order of the schema seems to be
- All expressions in the groupingExpressions field
- An extra i32 column to indicate which grouping a row belongs to.
- All measures
This different in schemas does not appear to be handled in the consumer.
To Reproduce
The following Substrait plan
{
"extensionUris": [
{
"extensionUriAnchor": 1,
"uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml"
}
],
"extensions": [
{
"extensionFunction": {
"extensionUriReference": 1,
"functionAnchor": 1,
"name": "sum:i8"
}
}
],
"relations": [
{
"root": {
"input": {
"aggregate": {
"common": {
"emit": {
"outputMapping": [
0, 1, 2
]
}
},
"input": {
"read": {
"baseSchema": {
"names": [
"c0",
"c1"
],
"struct": {
"nullability": "NULLABILITY_REQUIRED",
"types": [
{
"i8": {
"nullability": "NULLABILITY_NULLABLE"
}
},
{
"i8": {
"nullability": "NULLABILITY_NULLABLE"
}
}
]
}
},
"common": {
"direct": {}
},
"virtualTable": {}
}
},
"groupingExpressions": [
{
"selection": {
"directReference": {
"structField": {}
},
"rootReference": {}
}
},
{
"selection": {
"directReference": {
"structField": {
"field": 1
}
},
"rootReference": {}
}
}
],
"groupings": [
{
"expressionReferences": [0]
},
{
"expressionReferences": [1]
},
{
"expressionReferences": [0, 1]
}
],
"measures": [
{
"measure": {
"arguments": [
{
"value": {
"selection": {
"directReference": {
"structField": {}
},
"rootReference": {}
}
}
}
],
"functionReference": 1,
"invocation": "AGGREGATION_INVOCATION_ALL",
"outputType": {
"i8": {
"nullability": "NULLABILITY_NULLABLE"
}
},
"phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT"
}
}
]
}
},
"names": [
"c0",
"c1",
"summation"
]
}
}
],
"version": {
"minorNumber": 29
}
}
which captures a query like
SELECT c0, c1, sum(0) AS summation
FROM <some table>
GROUP BY GROUPING SETS((c0), (c1), (c0, c1))
yields the following DataFusion plan
Projection: c0, c1, __grouping_id AS summation
Aggregate: groupBy=[[GROUPING SETS ((c0), (c1), (c0, c1))]], aggr=[[sum(c0)]]
EmptyRelation: rows=0
With this plan, output column 2 should correspond to the summation, but in the Datafusion plan it is the __grouping_id
column.
Expected behavior
We would expect something like instead
Projection: c0, c1, sum(c0) AS summation
Aggregate: groupBy=[[GROUPING SETS ((c0), (c1), (c0, c1))]], aggr=[[sum(c0)]]
EmptyRelation: rows=0
Additional context
I suspect that fixing this in the consumer will also reveal bugs in the producer around handling for multiple groupings sets.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working