Skip to content

Handling of Aggregates with multiple groupings sets is incorrect #17910

@vbarua

Description

@vbarua

Describe the bug

In the presences of multiple grouping sets, the output schema for a AggregateRel consists of:

  1. All expressions in the groupingExpressions field
  2. All measures
  3. An extra i32 column to indicate which grouping a row belongs to.

In Datafusion, the default order of the schema seems to be

  1. All expressions in the groupingExpressions field
  2. An extra i32 column to indicate which grouping a row belongs to.
  3. All measures

This different in schemas does not appear to be handled in the consumer.

To Reproduce

The following Substrait plan

{
  "extensionUris": [
    {
      "extensionUriAnchor": 1,
      "uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml"
    }
  ],
  "extensions": [
    {
      "extensionFunction": {
        "extensionUriReference": 1,
        "functionAnchor": 1,
        "name": "sum:i8"
      }
    }
  ],
  "relations": [
    {
      "root": {
        "input": {
          "aggregate": {
            "common": {
              "emit": {
                "outputMapping": [
                  0, 1, 2
                ]
              }
            },
            "input": {
              "read": {
                "baseSchema": {
                  "names": [
                    "c0",
                    "c1"
                  ],
                  "struct": {
                    "nullability": "NULLABILITY_REQUIRED",
                    "types": [
                      {
                        "i8": {
                          "nullability": "NULLABILITY_NULLABLE"
                        }
                      },
                      {
                        "i8": {
                          "nullability": "NULLABILITY_NULLABLE"
                        }
                      }
                    ]
                  }
                },
                "common": {
                  "direct": {}
                },
                "virtualTable": {}
              }
            },
            "groupingExpressions": [
              {
                "selection": {
                  "directReference": {
                    "structField": {}
                  },
                  "rootReference": {}
                }
              },
              {
                "selection": {
                  "directReference": {
                    "structField": {
                      "field": 1
                    }
                  },
                  "rootReference": {}
                }
              }
            ],
            "groupings": [
              {
                "expressionReferences": [0]
              },
              {
                "expressionReferences": [1]
              },
              {
                "expressionReferences": [0, 1]
              }
            ],
            "measures": [
              {
                "measure": {
                  "arguments": [
                    {
                      "value": {
                        "selection": {
                          "directReference": {
                            "structField": {}
                          },
                          "rootReference": {}
                        }
                      }
                    }
                  ],
                  "functionReference": 1,
                  "invocation": "AGGREGATION_INVOCATION_ALL",
                  "outputType": {
                    "i8": {
                      "nullability": "NULLABILITY_NULLABLE"
                    }
                  },
                  "phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT"
                }
              }
            ]
          }
        },
        "names": [
          "c0",
          "c1",
          "summation"
        ]
      }
    }
  ],
  "version": {
    "minorNumber": 29
  }
}

which captures a query like

SELECT c0, c1, sum(0) AS summation
FROM <some table>
GROUP BY GROUPING SETS((c0), (c1), (c0, c1))

yields the following DataFusion plan

Projection: c0, c1, __grouping_id AS summation
  Aggregate: groupBy=[[GROUPING SETS ((c0), (c1), (c0, c1))]], aggr=[[sum(c0)]]
    EmptyRelation: rows=0

With this plan, output column 2 should correspond to the summation, but in the Datafusion plan it is the __grouping_id column.

Expected behavior

We would expect something like instead

Projection: c0, c1, sum(c0) AS summation
  Aggregate: groupBy=[[GROUPING SETS ((c0), (c1), (c0, c1))]], aggr=[[sum(c0)]]
    EmptyRelation: rows=0

Additional context

I suspect that fixing this in the consumer will also reveal bugs in the producer around handling for multiple groupings sets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions