-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-46536][SQL] Support GROUP BY calendar_interval_type #44538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46536][SQL] Support GROUP BY calendar_interval_type #44538
Conversation
*/ | ||
@Unstable | ||
public final class CalendarInterval implements Serializable { | ||
public final class CalendarInterval implements Serializable, Comparable<CalendarInterval> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We (i.e. @MaxGekk) have added standard year-month and day-time intervals which are much better than calendar interval. |
|
||
@Override | ||
public int compareTo(CalendarInterval o) { | ||
if (this.months != o.months) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing intervals does not necessarily short circuits via months. We could result in 1 month > 0 months 32 days
, which is wrong, obviously.
Besides, 1 month can be 28 ~ 30 days, making the legacy calendar interval type uncomparable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some comments to explain that this is alphabet ordering. It does not have actual meaning but just makes it possible to find identical interval instances.
We should do the same thing for map type so that we can group by map values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefankandic did you generate this using IDEA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comments.
@cloud-fan method was generated by intellij but I implemented the logic
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashMapGenerator.scala
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SortAggregateSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
Outdated
Show resolved
Hide resolved
) | ||
|
||
for (conf <- configurations) { | ||
withSQLConf(conf -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should set more configs to trigger sort fallback of (object) hash aggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added new config which also sets fallback threshold to 1, let me know if that is not enough
test("SPARK-46536 Support GROUP BY CalendarIntervalType") { | ||
val numRows = 50 | ||
val configurations = Seq( | ||
Seq.empty[(String, String)], // hash aggregate is used by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to set spark.sql.TungstenAggregate.testFallbackStartsAt
, see the code in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to disable codegen in order to hit the fallback logic, but hopefully it now tests it properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if all tests pass
…ndarInterval.java
Merged to master. |
What changes were proposed in this pull request?
Allow group by on columns of type CalendarInterval
Why are the changes needed?
Currently, Spark GROUP BY only allows orderable data types, otherwise the plan analysis fails: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L197-L203
However, this is too strict as GROUP BY only cares about equality, not ordering. The CalendarInterval type is not orderable (1 month and 30 days, we don't know which one is larger), but has well-defined equality. In fact, we already support
SELECT DISTINCT calendar_interval_type
in some cases (when hash aggregate is picked by the planner).Does this PR introduce any user-facing change?
Yes, users will now be able to do group by on columns of type CalendarInterval
How was this patch tested?
By adding new UTs
Was this patch authored or co-authored using generative AI tooling?
No