Skip to content

Conversation

stefankandic
Copy link
Contributor

@stefankandic stefankandic commented Dec 29, 2023

What changes were proposed in this pull request?

Allow group by on columns of type CalendarInterval

Why are the changes needed?

Currently, Spark GROUP BY only allows orderable data types, otherwise the plan analysis fails: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L197-L203

However, this is too strict as GROUP BY only cares about equality, not ordering. The CalendarInterval type is not orderable (1 month and 30 days, we don't know which one is larger), but has well-defined equality. In fact, we already support SELECT DISTINCT calendar_interval_type in some cases (when hash aggregate is picked by the planner).

Does this PR introduce any user-facing change?

Yes, users will now be able to do group by on columns of type CalendarInterval

How was this patch tested?

By adding new UTs

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Dec 29, 2023
@stefankandic stefankandic changed the title [WIP][SPARK-46536]Support GROUP BY calendar_interval_type [SPARK-46536][SQL] Support GROUP BY calendar_interval_type Dec 29, 2023
@stefankandic stefankandic marked this pull request as ready for review December 29, 2023 17:00
*/
@Unstable
public final class CalendarInterval implements Serializable {
public final class CalendarInterval implements Serializable, Comparable<CalendarInterval> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Please check the behavior
#27262
I'm not sure. @yaooqinn

@srielau
Copy link
Contributor

srielau commented Dec 30, 2023

We (i.e. @MaxGekk) have added standard year-month and day-time intervals which are much better than calendar interval.
It is of questionable value to improve this old type.
@cloud-fan @MaxGekk WDYT?


@Override
public int compareTo(CalendarInterval o) {
if (this.months != o.months) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing intervals does not necessarily short circuits via months. We could result in 1 month > 0 months 32 days, which is wrong, obviously.

Besides, 1 month can be 28 ~ 30 days, making the legacy calendar interval type uncomparable

Copy link
Contributor

@cloud-fan cloud-fan Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some comments to explain that this is alphabet ordering. It does not have actual meaning but just makes it possible to find identical interval instances.

We should do the same thing for map type so that we can group by map values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefankandic did you generate this using IDEA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comments.

@cloud-fan method was generated by intellij but I implemented the logic

)

for (conf <- configurations) {
withSQLConf(conf -> "true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set more configs to trigger sort fallback of (object) hash aggregate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new config which also sets fallback threshold to 1, let me know if that is not enough

test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
val numRows = 50
val configurations = Seq(
Seq.empty[(String, String)], // hash aggregate is used by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to disable codegen in order to hit the fallback logic, but hopefully it now tests it properly

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if all tests pass

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants