[SPARK-39567][SQL] Support ANSI intervals in the percentile functions #36965

MaxGekk · 2022-06-23T11:37:50Z

What changes were proposed in this pull request?

In the PR, I propose to extend the PercentileBase to support ANSI interval types (year-month and day-time interval types) by the Percentile, PercentileDisc, Median, PercentileCont expressions. Intermediately, input intervals are casted to double values, and eventually the result of the calculations are casted back to intervals.

Why are the changes needed?

To improve user experience with Spark SQL.

Does this PR introduce any user-facing change?

No. Before the changes, the percentile functions fail with error:

spark-sql> CREATE OR REPLACE TEMPORARY VIEW intervals AS SELECT * FROM VALUES
         > (0, INTERVAL '0' SECOND),
         > (0, INTERVAL '20' SECOND),
         > (0, INTERVAL '25' SECOND),
         > (0, INTERVAL '30' SECOND)
         > AS intervals(k, dt);
spark-sql> SELECT median(dt), percentile(dt, 0.25), percentile_cont(0.75)
         > WITHIN GROUP (ORDER BY dt) FROM intervals;
Error in query: cannot resolve 'median(intervals.dt)' due to data type mismatch: argument 1 requires numeric type, however, 'intervals.dt' is of interval second type.; line 1 pos 7;

After:

spark-sql> SELECT median(dt), percentile(dt, 0.25), percentile_cont(0.75)
         > WITHIN GROUP (ORDER BY dt) FROM intervals;
0 00:00:22.500000000	0 00:00:15.000000000	0 00:00:26.250000000

How was this patch tested?

By running the modified tests:

$ build/sbt "test:testOnly *PercentileQuerySuite"
$ build/sbt "test:testOnly *PercentileSuite"
$ build/sbt -Phive "test:testOnly *HiveWindowFunctionQuerySuite"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *ExpressionsSchemaSuite"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"

MaxGekk · 2022-06-23T13:48:31Z

@cloud-fan Are you ok with this approach?

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala

cloud-fan · 2022-06-24T12:52:02Z

this approach LGTM

MaxGekk · 2022-06-26T15:05:19Z

I guess this is not related to changes:

2022-06-26T10:41:06.1199982Z Error in .f(.x[[i]], ...) : Failed to parse Rd in histogram.Rd
2022-06-26T10:41:06.1201593Z �[34mℹ�[39m there is no package called ‘ggplot2’
2022-06-26T10:41:06.1201944Z Caused by error in `loadNamespace()`:
2022-06-26T10:41:06.1202347Z ! there is no package called ‘ggplot2’

…entile

MaxGekk · 2022-06-27T08:24:27Z

Merging to master. Thank you, @cloud-fan for review.

…le functions ### What changes were proposed in this pull request? In the PR, I propose to change the result type of the `PercentileBase` (the `Percentile`, `PercentileDisc`, `Median`, `PercentileCont` expressions): - For any year-month interval types return `YEAR TO MONTH INTERVAL` - For any day-time interval types return `DAY TO SECOND INTERVAL` The PR changes behavior of the following functions: `median()`, `percentile()`, `percentile_cont()` and `percentile_disc()`. This PR is a follow up of #36965. ### Why are the changes needed? 1. To not loose the fraction part of the result when it is possible. To behave in the same way as for numeric types when the functions return double. 2. To be consistent with `MEAN` which returns wide ANSI types, for instance: ```sql spark-sql> SELECT typeof(mean(distinct cast(col as interval hour))) FROM VALUES (1), (2), (2), (3), (4), (NULL) AS tab(col); interval day to second ``` ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql> SELECT typeof(median(distinct cast(col as interval hour))) FROM VALUES (1), (2), (2), (3), (4), (NULL) AS tab(col); interval hour ``` After: ```sql spark-sql> SELECT typeof(median(distinct cast(col as interval hour))) FROM VALUES (1), (2), (2), (3), (4), (NULL) AS tab(col); interval day to second ``` ### How was this patch tested? By running new tests: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z percentiles.sql" ``` Closes #37595 from MaxGekk/median-result-ansi-intervals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

Support ANSI intervals in the percentile functions

98628e5

github-actions bot added the SQL label Jun 23, 2022

MaxGekk added 4 commits June 23, 2022 15:13

Fix PercentileSuite

f56e182

Add examples

765643e

Re-gen sql.out

b263541

Refactoring

9559e58

Change ANSI interval types only

46684ce

cloud-fan reviewed Jun 24, 2022

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala Outdated Show resolved Hide resolved

MaxGekk added 2 commits June 26, 2022 11:49

Use NumericAndAnsiInterval

c0ff5ce

Add tests to percentiles.sql

0a85da2

MaxGekk changed the title ~~[WIP][SPARK-39567][SQL] Support ANSI intervals in the percentile functions~~ [SPARK-39567][SQL] Support ANSI intervals in the percentile functions Jun 26, 2022

MaxGekk marked this pull request as ready for review June 26, 2022 09:51

MaxGekk requested a review from cloud-fan June 26, 2022 09:51

MaxGekk added 2 commits June 26, 2022 18:07

Merge remote-tracking branch 'origin/master' into ansi-intervals-perc…

14307d2

…entile

Fix PercentileSuite

e98aa71

cloud-fan approved these changes Jun 27, 2022

View reviewed changes

MaxGekk closed this in d83f6dd Jun 27, 2022

MaxGekk mentioned this pull request Aug 22, 2022

[SPARK-40151][SQL] Return wider ANSI interval types from the percentile functions #37595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39567][SQL] Support ANSI intervals in the percentile functions #36965

[SPARK-39567][SQL] Support ANSI intervals in the percentile functions #36965

MaxGekk commented Jun 23, 2022 •

edited

MaxGekk commented Jun 23, 2022

cloud-fan commented Jun 24, 2022

MaxGekk commented Jun 26, 2022

MaxGekk commented Jun 27, 2022

[SPARK-39567][SQL] Support ANSI intervals in the percentile functions #36965

[SPARK-39567][SQL] Support ANSI intervals in the percentile functions #36965

Conversation

MaxGekk commented Jun 23, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Jun 23, 2022

cloud-fan commented Jun 24, 2022

MaxGekk commented Jun 26, 2022

MaxGekk commented Jun 27, 2022

MaxGekk commented Jun 23, 2022 •

edited