Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39567][SQL] Support ANSI intervals in the percentile functions #36965

Closed
wants to merge 10 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jun 23, 2022

What changes were proposed in this pull request?

In the PR, I propose to extend the PercentileBase to support ANSI interval types (year-month and day-time interval types) by the Percentile, PercentileDisc, Median, PercentileCont expressions. Intermediately, input intervals are casted to double values, and eventually the result of the calculations are casted back to intervals.

Why are the changes needed?

To improve user experience with Spark SQL.

Does this PR introduce any user-facing change?

No. Before the changes, the percentile functions fail with error:

spark-sql> CREATE OR REPLACE TEMPORARY VIEW intervals AS SELECT * FROM VALUES
         > (0, INTERVAL '0' SECOND),
         > (0, INTERVAL '20' SECOND),
         > (0, INTERVAL '25' SECOND),
         > (0, INTERVAL '30' SECOND)
         > AS intervals(k, dt);
spark-sql> SELECT median(dt), percentile(dt, 0.25), percentile_cont(0.75)
         > WITHIN GROUP (ORDER BY dt) FROM intervals;
Error in query: cannot resolve 'median(intervals.dt)' due to data type mismatch: argument 1 requires numeric type, however, 'intervals.dt' is of interval second type.; line 1 pos 7;

After:

spark-sql> SELECT median(dt), percentile(dt, 0.25), percentile_cont(0.75)
         > WITHIN GROUP (ORDER BY dt) FROM intervals;
0 00:00:22.500000000	0 00:00:15.000000000	0 00:00:26.250000000

How was this patch tested?

By running the modified tests:

$ build/sbt "test:testOnly *PercentileQuerySuite"
$ build/sbt "test:testOnly *PercentileSuite"
$ build/sbt -Phive "test:testOnly *HiveWindowFunctionQuerySuite"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *ExpressionsSchemaSuite"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"

@github-actions github-actions bot added the SQL label Jun 23, 2022
@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 23, 2022

@cloud-fan Are you ok with this approach?

@cloud-fan
Copy link
Contributor

this approach LGTM

@MaxGekk MaxGekk changed the title [WIP][SPARK-39567][SQL] Support ANSI intervals in the percentile functions [SPARK-39567][SQL] Support ANSI intervals in the percentile functions Jun 26, 2022
@MaxGekk MaxGekk marked this pull request as ready for review June 26, 2022 09:51
@MaxGekk MaxGekk requested a review from cloud-fan June 26, 2022 09:51
@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 26, 2022

I guess this is not related to changes:

2022-06-26T10:41:06.1199982Z Error in .f(.x[[i]], ...) : Failed to parse Rd in histogram.Rd
2022-06-26T10:41:06.1201593Z �[34mℹ�[39m there is no package called ‘ggplot2’
2022-06-26T10:41:06.1201944Z Caused by error in `loadNamespace()`:
2022-06-26T10:41:06.1202347Z ! there is no package called ‘ggplot2’

@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 27, 2022

Merging to master. Thank you, @cloud-fan for review.

@MaxGekk MaxGekk closed this in d83f6dd Jun 27, 2022
MaxGekk added a commit that referenced this pull request Aug 22, 2022
…le functions

### What changes were proposed in this pull request?
In the PR, I propose to change the result type of the `PercentileBase` (the `Percentile`, `PercentileDisc`, `Median`, `PercentileCont` expressions):
- For any year-month interval types return `YEAR TO MONTH INTERVAL`
- For any day-time interval types return `DAY TO SECOND INTERVAL`

The PR changes behavior of the following functions: `median()`, `percentile()`, `percentile_cont()` and `percentile_disc()`.

This PR is a follow up of #36965.

### Why are the changes needed?
1. To not loose the fraction part of the result when it is possible. To behave in the same way as for numeric types when the functions return double.
2. To be consistent with `MEAN` which returns wide ANSI types, for instance:
```sql
spark-sql> SELECT typeof(mean(distinct cast(col as interval hour))) FROM VALUES (1), (2), (2), (3), (4), (NULL) AS tab(col);
interval day to second
```

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```sql
spark-sql> SELECT typeof(median(distinct cast(col as interval hour))) FROM VALUES (1), (2), (2), (3), (4), (NULL) AS tab(col);
interval hour
```

After:
```sql
spark-sql> SELECT typeof(median(distinct cast(col as interval hour))) FROM VALUES (1), (2), (2), (3), (4), (NULL) AS tab(col);
interval day to second
```

### How was this patch tested?
By running new tests:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z percentiles.sql"
```

Closes #37595 from MaxGekk/median-result-ansi-intervals.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants