[SPARK-37614][SQL] Support ANSI Aggregate Function: regr_avgx & regr_avgy #34868

beliefer · 2021-12-11T12:51:50Z

What changes were proposed in this pull request?

REGR_AVGX and REGR_AVGY are ANSI aggregate functions.

REGR_AVGX returns the mean of the independent_variable_expression for all non-null data pairs of the dependent and independent variable arguments.
Syntax: REGR_AVGX(dependent_variable_expression, independent_variable_expression)
The equation for computing REGR_AVGX is: REGR_AVGX = SUM(x)/n

Examples:

> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
  2.75
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
  3.0
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
  3.0

REGR_AVGY returns the mean of the dependent_variable_expression for all non-null data pairs of the dependent and independent variable arguments.
Syntax: REGR_AVGY(dependent_variable_expression, independent_variable_expression)
The equation for computing REGR_AVGY is: REGR_AVGY = SUM(y)/n

Examples:

> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
  1.75
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
  1.6666666666666667
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
  1.5

dependent_variable_expression: the dependent variable for the regression. A dependent variable is something that is measured in response to a treatment. The expression cannot contain any ordered analytical or aggregate functions.

independent_variable_expression: the independent variable for the regression. An independent variable is a treatment: something that is varied under your control to test the behavior of another variable.

The expression cannot contain any ordered analytical or aggregate functions.

The mainstream database supports REGR_AVGX and REGR_AVGY show below:
Teradata
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/KkJgUSq2O6JRU3bCK~0cug
Snowflake
https://docs.snowflake.com/en/sql-reference/functions/regr_avgx.html
Oracle
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9
Vertica
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/REGR_AVGX.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAggregate%20Functions%7C_____24
DB2
https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count
H2
http://www.h2database.com/html/functions-aggregate.html#regr_avgx
Postgresql
https://www.postgresql.org/docs/8.4/functions-aggregate.html
Sybase
https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-avgx-function.html
Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm

Why are the changes needed?

REGR_AVGX and REGR_AVGY are very useful.

Does this PR introduce any user-facing change?

'Yes'. New feature.

How was this patch tested?

New tests.

SparkQA · 2021-12-11T13:50:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50568/

SparkQA · 2021-12-11T14:36:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50568/

SparkQA · 2021-12-11T15:09:20Z

Test build #146093 has finished for PR 34868 at commit 9bfd4cb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RegrAvgX(left: Expression, right: Expression) extends RegrAvg
case class RegrAvgY(left: Expression, right: Expression) extends RegrAvg
trait RegrAvg extends UnevaluableAggregate with ImplicitCastInputTypes with BinaryLike[Expression]

SparkQA · 2021-12-12T02:48:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50575/

SparkQA · 2021-12-12T03:47:55Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50575/

SparkQA · 2021-12-12T07:02:31Z

Test build #146101 has finished for PR 34868 at commit 4220627.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-17T08:19:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50796/

SparkQA · 2021-12-17T09:02:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50796/

SparkQA · 2021-12-17T09:47:29Z

Test build #146322 has finished for PR 34868 at commit f3262db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-17T14:13:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50806/

SparkQA · 2021-12-17T14:58:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50806/

SparkQA · 2021-12-17T15:37:47Z

Test build #146333 has finished for PR 34868 at commit 677124b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-18T02:18:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50818/

SparkQA · 2021-12-18T02:20:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50819/

SparkQA · 2021-12-18T03:01:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50818/

SparkQA · 2021-12-18T03:06:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50819/

SparkQA · 2021-12-18T04:17:58Z

Test build #146344 has finished for PR 34868 at commit 742e73a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-18T06:18:13Z

Test build #146345 has finished for PR 34868 at commit f710cdd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-12-18T08:08:43Z

ping @MaxGekk @gengliangwang cc @cloud-fan

SparkQA · 2021-12-20T08:27:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50867/

cloud-fan · 2021-12-20T08:53:01Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regressions.scala

@@ -0,0 +1,124 @@
+/*


I'd make the file name clearer: linearRegression.scala

SparkQA · 2021-12-20T09:40:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50867/

SparkQA · 2021-12-20T12:48:39Z

Test build #146392 has finished for PR 34868 at commit 06d6790.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-20T14:26:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50875/

SparkQA · 2021-12-20T15:22:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50875/

SparkQA · 2021-12-20T18:46:38Z

Test build #146400 has finished for PR 34868 at commit 3423d55.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RegrCount(left: Expression, right: Expression) extends RegressionAggregate
trait RegrAvg extends RegressionAggregate
case class RegrAvgX(left: Expression, right: Expression) extends RegrAvg
case class RegrAvgY(left: Expression, right: Expression) extends RegrAvg

amaliujia · 2022-02-10T00:43:05Z

@beliefer can you document the new functions in the PR description?

Basically can you list

the function signature (name and argument type)
Argument valid input range. Maybe some examples
Not valid argument inputs. Maybe some examples.
other special cases if any

There are usage comment for each expression. We can also copy documentation above into those comment. Current usage description is too vague to understand the function specification:

  usage = """
     _FUNC_(expr1, expr2) - Returns the average of the independent variable for non-null pairs in
                            a group.
  """

Based on the function specification, we can further check if we have enough test coverage.

It's very important to have function specification. Even though there are links to other dialects' documentation. We still want our documentation to know how exactly it is same or different from others.

beliefer · 2022-02-10T03:51:07Z

@amaliujia PR description updated.

amaliujia · 2022-02-10T04:24:08Z

@beliefer can you refer to https://issues.apache.org/jira/browse/SPARK-38063 as an example to update the function specification?

Basically instead of reading code, let's first document and discuss what the functions should be, especially on their argument type, input range, illegal/legal inputs, etc.

After we catch edges cases by walking through the specification, writing code and code review will also be easier.

…avgy

cloud-fan · 2022-02-24T13:06:15Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala

+  usage = """
+     _FUNC_(expr1, expr2) - Returns the average of the independent variable for non-null pairs in
+                            a group.
+  """,


can you try DESC FUNCTION locally? I'm afraid multi-line string doesn't work well in function doc.

Yes. I should use // scalastyle:off line.size.limit.

If it's one line, can we use "..."?

cloud-fan · 2022-02-24T13:08:29Z

sql/core/src/test/resources/sql-functions/sql-expression-schema.md

-| org.apache.spark.sql.catalyst.expressions.Ceil | ceil | SELECT ceil(-0.1) | struct<CEIL(-0.1):decimal(1,0)> |
-| org.apache.spark.sql.catalyst.expressions.Ceil | ceiling | SELECT ceiling(-0.1) | struct<ceiling(-0.1):decimal(1,0)> |
+| org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$ | ceil | SELECT ceil(-0.1) | struct<CEIL(-0.1):decimal(1,0)> |
+| org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$ | ceiling | SELECT ceiling(-0.1) | struct<ceiling(-0.1):decimal(1,0)> |


interesting... existing tests didn't capture them.

Yes. The problem occurs in many of my PR.

cloud-fan

LGTM except for one comment.

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala

…essions/aggregate/linearRegression.scala Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>

gengliangwang · 2022-02-24T13:48:02Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala

+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+     _FUNC_(expr1, expr2) - Returns the average of the independent variable for non-null pairs in a group, where `expr1` is the dependent variable and `expr2` is the independent variable.


On second thought, let's change (expr1, expr2) to (y, x) so that users can understanding the names of "avgx" and "avgy"
https://docs.snowflake.com/en/sql-reference/functions/regr_avgx.html
https://docs.snowflake.com/en/sql-reference/functions/regr_avgy.html

gengliangwang · 2022-02-24T13:48:18Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala

+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+     _FUNC_(expr1, expr2) - Returns the average of the dependent variable for non-null pairs in a group, where `expr1` is the dependent variable and `expr2` is the independent variable.


gengliangwang · 2022-02-24T13:57:54Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala

+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);


Let's add one example or test case for checking both regr_avgx(1, null) and regr_avgy(1, null) return null

cloud-fan · 2022-02-25T04:39:14Z

The last commit just fixes the doc string and the previous commit passed all tests. I'm merging it to master, thanks!

beliefer · 2022-02-25T06:26:24Z

@cloud-fan @gengliangwang Thank you for the review.

github-actions bot added the SQL label Dec 11, 2021

beliefer force-pushed the SPARK-37614 branch from 4220627 to c09ccc6 Compare December 17, 2021 06:49

beliefer force-pushed the SPARK-37614 branch from f3262db to 677124b Compare December 17, 2021 12:51

cloud-fan reviewed Dec 20, 2021

View reviewed changes

beliefer force-pushed the SPARK-37614 branch from 3423d55 to 53856a5 Compare December 28, 2021 06:24

beliefer added 8 commits February 24, 2022 16:12

[SPARK-37614][SQL] Support ANSI Aggregate Function: regr_avgx & regr_…

e6f7c86

…avgy

Update code

6bf0dd4

Update code

c261f0d

Update code

e389b5d

Update code

6dfcf6b

Update code

214db27

Update code

80722dc

Update code

5d07782

beliefer force-pushed the SPARK-37614 branch from e5ece31 to 5d07782 Compare February 24, 2022 09:37

cloud-fan reviewed Feb 24, 2022

View reviewed changes

cloud-fan approved these changes Feb 24, 2022

View reviewed changes

Update code

894ee46

gengliangwang reviewed Feb 24, 2022

View reviewed changes

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Feb 24, 2022

View reviewed changes

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala Outdated Show resolved Hide resolved

beliefer and others added 2 commits February 24, 2022 21:46

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…

f8e2d1e

…essions/aggregate/linearRegression.scala Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…

9e0025a

…essions/aggregate/linearRegression.scala Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>

gengliangwang reviewed Feb 24, 2022

View reviewed changes

Update code

52137fc

gengliangwang reviewed Feb 24, 2022

View reviewed changes

gengliangwang approved these changes Feb 24, 2022

View reviewed changes

beliefer added 2 commits February 25, 2022 09:20

Update code

0743de1

Update code

a61dd3b

cloud-fan closed this in 95f06f3 Feb 25, 2022

[SPARK-37614][SQL] Support ANSI Aggregate Function: regr_avgx & regr_avgy #34868

[SPARK-37614][SQL] Support ANSI Aggregate Function: regr_avgx & regr_avgy #34868

Conversation

beliefer commented Dec 11, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 11, 2021

SparkQA commented Dec 11, 2021

SparkQA commented Dec 11, 2021

SparkQA commented Dec 12, 2021

SparkQA commented Dec 12, 2021

SparkQA commented Dec 12, 2021

SparkQA commented Dec 17, 2021

SparkQA commented Dec 17, 2021

SparkQA commented Dec 17, 2021

SparkQA commented Dec 17, 2021

SparkQA commented Dec 17, 2021

SparkQA commented Dec 17, 2021

SparkQA commented Dec 18, 2021

SparkQA commented Dec 18, 2021

SparkQA commented Dec 18, 2021

SparkQA commented Dec 18, 2021

SparkQA commented Dec 18, 2021

SparkQA commented Dec 18, 2021

beliefer commented Dec 18, 2021

SparkQA commented Dec 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 20, 2021

SparkQA commented Dec 20, 2021

SparkQA commented Dec 20, 2021

SparkQA commented Dec 20, 2021

SparkQA commented Dec 20, 2021

amaliujia commented Feb 10, 2022 • edited

beliefer commented Feb 10, 2022

amaliujia commented Feb 10, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 25, 2022

beliefer commented Feb 25, 2022

beliefer commented Dec 11, 2021 •

edited

amaliujia commented Feb 10, 2022 •

edited

amaliujia commented Feb 10, 2022 •

edited