Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37614][SQL] Support ANSI Aggregate Function: regr_avgx & regr_avgy #34868

Closed
wants to merge 14 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Dec 11, 2021

What changes were proposed in this pull request?

REGR_AVGX and REGR_AVGY are ANSI aggregate functions.

REGR_AVGX returns the mean of the independent_variable_expression for all non-null data pairs of the dependent and independent variable arguments.
Syntax: REGR_AVGX(dependent_variable_expression, independent_variable_expression)
The equation for computing REGR_AVGX is: REGR_AVGX = SUM(x)/n

Examples:

> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
  2.75
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
  3.0
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
  3.0

REGR_AVGY returns the mean of the dependent_variable_expression for all non-null data pairs of the dependent and independent variable arguments.
Syntax: REGR_AVGY(dependent_variable_expression, independent_variable_expression)
The equation for computing REGR_AVGY is: REGR_AVGY = SUM(y)/n

Examples:

> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
  1.75
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
  1.6666666666666667
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
  1.5

dependent_variable_expression: the dependent variable for the regression. A dependent variable is something that is measured in response to a treatment. The expression cannot contain any ordered analytical or aggregate functions.

independent_variable_expression: the independent variable for the regression. An independent variable is a treatment: something that is varied under your control to test the behavior of another variable.

The expression cannot contain any ordered analytical or aggregate functions.

The mainstream database supports REGR_AVGX and REGR_AVGY show below:
Teradata
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/KkJgUSq2O6JRU3bCK~0cug
Snowflake
https://docs.snowflake.com/en/sql-reference/functions/regr_avgx.html
Oracle
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9
Vertica
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/REGR_AVGX.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAggregate%20Functions%7C_____24
DB2
https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count
H2
http://www.h2database.com/html/functions-aggregate.html#regr_avgx
Postgresql
https://www.postgresql.org/docs/8.4/functions-aggregate.html
Sybase
https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-avgx-function.html
Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm

Why are the changes needed?

REGR_AVGX and REGR_AVGY are very useful.

Does this PR introduce any user-facing change?

'Yes'. New feature.

How was this patch tested?

New tests.

@github-actions github-actions bot added the SQL label Dec 11, 2021
@SparkQA
Copy link

SparkQA commented Dec 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50568/

@SparkQA
Copy link

SparkQA commented Dec 11, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50568/

@SparkQA
Copy link

SparkQA commented Dec 11, 2021

Test build #146093 has finished for PR 34868 at commit 9bfd4cb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class RegrAvgX(left: Expression, right: Expression) extends RegrAvg
  • case class RegrAvgY(left: Expression, right: Expression) extends RegrAvg
  • trait RegrAvg extends UnevaluableAggregate with ImplicitCastInputTypes with BinaryLike[Expression]

@SparkQA
Copy link

SparkQA commented Dec 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50575/

@SparkQA
Copy link

SparkQA commented Dec 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50575/

@SparkQA
Copy link

SparkQA commented Dec 12, 2021

Test build #146101 has finished for PR 34868 at commit 4220627.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50796/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50796/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Test build #146322 has finished for PR 34868 at commit f3262db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50806/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50806/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Test build #146333 has finished for PR 34868 at commit 677124b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50818/

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50819/

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50818/

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50819/

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Test build #146344 has finished for PR 34868 at commit 742e73a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Test build #146345 has finished for PR 34868 at commit f710cdd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

ping @MaxGekk @gengliangwang cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Dec 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50867/

@@ -0,0 +1,124 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make the file name clearer: linearRegression.scala

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@SparkQA
Copy link

SparkQA commented Dec 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50867/

@SparkQA
Copy link

SparkQA commented Dec 20, 2021

Test build #146392 has finished for PR 34868 at commit 06d6790.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50875/

@SparkQA
Copy link

SparkQA commented Dec 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50875/

@SparkQA
Copy link

SparkQA commented Dec 20, 2021

Test build #146400 has finished for PR 34868 at commit 3423d55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class RegrCount(left: Expression, right: Expression) extends RegressionAggregate
  • trait RegrAvg extends RegressionAggregate
  • case class RegrAvgX(left: Expression, right: Expression) extends RegrAvg
  • case class RegrAvgY(left: Expression, right: Expression) extends RegrAvg

@amaliujia
Copy link
Contributor

amaliujia commented Feb 10, 2022

@beliefer can you document the new functions in the PR description?

Basically can you list

  1. the function signature (name and argument type)
  2. Argument valid input range. Maybe some examples
  3. Not valid argument inputs. Maybe some examples.
  4. other special cases if any

There are usage comment for each expression. We can also copy documentation above into those comment. Current usage description is too vague to understand the function specification:

  usage = """
     _FUNC_(expr1, expr2) - Returns the average of the independent variable for non-null pairs in
                            a group.
  """

Based on the function specification, we can further check if we have enough test coverage.

It's very important to have function specification. Even though there are links to other dialects' documentation. We still want our documentation to know how exactly it is same or different from others.

@beliefer
Copy link
Contributor Author

@amaliujia PR description updated.

@amaliujia
Copy link
Contributor

amaliujia commented Feb 10, 2022

@beliefer can you refer to https://issues.apache.org/jira/browse/SPARK-38063 as an example to update the function specification?

Basically instead of reading code, let's first document and discuss what the functions should be, especially on their argument type, input range, illegal/legal inputs, etc.

After we catch edges cases by walking through the specification, writing code and code review will also be easier.

usage = """
_FUNC_(expr1, expr2) - Returns the average of the independent variable for non-null pairs in
a group.
""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try DESC FUNCTION locally? I'm afraid multi-line string doesn't work well in function doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I should use // scalastyle:off line.size.limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's one line, can we use "..."?

| org.apache.spark.sql.catalyst.expressions.Ceil | ceil | SELECT ceil(-0.1) | struct<CEIL(-0.1):decimal(1,0)> |
| org.apache.spark.sql.catalyst.expressions.Ceil | ceiling | SELECT ceiling(-0.1) | struct<ceiling(-0.1):decimal(1,0)> |
| org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$ | ceil | SELECT ceil(-0.1) | struct<CEIL(-0.1):decimal(1,0)> |
| org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$ | ceiling | SELECT ceiling(-0.1) | struct<ceiling(-0.1):decimal(1,0)> |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting... existing tests didn't capture them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The problem occurs in many of my PR.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment.

beliefer and others added 2 commits February 24, 2022 21:46
…essions/aggregate/linearRegression.scala

Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>
…essions/aggregate/linearRegression.scala

Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>
// scalastyle:off line.size.limit
@ExpressionDescription(
usage = """
_FUNC_(expr1, expr2) - Returns the average of the independent variable for non-null pairs in a group, where `expr1` is the dependent variable and `expr2` is the independent variable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, let's change (expr1, expr2) to (y, x) so that users can understanding the names of "avgx" and "avgy"
https://docs.snowflake.com/en/sql-reference/functions/regr_avgx.html
https://docs.snowflake.com/en/sql-reference/functions/regr_avgy.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// scalastyle:off line.size.limit
@ExpressionDescription(
usage = """
_FUNC_(expr1, expr2) - Returns the average of the dependent variable for non-null pairs in a group, where `expr1` is the dependent variable and `expr2` is the independent variable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

""",
examples = """
Examples:
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add one example or test case for checking both regr_avgx(1, null) and regr_avgy(1, null) return null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@cloud-fan
Copy link
Contributor

The last commit just fixes the doc string and the previous commit passed all tests. I'm merging it to master, thanks!

@cloud-fan cloud-fan closed this in 95f06f3 Feb 25, 2022
@beliefer
Copy link
Contributor Author

@cloud-fan @gengliangwang Thank you for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants