New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37614][SQL] Support ANSI Aggregate Function: regr_avgx & regr_avgy #34868
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146093 has finished for PR 34868 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146101 has finished for PR 34868 at commit
|
4220627
to
c09ccc6
Compare
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146322 has finished for PR 34868 at commit
|
f3262db
to
677124b
Compare
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146333 has finished for PR 34868 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Test build #146344 has finished for PR 34868 at commit
|
Test build #146345 has finished for PR 34868 at commit
|
ping @MaxGekk @gengliangwang cc @cloud-fan |
Kubernetes integration test starting |
@@ -0,0 +1,124 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make the file name clearer: linearRegression.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Kubernetes integration test status failure |
Test build #146392 has finished for PR 34868 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146400 has finished for PR 34868 at commit
|
3423d55
to
53856a5
Compare
@beliefer can you document the new functions in the PR description? Basically can you list
There are usage comment for each expression. We can also copy documentation above into those comment. Current usage description is too vague to understand the function specification:
Based on the function specification, we can further check if we have enough test coverage. It's very important to have function specification. Even though there are links to other dialects' documentation. We still want our documentation to know how exactly it is same or different from others. |
@amaliujia PR description updated. |
@beliefer can you refer to https://issues.apache.org/jira/browse/SPARK-38063 as an example to update the function specification? Basically instead of reading code, let's first document and discuss what the functions should be, especially on their argument type, input range, illegal/legal inputs, etc. After we catch edges cases by walking through the specification, writing code and code review will also be easier. |
e5ece31
to
5d07782
Compare
usage = """ | ||
_FUNC_(expr1, expr2) - Returns the average of the independent variable for non-null pairs in | ||
a group. | ||
""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try DESC FUNCTION
locally? I'm afraid multi-line string doesn't work well in function doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I should use // scalastyle:off line.size.limit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's one line, can we use "..."
?
| org.apache.spark.sql.catalyst.expressions.Ceil | ceil | SELECT ceil(-0.1) | struct<CEIL(-0.1):decimal(1,0)> | | ||
| org.apache.spark.sql.catalyst.expressions.Ceil | ceiling | SELECT ceiling(-0.1) | struct<ceiling(-0.1):decimal(1,0)> | | ||
| org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$ | ceil | SELECT ceil(-0.1) | struct<CEIL(-0.1):decimal(1,0)> | | ||
| org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$ | ceiling | SELECT ceiling(-0.1) | struct<ceiling(-0.1):decimal(1,0)> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting... existing tests didn't capture them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The problem occurs in many of my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one comment.
...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
Outdated
Show resolved
Hide resolved
...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
Outdated
Show resolved
Hide resolved
…essions/aggregate/linearRegression.scala Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>
…essions/aggregate/linearRegression.scala Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>
// scalastyle:off line.size.limit | ||
@ExpressionDescription( | ||
usage = """ | ||
_FUNC_(expr1, expr2) - Returns the average of the independent variable for non-null pairs in a group, where `expr1` is the dependent variable and `expr2` is the independent variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, let's change (expr1, expr2) to (y, x) so that users can understanding the names of "avgx" and "avgy"
https://docs.snowflake.com/en/sql-reference/functions/regr_avgx.html
https://docs.snowflake.com/en/sql-reference/functions/regr_avgy.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
// scalastyle:off line.size.limit | ||
@ExpressionDescription( | ||
usage = """ | ||
_FUNC_(expr1, expr2) - Returns the average of the dependent variable for non-null pairs in a group, where `expr1` is the dependent variable and `expr2` is the independent variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
""", | ||
examples = """ | ||
Examples: | ||
> SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add one example or test case for checking both regr_avgx(1, null)
and regr_avgy(1, null)
return null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
The last commit just fixes the doc string and the previous commit passed all tests. I'm merging it to master, thanks! |
@cloud-fan @gengliangwang Thank you for the review. |
What changes were proposed in this pull request?
REGR_AVGX
andREGR_AVGY
are ANSI aggregate functions.REGR_AVGX
returns the mean of the independent_variable_expression for all non-null data pairs of the dependent and independent variable arguments.Syntax:
REGR_AVGX(dependent_variable_expression, independent_variable_expression)
The equation for computing
REGR_AVGX
is:REGR_AVGX = SUM(x)/n
Examples:
REGR_AVGY
returns the mean of the dependent_variable_expression for all non-null data pairs of the dependent and independent variable arguments.Syntax:
REGR_AVGY(dependent_variable_expression, independent_variable_expression)
The equation for computing
REGR_AVGY
is:REGR_AVGY = SUM(y)/n
Examples:
dependent_variable_expression: the dependent variable for the regression. A dependent variable is something that is measured in response to a treatment. The expression cannot contain any ordered analytical or aggregate functions.
independent_variable_expression: the independent variable for the regression. An independent variable is a treatment: something that is varied under your control to test the behavior of another variable.
The expression cannot contain any ordered analytical or aggregate functions.
The mainstream database supports
REGR_AVGX
andREGR_AVGY
show below:Teradata
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/KkJgUSq2O6JRU3bCK~0cug
Snowflake
https://docs.snowflake.com/en/sql-reference/functions/regr_avgx.html
Oracle
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9
Vertica
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/REGR_AVGX.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAggregate%20Functions%7C_____24
DB2
https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count
H2
http://www.h2database.com/html/functions-aggregate.html#regr_avgx
Postgresql
https://www.postgresql.org/docs/8.4/functions-aggregate.html
Sybase
https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-avgx-function.html
Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm
Why are the changes needed?
REGR_AVGX
andREGR_AVGY
are very useful.Does this PR introduce any user-facing change?
'Yes'. New feature.
How was this patch tested?
New tests.