-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37641][SQL] Support ANSI Aggregate Function: regr_r2 #34894
Conversation
Test build #146176 has finished for PR 34894 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146179 has finished for PR 34894 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146187 has finished for PR 34894 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146391 has finished for PR 34894 at commit
|
dab7c93
to
c1b22f7
Compare
ping @MaxGekk @gengliangwang cc @cloud-fan |
case class RegrR2(x: Expression, y: Expression) extends PearsonCorrelation(x, y, true) { | ||
override def prettyName: String = "regr_r2" | ||
override val evaluateExpression: Expression = { | ||
lazy val corr = ck / sqrt(xMk * yMk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are building expressions here, not calculating the result. I don't think this lazy val is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think it should be good to make sure we won't calculate the corr
value if xMk or yMk is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm it's just Expression.
override val evaluateExpression: Expression = { | ||
lazy val corr = ck / sqrt(xMk * yMk) | ||
If(n === 0.0, Literal.create(null, DoubleType), | ||
If(n === 1.0, divideByZeroEvalResult, corr * corr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
divideByZeroEvalResult
is for legacy behaviors, which we don't need to follow in new functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, is there any reference implementation or algorithm introduction to explain this corr * corr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know how to evaluate corr, reference
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Corr.scala
Line 126 in 52e7602
If(n === 1.0, divideByZeroEvalResult, ck / sqrt(xMk * yMk))) |
a8fc177
to
4925a76
Compare
96023b4
to
dd399a6
Compare
|
||
// scalastyle:off line.size.limit | ||
@ExpressionDescription( | ||
usage = "_FUNC_(y, x) - Returns the number of non-null number pairs in a group, where `y` is the dependent variable and `x` is the independent variable.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function description is not correct, please update.
override val evaluateExpression: Expression = { | ||
val corr = ck / sqrt(xMk * yMk) | ||
If(xMk === 0.0, Literal.create(null, DoubleType), | ||
If(yMk === 0.0, Literal.create(1.0, DoubleType), corr * corr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be better to compute the corr
here. So you don't compute the value when xMk or yMk is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm it's just Expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. there are expressions here.
The changes LGTM, cc @cloud-fan can you confirm? |
thanks, merging to master/3.3! |
### What changes were proposed in this pull request? This PR used to support ANSI aggregate Function: `regr_r2` **Syntax**: REGR_R2(y, x) **Arguments**: - **y**:The dependent variable. This must be an expression that can be evaluated to a numeric type. - **x**:The independent variable. This must be an expression that can be evaluated to a numeric type. **Examples**: `select k, regr_r2(v, v2) from aggr group by k;` | k | regr_r2(v, v2) | |--|---------------| | 1 | [NULL] | | 2 | 0.9976905312 | The mainstream database supports `regr_r2` show below: **Teradata** https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/exhFe2f_YyGqKFakYYUn2A **Snowflake** https://docs.snowflake.com/en/sql-reference/functions/regr_r2.html **Oracle** https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9 **DB2** https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count **H2** http://www.h2database.com/html/functions-aggregate.html#regr_r2 **Postgresql** https://www.postgresql.org/docs/8.4/functions-aggregate.html **Sybase** https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-r2-function.html **Exasol** https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm ### Why are the changes needed? `regr_r2` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. New feature. ### How was this patch tested? New tests. Closes #34894 from beliefer/SPARK-37641. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b01d81e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@jiangxb1987 Thank you for the review. @cloud-fan Thank you too. |
What changes were proposed in this pull request?
This PR used to support ANSI aggregate Function:
regr_r2
Syntax: REGR_R2(y, x)
Arguments:
Examples:
select k, regr_r2(v, v2) from aggr group by k;
The mainstream database supports
regr_r2
show below:Teradata
https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/exhFe2f_YyGqKFakYYUn2A
Snowflake
https://docs.snowflake.com/en/sql-reference/functions/regr_r2.html
Oracle
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9
DB2
https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count
H2
http://www.h2database.com/html/functions-aggregate.html#regr_r2
Postgresql
https://www.postgresql.org/docs/8.4/functions-aggregate.html
Sybase
https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-r2-function.html
Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm
Why are the changes needed?
regr_r2
is very useful.Does this PR introduce any user-facing change?
'Yes'. New feature.
How was this patch tested?
New tests.