Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37641][SQL] Support ANSI Aggregate Function: regr_r2 #34894

Closed
wants to merge 8 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Dec 14, 2021

What changes were proposed in this pull request?

This PR used to support ANSI aggregate Function: regr_r2

Syntax: REGR_R2(y, x)
Arguments:

  • y:The dependent variable. This must be an expression that can be evaluated to a numeric type.
  • x:The independent variable. This must be an expression that can be evaluated to a numeric type.

Examples:
select k, regr_r2(v, v2) from aggr group by k;

k regr_r2(v, v2)
1 [NULL]
2 0.9976905312

The mainstream database supports regr_r2 show below:
Teradata
https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/exhFe2f_YyGqKFakYYUn2A
Snowflake
https://docs.snowflake.com/en/sql-reference/functions/regr_r2.html
Oracle
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9
DB2
https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count
H2
http://www.h2database.com/html/functions-aggregate.html#regr_r2
Postgresql
https://www.postgresql.org/docs/8.4/functions-aggregate.html
Sybase
https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-r2-function.html
Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm

Why are the changes needed?

regr_r2 is very useful.

Does this PR introduce any user-facing change?

'Yes'. New feature.

How was this patch tested?

New tests.

@github-actions github-actions bot added the SQL label Dec 14, 2021
@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Test build #146176 has finished for PR 34894 at commit 267e94c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class RegrR2(x: Expression, y: Expression)

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50649/

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50649/

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50652/

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50652/

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Test build #146179 has finished for PR 34894 at commit afe3707.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50661/

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50661/

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Test build #146187 has finished for PR 34894 at commit afe3707.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50866/

@SparkQA
Copy link

SparkQA commented Dec 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50866/

@SparkQA
Copy link

SparkQA commented Dec 20, 2021

Test build #146391 has finished for PR 34894 at commit dab7c93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

ping @MaxGekk @gengliangwang cc @cloud-fan

@beliefer beliefer changed the title [WIP][SPARK-37641][SQL] Support ANSI Aggregate Function: regr_r2 [SPARK-37641][SQL] Support ANSI Aggregate Function: regr_r2 Feb 27, 2022
case class RegrR2(x: Expression, y: Expression) extends PearsonCorrelation(x, y, true) {
override def prettyName: String = "regr_r2"
override val evaluateExpression: Expression = {
lazy val corr = ck / sqrt(xMk * yMk)
Copy link
Contributor

@cloud-fan cloud-fan Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are building expressions here, not calculating the result. I don't think this lazy val is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think it should be good to make sure we won't calculate the corr value if xMk or yMk is 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm it's just Expression.

override val evaluateExpression: Expression = {
lazy val corr = ck / sqrt(xMk * yMk)
If(n === 0.0, Literal.create(null, DoubleType),
If(n === 1.0, divideByZeroEvalResult, corr * corr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

divideByZeroEvalResult is for legacy behaviors, which we don't need to follow in new functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, is there any reference implementation or algorithm introduction to explain this corr * corr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

We know how to evaluate corr, reference

If(n === 1.0, divideByZeroEvalResult, ck / sqrt(xMk * yMk)))

@beliefer beliefer force-pushed the SPARK-37641 branch 2 times, most recently from a8fc177 to 4925a76 Compare March 13, 2022 02:38
@beliefer beliefer force-pushed the SPARK-37641 branch 3 times, most recently from 96023b4 to dd399a6 Compare March 24, 2022 01:40

// scalastyle:off line.size.limit
@ExpressionDescription(
usage = "_FUNC_(y, x) - Returns the number of non-null number pairs in a group, where `y` is the dependent variable and `x` is the independent variable.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function description is not correct, please update.

override val evaluateExpression: Expression = {
val corr = ck / sqrt(xMk * yMk)
If(xMk === 0.0, Literal.create(null, DoubleType),
If(yMk === 0.0, Literal.create(1.0, DoubleType), corr * corr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be better to compute the corr here. So you don't compute the value when xMk or yMk is 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm it's just Expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. there are expressions here.

@jiangxb1987
Copy link
Contributor

The changes LGTM, cc @cloud-fan can you confirm?

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.3!

@cloud-fan cloud-fan closed this in b01d81e Mar 29, 2022
cloud-fan pushed a commit that referenced this pull request Mar 29, 2022
### What changes were proposed in this pull request?
This PR used to support ANSI aggregate Function: `regr_r2`

**Syntax**: REGR_R2(y, x)
**Arguments**:
- **y**:The dependent variable. This must be an expression that can be evaluated to a numeric type.
- **x**:The independent variable. This must be an expression that can be evaluated to a numeric type.

**Examples**:
`select k, regr_r2(v, v2) from aggr group by k;`

| k | regr_r2(v, v2) |
|--|---------------|
| 1 |    [NULL]         |
| 2 | 0.9976905312   |

The mainstream database supports `regr_r2` show below:
**Teradata**
https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/exhFe2f_YyGqKFakYYUn2A
**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/regr_r2.html
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9
**DB2**
https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count
**H2**
http://www.h2database.com/html/functions-aggregate.html#regr_r2
**Postgresql**
https://www.postgresql.org/docs/8.4/functions-aggregate.html
**Sybase**
https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-r2-function.html
**Exasol**
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm

### Why are the changes needed?
`regr_r2` is very useful.

### Does this PR introduce _any_ user-facing change?
'Yes'. New feature.

### How was this patch tested?
New tests.

Closes #34894 from beliefer/SPARK-37641.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b01d81e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@beliefer
Copy link
Contributor Author

@jiangxb1987 Thank you for the review. @cloud-fan Thank you too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants