Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum} #40070

Closed
wants to merge 5 commits into from

Conversation

amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

Adding more API to agg including max,min,mean,count,avg,sum.

Why are the changes needed?

API coverage

Does this PR introduce any user-facing change?

NO

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

@hvanhovell

@amaliujia
Copy link
Contributor Author

amaliujia commented Feb 17, 2023

I need to update golden files in this PR.

.setFunctionName("avg")
.addArguments(inputExpr)
.setIsDistinct(false)
functions.avg(columnName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You currently don't return this function, but the result of builder.build(). If you do, it should be functions.avg(columnName).expr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did right replacement and hit a proto -> plan test generation failure.

I am planing look into that separately. I am gonna need some time to learn how to debug org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite

case name =>
builder.getUnresolvedFunctionBuilder
.setFunctionName(name)
.addArguments(inputExpr)
.addArguments(df(columnName).expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Column.fn instead?

Copy link
Contributor Author

@amaliujia amaliujia Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on that I will revert this part. It seems hit an issue somewhere if I switch to use functions API. I need to understand more on the functions implementation.

I will debug this separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my earlier comment, and also the tests are broken.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the tests and fix strToExpr. Looks good otherwise.

* @since 3.4.0
*/
def count(): Long = {
groupBy().count().collect().head.getLong(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't I implement that?

lol...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my local branch that I rebased today, there is no this API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you are right... I thought I added it.

*/
@scala.annotation.varargs
def mean(colNames: String*): DataFrame = {
toDF(colNames.map(colName => functions.mean(colName)).toSeq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need toSeq here? I though scala varags are always a Seq...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I see. Removing those toSeq.

@@ -109,7 +109,7 @@ class RelationalGroupedDataset protected[sql] (
agg(exprs.asScala.toMap)
}

private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = {
private[this] def strToColumn(expr: String, inputExpr: proto.Expression): Column = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

private[this] def strToColumn(expr: String, inputExpr: Column): Column = {
  expr.toLowerCase(Locale.ROOT) match {
    case "avg" | "average" | "mean" => functions.avg(inputExpr)
    case "stddev" | "std" => functions.avg(inputExpr)
    case "count" | "size" => functions.count(inputExpr) // Analyzer will take care of * expansion
    case name => Column.fn(name, inputExpr)
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are suggesting now. Done.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One small comment.

@hvanhovell
Copy link
Contributor

Merging.

hvanhovell pushed a commit that referenced this pull request Feb 18, 2023
### What changes were proposed in this pull request?

Adding more API to `agg` including max,min,mean,count,avg,sum.

### Why are the changes needed?

API coverage
### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes #40070 from amaliujia/rw-agg2.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit 74f53b8)
Signed-off-by: Herman van Hovell <herman@databricks.com>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?

Adding more API to `agg` including max,min,mean,count,avg,sum.

### Why are the changes needed?

API coverage
### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes apache#40070 from amaliujia/rw-agg2.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit 74f53b8)
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants