[SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum} #40070

amaliujia · 2023-02-17T20:37:36Z

What changes were proposed in this pull request?

Adding more API to agg including max,min,mean,count,avg,sum.

Why are the changes needed?

API coverage

Does this PR introduce any user-facing change?

NO

How was this patch tested?

UT

amaliujia · 2023-02-17T20:37:49Z

@hvanhovell

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

amaliujia · 2023-02-17T20:54:27Z

I need to update golden files in this PR.

hvanhovell · 2023-02-17T20:54:44Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

-          .setFunctionName("avg")
-          .addArguments(inputExpr)
-          .setIsDistinct(false)
+        functions.avg(columnName)


You currently don't return this function, but the result of builder.build(). If you do, it should be functions.avg(columnName).expr.

I think I did right replacement and hit a proto -> plan test generation failure.

I am planing look into that separately. I am gonna need some time to learn how to debug org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite

hvanhovell · 2023-02-17T20:55:02Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

      case name =>
        builder.getUnresolvedFunctionBuilder
          .setFunctionName(name)
-          .addArguments(inputExpr)
+          .addArguments(df(columnName).expr)


Use Column.fn instead?

Hold on that I will revert this part. It seems hit an issue somewhere if I switch to use functions API. I need to understand more on the functions implementation.

I will debug this separately.

See my earlier comment, and also the tests are broken.

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala

hvanhovell

Please update the tests and fix strToExpr. Looks good otherwise.

hvanhovell · 2023-02-17T21:15:10Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * @since 3.4.0
+   */
+  def count(): Long = {
+    groupBy().count().collect().head.getLong(0)


Didn't I implement that?

lol...

In my local branch that I rebased today, there is no this API.

Well you are right... I thought I added it.

hvanhovell · 2023-02-17T21:16:23Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+   */
+  @scala.annotation.varargs
+  def mean(colNames: String*): DataFrame = {
+    toDF(colNames.map(colName => functions.mean(colName)).toSeq)


do we need toSeq here? I though scala varags are always a Seq...

hmmm I see. Removing those toSeq.

hvanhovell · 2023-02-17T21:27:39Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

@@ -109,7 +109,7 @@ class RelationalGroupedDataset protected[sql] (
    agg(exprs.asScala.toMap)
  }

-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = {
+  private[this] def strToColumn(expr: String, inputExpr: proto.Expression): Column = {


How about:

private[this] def strToColumn(expr: String, inputExpr: Column): Column = { expr.toLowerCase(Locale.ROOT) match { case "avg" | "average" | "mean" => functions.avg(inputExpr) case "stddev" | "std" => functions.avg(inputExpr) case "count" | "size" => functions.count(inputExpr) // Analyzer will take care of * expansion case name => Column.fn(name, inputExpr) } }

I see what you are suggesting now. Done.

hvanhovell

LGTM. One small comment.

hvanhovell · 2023-02-18T00:48:56Z

Merging.

### What changes were proposed in this pull request? Adding more API to `agg` including max,min,mean,count,avg,sum. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40070 from amaliujia/rw-agg2. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 74f53b8) Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? Adding more API to `agg` including max,min,mean,count,avg,sum. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes apache#40070 from amaliujia/rw-agg2. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 74f53b8) Signed-off-by: Herman van Hovell <herman@databricks.com>

[SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}.

6f40669

github-actions bot added CONNECT SQL labels Feb 17, 2023

hvanhovell reviewed Feb 17, 2023

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Feb 17, 2023

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala Show resolved Hide resolved

hvanhovell reviewed Feb 17, 2023

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Feb 17, 2023

View reviewed changes

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala Show resolved Hide resolved

hvanhovell requested changes Feb 17, 2023

View reviewed changes

amaliujia added 2 commits February 17, 2023 13:13

update

75f855f

update

2e0b752

hvanhovell reviewed Feb 17, 2023

View reviewed changes

update

e08c52d

hvanhovell reviewed Feb 17, 2023

View reviewed changes

hvanhovell approved these changes Feb 17, 2023

View reviewed changes

update

380b36c

hvanhovell closed this in 74f53b8 Feb 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum} #40070

[SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum} #40070

amaliujia commented Feb 17, 2023

amaliujia commented Feb 17, 2023

amaliujia commented Feb 17, 2023 •

edited

hvanhovell Feb 17, 2023

amaliujia Feb 17, 2023

hvanhovell Feb 17, 2023

amaliujia Feb 17, 2023 •

edited

hvanhovell Feb 17, 2023

hvanhovell left a comment

hvanhovell Feb 17, 2023

amaliujia Feb 17, 2023

hvanhovell Feb 17, 2023

hvanhovell Feb 17, 2023

amaliujia Feb 17, 2023

hvanhovell Feb 17, 2023

amaliujia Feb 17, 2023

hvanhovell left a comment

hvanhovell commented Feb 18, 2023

[SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum} #40070

[SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum} #40070

Conversation

amaliujia commented Feb 17, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

amaliujia commented Feb 17, 2023

amaliujia commented Feb 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia Feb 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

hvanhovell commented Feb 18, 2023

amaliujia commented Feb 17, 2023 •

edited

amaliujia Feb 17, 2023 •

edited