[SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect #40116

ritikam2 · 2023-02-22T05:09:30Z

What changes were proposed in this pull request?

correct the output column name of groupBy.agg(count_distinct), so the "*" is expanded correctly into column names and the output column has the distinct keyword.

Why are the changes needed?

Output column name for groupBy.agg(count_distinct) is incorrect . However similar queries in spark sql return correct output column. For groupBy.agg queries on dataframe "*" is not expanded correctly in the output column and the distinct keyword is missing from output column.

// initializing data
scala> val df = spark.range(1, 10).withColumn("value", lit(1))
df: org.apache.spark.sql.DataFrame = [id: bigint, value: int]
scala> df.createOrReplaceTempView("table")

// Dataframe  aggregate queries with incorrect output column
scala> df.groupBy("id").agg(count_distinct($"*"))
res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): bigint]
scala> df.groupBy("id").agg(count_distinct($"value"))
res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint]

// Spark Sql aggregate queries with correct output column
scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ")
res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, value): bigint]
scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ")
res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): bigint]

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

zhengruifeng · 2023-02-23T09:26:06Z

I guess you may need to Go to “Actions” tab on your forked repository and enable “Build and test” and “Report test results” workflows

https://spark.apache.org/contributing.html

ritikam2 · 2023-02-23T15:46:22Z

Did that

srowen · 2023-02-25T14:36:24Z

Eh, this does not explain the issue at all. Please do so.

ritikam2 · 2023-02-26T00:10:35Z

I have enabled the workflows on the branch. Is there something else that I need to do?

ritikam2 · 2023-02-26T00:21:35Z

Sean not sure which issue you were referring to. I updated the why the changes are needed section of the pull request to mirror what Zheng had already put in his pull request.

srowen · 2023-02-26T03:49:44Z

This is about SPARK-41391? it also doesn't contain a simple description of what you're reporting, just code snippets. I can work it out, but this could be explained in just a few sentences

srowen · 2023-02-26T03:49:56Z

Please fix the PR description too https://spark.apache.org/contributing.html

ritikam2 · 2023-02-26T08:27:17Z

Sean I tried to correct the two things pointed out by you. Let me know if that works

srowen · 2023-02-26T15:26:58Z

Looks better. Title should start with [SPARK-41391] to link it. Please include the description in the title; there is nothing there now

ritikam2 · 2023-02-27T18:46:39Z

Not sure how my checkins are causing javadoc genration error

srowen · 2023-02-27T18:55:58Z

It's the [[Star]] in the scaladoc you added. Just don't make it a reference

ritikam2 · 2023-03-01T00:31:54Z

Is there anything else that I need to do for the fix to be accepted?

srowen · 2023-03-01T00:43:14Z

@cloud-fan or @HyukjinKwon do you have an opinion?

zhengruifeng · 2023-03-01T01:31:36Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+  /**
+   * Returns true if `exprs` contains a star.
+   */
+  def containsStar(exprs: Seq[Expression]): Boolean =


it should be private.
since it is only used once, i think we can inline it.

cloud-fan · 2023-03-01T08:58:44Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

@@ -89,9 +89,18 @@ class RelationalGroupedDataset protected[sql](
    case expr: NamedExpression => expr
    case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>
      UnresolvedAlias(a, Some(Column.generateAlias))
+    case ag: UnresolvedFunction if (containsStar(Seq(ag))) || ag.isDistinct =>


it's weird to have this special case. Shall we always use UnresolvedAlias?

ritikam2 · 2023-03-02T04:33:04Z

Not sure why the suggested changes made the build fail in the
catalyst,hive-thriftserver module and
sql-other test module.
2023-03-01T22:23:36.6700903Z Error instrumenting class:org.apache.spark.sql.execution.streaming.state.SchemaHelper$SchemaV2Reader2023-03-01T22:23:36.8662344Z Error instrumenting class:org.apache.spark.sql.v2.avro.AvroScan
2023-03-01T22:23:36.8712474Z Error instrumenting class:org.apache.spark.api.python.DoubleArrayWritable

HyukjinKwon · 2023-03-02T07:24:39Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+  /**
+   * Returns true if `exprs` contains a star.
+   */
+  @inline final private def containsStar(exprs: Seq[Expression]): Boolean =


Let's probably remove this.

ritikam2 · 2023-03-03T19:00:17Z

Any comments. Apparently having all expr as unresolvedAlias is not working.

cloud-fan · 2023-03-08T09:30:11Z

Apparently having all expr as unresolvedAlias is not working.

Can you share the test failures? Maybe we just need to update the tests with the different alias name.

ritikam2 · 2023-03-08T17:47:19Z

7_Run Build modules sql - other tests.txt

cloud-fan · 2023-03-10T06:29:25Z

I think the test is easy to fix. It wants to test the aggregate function result, but not the generated alias, so we just change the testing query to add alias explicitly.

val avgDF = intervalData.select(
      avg($"year-month").as("a1"),
      avg($"year").as("a2"),
      ...

ritikam2 · 2023-03-11T06:34:17Z

I think the test is easy to fix. It wants to test the aggregate function result, but not the generated alias, so we just change the testing query to add alias explicitly.
val avgDF = intervalData.select(
      avg($"year-month").as("a1"),
      avg($"year").as("a2"),
      ...

Couple of questions

Is it required and documented that we should add alias with the aggregate functions? If that is not a requirement then fixing the test case is potentially covering an issue.
The Thread leaks reported in the sql-other tests in not just from DataFrameAggregateSuite, but from multiple other suites

023-03-03T04:05:16.9822203Z 04:05:16.978 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 393.0 failed 1 times; aborting job
2023-03-03T04:05:16.9866693Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- SPARK-30668: use legacy timestamp parser in to_timestamp (154 milliseconds)�[0m�[0m
2023-03-03T04:05:17.0464670Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- SPARK-30752: convert time zones on a daylight saving day (62 milliseconds)�[0m�[0m
2023-03-03T04:05:17.1930942Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- SPARK-30766: date_trunc of old timestamps to hours and days (142 milliseconds)�[0m�[0m
2023-03-03T04:05:17.3358608Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- SPARK-30793: truncate timestamps before the epoch to seconds and minutes (146 milliseconds)�[0m�[0m
2023-03-03T04:05:17.3824844Z 04:05:17.382 WARN org.apache.spark.sql.DateFunctionsSuite:
2023-03-03T04:05:17.3845065Z
2023-03-03T04:05:17.3846873Z ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.DateFunctionsSuite,

cloud-fan · 2023-03-13T03:34:08Z

The auto-generated alias name is fragile and we are trying to improve it at #40126

Can you give some examples of how the new update changes the alias name? If it's not reasonable, we should keep the previous code.

ritikam2 · 2023-03-14T18:38:43Z

The auto-generated alias name is fragile and we are trying to improve it at #40126

Can you give some examples of how the new update changes the alias name? If it's not reasonable, we should keep the previous code.

I am attaching a file showing some failures when all the aggregate expressions were made UnresolvedAlias. My latest checkin where I only make those aggregate expressions that have "*" as UnresolvedAlias works. The build went through.So it is essentially the unresolvedstar() that is being produced by the toPrettySQL for the agg expr with star that the Analyzer is not able to resolve.
sqlOtherTests.txt

ritikam2 · 2023-03-15T16:39:01Z

Can anyone tell me how I am getting this single quote in count expression. Attaching the picture. This can potentially cause problems down the lane where tree nodes are compared in the transformDownWithPruning where the two nodes are not same because of this single quote

cloud-fan · 2023-03-16T03:25:07Z

The single quote indicates that the expression is unresolved, I think it doesn't matter here.

cloud-fan · 2023-03-16T03:31:13Z

sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala

@@ -40,12 +40,15 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits {
   */
  implicit class StringToColumn(val sc: StringContext) {
    def $(args: Any*): ColumnName = {
-      new ColumnName(sc.s(args: _*))
+      if (sc.parts.length == 1 && sc.parts.contains("*")) {


what does this change fix?

Yuo this is redundant. Removed it in the latest build

cloud-fan · 2023-03-17T07:43:21Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+       if (containsStar(Seq(expr))) {
+            UnresolvedAlias(expr, None)
+        } else {
+          Alias(expr, toPrettySQL(expr))()


If we want a surgical fix, shall we fix how toPrettySQL handles star?

If we want a surgical fix, shall we fix how toPrettySQL handles star?

Sure we can fix toPrettySQL. But the best we can do is to get count(distinct *) which is not the same as what spark.sql produces.

If we want to duplicate spark.sql behavior the best option would be to create unresolvedAlias for expr containing "*" as pushed in the latest build.

ritikam2 · 2023-03-22T05:20:51Z

Perhaps the following would be better solution. Instead of looking for star any UnresolvedFunction should have UnresolvedAlias. Any comments?

private[this] def alias(expr: Expression): NamedExpression = expr match {
case expr: NamedExpression => expr
case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>
UnresolvedAlias(a, Some(Column.generateAlias))
case expr: Expression =>
if (expr.isInstanceOf[UnresolvedFunction]) {
UnresolvedAlias(expr, None)
} else {
Alias(expr, toPrettySQL(expr))()
}
}

cloud-fan · 2023-03-24T06:45:09Z

any UnresolvedFunction should have UnresolvedAlias.

SGTM. Or more aggressively, any expression should have UnresolvedAlias, and update failed tests.

ritikam2 · 2023-03-24T23:57:59Z

Right. This is simple 1 file fix with addition of test case versus the other one which may involve number of files.

ritikam2 · 2023-03-28T17:29:40Z

Please see if this fix can be pulled.

cloud-fan · 2023-03-29T07:42:16Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

@@ -89,7 +89,12 @@ class RelationalGroupedDataset protected[sql](
    case expr: NamedExpression => expr
    case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>
      UnresolvedAlias(a, Some(Column.generateAlias))
-    case expr: Expression => Alias(expr, toPrettySQL(expr))()
+    case expr: Expression =>


nit:

case u: UnresolvedFunction => UnresolvedAlias(expr, None) case expr: Expression => Alias(expr, toPrettySQL(expr))()

cloud-fan · 2023-03-29T07:42:27Z

sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala

@@ -40,12 +40,11 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits {
   */
  implicit class StringToColumn(val sc: StringContext) {
    def $(args: Any*): ColumnName = {
-      new ColumnName(sc.s(args: _*))
+        new ColumnName(sc.s(args: _*))


unnecessary change

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

cloud-fan · 2023-03-30T01:45:31Z

sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala

@@ -45,7 +45,7 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits {
  }

  // Primitives
-
+  


I think this will fail scalastyle check

cloud-fan · 2023-03-31T04:13:07Z

thanks, merging to master!

SPARK-41391 Fix

29ffbe9

github-actions bot added the SQL label Feb 22, 2023

ritikam2 changed the title ~~[WIP]SPARK-41391 Fix~~ SPARK-41391[SQL][WIP] Feb 26, 2023

ritikam2 changed the title ~~SPARK-41391[SQL][WIP]~~ SPARK-41391[SQL][WIP] The output column name of groupBy.agg(count_distinct) is incorrect Feb 26, 2023

SPARK-41391 Additional Fix

5e50ef7

ritikam2 changed the title ~~SPARK-41391[SQL][WIP] The output column name of groupBy.agg(count_distinct) is incorrect~~ SPARK-41391[SQL] The output column name of groupBy.agg(count_distinct) is incorrect Feb 27, 2023

zhengruifeng changed the title ~~SPARK-41391[SQL] The output column name of groupBy.agg(count_distinct) is incorrect~~ [SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect Feb 27, 2023

ritikam2 added 2 commits February 27, 2023 08:35

SPARK-41391 Additional Fix2

2e85944

SPARK-41391 Additional Fix3

628df08

SPARK-41391 Additional Fix4

d3966cd

zhengruifeng reviewed Mar 1, 2023

View reviewed changes

cloud-fan reviewed Mar 1, 2023

View reviewed changes

SPARK-41391 Suggested updates

9eb1c1d

HyukjinKwon reviewed Mar 2, 2023

View reviewed changes

ritikam2 added 2 commits March 2, 2023 09:47

SPARK-41391 Suggested updates without inline

61f4dbe

SPARK-41391 Removed containsStar method

8d9840e

ritikam2 added 2 commits March 13, 2023 19:39

SPARK-41391 Adding Unresolvedalias for agg expression with *

5b1009a

SPARK-41391 Adding Unresolvedalias for agg expression with * fix1

606eec9

cloud-fan reviewed Mar 16, 2023

View reviewed changes

SPARK-41391 Removiong changes in SQLImplicits.scala

96ddcb6

cloud-fan reviewed Mar 17, 2023

View reviewed changes

SPARK-41391 UnresolvedFunction as UnresolvedAlias

fa89f41

cloud-fan reviewed Mar 29, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Show resolved Hide resolved

ritikam2 added 2 commits March 29, 2023 10:12

SPARK-41391 fixes as suggested by @cloud-fan

c0455cb

SPARK-41391 fix2 as suggested by @cloud-fan

1cea7d6

cloud-fan reviewed Mar 30, 2023

View reviewed changes

cloud-fan approved these changes Mar 30, 2023

View reviewed changes

SPARK-41391 build fix

5487002

cloud-fan closed this in cb7d082 Mar 31, 2023

		@@ -45,7 +45,7 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits {
		}

		// Primitives

[SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect #40116

[SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect #40116

Conversation

ritikam2 commented Feb 22, 2023 • edited by zhengruifeng

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Feb 23, 2023

ritikam2 commented Feb 23, 2023

srowen commented Feb 25, 2023

ritikam2 commented Feb 26, 2023

ritikam2 commented Feb 26, 2023

srowen commented Feb 26, 2023

srowen commented Feb 26, 2023

ritikam2 commented Feb 26, 2023

srowen commented Feb 26, 2023

ritikam2 commented Feb 27, 2023

srowen commented Feb 27, 2023

ritikam2 commented Mar 1, 2023

srowen commented Mar 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritikam2 commented Mar 2, 2023

Choose a reason for hiding this comment

ritikam2 commented Mar 3, 2023

cloud-fan commented Mar 8, 2023

ritikam2 commented Mar 8, 2023

cloud-fan commented Mar 10, 2023

ritikam2 commented Mar 11, 2023

cloud-fan commented Mar 13, 2023

ritikam2 commented Mar 14, 2023 • edited

ritikam2 commented Mar 15, 2023 • edited

cloud-fan commented Mar 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritikam2 commented Mar 22, 2023 • edited

cloud-fan commented Mar 24, 2023 • edited

ritikam2 commented Mar 24, 2023

ritikam2 commented Mar 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 31, 2023

ritikam2 commented Feb 22, 2023 •

edited by zhengruifeng

ritikam2 commented Mar 14, 2023 •

edited

ritikam2 commented Mar 15, 2023 •

edited

ritikam2 commented Mar 22, 2023 •

edited

cloud-fan commented Mar 24, 2023 •

edited