[SPARK-41391][SQL] The output column name of `groupBy.agg(count_distinct)` is incorrect #38917

zhengruifeng · 2022-12-05T11:49:58Z

What changes were proposed in this pull request?

correct the output column name of groupBy.agg(count_distinct)

Why are the changes needed?

before this PR: [id: bigint, count(value): bigint]

scala> val df = spark.range(1, 10).withColumn("value", lit(1))
df: org.apache.spark.sql.DataFrame = [id: bigint, value: int]

scala> df.select(count_distinct($"value"))
res0: org.apache.spark.sql.DataFrame = [count(DISTINCT value): bigint]

scala> df.groupBy("id").agg(count_distinct($"value"))
res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint]

scala> df.select(sum_distinct($"value"))
res2: org.apache.spark.sql.DataFrame = [sum(DISTINCT value): bigint]

scala> df.groupBy("id").agg(sum_distinct($"value"))
res3: org.apache.spark.sql.DataFrame = [id: bigint, sum(DISTINCT value): bigint]

scala> df.createOrReplaceTempView("table")

scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ")
res5: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): bigint]

after this PR: [id: bigint, count(DISTINCT value): bigint]

scala> val df = spark.range(1, 10).withColumn("value", lit(1))
df: org.apache.spark.sql.DataFrame = [id: bigint, value: int]

scala> df.select(count_distinct($"value"))
res0: org.apache.spark.sql.DataFrame = [count(DISTINCT value): bigint]

scala> df.groupBy("id").agg(count_distinct($"value"))
res1: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): bigint]

Does this PR introduce any user-facing change?

the default column name changed

How was this patch tested?

added UT

init

amaliujia

LGTM nice catch!

amaliujia · 2022-12-06T03:52:55Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

@@ -1134,6 +1134,11 @@ class DataFrameSuite extends QueryTest
    checkAnswer(approxSummaryDF, approxSummaryResult)
  }

+  test("SPARK-41391: Correct the output column name of groupBy.agg(count_distinct)") {
+    val df = person.groupBy("id").agg(count_distinct(col("name")))
+    assert(df.columns === Array("id", "count(DISTINCT name)"))


nit: does it make sense to compare with columns from the SQL example?

nice, will update

zhengruifeng · 2022-12-06T10:38:58Z

sql - other keeps failing, I need a bit more time to investigate

zhengruifeng · 2022-12-06T11:16:21Z

need to take * into account, and groupBy.agg(count_distinct($"*")) output column count(unresolvedstar())

scala> df.select(count_distinct(col("*")))
res12: org.apache.spark.sql.DataFrame = [count(DISTINCT id, value): bigint]

scala> df.groupBy("id").agg(count_distinct($"*"))
res13: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): bigint]

scala> spark.sql(" SELECT COUNT(DISTINCT *) FROM table ")
res14: org.apache.spark.sql.DataFrame = [count(DISTINCT id, value): bigint]

scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ")
res15: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, value): bigint]

zhengruifeng · 2022-12-06T12:33:20Z

this PR causes SPARK-27581: DataFrame count_distinct("*") shouldn't fail with AnalysisException fail:

2022-12-06T10:00:45.0030472Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m- SPARK-27581: DataFrame count_distinct("*") shouldn't fail with AnalysisException *** FAILED *** (12 milliseconds)�[0m�[0m
2022-12-06T10:00:45.0035652Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'count'.�[0m�[0m
2022-12-06T10:00:45.0041055Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.errors.QueryCompilationErrors$.invalidStarUsageError(QueryCompilationErrors.scala:465)�[0m�[0m
2022-12-06T10:00:45.0045499Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1759)�[0m�[0m
2022-12-06T10:00:45.0050364Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1715)�[0m�[0m
2022-12-06T10:00:45.0054806Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:566)�[0m�[0m
2022-12-06T10:00:45.0060748Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)�[0m�[0m
2022-12-06T10:00:45.0065073Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:566)�[0m�[0m
2022-12-06T10:00:45.0075998Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:563)�[0m�[0m
2022-12-06T10:00:45.0080489Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.immutable.List.map(List.scala:293)�[0m�[0m

I believe the analyzer need to be changed to fix this issue, let me close this PR and ping @cloud-fan and @viirya to take a look since I think it's related to #24482.

scala> val df = spark.range(1, 10).withColumn("value", lit(1))
df: org.apache.spark.sql.DataFrame = [id: bigint, value: int]

scala> df.createOrReplaceTempView("table")

scala> df.groupBy("id").agg(count_distinct($"value"))
res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint]

scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ")
res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): bigint]

scala> df.groupBy("id").agg(count_distinct($"*"))
res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): bigint]

scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ")
res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, value): bigint]

github-actions bot added the SQL label Dec 5, 2022

zhengruifeng mentioned this pull request Dec 5, 2022

[SPARK-41381][CONNECT][PYTHON] Implement count_distinct and sum_distinct functions #38914

Closed

init

016df1a

init

zhengruifeng force-pushed the sql_fix_count_distinct_name branch from 8ae25d5 to 016df1a Compare December 6, 2022 03:41

amaliujia approved these changes Dec 6, 2022

View reviewed changes

HyukjinKwon approved these changes Dec 6, 2022

View reviewed changes

address comments

b69edeb

zhengruifeng closed this Dec 6, 2022

zhengruifeng deleted the sql_fix_count_distinct_name branch March 30, 2023 03:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41391][SQL] The output column name of `groupBy.agg(count_distinct)` is incorrect #38917

[SPARK-41391][SQL] The output column name of `groupBy.agg(count_distinct)` is incorrect #38917

zhengruifeng commented Dec 5, 2022

amaliujia left a comment

amaliujia Dec 6, 2022

zhengruifeng Dec 6, 2022

zhengruifeng commented Dec 6, 2022

zhengruifeng commented Dec 6, 2022

zhengruifeng commented Dec 6, 2022

[SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect #38917

[SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect #38917

Conversation

zhengruifeng commented Dec 5, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

amaliujia left a comment

Choose a reason for hiding this comment

amaliujia Dec 6, 2022

Choose a reason for hiding this comment

zhengruifeng Dec 6, 2022

Choose a reason for hiding this comment

zhengruifeng commented Dec 6, 2022

zhengruifeng commented Dec 6, 2022

zhengruifeng commented Dec 6, 2022

[SPARK-41391][SQL] The output column name of `groupBy.agg(count_distinct)` is incorrect #38917

[SPARK-41391][SQL] The output column name of `groupBy.agg(count_distinct)` is incorrect #38917