[SPARK-47685][SQL] Restore the support for `Stream` type in `Dataset#groupBy` by LuciferYang · Pull Request #45811 · apache/spark

LuciferYang · 2024-04-02T05:44:11Z

What changes were proposed in this pull request?

When I reviewed the changes in SPARK-45685, I found an old user case that is no longer supported:

Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): _*).sum("id")

[info] - SPARK-38221: group by `Stream` of complex expressions should not fail *** FAILED *** (51 milliseconds)
[info]   org.apache.spark.SparkException: Task not serializable
[info]   at org.apache.spark.util.SparkClosureCleaner$.clean(SparkClosureCleaner.scala:45)
[info]   at org.apache.spark.SparkContext.clean(SparkContext.scala:2718)
[info]   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:908)
[info]   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[info]   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:411)
[info]   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:907)
[info]   at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:762)
...

Since this is a historical user usage, and although the Stream type has been deprecated after Scala 2.13.0, it has not been removed, so this PR restores the support for Stream type in Dataset#groupBy.

Why are the changes needed?

Restore the support for Stream type in Dataset#groupBy

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GitHub Actions
Restored the test case for dataset group by Stream.

Was this patch authored or co-authored using generative AI tooling?

No

LuciferYang · 2024-04-02T05:44:58Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

  import RelationalGroupedDataset._

  private[this] def toDF(aggExprs: Seq[Expression]): DataFrame = {
+    @scala.annotation.nowarn("cat=deprecation")


Need to suppress the use of Stream

LuciferYang · 2024-04-02T05:45:13Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

  }

+  test("SPARK-38221: group by `Stream` of complex expressions should not fail") {
+    @scala.annotation.nowarn("cat=deprecation")


Need to suppress the use of Stream

LuciferYang · 2024-04-02T05:46:31Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

  }

-  test("SPARK-38221: group by stream of complex expressions should not fail") {
+  test("SPARK-45685: group by `LazyList` of complex expressions should not fail") {


This test case is essentially added in SPARK-45685, the test case has been renamed to make its description clearer

LuciferYang · 2024-04-02T05:46:42Z

cc @dongjoon-hyun

dongjoon-hyun

+1, LGTM (Pending CIs).

Thank you for finding and fixing the regression, @LuciferYang .

LuciferYang · 2024-04-02T09:38:50Z

Merged into master for Spark 4.0. Thanks @dongjoon-hyun @HyukjinKwon @zhengruifeng

LuciferYang added 2 commits April 2, 2024 13:28

fix

01d6df8

nowarn

472739f

github-actions bot added the SQL label Apr 2, 2024

LuciferYang commented Apr 2, 2024

View reviewed changes

dongjoon-hyun approved these changes Apr 2, 2024

View reviewed changes

HyukjinKwon approved these changes Apr 2, 2024

View reviewed changes

zhengruifeng approved these changes Apr 2, 2024

View reviewed changes

LuciferYang closed this in 03f4e45 Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-47685][SQL] Restore the support for `Stream` type in `Dataset#groupBy`#45811

[SPARK-47685][SQL] Restore the support for `Stream` type in `Dataset#groupBy`#45811
LuciferYang wants to merge 2 commits intoapache:masterfrom
LuciferYang:SPARK-47685

LuciferYang commented Apr 2, 2024

Uh oh!

LuciferYang Apr 2, 2024

Uh oh!

LuciferYang Apr 2, 2024

Uh oh!

LuciferYang Apr 2, 2024

Uh oh!

LuciferYang commented Apr 2, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

LuciferYang commented Apr 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

LuciferYang commented Apr 2, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

LuciferYang Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

LuciferYang Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Apr 2, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Apr 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants