[SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset by eejbyfeldt · Pull Request #43188 · apache/spark

eejbyfeldt · 2023-09-30T09:36:33Z

What changes were proposed in this pull request?

Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using StorageLevel.NONE. Before that patch a query like:

import org.apache.spark.storage.StorageLevel
spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count()

would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

Line 294 in eac5a8c

rowCountStats.add(batch.numRows)

that the input is empty. The problem is that the action that should make sure the statistics are collected here

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala

Lines 285 to 291 in eac5a8c

    
           sparkContext.submitJob( 
        
             rdd, 
        
             (_: Iterator[CachedBatch]) => (), 
        
             (0 until rdd.getNumPartitions).toSeq, 
        
             (_: Int, _: Unit) => (), 
        
             () 
        
           )

never use the iterator and when we have StorageLevel.NONE the persisting will also not use the iterator and we will not gather the correct statistics.

The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.

Why are the changes needed?

The current code has a correctness issue.

Does this PR introduce any user-facing change?

Yes, fixes the correctness issue.

How was this patch tested?

New and existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

srowen

Looks OK. I think this needs to go into 3.5 too?

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

eejbyfeldt · 2023-10-02T08:01:30Z

Looks OK. I think this needs to go into 3.5 too?

Yes, this fix is also needed in the 3.5 branch.

WeichenXu123 · 2023-10-02T09:37:36Z

@eejbyfeldt Spark package releasing pipeline has some issue recently, once it is fixed I will release new version graphframe for spark 3.5

eejbyfeldt · 2023-10-02T13:05:06Z

@WeichenXu123 this commit should also go into the 3.5 branch since also affected by the correctness bug. Also based on the commit message of a0c9ab6 it did not look like you merged it using the merge_spark_pr.py script?

WeichenXu123 · 2023-10-04T09:19:38Z

@WeichenXu123 this commit should also go into the 3.5 branch since also affected by the correctness bug. Also based on the commit message of a0c9ab6 it did not look like you merged it using the merge_spark_pr.py script?

Yes, my fault :) We should use merge_spark_pr.py to merge it.

I will backport it to spark 3.5.

WeichenXu123 · 2023-10-04T09:20:28Z

@eejbyfeldt I found there's some conflicts when I cherry-pick this commit a0c9ab6 to spark 3.5

could you file a separate PR against spark 3.5 ? Thanks!

…evel.NONE on Dataset (apache#43188) * SPARK-45386: Fix correctness issue with StorageLevel.NONE * Move to CacheManager * Add comment

eejbyfeldt · 2023-10-04T10:33:16Z

@eejbyfeldt I found there's some conflicts when I cherry-pick this commit a0c9ab6 to spark 3.5

could you file a separate PR against spark 3.5 ? Thanks!

#43213

dongjoon-hyun · 2023-10-04T15:32:44Z

Thank you, @eejbyfeldt and all.

To @WeichenXu123 .
Please use our merge script. It has much more features to help Apache Spark committers. 😄

https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py

mridulm · 2023-10-04T15:57:49Z

Wondering if there is a way to disable that "squash and merge" button @dongjoon-hyun :-)

WeichenXu123 · 2023-10-05T08:32:19Z

Thank you, @eejbyfeldt and all.

To @WeichenXu123 . Please use our merge script. It has much more features to help Apache Spark committers. 😄

https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py

Sure. :)

SPARK-45386: Fix correctness issue with StorageLevel.NONE

08fd200

github-actions bot added the SQL label Sep 30, 2023

eejbyfeldt changed the title ~~SPARK-45386: Fix correctness issue with StorageLevel.NONE on Dataset~~ SPARK-45386: Fix correctness issue with persist using StorageLevel.NONE on Dataset Sep 30, 2023

eejbyfeldt changed the title ~~SPARK-45386: Fix correctness issue with persist using StorageLevel.NONE on Dataset~~ [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset Sep 30, 2023

mridulm reviewed Sep 30, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

Move to CacheManager

4434ebc

eejbyfeldt requested a review from mridulm October 1, 2023 07:08

eejbyfeldt mentioned this pull request Oct 1, 2023

Add Spark 3.5.0 support graphframes/graphframes#436

Merged

WeichenXu123 approved these changes Oct 1, 2023

View reviewed changes

srowen approved these changes Oct 1, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Show resolved Hide resolved

Add comment

e228dfd

WeichenXu123 merged commit a0c9ab6 into apache:master Oct 2, 2023

dongjoon-hyun mentioned this pull request Jul 16, 2024

[MINOR][TESTS] Remove unused test jar (udf_noA.jar) #47309

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset#43188

[SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset#43188
WeichenXu123 merged 3 commits intoapache:masterfrom
eejbyfeldt:SPARK-45386

eejbyfeldt commented Sep 30, 2023 •

edited

Loading

Uh oh!

Uh oh!

srowen left a comment

Uh oh!

Uh oh!

eejbyfeldt commented Oct 2, 2023

Uh oh!

WeichenXu123 commented Oct 2, 2023

Uh oh!

eejbyfeldt commented Oct 2, 2023

Uh oh!

WeichenXu123 commented Oct 4, 2023

Uh oh!

WeichenXu123 commented Oct 4, 2023

Uh oh!

eejbyfeldt commented Oct 4, 2023

Uh oh!

dongjoon-hyun commented Oct 4, 2023 •

edited

Loading

Uh oh!

mridulm commented Oct 4, 2023

Uh oh!

WeichenXu123 commented Oct 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

	sparkContext.submitJob(
	rdd,
	(_: Iterator[CachedBatch]) => (),
	(0 until rdd.getNumPartitions).toSeq,
	(_: Int, _: Unit) => (),
	()
	)

Conversation

eejbyfeldt commented Sep 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eejbyfeldt commented Oct 2, 2023

Uh oh!

WeichenXu123 commented Oct 2, 2023

Uh oh!

eejbyfeldt commented Oct 2, 2023

Uh oh!

WeichenXu123 commented Oct 4, 2023

Uh oh!

WeichenXu123 commented Oct 4, 2023

Uh oh!

eejbyfeldt commented Oct 4, 2023

Uh oh!

dongjoon-hyun commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Oct 4, 2023

Uh oh!

WeichenXu123 commented Oct 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

eejbyfeldt commented Sep 30, 2023 •

edited

Loading

dongjoon-hyun commented Oct 4, 2023 •

edited

Loading