[SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset#43188
Conversation
srowen
left a comment
There was a problem hiding this comment.
Looks OK. I think this needs to go into 3.5 too?
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
Show resolved
Hide resolved
Yes, this fix is also needed in the 3.5 branch. |
|
@eejbyfeldt Spark package releasing pipeline has some issue recently, once it is fixed I will release new version graphframe for spark 3.5 |
|
@WeichenXu123 this commit should also go into the 3.5 branch since also affected by the correctness bug. Also based on the commit message of a0c9ab6 it did not look like you merged it using the merge_spark_pr.py script? |
Yes, my fault :) We should use merge_spark_pr.py to merge it. I will backport it to spark 3.5. |
|
@eejbyfeldt I found there's some conflicts when I cherry-pick this commit a0c9ab6 to spark 3.5 could you file a separate PR against spark 3.5 ? Thanks! |
…evel.NONE on Dataset (apache#43188) * SPARK-45386: Fix correctness issue with StorageLevel.NONE * Move to CacheManager * Add comment
|
|
Thank you, @eejbyfeldt and all. To @WeichenXu123 . |
|
Wondering if there is a way to disable that "squash and merge" button @dongjoon-hyun :-) |
Sure. :) |
What changes were proposed in this pull request?
Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using
StorageLevel.NONE. Before that patch a query like:would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
Line 294 in eac5a8c
that the input is empty. The problem is that the action that should make sure the statistics are collected here
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala
Lines 285 to 291 in eac5a8c
never use the iterator and when we have
StorageLevel.NONEthe persisting will also not use the iterator and we will not gather the correct statistics.The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.
Why are the changes needed?
The current code has a correctness issue.
Does this PR introduce any user-facing change?
Yes, fixes the correctness issue.
How was this patch tested?
New and existing unit tests.
Was this patch authored or co-authored using generative AI tooling?
No