-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-28548][SQL] explain() shows wrong result for persisted DataFrames after some operations #25280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1114600 to
3b16d9a
Compare
|
Test build #108278 has finished for PR 25280 at commit
|
|
Test build #108279 has finished for PR 25280 at commit
|
|
wait, it seems the query works well in v2.4.3; Which commit affects the behaviour? |
|
It's by design that a dataframe can't change its physical plan once the physical plan is materialized. That said, I think the current behavior is corrected and v2.4.3 is wrong. I think this problem is fixed by #24654 If you do wanna run the plan with cached data, maybe we can do |
|
@cloud-fan I have already noticed #24654. The problem mentioned in that PR is that pre-analyzed logical plan was always reused and re-analyzed in explain command even though analyzed-logical plan is already materialized. This solution consider that problem. If we have already analyzed-logical plan materialized, we use it otherwise create one. |
|
We actually take the query execution of dataset to execute. If we have executed a dataset, so its physical plan is materialized, then persist it. In 2.4.3, although df.explain shows a cached plan, I think execution still uses physical plan without cache? This fix also has the issue? That said, in 2.4.3, df.explain shows query plan with current status like cache, temp view, I think it doesn't really match with dataset execution. Like: val df = spark.range(10)
df.explain // show query plan without cache
df.collect() // execution without cache
df.persist
df.explain // show query plan with cache
df.collect() // still execution without cache |
|
Test build #108331 has finished for PR 25280 at commit
|
|
We cannot just return a new Dataset in persist, btw, like this? |
|
@maropu this is a behavior change and can break many queries silently. |
|
@viirya With my change, we can get following result. After Some operations including |
|
... where are we on this? this seems to be severe correctness impacting issue? |
|
This is not a correctness issue. This is just an undocumented property "a dataframe can't change its physical plan once the physical plan is materialized" causing confusions. After so many days, I have a fresh idea now: when |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
|
btw, we don't need to describe something about this behaviour in the migration guide if the current one is correct? @gatorsmile @cloud-fan |
|
I wouldn't mind documenting it but sounds more like a bug fix given my reading, and it seems only affecting a debug API ( |
|
ok, thanks for the check. |
What changes were proposed in this pull request?
After some operations against Datasets and then persist them, Dataset.explain shows wrong result.
One of those operations is explain() itself.
An example here.
Expected result is like as follows.
But I got this.
This issue is caused by
withCachedDatainQueryExecutionis materialized early whenexplain()or such methods are called so this patch prevents it.How was this patch tested?
Additional test cases in
ExplainSuite.scala