[SPARK-28548][SQL] explain() shows wrong result for persisted DataFrames after some operations #25280

sarutak · 2019-07-28T18:48:09Z

What changes were proposed in this pull request?

After some operations against Datasets and then persist them, Dataset.explain shows wrong result.
One of those operations is explain() itself.
An example here.

val df = spark.range(10)
df.explain
df.persist
df.explain

Expected result is like as follows.

== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [id#7L]
      +- InMemoryRelation [id#7L], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(1) Range (0, 10, step=1, splits=12)

But I got this.

== Physical Plan ==
*(1) Range (0, 10, step=1, splits=12)

This issue is caused by withCachedData in QueryExecution is materialized early when explain() or such methods are called so this patch prevents it.

How was this patch tested?

Additional test cases in ExplainSuite.scala

SparkQA · 2019-07-28T18:54:30Z

Test build #108278 has finished for PR 25280 at commit 3b16d9a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-28T20:23:18Z

Test build #108279 has finished for PR 25280 at commit e76e2b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-07-29T01:36:09Z

wait, it seems the query works well in v2.4.3;

scala> df.explain
== Physical Plan ==
*(1) Range (0, 10, step=1, splits=4)

scala> df.persist
res1: df.type = [id: bigint]

scala> df.explain
== Physical Plan ==
*(1) InMemoryTableScan [id#0L]
   +- InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *(1) Range (0, 10, step=1, splits=4)
scala>

Which commit affects the behaviour?

cloud-fan · 2019-07-29T07:48:08Z

It's by design that a dataframe can't change its physical plan once the physical plan is materialized. That said, df.persist has no effect if df.explain is called before.

I think the current behavior is corrected and v2.4.3 is wrong. I think this problem is fixed by #24654

If you do wanna run the plan with cached data, maybe we can do val df2 = df.toDF() and execute df2 instead.

cloud-fan · 2019-07-29T08:01:01Z

Actually, this does expose a problem. Before #24654, df.explain won't materialize the physical plan, but now we do. cc @viirya do you have any ideas? I think this one is hard to fix.

sarutak · 2019-07-29T14:20:36Z

@cloud-fan I have already noticed #24654. The problem mentioned in that PR is that pre-analyzed logical plan was always reused and re-analyzed in explain command even though analyzed-logical plan is already materialized.

This solution consider that problem. If we have already analyzed-logical plan materialized, we use it otherwise create one.

viirya · 2019-07-29T14:49:19Z

We actually take the query execution of dataset to execute. If we have executed a dataset, so its physical plan is materialized, then persist it. In 2.4.3, although df.explain shows a cached plan, I think execution still uses physical plan without cache? This fix also has the issue?

That said, in 2.4.3, df.explain shows query plan with current status like cache, temp view, I think it doesn't really match with dataset execution.

Like:

val df = spark.range(10)
df.explain // show query plan without cache
df.collect() // execution without cache 
df.persist
df.explain // show query plan with cache
df.collect() // still execution without cache

SparkQA · 2019-07-29T15:44:14Z

Test build #108331 has finished for PR 25280 at commit 98bcee4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Plans(

maropu · 2019-07-29T23:34:54Z

We cannot just return a new Dataset in persist, btw, like this?

val df = spark.range(10)
df.explain // show query plan without cache
df.collect // execution without cache

val cachedDf = df.persist

df.explain // show query plan without cache
df.collect // execution without cache

cachedDf.explain // show query plan with cache
cachedDf.collect // execution with cache

cloud-fan · 2019-07-30T02:12:16Z

@maropu this is a behavior change and can break many queries silently.

sarutak · 2019-07-31T16:18:12Z

@viirya With my change, we can get following result.

val df = spark.range(10)
df.explain  // show query plan without cache
df.collect  // execution without cache
df.persist
df.explain // show query plan without cache
df.collect // execution without cache
df.queryExecution.executedPlan.find(_.isInstanceOf[InMemoryTableScanExec]) // None

After collect, persist is still ignored but this result is different from one of 2.4.3 and same for master branch.
As you mentioned df.collect and some operations materialize the executedPlan and cause this problem.

Some operations including df.show don't cause this problem because they create the new root plan at the time of execution implicitly so I wonder creating dummy root plan at the time of executing collect or similar operations resolve this type of problem?

felixcheung · 2019-09-01T00:39:25Z

... where are we on this? this seems to be severe correctness impacting issue?

cloud-fan · 2019-09-02T05:56:22Z

This is not a correctness issue. This is just an undocumented property "a dataframe can't change its physical plan once the physical plan is materialized" causing confusions.

After so many days, I have a fresh idea now: when df.cache is called, reset the cachedPlan, optimizedPlan and physical plans. We can implement the lazy evaluation manually to support it instead of relying on Scals lazy val.

github-actions · 2019-12-26T00:07:29Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

maropu · 2019-12-27T01:48:22Z

btw, we don't need to describe something about this behaviour in the migration guide if the current one is correct? @gatorsmile @cloud-fan

HyukjinKwon · 2019-12-27T01:58:28Z

I wouldn't mind documenting it but sounds more like a bug fix given my reading, and it seems only affecting a debug API (explain). So I suspect it's fine not to document for now.

maropu · 2019-12-28T01:59:47Z

ok, thanks for the check.

Fix the explain result for persisted Datasets

3b16d9a

sarutak force-pushed the fix-cache-ignored-issue branch from 1114600 to 3b16d9a Compare July 28, 2019 18:50

Fix style

e76e2b3

dongjoon-hyun added the SQL label Jul 28, 2019

Fix to throw AnalysisException lazily

98bcee4

github-actions bot added the Stale label Dec 26, 2019

github-actions bot closed this Dec 27, 2019

[SPARK-28548][SQL] explain() shows wrong result for persisted DataFrames after some operations #25280

[SPARK-28548][SQL] explain() shows wrong result for persisted DataFrames after some operations #25280

Uh oh!

Conversation

sarutak commented Jul 28, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 28, 2019

Uh oh!

SparkQA commented Jul 28, 2019

Uh oh!

maropu commented Jul 29, 2019

Uh oh!

cloud-fan commented Jul 29, 2019

Uh oh!

cloud-fan commented Jul 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarutak commented Jul 29, 2019

Uh oh!

viirya commented Jul 29, 2019

Uh oh!

SparkQA commented Jul 29, 2019

Uh oh!

maropu commented Jul 29, 2019

Uh oh!

cloud-fan commented Jul 30, 2019

Uh oh!

sarutak commented Jul 31, 2019

Uh oh!

felixcheung commented Sep 1, 2019

Uh oh!

cloud-fan commented Sep 2, 2019

Uh oh!

github-actions bot commented Dec 26, 2019

Uh oh!

maropu commented Dec 27, 2019

Uh oh!

HyukjinKwon commented Dec 27, 2019

Uh oh!

maropu commented Dec 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

cloud-fan commented Jul 29, 2019 •

edited

Loading