Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34076][SQL] SQLContext.dropTempTable fails if cache is non-empty #31136

Closed
wants to merge 2 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Jan 11, 2021

What changes were proposed in this pull request?

This changes CatalogImpl.dropTempView and CatalogImpl.dropGlobalTempView use analyzed logical plan instead of viewDef which is unresolved.

Why are the changes needed?

Currently, CatalogImpl.dropTempView is implemented as following:

override def dropTempView(viewName: String): Boolean = {
  sparkSession.sessionState.catalog.getTempView(viewName).exists { viewDef =>
    sparkSession.sharedState.cacheManager.uncacheQuery(
      sparkSession, viewDef, cascade = false)
    sessionCatalog.dropTempView(viewName)
  }
}

Here, the logical plan viewDef is not resolved, and when passing to uncacheQuery, it could fail at sameResult call, where canonicalized plan is compared. The error message looks like:

Invalid call to qualifier on unresolved object, tree: 'key

This can be reproduced via:

sql(s"CREATE TEMPORARY VIEW $v AS SELECT key FROM src LIMIT 10")
sql(s"CREATE TABLE $t AS SELECT * FROM src")
sql(s"CACHE TABLE $t")
dropTempTable(v)

Does this PR introduce any user-facing change?

The only user-facing change is that, previously SQLContext.dropTempTable may fail in the above scenario but will work with this fix.

How was this patch tested?

Added new unit tests.

@github-actions github-actions bot added the SQL label Jan 11, 2021
@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Test build #133946 has finished for PR 31136 at commit 708adc4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38534/

@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38534/

@sunchao sunchao marked this pull request as draft January 12, 2021 01:12
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sql(s"CREATE TEMPORARY VIEW $v AS SELECT key FROM src LIMIT 10")
sql(s"CREATE TABLE $t AS SELECT * FROM src")
sql(s"CACHE TABLE $t")
dropTempTable(v)

Is the table $t related to the temp view $v? Why caching the table $t affects the temp view $v?

@sunchao
Copy link
Member Author

sunchao commented Jan 12, 2021

This is just to make sure the cache is non-empty so when dropping the view comparison will be done. The issue occurs during the comparison.

sparkSession.sharedState.cacheManager.uncacheQuery(
sparkSession, viewDef, cascade = false)
sparkSession.table(viewName), cascade = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add an assert into uncacheQuery to make sure all passed in logical plans are analyzed? e.g.,

def uncacheQuery(
      spark: SparkSession,
      plan: LogicalPlan,
      cascade: Boolean,
      blocking: Boolean = false): Unit = {
    assert(plan.analyzed, "the plan input to `uncacheQuery` should be analyzed")
   ...
}

Copy link
Member Author

@sunchao sunchao Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure about this. It seems common that a temporary view is dropped after its dependent view/table is already gone, as demonstrated by the test failures, so this assertion will fail in those cases.

Another approach is to ignore the possible non-fatal errors during query analysis and uncache, similar to what's done in DropTableCommand.

@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38568/

@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38568/

@sunchao sunchao marked this pull request as ready for review January 12, 2021 20:33
@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Test build #133980 has finished for PR 31136 at commit 712b079.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sparkSession.sharedState.cacheManager.uncacheQuery(
sparkSession, plan.analyzed, cascade = false)
} catch {
case NonFatal(_) => // ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is table or view not found error, does it mean we skip uncaching the temp view?

Copy link
Member Author

@sunchao sunchao Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. I think currently this won't happen because:

  1. when dropping a table or permanent view, we'll drop all the caches with reference on it in a cascading fashion, so when we are dropping the caches themselves, they are already invalidated.
  2. On the other hand, we currently store temp view as analyzed logical plans so they won't be analyzed again upon retrieving, which means we won't run into the error you mentioned. Although, this also means the plans themselves could become stale and potentially generate incorrect result. [SPARK-34052][SQL] store SQL text for a temp view created using "CACHE TABLE .. AS SELECT" #31107 proposes to change this, following similar changes done by [SPARK-33142][SPARK-33647][SQL] Store SQL text for SQL temp view #30567, so the behavior of temporary view as well as cache is more aligned to that of permanent view.

@HyukjinKwon
Copy link
Member

cc @cloud-fan and @MaxGekk too FYI

sparkSession.sharedState.cacheManager.uncacheQuery(
sparkSession, plan.analyzed, cascade = false)
} catch {
case NonFatal(_) => // ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safer to not add this try-catch? if there are failures, then the uncache fails and we should not swallow the exception?

Copy link
Member Author

@sunchao sunchao Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also discussed above. Suppose we have:

CREATE TEMPORARY VIEW v1 AS SELECT * FROM v2

a Spark user can drop v2 first followed by v1. In this case the uncacheQuery will fail because v2 is already gone. Consequently, the view will not be dropped. This seems to be a quite common scenario.

Also this is following the DropTableCommand, which swallows any non-fatal exception and proceed to drop the view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation!

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.1!

@cloud-fan cloud-fan closed this in 62d82b5 Jan 13, 2021
cloud-fan pushed a commit that referenced this pull request Jan 13, 2021
### What changes were proposed in this pull request?

This changes `CatalogImpl.dropTempView` and `CatalogImpl.dropGlobalTempView` use analyzed logical plan instead of `viewDef` which is unresolved.

### Why are the changes needed?

Currently, `CatalogImpl.dropTempView` is implemented as following:

```scala
override def dropTempView(viewName: String): Boolean = {
  sparkSession.sessionState.catalog.getTempView(viewName).exists { viewDef =>
    sparkSession.sharedState.cacheManager.uncacheQuery(
      sparkSession, viewDef, cascade = false)
    sessionCatalog.dropTempView(viewName)
  }
}
```

Here, the logical plan `viewDef` is not resolved, and when passing to `uncacheQuery`, it could fail at `sameResult` call, where canonicalized plan is compared. The error message looks like:
```
Invalid call to qualifier on unresolved object, tree: 'key
```

This can be reproduced via:
```scala
sql(s"CREATE TEMPORARY VIEW $v AS SELECT key FROM src LIMIT 10")
sql(s"CREATE TABLE $t AS SELECT * FROM src")
sql(s"CACHE TABLE $t")
dropTempTable(v)
```

### Does this PR introduce _any_ user-facing change?

The only user-facing change is that, previously `SQLContext.dropTempTable` may fail in the above scenario but will work with this fix.

### How was this patch tested?

Added new unit tests.

Closes #31136 from sunchao/SPARK-34076.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 62d82b5)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@viirya
Copy link
Member

viirya commented Jan 13, 2021

lgtm too.

@sunchao sunchao deleted the SPARK-34076 branch January 13, 2021 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants