[SPARK-34269][SQL] Simplify SQL view resolution #31368

cloud-fan · 2021-01-27T19:00:57Z

What changes were proposed in this pull request?

The currently SQL (temp or permanent) view resolution is done in 2 steps:

In SessionCatalog, we get the view metadata, parse the view SQL string, and wrap it with View.
At the beginning of the optimizer, we run EliminateView, which drops the wrapper View, and apply some special logic to match the view schema.

Step 2 is tricky, as we need to retain the output attr expr id, while we need to add an extra Project to add cast and alias. This PR simplifies the view solution by building a completed plan (with cast and alias added) in SessionCatalog, so that we only have 1 step.

Why are the changes needed?

Code simplification. It also fixes issues like #31352

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

cloud-fan · 2021-01-27T19:03:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

-          // output, nor with the query column names, throw an AnalysisException.
-          // If the view's child output can't up cast to the view output,
-          // throw an AnalysisException, too.
-          case v @ View(desc, _, output, child) if child.resolved && !v.sameOutput(child) =>


This is not needed anymore, because

View.output now directly comes from child.output

The UpCast is added to the plan, and will go through its own error reporting branch.

cloud-fan · 2021-01-27T19:05:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

@@ -230,7 +230,7 @@ object LogicalPlanIntegrity {
      // NOTE: we still need to filter resolved expressions here because the output of
      // some resolved logical plans can have unresolved references,
      // e.g., outer references in `ExistenceJoin`.
-      p.output.filter(_.resolved).map { a => (a.exprId, a.dataType) }
+      p.output.filter(_.resolved).map { a => (a.exprId, a.dataType.asNullable) }


@maropu We can eliminate cast for complex types that are compatible (only nullability is different), so the previous logic could fail valid queries.

should we add a test for the query that failed with the previous logic?

The view tests fail without this change. It's a test only thing (the check is skipped in production) that we don't need to backport, so I didn't spend time putting this into a separate PR with tests.

got it. thanks.

cloud-fan · 2021-01-27T19:05:55Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

-    }
-    eliminated.canonicalized
-  }
+  override def doCanonicalize(): LogicalPlan = child.canonicalized


@imback82 now the problem goes away.

cloud-fan · 2021-01-27T19:06:28Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala

@@ -655,28 +654,6 @@ class AnalysisSuite extends AnalysisTest with Matchers {
    }
  }

-  test("SPARK-25691: AliasViewChild with different nullabilities") {


This test is not needed anymore because EliminateView is super simple now.

cloud-fan · 2021-01-27T19:07:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

@@ -625,7 +625,8 @@ case class DescribeTableCommand(
        throw new AnalysisException(
          s"DESC PARTITION is not allowed on a temporary view: ${table.identifier}")
      }
-      describeSchema(catalog.lookupRelation(table).schema, result, header = false)
+      val schema = catalog.getTempViewOrPermanentTableMetadata(table).schema


the view plan can be unresolved (with cast and alias added), we should use the recorded view schema.

cloud-fan · 2021-01-27T19:07:59Z

cc @linhongliu-db @imback82 @maropu @viirya

SparkQA · 2021-01-27T21:15:09Z

Test build #134565 has finished for PR 31368 at commit 640a36b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala

maropu · 2021-01-28T00:24:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+    val viewPlan = if (viewColumnNames.nonEmpty) {
+      assert(viewColumnNames.length == metadata.schema.length)
+      // For view queries like `SELECT * FROM t`, the schema of the referenced table/view may
+      // change after the view has been created. We need to add an extra SELECT to pick the columns


For view queries like SELECT * FROM t, the schema of the referenced table/view may change after the view has been created.

We already have some tests for the case above somewhere?

I think so, the comment is copied from the previous code.

Hm? Is the comment "For view queries like SELECT * FROM t..." copied in this PR? I don't see its original place here.

This code seems copied from EliminateView, but its original comment is different. The EliminateView's comment is more about resolution of attribute of view text.

It's from https://github.com/apache/spark/pull/31368/files#diff-782f0d0b0d5fa6cf642285962eb0c831d9807e3f9ec2810f964292da89547e1aL38

I changed it a little bit to match the current context.

sunchao · 2021-01-28T02:23:12Z

Interesting. This seems to overlap with SPARK-34108 but it appears that it doesn't solve the issue in the JIRA.

cloud-fan · 2021-01-28T04:42:33Z

@sunchao I fixed some places, can you try again?

SparkQA · 2021-01-28T05:31:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39175/

SparkQA · 2021-01-28T05:36:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39175/

sunchao · 2021-01-28T05:39:50Z

@cloud-fan it's working now - thanks! I'll close the JIRA as duplicate.

cloud-fan · 2021-01-28T05:44:33Z

@sunchao it's still valuable to keep your PR and add tests :)

sunchao · 2021-01-28T05:51:25Z

@cloud-fan sure - I can reopen it later to include more test coverage for this.

viirya · 2021-01-28T07:09:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+        Alias(UpCast(UnresolvedAttribute.quoted(col), field.dataType), field.name)(
+          explicitMetadata = Some(field.metadata))
+      }
+      Project(projectList, parsedPlan)


If the child plan's output is same as view's schema, this projection will be removed by optimization, right?

viirya · 2021-01-28T07:15:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+    val viewPlan = if (viewColumnNames.nonEmpty) {
+      assert(viewColumnNames.length == metadata.schema.length)
+      // For view queries like `SELECT * FROM t`, the schema of the referenced table/view may
+      // change after the view has been created. We need to add an extra SELECT to pick the columns


Hm? Is the comment "For view queries like SELECT * FROM t..." copied in this PR? I don't see its original place here.

This code seems copied from EliminateView, but its original comment is different. The EliminateView's comment is more about resolution of attribute of view text.

viirya · 2021-01-28T07:19:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+    } else {
+      // For view created before Spark 2.2.0, the view text is already fully qualified, the plan
+      // output is the same with the view output.
+      parsedPlan


For the issue "the schema of the referenced table/view is changed ...", doesn't this also suffer from it too? The view text is fully qualified doesn't mean it has no problem that the referenced table/view changes schema. Isn't?

Before Spark 2.2.0, we generate SQL from logical plan, and the logical plan already has extra Project to add alias, see https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala#L214

ah but I should still add cast, to match the behavior before this PR.

I see. Then I think the comment here can be updated together. The original comment is about output qualification.

SparkQA · 2021-01-28T08:47:41Z

Test build #134588 has finished for PR 31368 at commit 2c66b39.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-28T09:08:01Z

Test build #134605 has finished for PR 31368 at commit dfc9d9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-28T09:29:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39193/

SparkQA · 2021-01-28T09:33:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39193/

SparkQA · 2021-01-28T12:32:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39204/

SparkQA · 2021-01-28T13:21:48Z

Test build #134616 has finished for PR 31368 at commit a467ed6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-28T14:02:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39204/

SparkQA · 2021-01-28T16:28:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39209/

SparkQA · 2021-01-28T16:46:49Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39209/

SparkQA · 2021-01-28T20:18:02Z

Test build #134621 has finished for PR 31368 at commit 2587e53.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imback82

+1, changes look fine to me.

imback82 · 2021-01-28T18:39:35Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+  // creation. We should remove this extra Project during canonicalize if it does nothing.
+  // See more details in `SessionCatalog.fromCatalogTable`.
+  private def canRemoveProject(p: Project): Boolean = {
+    p.output.length == p.child.output.length && p.projectList.zipWithIndex.forall {


nit: you can do p.projectList.zip(p.child.output).forall instead so that you don't need to reference the output by index?

imback82 · 2021-01-28T18:45:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

@@ -230,7 +230,7 @@ object LogicalPlanIntegrity {
      // NOTE: we still need to filter resolved expressions here because the output of
      // some resolved logical plans can have unresolved references,
      // e.g., outer references in `ExistenceJoin`.
-      p.output.filter(_.resolved).map { a => (a.exprId, a.dataType) }
+      p.output.filter(_.resolved).map { a => (a.exprId, a.dataType.asNullable) }


should we add a test for the query that failed with the previous logic?

viirya · 2021-01-28T21:04:33Z

sql/core/src/test/resources/sql-tests/results/group-by-filter.sql.out

-:              +- Project [dept_id#x, dept_name#x, state#x]
-:                 +- SubqueryAlias DEPT
-:                    +- LocalRelation [dept_id#x, dept_name#x, state#x]
+:              +- Project [cast(dept_id#x as int) AS dept_id#x, cast(dept_name#x as string) AS dept_name#x, cast(state#x as string) AS state#x]


There are some newly added cast. Are they redundant?

Redundant casts in an analyzing phase looks fine to me.

Can we add a rule at the end of Analyzer if the plan is resolved to check the top Project and reduce theCast if is redundant ? The UpCast seems to avoid the table reference changed before view analysis but we can remove it after analysis.

It's better to delay the cast adding (after the parsed view plan is resolved), so that we can skip adding cast for views that have no schema changing. But I can't find an easy way to do it and this is really not a big deal (optimizer willl remove redundant casts), so I go with the simple approach for maintainability.

sounds okay.

viirya

Looks fine, just one question about some newly added cast in query plan.

AmplabJenkins · 2021-01-29T06:36:44Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134639/

cloud-fan · 2021-01-29T06:45:56Z

GA passed, merging to master, thanks for the review!

…nd project removal ### What changes were proposed in this pull request? This adds a few test cases for looking up cached temporary/permanent view created using clauses such as `ORDER BY` or `LIMIT`. ### Why are the changes needed? Due to `EliminateView` and how canonization is done for `View`, which inserts an extra project operator, cache lookup could fail in the following simple example: ```sql > CREATE TABLE t (key bigint, value string) USING parquet > CACHE TABLE v1 AS SELECT * FROM t ORDER BY key > SELECT * FROM t ORDER BY key ``` #31368 addresses this issue by removing the project operator if `canRemoveProject` check is successful. This PR adds a few tests. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? This PR just adds unit tests. Closes #31182 from sunchao/SPARK-34108. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…egate's grouping expression ### What changes were proposed in this pull request? This PR is a follow-up to #31368 to add a test case that has a subquery with "view" in aggregate's grouping expression. The existing test tests a subquery with dataframe's temp view, so it doesn't contain a `View` node. ### Why are the changes needed? To increase the test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test. Closes #31352 from imback82/grouping_expr. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? The currently SQL (temp or permanent) view resolution is done in 2 steps: 1. In `SessionCatalog`, we get the view metadata, parse the view SQL string, and wrap it with `View`. 2. At the beginning of the optimizer, we run `EliminateView`, which drops the wrapper `View`, and apply some special logic to match the view schema. Step 2 is tricky, as we need to retain the output attr expr id, while we need to add an extra `Project` to add cast and alias. This PR simplifies the view solution by building a completed plan (with cast and alias added) in `SessionCatalog`, so that we only have 1 step. ### Why are the changes needed? Code simplification. It also fixes issues like apache#31352 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#31368 from cloud-fan/try. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nd project removal ### What changes were proposed in this pull request? This adds a few test cases for looking up cached temporary/permanent view created using clauses such as `ORDER BY` or `LIMIT`. ### Why are the changes needed? Due to `EliminateView` and how canonization is done for `View`, which inserts an extra project operator, cache lookup could fail in the following simple example: ```sql > CREATE TABLE t (key bigint, value string) USING parquet > CACHE TABLE v1 AS SELECT * FROM t ORDER BY key > SELECT * FROM t ORDER BY key ``` apache#31368 addresses this issue by removing the project operator if `canRemoveProject` check is successful. This PR adds a few tests. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? This PR just adds unit tests. Closes apache#31182 from sunchao/SPARK-34108. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…egate's grouping expression ### What changes were proposed in this pull request? This PR is a follow-up to apache#31368 to add a test case that has a subquery with "view" in aggregate's grouping expression. The existing test tests a subquery with dataframe's temp view, so it doesn't contain a `View` node. ### Why are the changes needed? To increase the test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test. Closes apache#31352 from imback82/grouping_expr. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

cloud-fan commented Jan 27, 2021

View reviewed changes

cloud-fan mentioned this pull request Jan 27, 2021

[SPARK-34269][SQL][TESTS][FOLLOWUP] Test a subquery with view in aggregate's grouping expression #31352

Closed

github-actions bot added the SQL label Jan 27, 2021

maropu reviewed Jan 28, 2021

View reviewed changes

cloud-fan force-pushed the try branch from 640a36b to 2c66b39 Compare January 28, 2021 04:40

sunchao mentioned this pull request Jan 28, 2021

[SPARK-34269][SQL][TESTS][FOLLOWUP] Add test cases for cache lookup and project removal #31182

Closed

viirya reviewed Jan 28, 2021

View reviewed changes

cloud-fan force-pushed the try branch from 2c66b39 to dfc9d9d Compare January 28, 2021 08:16

simplify view resolution

a467ed6

cloud-fan force-pushed the try branch from dfc9d9d to a467ed6 Compare January 28, 2021 11:30

fix thriftserver test

2587e53

imback82 reviewed Jan 28, 2021

View reviewed changes

viirya reviewed Jan 28, 2021

View reviewed changes

maropu approved these changes Jan 29, 2021

View reviewed changes

address comment

2aebf05

imback82 approved these changes Jan 29, 2021

View reviewed changes

viirya approved these changes Jan 29, 2021

View reviewed changes

cloud-fan closed this in b891862 Jan 29, 2021

tprelle mentioned this pull request Feb 24, 2021

[SPARK-34528][SQL] Named explicitly field in struct of a catalog view #31639

Closed

[SPARK-34269][SQL] Simplify SQL view resolution #31368

[SPARK-34269][SQL] Simplify SQL view resolution #31368

Conversation

cloud-fan commented Jan 27, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jan 29, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 27, 2021

SparkQA commented Jan 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao commented Jan 28, 2021

cloud-fan commented Jan 28, 2021

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

sunchao commented Jan 28, 2021

cloud-fan commented Jan 28, 2021

sunchao commented Jan 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

SparkQA commented Jan 28, 2021

imback82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 29, 2021

cloud-fan commented Jan 29, 2021

cloud-fan commented Jan 27, 2021 •

edited

cloud-fan Jan 29, 2021 •

edited