Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34269][SQL] Simplify SQL view resolution #31368

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 27, 2021

What changes were proposed in this pull request?

The currently SQL (temp or permanent) view resolution is done in 2 steps:

  1. In SessionCatalog, we get the view metadata, parse the view SQL string, and wrap it with View.
  2. At the beginning of the optimizer, we run EliminateView, which drops the wrapper View, and apply some special logic to match the view schema.

Step 2 is tricky, as we need to retain the output attr expr id, while we need to add an extra Project to add cast and alias. This PR simplifies the view solution by building a completed plan (with cast and alias added) in SessionCatalog, so that we only have 1 step.

Why are the changes needed?

Code simplification. It also fixes issues like #31352

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

// output, nor with the query column names, throw an AnalysisException.
// If the view's child output can't up cast to the view output,
// throw an AnalysisException, too.
case v @ View(desc, _, output, child) if child.resolved && !v.sameOutput(child) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed anymore, because

  1. View.output now directly comes from child.output
  2. The UpCast is added to the plan, and will go through its own error reporting branch.

@@ -230,7 +230,7 @@ object LogicalPlanIntegrity {
// NOTE: we still need to filter resolved expressions here because the output of
// some resolved logical plans can have unresolved references,
// e.g., outer references in `ExistenceJoin`.
p.output.filter(_.resolved).map { a => (a.exprId, a.dataType) }
p.output.filter(_.resolved).map { a => (a.exprId, a.dataType.asNullable) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu We can eliminate cast for complex types that are compatible (only nullability is different), so the previous logic could fail valid queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a test for the query that failed with the previous logic?

Copy link
Contributor Author

@cloud-fan cloud-fan Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The view tests fail without this change. It's a test only thing (the check is skipped in production) that we don't need to backport, so I didn't spend time putting this into a separate PR with tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. thanks.

}
eliminated.canonicalized
}
override def doCanonicalize(): LogicalPlan = child.canonicalized
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 now the problem goes away.

@@ -655,28 +654,6 @@ class AnalysisSuite extends AnalysisTest with Matchers {
}
}

test("SPARK-25691: AliasViewChild with different nullabilities") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not needed anymore because EliminateView is super simple now.

@@ -625,7 +625,8 @@ case class DescribeTableCommand(
throw new AnalysisException(
s"DESC PARTITION is not allowed on a temporary view: ${table.identifier}")
}
describeSchema(catalog.lookupRelation(table).schema, result, header = false)
val schema = catalog.getTempViewOrPermanentTableMetadata(table).schema
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the view plan can be unresolved (with cast and alias added), we should use the recorded view schema.

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Jan 27, 2021

Test build #134565 has finished for PR 31368 at commit 640a36b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val viewPlan = if (viewColumnNames.nonEmpty) {
assert(viewColumnNames.length == metadata.schema.length)
// For view queries like `SELECT * FROM t`, the schema of the referenced table/view may
// change after the view has been created. We need to add an extra SELECT to pick the columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For view queries like SELECT * FROM t, the schema of the referenced table/view may change after the view has been created.

We already have some tests for the case above somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, the comment is copied from the previous code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? Is the comment "For view queries like SELECT * FROM t..." copied in this PR? I don't see its original place here.

This code seems copied from EliminateView, but its original comment is different. The EliminateView's comment is more about resolution of attribute of view text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao
Copy link
Member

sunchao commented Jan 28, 2021

Interesting. This seems to overlap with SPARK-34108 but it appears that it doesn't solve the issue in the JIRA.

@cloud-fan
Copy link
Contributor Author

@sunchao I fixed some places, can you try again?

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39175/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39175/

@sunchao
Copy link
Member

sunchao commented Jan 28, 2021

@cloud-fan it's working now - thanks! I'll close the JIRA as duplicate.

@cloud-fan
Copy link
Contributor Author

@sunchao it's still valuable to keep your PR and add tests :)

@sunchao
Copy link
Member

sunchao commented Jan 28, 2021

@cloud-fan sure - I can reopen it later to include more test coverage for this.

Alias(UpCast(UnresolvedAttribute.quoted(col), field.dataType), field.name)(
explicitMetadata = Some(field.metadata))
}
Project(projectList, parsedPlan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the child plan's output is same as view's schema, this projection will be removed by optimization, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

val viewPlan = if (viewColumnNames.nonEmpty) {
assert(viewColumnNames.length == metadata.schema.length)
// For view queries like `SELECT * FROM t`, the schema of the referenced table/view may
// change after the view has been created. We need to add an extra SELECT to pick the columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? Is the comment "For view queries like SELECT * FROM t..." copied in this PR? I don't see its original place here.

This code seems copied from EliminateView, but its original comment is different. The EliminateView's comment is more about resolution of attribute of view text.

} else {
// For view created before Spark 2.2.0, the view text is already fully qualified, the plan
// output is the same with the view output.
parsedPlan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the issue "the schema of the referenced table/view is changed ...", doesn't this also suffer from it too? The view text is fully qualified doesn't mean it has no problem that the referenced table/view changes schema. Isn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before Spark 2.2.0, we generate SQL from logical plan, and the logical plan already has extra Project to add alias, see https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala#L214

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah but I should still add cast, to match the behavior before this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then I think the comment here can be updated together. The original comment is about output qualification.

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Test build #134588 has finished for PR 31368 at commit 2c66b39.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Test build #134605 has finished for PR 31368 at commit dfc9d9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39193/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39193/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39204/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Test build #134616 has finished for PR 31368 at commit a467ed6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39204/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39209/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39209/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Test build #134621 has finished for PR 31368 at commit 2587e53.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, changes look fine to me.

// creation. We should remove this extra Project during canonicalize if it does nothing.
// See more details in `SessionCatalog.fromCatalogTable`.
private def canRemoveProject(p: Project): Boolean = {
p.output.length == p.child.output.length && p.projectList.zipWithIndex.forall {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can do p.projectList.zip(p.child.output).forall instead so that you don't need to reference the output by index?

@@ -230,7 +230,7 @@ object LogicalPlanIntegrity {
// NOTE: we still need to filter resolved expressions here because the output of
// some resolved logical plans can have unresolved references,
// e.g., outer references in `ExistenceJoin`.
p.output.filter(_.resolved).map { a => (a.exprId, a.dataType) }
p.output.filter(_.resolved).map { a => (a.exprId, a.dataType.asNullable) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a test for the query that failed with the previous logic?

: +- Project [dept_id#x, dept_name#x, state#x]
: +- SubqueryAlias DEPT
: +- LocalRelation [dept_id#x, dept_name#x, state#x]
: +- Project [cast(dept_id#x as int) AS dept_id#x, cast(dept_name#x as string) AS dept_name#x, cast(state#x as string) AS state#x]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some newly added cast. Are they redundant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant casts in an analyzing phase looks fine to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a rule at the end of Analyzer if the plan is resolved to check the top Project and reduce theCast if is redundant ? The UpCast seems to avoid the table reference changed before view analysis but we can remove it after analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to delay the cast adding (after the parsed view plan is resolved), so that we can skip adding cast for views that have no schema changing. But I can't find an easy way to do it and this is really not a big deal (optimizer willl remove redundant casts), so I go with the simple approach for maintainability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds okay.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, just one question about some newly added cast in query plan.

@AmplabJenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134639/

@cloud-fan
Copy link
Contributor Author

GA passed, merging to master, thanks for the review!

@cloud-fan cloud-fan closed this in b891862 Jan 29, 2021
viirya pushed a commit that referenced this pull request Jan 30, 2021
…nd project removal

### What changes were proposed in this pull request?

This adds a few test cases for looking up cached temporary/permanent view created using clauses such as `ORDER BY` or `LIMIT`.

### Why are the changes needed?

Due to `EliminateView` and how canonization is done for `View`, which inserts an extra project operator, cache lookup could fail in the following simple example:
```sql
> CREATE TABLE t (key bigint, value string) USING parquet
> CACHE TABLE v1 AS SELECT * FROM t ORDER BY key
> SELECT * FROM t ORDER BY key
```

#31368 addresses this issue by removing the project operator if `canRemoveProject` check is successful. This PR adds a few tests.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

This PR just adds unit tests.

Closes #31182 from sunchao/SPARK-34108.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
dongjoon-hyun pushed a commit that referenced this pull request Jan 31, 2021
…egate's grouping expression

### What changes were proposed in this pull request?

This PR is a follow-up to #31368 to add a test case that has a subquery with "view" in aggregate's grouping expression. The existing test tests a subquery with dataframe's temp view, so it doesn't contain a `View` node.

### Why are the changes needed?

To increase the test coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a new test.

Closes #31352 from imback82/grouping_expr.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
### What changes were proposed in this pull request?

The currently SQL (temp or permanent) view resolution is done in 2 steps:
1. In `SessionCatalog`, we get the view metadata, parse the view SQL string, and wrap it with `View`.
2. At the beginning of the optimizer, we run `EliminateView`, which drops the wrapper `View`, and apply some special logic to match the view schema.

Step 2 is tricky, as we need to retain the output attr expr id, while we need to add an extra `Project` to add cast and alias. This PR simplifies the view solution by building a completed plan (with cast and alias added) in `SessionCatalog`, so that we only have 1 step.

### Why are the changes needed?

Code simplification. It also fixes issues like apache#31352

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes apache#31368 from cloud-fan/try.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…nd project removal

### What changes were proposed in this pull request?

This adds a few test cases for looking up cached temporary/permanent view created using clauses such as `ORDER BY` or `LIMIT`.

### Why are the changes needed?

Due to `EliminateView` and how canonization is done for `View`, which inserts an extra project operator, cache lookup could fail in the following simple example:
```sql
> CREATE TABLE t (key bigint, value string) USING parquet
> CACHE TABLE v1 AS SELECT * FROM t ORDER BY key
> SELECT * FROM t ORDER BY key
```

apache#31368 addresses this issue by removing the project operator if `canRemoveProject` check is successful. This PR adds a few tests.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

This PR just adds unit tests.

Closes apache#31182 from sunchao/SPARK-34108.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…egate's grouping expression

### What changes were proposed in this pull request?

This PR is a follow-up to apache#31368 to add a test case that has a subquery with "view" in aggregate's grouping expression. The existing test tests a subquery with dataframe's temp view, so it doesn't contain a `View` node.

### Why are the changes needed?

To increase the test coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a new test.

Closes apache#31352 from imback82/grouping_expr.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants