New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34269][SQL] Simplify SQL view resolution #31368
Conversation
// output, nor with the query column names, throw an AnalysisException. | ||
// If the view's child output can't up cast to the view output, | ||
// throw an AnalysisException, too. | ||
case v @ View(desc, _, output, child) if child.resolved && !v.sameOutput(child) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed anymore, because
View.output
now directly comes fromchild.output
- The
UpCast
is added to the plan, and will go through its own error reporting branch.
@@ -230,7 +230,7 @@ object LogicalPlanIntegrity { | |||
// NOTE: we still need to filter resolved expressions here because the output of | |||
// some resolved logical plans can have unresolved references, | |||
// e.g., outer references in `ExistenceJoin`. | |||
p.output.filter(_.resolved).map { a => (a.exprId, a.dataType) } | |||
p.output.filter(_.resolved).map { a => (a.exprId, a.dataType.asNullable) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu We can eliminate cast for complex types that are compatible (only nullability is different), so the previous logic could fail valid queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a test for the query that failed with the previous logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The view tests fail without this change. It's a test only thing (the check is skipped in production) that we don't need to backport, so I didn't spend time putting this into a separate PR with tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. thanks.
} | ||
eliminated.canonicalized | ||
} | ||
override def doCanonicalize(): LogicalPlan = child.canonicalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 now the problem goes away.
@@ -655,28 +654,6 @@ class AnalysisSuite extends AnalysisTest with Matchers { | |||
} | |||
} | |||
|
|||
test("SPARK-25691: AliasViewChild with different nullabilities") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not needed anymore because EliminateView
is super simple now.
@@ -625,7 +625,8 @@ case class DescribeTableCommand( | |||
throw new AnalysisException( | |||
s"DESC PARTITION is not allowed on a temporary view: ${table.identifier}") | |||
} | |||
describeSchema(catalog.lookupRelation(table).schema, result, header = false) | |||
val schema = catalog.getTempViewOrPermanentTableMetadata(table).schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the view plan can be unresolved (with cast and alias added), we should use the recorded view schema.
Test build #134565 has finished for PR 31368 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala
Show resolved
Hide resolved
val viewPlan = if (viewColumnNames.nonEmpty) { | ||
assert(viewColumnNames.length == metadata.schema.length) | ||
// For view queries like `SELECT * FROM t`, the schema of the referenced table/view may | ||
// change after the view has been created. We need to add an extra SELECT to pick the columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For view queries like
SELECT * FROM t
, the schema of the referenced table/view may change after the view has been created.
We already have some tests for the case above somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, the comment is copied from the previous code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? Is the comment "For view queries like SELECT * FROM t
..." copied in this PR? I don't see its original place here.
This code seems copied from EliminateView
, but its original comment is different. The EliminateView
's comment is more about resolution of attribute of view text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it a little bit to match the current context.
Interesting. This seems to overlap with SPARK-34108 but it appears that it doesn't solve the issue in the JIRA. |
@sunchao I fixed some places, can you try again? |
Kubernetes integration test starting |
Kubernetes integration test status failure |
@cloud-fan it's working now - thanks! I'll close the JIRA as duplicate. |
@sunchao it's still valuable to keep your PR and add tests :) |
@cloud-fan sure - I can reopen it later to include more test coverage for this. |
Alias(UpCast(UnresolvedAttribute.quoted(col), field.dataType), field.name)( | ||
explicitMetadata = Some(field.metadata)) | ||
} | ||
Project(projectList, parsedPlan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the child plan's output is same as view's schema, this projection will be removed by optimization, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
val viewPlan = if (viewColumnNames.nonEmpty) { | ||
assert(viewColumnNames.length == metadata.schema.length) | ||
// For view queries like `SELECT * FROM t`, the schema of the referenced table/view may | ||
// change after the view has been created. We need to add an extra SELECT to pick the columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? Is the comment "For view queries like SELECT * FROM t
..." copied in this PR? I don't see its original place here.
This code seems copied from EliminateView
, but its original comment is different. The EliminateView
's comment is more about resolution of attribute of view text.
} else { | ||
// For view created before Spark 2.2.0, the view text is already fully qualified, the plan | ||
// output is the same with the view output. | ||
parsedPlan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the issue "the schema of the referenced table/view is changed ...", doesn't this also suffer from it too? The view text is fully qualified doesn't mean it has no problem that the referenced table/view changes schema. Isn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before Spark 2.2.0, we generate SQL from logical plan, and the logical plan already has extra Project
to add alias, see https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala#L214
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah but I should still add cast, to match the behavior before this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then I think the comment here can be updated together. The original comment is about output qualification.
Test build #134588 has finished for PR 31368 at commit
|
Test build #134605 has finished for PR 31368 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Test build #134616 has finished for PR 31368 at commit
|
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #134621 has finished for PR 31368 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, changes look fine to me.
// creation. We should remove this extra Project during canonicalize if it does nothing. | ||
// See more details in `SessionCatalog.fromCatalogTable`. | ||
private def canRemoveProject(p: Project): Boolean = { | ||
p.output.length == p.child.output.length && p.projectList.zipWithIndex.forall { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can do p.projectList.zip(p.child.output).forall
instead so that you don't need to reference the output by index?
@@ -230,7 +230,7 @@ object LogicalPlanIntegrity { | |||
// NOTE: we still need to filter resolved expressions here because the output of | |||
// some resolved logical plans can have unresolved references, | |||
// e.g., outer references in `ExistenceJoin`. | |||
p.output.filter(_.resolved).map { a => (a.exprId, a.dataType) } | |||
p.output.filter(_.resolved).map { a => (a.exprId, a.dataType.asNullable) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a test for the query that failed with the previous logic?
: +- Project [dept_id#x, dept_name#x, state#x] | ||
: +- SubqueryAlias DEPT | ||
: +- LocalRelation [dept_id#x, dept_name#x, state#x] | ||
: +- Project [cast(dept_id#x as int) AS dept_id#x, cast(dept_name#x as string) AS dept_name#x, cast(state#x as string) AS state#x] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some newly added cast
. Are they redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant casts in an analyzing phase looks fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a rule at the end of Analyzer
if the plan is resolved to check the top Project
and reduce theCast
if is redundant ? The UpCast seems to avoid the table reference changed before view analysis but we can remove it after analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to delay the cast adding (after the parsed view plan is resolved), so that we can skip adding cast for views that have no schema changing. But I can't find an easy way to do it and this is really not a big deal (optimizer willl remove redundant casts), so I go with the simple approach for maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, just one question about some newly added cast
in query plan.
Refer to this link for build results (access rights to CI server needed): |
GA passed, merging to master, thanks for the review! |
…nd project removal ### What changes were proposed in this pull request? This adds a few test cases for looking up cached temporary/permanent view created using clauses such as `ORDER BY` or `LIMIT`. ### Why are the changes needed? Due to `EliminateView` and how canonization is done for `View`, which inserts an extra project operator, cache lookup could fail in the following simple example: ```sql > CREATE TABLE t (key bigint, value string) USING parquet > CACHE TABLE v1 AS SELECT * FROM t ORDER BY key > SELECT * FROM t ORDER BY key ``` #31368 addresses this issue by removing the project operator if `canRemoveProject` check is successful. This PR adds a few tests. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? This PR just adds unit tests. Closes #31182 from sunchao/SPARK-34108. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…egate's grouping expression ### What changes were proposed in this pull request? This PR is a follow-up to #31368 to add a test case that has a subquery with "view" in aggregate's grouping expression. The existing test tests a subquery with dataframe's temp view, so it doesn't contain a `View` node. ### Why are the changes needed? To increase the test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test. Closes #31352 from imback82/grouping_expr. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? The currently SQL (temp or permanent) view resolution is done in 2 steps: 1. In `SessionCatalog`, we get the view metadata, parse the view SQL string, and wrap it with `View`. 2. At the beginning of the optimizer, we run `EliminateView`, which drops the wrapper `View`, and apply some special logic to match the view schema. Step 2 is tricky, as we need to retain the output attr expr id, while we need to add an extra `Project` to add cast and alias. This PR simplifies the view solution by building a completed plan (with cast and alias added) in `SessionCatalog`, so that we only have 1 step. ### Why are the changes needed? Code simplification. It also fixes issues like apache#31352 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#31368 from cloud-fan/try. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…nd project removal ### What changes were proposed in this pull request? This adds a few test cases for looking up cached temporary/permanent view created using clauses such as `ORDER BY` or `LIMIT`. ### Why are the changes needed? Due to `EliminateView` and how canonization is done for `View`, which inserts an extra project operator, cache lookup could fail in the following simple example: ```sql > CREATE TABLE t (key bigint, value string) USING parquet > CACHE TABLE v1 AS SELECT * FROM t ORDER BY key > SELECT * FROM t ORDER BY key ``` apache#31368 addresses this issue by removing the project operator if `canRemoveProject` check is successful. This PR adds a few tests. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? This PR just adds unit tests. Closes apache#31182 from sunchao/SPARK-34108. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…egate's grouping expression ### What changes were proposed in this pull request? This PR is a follow-up to apache#31368 to add a test case that has a subquery with "view" in aggregate's grouping expression. The existing test tests a subquery with dataframe's temp view, so it doesn't contain a `View` node. ### Why are the changes needed? To increase the test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test. Closes apache#31352 from imback82/grouping_expr. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
The currently SQL (temp or permanent) view resolution is done in 2 steps:
SessionCatalog
, we get the view metadata, parse the view SQL string, and wrap it withView
.EliminateView
, which drops the wrapperView
, and apply some special logic to match the view schema.Step 2 is tricky, as we need to retain the output attr expr id, while we need to add an extra
Project
to add cast and alias. This PR simplifies the view solution by building a completed plan (with cast and alias added) inSessionCatalog
, so that we only have 1 step.Why are the changes needed?
Code simplification. It also fixes issues like #31352
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing tests