[SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer [Backport-2.1] #16843
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a backport of 73ee739
What changes were proposed in this pull request?
The optimizer tries to remove redundant alias only projections from the query plan using the
RemoveAliasOnlyProject
rule. The current rule identifies removes such a project and rewrites the project's attributes in the entire tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE) and the duplicated part contains the alias only project, in this case the rewrite will break the tree.This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.
The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in
TreeNode
,QueryPlan
andLogicalPlan
to make this much more straightforward; this basically allows you to manually traverse the tree.How was this patch tested?
I have added unit tests to
RemoveRedundantAliasAndProjectSuite
and I have added integration tests to theSQLQueryTestSuite.union
andSQLQueryTestSuite.cte
test cases.