Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-34974][SQL] Improve subquery decorrelation framework
### What changes were proposed in this pull request? This PR implements the decorrelation technique in the paper "Unnesting Arbitrary Queries" by T. Neumann; A. Kemper (http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf). It currently supports Filter, Project, Aggregate, Join, and UnaryNode that passes CheckAnalysis. This feature can be controlled by the config `spark.sql.optimizer.decorrelateInnerQuery.enabled` (default: true). A few notes: 1. This PR does not relax any constraints in CheckAnalysis for correlated subqueries, even though some cases can be supported by this new framework, such as aggregate with correlated non-equality predicates. This PR focuses on adding the new framework and making sure all existing cases can be supported. Constraints can be relaxed gradually in the future via separate PRs. 2. The new framework is only enabled for correlated scalar subqueries, as the first step. EXISTS/IN subqueries can be supported in the future. ### Why are the changes needed? Currently, Spark has limited support for correlated subqueries. It only allows `Filter` to reference outer query columns and does not support non-equality predicates when the subquery is aggregated. This new framework will allow more operators to host outer column references and support correlated non-equality predicates and more types of operators in correlated subqueries. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit and SQL query tests and new optimizer plan tests. Closes #32072 from allisonwang-db/spark-34974-decorrelation. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information