Please sign in to comment.
[SPARK-23223][SQL] Make stacking dataset transforms more performant
## What changes were proposed in this pull request? It is a common pattern to apply multiple transforms to a `Dataset` (using `Dataset.withColumn` for example. This is currently quite expensive because we run `CheckAnalysis` on the full plan and create an encoder for each intermediate `Dataset`. This PR extends the usage of the `AnalysisBarrier` to include `CheckAnalysis`. By doing this we hide the already analyzed plan from `CheckAnalysis` because barrier is a `LeafNode`. The `AnalysisBarrier` is in the `FinishAnalysis` phase of the optimizer. We also make binding the `Dataset` encoder lazy. The bound encoder is only needed when we materialize the dataset. ## How was this patch tested? Existing test should cover this. Author: Herman van Hovell <firstname.lastname@example.org> Closes #20402 from hvanhovell/SPARK-23223. (cherry picked from commit 2d903cf) Signed-off-by: Herman van Hovell <email@example.com>
- Loading branch information...
Showing with 25 additions and 21 deletions.
- +14 −2 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
- +1 −0 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
- +1 −2 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisTest.scala
- +6 −2 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
- +2 −14 sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
- +1 −1 sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala