[SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query) #30138

c21 · 2020-10-23T07:45:57Z

What changes were proposed in this pull request?

This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate.

Why are the changes needed?

Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test for cached query in DisableUnnecessaryBucketedScanSuite.scala. Also change a bunch of unit tests which should disable auto bucketed scan to make them work.

…uery)

c21 · 2020-10-23T07:46:43Z

cc @cloud-fan , @maropu and @viirya if you guys have time to take a look, thanks. This is the followup from #29804 .

SparkQA · 2020-10-23T08:28:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34797/

cloud-fan · 2020-10-23T08:47:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala

-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(
+      session: SparkSession,
+      configurations: Seq[String]): SparkSession = {


nit: to be more type safe, how about Seq[ConfigEntry[Boolean]]?

sure, updated, it's safer.

cloud-fan · 2020-10-23T08:49:19Z

sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala

+    withTable("t1") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+        df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
+        sql("CACHE TABLE tempTable AS SELECT i FROM t1")


why not just CACHE TABLE t1?

Either way is fine for me, if you think it's too redundant I can also change that.

yea let's be simpler.

it can also save the uncache at the end, as the table will be dropped at the end.

@cloud-fan - sure, updated.

SparkQA · 2020-10-23T08:51:50Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34797/

maropu · 2020-10-23T11:23:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+   * 1. AQE
+   * 2. Automatic bucketed table scan
+   */
+  private val configsOff = Seq(


nit: How about configsOff -> forceDisableConfigs?

@maropu - sure, updated.

maropu · 2020-10-23T11:26:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-      val inMemoryRelation = sessionWithAqeOff.withActive {
-        val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
+      // Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.
+      val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(


nit: it seems we don't this line break;

val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(query.sparkSession, configsOff)

@maropu - updated.

maropu · 2020-10-23T11:31:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala

   */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(


Since this method is not only for AQE now, could you move this method into a more suitable place, e.g., object SparkSessoin or somewhere?

+1 move to other general object.

Sounds good it makes sense to me, moved to object SparkSessoin.

I know this is old, but why all the configurations (e.g., AQE) must be disabled for CacheManager?

That's because performance regression can happen. Could you check the previous discussion, e.g., #29804 (comment) ?

I feel introducing the configs to enable them (i.e. allow user to enable AQE for cached query, or allow user to enable aoth bucketed scan for cached query) is dangerous, as user can cause correctness bug to their pipeline if using them blindly.

@c21 I have such thought at first, but cann't find a negative case. Can you point out a case that can cause correctness bug ?

If I don's miss something, it just affect the perfermance about extra shuffle. For correctness, let's assuming a cache plan with AQE enabled:

For lazy cache. the AQE framework will ensure the correctness of the new query with the cached plan .

For force cache. if the output paritioning or ordering of cached plan has been affected by AQE then Spark will use EnsureRequirements to promise the correctness.

Is this related to correctness? I thought this was performance related because they can change output partitions implicitly.

@ulysses-you, @maropu - sorry my bad. This and AQE is for performance only, but not correctness. Then i am find with either adding or not adding another config.

Could you join the discussion in https://issues.apache.org/jira/browse/SPARK-35332 ? I thinks the jira ticket is related to this topic.

SparkQA · 2020-10-23T12:33:50Z

Test build #130197 has finished for PR 30138 at commit 7315be8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21

Address all comments and the PR is ready for review again, thanks. cc @cloud-fan , @maropu and @viirya .

c21 · 2020-10-23T21:38:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+   * 1. AQE
+   * 2. Automatic bucketed table scan
+   */
+  private val configsOff = Seq(


@maropu - sure, updated.

c21 · 2020-10-23T21:38:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-      val inMemoryRelation = sessionWithAqeOff.withActive {
-        val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
+      // Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.
+      val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(


@maropu - updated.

c21 · 2020-10-23T21:39:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala

   */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(


Sounds good it makes sense to me, moved to object SparkSessoin.

c21 · 2020-10-23T21:39:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala

-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(
+      session: SparkSession,
+      configurations: Seq[String]): SparkSession = {


sure, updated, it's safer.

c21 · 2020-10-23T21:44:25Z

sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala

+    withTable("t1") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+        df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
+        sql("CACHE TABLE tempTable AS SELECT i FROM t1")


@cloud-fan - sure, updated.

SparkQA · 2020-10-23T22:03:26Z

Test build #130216 has finished for PR 30138 at commit da54eaa.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-23T22:42:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34816/

SparkQA · 2020-10-23T22:53:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34817/

SparkQA · 2020-10-23T23:11:20Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34816/

SparkQA · 2020-10-23T23:16:05Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34817/

SparkQA · 2020-10-24T02:26:17Z

Test build #130217 has finished for PR 30138 at commit 09c6ca9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-10-24T02:40:35Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+   *
+   * @since 3.1.0
+   */
+  def getOrCloneSessionWithConfigsOff(


private[spark].

curious why we need to add this? what's the issue we are preventing? also why private[spark] but not private[sql]?

Oh, private[sql] is better. I don't think we should expose this as public.

@viirya - sure, updated.

SparkQA · 2020-10-24T05:56:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34828/

SparkQA · 2020-10-24T06:17:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34828/

SparkQA · 2020-10-24T07:05:02Z

Test build #130228 has finished for PR 30138 at commit 8c1d11e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2020-10-26T06:42:44Z

@cloud-fan - wondering do you think if the PR is ready to go? Thanks.

viirya · 2020-10-26T06:48:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-      val sessionWithAqeOff = getOrCloneSessionWithAqeOff(query.sparkSession)
-      val inMemoryRelation = sessionWithAqeOff.withActive {
-        val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
+      // Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.


nit: this comment seems duplicated with above.

@viirya - removed.

SparkQA · 2020-10-26T07:05:01Z

Test build #130262 has finished for PR 30138 at commit 9e6aaf4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2020-10-26T07:05:53Z

retest this please

cloud-fan · 2020-10-26T07:14:16Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+   * Returns a cloned SparkSession with all specified configurations disabled, or
+   * the original SparkSession if all configurations are already disabled.
+   *
+   * @since 3.1.0


this is not needed for internal APIs.

@cloud-fan - removed.

SparkQA · 2020-10-26T07:35:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34862/

SparkQA · 2020-10-26T08:04:08Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34862/

SparkQA · 2020-10-26T08:47:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34867/

SparkQA · 2020-10-26T09:18:35Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34867/

SparkQA · 2020-10-26T09:33:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34870/

SparkQA · 2020-10-26T09:59:55Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34870/

maropu · 2020-10-26T11:23:45Z

Thanks! Merged to master.

SparkQA · 2020-10-26T11:52:45Z

Test build #130266 has finished for PR 30138 at commit 9e6aaf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-26T12:10:25Z

Test build #130270 has finished for PR 30138 at commit 070da00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2020-10-26T17:01:28Z

Thanks @maropu , @viirya and @cloud-fan for review!

… for cached query) ### What changes were proposed in this pull request? This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate. ### Why are the changes needed? Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work. Closes apache#30138 from c21/enable-auto-bucket. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

Enable auto bucketed table scan by default (disable only for cached q…

7315be8

…uery)

cloud-fan reviewed Oct 23, 2020

View reviewed changes

maropu reviewed Oct 23, 2020

View reviewed changes

Address all comments

610b598

c21 commented Oct 23, 2020

View reviewed changes

avoid unnecessary line break

da54eaa

Fix style

09c6ca9

viirya reviewed Oct 24, 2020

View reviewed changes

Address comment to make the method not public

8c1d11e

maropu approved these changes Oct 26, 2020

View reviewed changes

viirya approved these changes Oct 26, 2020

View reviewed changes

Remove duplicated comments

9e6aaf4

cloud-fan reviewed Oct 26, 2020

View reviewed changes

Remove comment for version

070da00

cloud-fan approved these changes Oct 26, 2020

View reviewed changes

maropu closed this in 1042d49 Oct 26, 2020

c21 deleted the enable-auto-bucket branch October 26, 2020 17:01

[SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query) #30138

[SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query) #30138

Conversation

c21 commented Oct 23, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 commented Oct 23, 2020

SparkQA commented Oct 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu May 8, 2021 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Oct 23, 2020

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 23, 2020

SparkQA commented Oct 23, 2020

SparkQA commented Oct 23, 2020

SparkQA commented Oct 23, 2020

SparkQA commented Oct 23, 2020

SparkQA commented Oct 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Oct 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 24, 2020

SparkQA commented Oct 24, 2020

SparkQA commented Oct 24, 2020

c21 commented Oct 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 26, 2020

c21 commented Oct 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

maropu commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

c21 commented Oct 26, 2020

maropu May 8, 2021 •

edited

Loading

viirya Oct 24, 2020 •

edited

Loading