Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query) #30138

Closed
wants to merge 7 commits into from

Conversation

c21
Copy link
Contributor

@c21 c21 commented Oct 23, 2020

What changes were proposed in this pull request?

This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate.

Why are the changes needed?

Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test for cached query in DisableUnnecessaryBucketedScanSuite.scala. Also change a bunch of unit tests which should disable auto bucketed scan to make them work.

@c21
Copy link
Contributor Author

c21 commented Oct 23, 2020

cc @cloud-fan , @maropu and @viirya if you guys have time to take a look, thanks. This is the followup from #29804 .

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34797/

if (!session.sessionState.conf.adaptiveExecutionEnabled) {
def getOrCloneSessionWithConfigsOff(
session: SparkSession,
configurations: Seq[String]): SparkSession = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to be more type safe, how about Seq[ConfigEntry[Boolean]]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, updated, it's safer.

withTable("t1") {
withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
sql("CACHE TABLE tempTable AS SELECT i FROM t1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just CACHE TABLE t1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine for me, if you think it's too redundant I can also change that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea let's be simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can also save the uncache at the end, as the table will be dropped at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - sure, updated.

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34797/

* 1. AQE
* 2. Automatic bucketed table scan
*/
private val configsOff = Seq(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about configsOff -> forceDisableConfigs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - sure, updated.

val inMemoryRelation = sessionWithAqeOff.withActive {
val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
// Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.
val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it seems we don't this line break;

      val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(query.sparkSession, configsOff)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - updated.

*/
def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
if (!session.sessionState.conf.adaptiveExecutionEnabled) {
def getOrCloneSessionWithConfigsOff(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method is not only for AQE now, could you move this method into a more suitable place, e.g., object SparkSessoin or somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 move to other general object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good it makes sense to me, moved to object SparkSessoin.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is old, but why all the configurations (e.g., AQE) must be disabled for CacheManager?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because performance regression can happen. Could you check the previous discussion, e.g., #29804 (comment) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel introducing the configs to enable them (i.e. allow user to enable AQE for cached query, or allow user to enable aoth bucketed scan for cached query) is dangerous, as user can cause correctness bug to their pipeline if using them blindly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c21 I have such thought at first, but cann't find a negative case. Can you point out a case that can cause correctness bug ?

If I don's miss something, it just affect the perfermance about extra shuffle. For correctness, let's assuming a cache plan with AQE enabled:

  • For lazy cache. the AQE framework will ensure the correctness of the new query with the cached plan .
  • For force cache. if the output paritioning or ordering of cached plan has been affected by AQE then Spark will use EnsureRequirements to promise the correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to correctness? I thought this was performance related because they can change output partitions implicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulysses-you, @maropu - sorry my bad. This and AQE is for performance only, but not correctness. Then i am find with either adding or not adding another config.

Copy link
Member

@maropu maropu May 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you join the discussion in https://issues.apache.org/jira/browse/SPARK-35332 ? I thinks the jira ticket is related to this topic.

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Test build #130197 has finished for PR 30138 at commit 7315be8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address all comments and the PR is ready for review again, thanks. cc @cloud-fan , @maropu and @viirya .

* 1. AQE
* 2. Automatic bucketed table scan
*/
private val configsOff = Seq(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - sure, updated.

val inMemoryRelation = sessionWithAqeOff.withActive {
val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
// Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.
val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - updated.

*/
def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
if (!session.sessionState.conf.adaptiveExecutionEnabled) {
def getOrCloneSessionWithConfigsOff(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good it makes sense to me, moved to object SparkSessoin.

if (!session.sessionState.conf.adaptiveExecutionEnabled) {
def getOrCloneSessionWithConfigsOff(
session: SparkSession,
configurations: Seq[String]): SparkSession = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, updated, it's safer.

withTable("t1") {
withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
sql("CACHE TABLE tempTable AS SELECT i FROM t1")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - sure, updated.

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Test build #130216 has finished for PR 30138 at commit da54eaa.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34816/

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34817/

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34816/

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34817/

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Test build #130217 has finished for PR 30138 at commit 09c6ca9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*
* @since 3.1.0
*/
def getOrCloneSessionWithConfigsOff(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[spark].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why we need to add this? what's the issue we are preventing? also why private[spark] but not private[sql]?

Copy link
Member

@viirya viirya Oct 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, private[sql] is better. I don't think we should expose this as public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya - sure, updated.

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34828/

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34828/

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Test build #130228 has finished for PR 30138 at commit 8c1d11e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@c21
Copy link
Contributor Author

c21 commented Oct 26, 2020

@cloud-fan - wondering do you think if the PR is ready to go? Thanks.

val sessionWithAqeOff = getOrCloneSessionWithAqeOff(query.sparkSession)
val inMemoryRelation = sessionWithAqeOff.withActive {
val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
// Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment seems duplicated with above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya - removed.

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Test build #130262 has finished for PR 30138 at commit 9e6aaf4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@c21
Copy link
Contributor Author

c21 commented Oct 26, 2020

retest this please

* Returns a cloned SparkSession with all specified configurations disabled, or
* the original SparkSession if all configurations are already disabled.
*
* @since 3.1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed for internal APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - removed.

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34862/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34862/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34867/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34867/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34870/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34870/

@maropu maropu closed this in 1042d49 Oct 26, 2020
@maropu
Copy link
Member

maropu commented Oct 26, 2020

Thanks! Merged to master.

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Test build #130266 has finished for PR 30138 at commit 9e6aaf4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Test build #130270 has finished for PR 30138 at commit 070da00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@c21
Copy link
Contributor Author

c21 commented Oct 26, 2020

Thanks @maropu , @viirya and @cloud-fan for review!

@c21 c21 deleted the enable-auto-bucket branch October 26, 2020 17:01
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Oct 27, 2020
… for cached query)

### What changes were proposed in this pull request?

This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate.

### Why are the changes needed?

Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work.

Closes apache#30138 from c21/enable-auto-bucket.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants