Skip to content

[SPARK-31037][SQL] refine AQE config names#27793

Closed
cloud-fan wants to merge 2 commits intoapache:masterfrom
cloud-fan:aqe
Closed

[SPARK-31037][SQL] refine AQE config names#27793
cloud-fan wants to merge 2 commits intoapache:masterfrom
cloud-fan:aqe

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Mar 4, 2020

What changes were proposed in this pull request?

When introducing AQE to others, I feel the config names are a bit incoherent and hard to use.
This PR refines the config names:

  1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere.
  2. targetPostShuffleInputSize is obscure, rename to advisoryPartitionSizeInBytes.
  3. reducePostShufflePartitions doesn't match the actual optimization, rename to coalescePartitions
  4. minNumPostShufflePartitions is obscure, rename it minPartitionNum under the coalescePartitions namespace
  5. maxNumPostShufflePartitions is confusing with the word "max", rename it initialPartitionNum
  6. skewedJoinOptimization is too verbose. skew join is a well-known terminology in database area, we can just say skewJoin

Why are the changes needed?

Make the config names easy to understand.

Does this PR introduce any user-facing change?

deprecate the config spark.sql.adaptive.shuffle.targetPostShuffleInputSize

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

cc @JkSelf @maryannxue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.0.0

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119314 has finished for PR 27793 at commit 186a6af.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119316 has finished for PR 27793 at commit 2ebd979.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable -> "enabled"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"number" -> "minimum number"

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119324 has finished for PR 27793 at commit 9ce3f04.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119325 has finished for PR 27793 at commit 2dffcd5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these shorter and easier to remember? Personally, maybe change spark.sql.adaptive.coalesceShufflePartitions.enabled to spark.sql.adaptive.mergePartitions, spark.sql.adaptive.coalesceShufflePartitions.initialPartitionNum to spark.sql.adaptive.mergePartitions.initialNum and spark.sql.adaptive.coalesceShufflePartitions.minPartitionNum to spark.sql.adaptive.mergePartiontions.minNum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shuffle is where this behavior happens and have been written in the doc field, we may not need to enforce it in the config name, and it seems that we do not have any other places under adaptive to coalesce partitions. And merge might be easier to spell than coalesce :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with "coalescePartitions".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also rename advisoryShufflePartitionSizeInBytes to advisoryPartitionSizeInBytes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advisoryPartitionSizeInBytes looks good to me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we uniform the verb naming in reduce, coalesce, or merge both in here configuration name and the optimization rule name of ReduceNumShufflePartitions ? If we use the coalescePartitions, It is better to modify the optimization rule name from ReduceNumShufflePartitions to CoalesceShufflePartitions ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we should, but we can do it in another PR. The code name is not user facing and doesn't need to made into 3.0.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local shuffle reader may optimize the local reader both build side and probe side?

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119363 has finished for PR 27793 at commit 3daf7df.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119366 has finished for PR 27793 at commit 5a63fdb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119368 has finished for PR 27793 at commit f2dea40.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -67,8 +67,8 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
* SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?change it to ADVISORY_PARTITION_SIZE_IN_BYTES?

val SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE =
buildConf("spark.sql.adaptive.shuffle.targetPostShuffleInputSize")
.internal()
.doc("(Deprecated since Spark 3.0)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tell users what is the new conf that replaces it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc here is not user-facing. End users will see a message to suggest the new config by https://github.com/apache/spark/pull/27793/files#diff-9a6b543db706f1a90f790783d6930a13R2494

"the number of post-shuffle partitions based on map output statistics.")
val ADVISORY_PARTITION_SIZE_IN_BYTES =
buildConf("spark.sql.adaptive.advisoryPartitionSizeInBytes")
.doc("The advisory size in bytes of the shuffle partition during adaptive optimization. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advisory size in bytes of the shuffle partition during adaptive optimization (when '${ADAPTIVE_EXECUTION_ENABLED.key}' is true).

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a few minor comments.

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119381 has finished for PR 27793 at commit 8d18db8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for review, merging to master/3.0!

@cloud-fan cloud-fan closed this in ba86524 Mar 5, 2020
cloud-fan added a commit that referenced this pull request Mar 5, 2020
When introducing AQE to others, I feel the config names are a bit incoherent and hard to use.
This PR refines the config names:
1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere.
2. `targetPostShuffleInputSize` is obscure, rename to `advisoryShufflePartitionSizeInBytes`.
3. `reducePostShufflePartitions` doesn't match the actual optimization, rename to `coalesceShufflePartitions`
4. `minNumPostShufflePartitions` is obscure, rename it `minPartitionNum` under the `coalesceShufflePartitions` namespace
5. `maxNumPostShufflePartitions` is confusing with the word "max", rename it `initialPartitionNum`
6. `skewedJoinOptimization` is too verbose. skew join is a well-known terminology in database area, we can just say `skewJoin`

Make the config names easy to understand.

deprecate the config `spark.sql.adaptive.shuffle.targetPostShuffleInputSize`

N/A

Closes #27793 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

When introducing AQE to others, I feel the config names are a bit incoherent and hard to use.
This PR refines the config names:
1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere.
2. `targetPostShuffleInputSize` is obscure, rename to `advisoryShufflePartitionSizeInBytes`.
3. `reducePostShufflePartitions` doesn't match the actual optimization, rename to `coalesceShufflePartitions`
4. `minNumPostShufflePartitions` is obscure, rename it `minPartitionNum` under the `coalesceShufflePartitions` namespace
5. `maxNumPostShufflePartitions` is confusing with the word "max", rename it `initialPartitionNum`
6. `skewedJoinOptimization` is too verbose. skew join is a well-known terminology in database area, we can just say `skewJoin`

### Why are the changes needed?

Make the config names easy to understand.

### Does this PR introduce any user-facing change?

deprecate the config `spark.sql.adaptive.shuffle.targetPostShuffleInputSize`

### How was this patch tested?

N/A

Closes apache#27793 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Comments