[SPARK-31037][SQL] refine AQE config names#27793
[SPARK-31037][SQL] refine AQE config names#27793cloud-fan wants to merge 2 commits intoapache:masterfrom
Conversation
|
Test build #119314 has finished for PR 27793 at commit
|
|
Test build #119316 has finished for PR 27793 at commit
|
There was a problem hiding this comment.
"number" -> "minimum number"
|
Test build #119324 has finished for PR 27793 at commit
|
|
Test build #119325 has finished for PR 27793 at commit
|
There was a problem hiding this comment.
Can we make these shorter and easier to remember? Personally, maybe change spark.sql.adaptive.coalesceShufflePartitions.enabled to spark.sql.adaptive.mergePartitions, spark.sql.adaptive.coalesceShufflePartitions.initialPartitionNum to spark.sql.adaptive.mergePartitions.initialNum and spark.sql.adaptive.coalesceShufflePartitions.minPartitionNum to spark.sql.adaptive.mergePartiontions.minNum.
There was a problem hiding this comment.
Shuffle is where this behavior happens and have been written in the doc field, we may not need to enforce it in the config name, and it seems that we do not have any other places under adaptive to coalesce partitions. And merge might be easier to spell than coalesce :)
There was a problem hiding this comment.
I'm fine with "coalescePartitions".
There was a problem hiding this comment.
shall we also rename advisoryShufflePartitionSizeInBytes to advisoryPartitionSizeInBytes?
There was a problem hiding this comment.
advisoryPartitionSizeInBytes looks good to me.
There was a problem hiding this comment.
Can we uniform the verb naming in reduce, coalesce, or merge both in here configuration name and the optimization rule name of ReduceNumShufflePartitions ? If we use the coalescePartitions, It is better to modify the optimization rule name from ReduceNumShufflePartitions to CoalesceShufflePartitions ?
There was a problem hiding this comment.
Think we should, but we can do it in another PR. The code name is not user facing and doesn't need to made into 3.0.
There was a problem hiding this comment.
local shuffle reader may optimize the local reader both build side and probe side?
|
Test build #119363 has finished for PR 27793 at commit
|
|
Test build #119366 has finished for PR 27793 at commit
|
|
Test build #119368 has finished for PR 27793 at commit
|
| @@ -67,8 +67,8 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] { | |||
| * SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE. | |||
There was a problem hiding this comment.
?change it to ADVISORY_PARTITION_SIZE_IN_BYTES?
| val SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE = | ||
| buildConf("spark.sql.adaptive.shuffle.targetPostShuffleInputSize") | ||
| .internal() | ||
| .doc("(Deprecated since Spark 3.0)") |
There was a problem hiding this comment.
Also tell users what is the new conf that replaces it?
There was a problem hiding this comment.
The doc here is not user-facing. End users will see a message to suggest the new config by https://github.com/apache/spark/pull/27793/files#diff-9a6b543db706f1a90f790783d6930a13R2494
| "the number of post-shuffle partitions based on map output statistics.") | ||
| val ADVISORY_PARTITION_SIZE_IN_BYTES = | ||
| buildConf("spark.sql.adaptive.advisoryPartitionSizeInBytes") | ||
| .doc("The advisory size in bytes of the shuffle partition during adaptive optimization. " + |
There was a problem hiding this comment.
The advisory size in bytes of the shuffle partition during adaptive optimization (when '${ADAPTIVE_EXECUTION_ENABLED.key}' is true).
gatorsmile
left a comment
There was a problem hiding this comment.
LGTM except a few minor comments.
|
Test build #119381 has finished for PR 27793 at commit
|
|
thanks for review, merging to master/3.0! |
When introducing AQE to others, I feel the config names are a bit incoherent and hard to use. This PR refines the config names: 1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere. 2. `targetPostShuffleInputSize` is obscure, rename to `advisoryShufflePartitionSizeInBytes`. 3. `reducePostShufflePartitions` doesn't match the actual optimization, rename to `coalesceShufflePartitions` 4. `minNumPostShufflePartitions` is obscure, rename it `minPartitionNum` under the `coalesceShufflePartitions` namespace 5. `maxNumPostShufflePartitions` is confusing with the word "max", rename it `initialPartitionNum` 6. `skewedJoinOptimization` is too verbose. skew join is a well-known terminology in database area, we can just say `skewJoin` Make the config names easy to understand. deprecate the config `spark.sql.adaptive.shuffle.targetPostShuffleInputSize` N/A Closes #27793 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? When introducing AQE to others, I feel the config names are a bit incoherent and hard to use. This PR refines the config names: 1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere. 2. `targetPostShuffleInputSize` is obscure, rename to `advisoryShufflePartitionSizeInBytes`. 3. `reducePostShufflePartitions` doesn't match the actual optimization, rename to `coalesceShufflePartitions` 4. `minNumPostShufflePartitions` is obscure, rename it `minPartitionNum` under the `coalesceShufflePartitions` namespace 5. `maxNumPostShufflePartitions` is confusing with the word "max", rename it `initialPartitionNum` 6. `skewedJoinOptimization` is too verbose. skew join is a well-known terminology in database area, we can just say `skewJoin` ### Why are the changes needed? Make the config names easy to understand. ### Does this PR introduce any user-facing change? deprecate the config `spark.sql.adaptive.shuffle.targetPostShuffleInputSize` ### How was this patch tested? N/A Closes apache#27793 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
When introducing AQE to others, I feel the config names are a bit incoherent and hard to use.
This PR refines the config names:
targetPostShuffleInputSizeis obscure, rename toadvisoryPartitionSizeInBytes.reducePostShufflePartitionsdoesn't match the actual optimization, rename tocoalescePartitionsminNumPostShufflePartitionsis obscure, rename itminPartitionNumunder thecoalescePartitionsnamespacemaxNumPostShufflePartitionsis confusing with the word "max", rename itinitialPartitionNumskewedJoinOptimizationis too verbose. skew join is a well-known terminology in database area, we can just sayskewJoinWhy are the changes needed?
Make the config names easy to understand.
Does this PR introduce any user-facing change?
deprecate the config
spark.sql.adaptive.shuffle.targetPostShuffleInputSizeHow was this patch tested?
N/A