[SPARK-40211][CORE][SQL] Allow customize initial partitions number in take() behavior #37661

liuzqt · 2022-08-25T19:08:21Z

What changes were proposed in this pull request?

SPARK-40211 add a initialNumPartitions config parameter to allow customizing initial partitions to try in take()

Why are the changes needed?

Currently, the initial partitions to try to hardcode to 1, which might cause unnecessary overhead. By setting this new configuration to a high value we could effectively mitigate the “run multiple jobs” overhead in take behavior. We could also set it to higher-than-1-but-still-small values (like, say, 10) to achieve a middle-ground trade-off.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Unit test

core/src/main/scala/org/apache/spark/internal/config/package.scala

JoshRosen · 2022-08-25T19:15:47Z

core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala

-          if (results.size == 0) {
-            numPartsToTry = partsScanned * 4L
+          if (results.isEmpty) {
+            numPartsToTry = partsScanned * scaleUpFactor


It looks like this is fixing a pre-existing bug where the RDD_LIMIT_SCALE_UP_FACTOR wasn't being applied to the AsyncRDDActions version of take(). Nice!

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

JoshRosen

LGTM pending tests.

It looks like GitHub Actions might not be properly enabled for your fork. Can you follow the instructions linked from https://spark.apache.org/contributing.html to enable actions in your fork, then push an empty commit to re-trigger tests?

Go to “Actions” tab on your forked repository and enable “Build and test” and “Report test results” workflows

mridulm

Just a minor comment, looks good to me.
Nice change @liuzqt !

mridulm · 2022-08-26T19:10:45Z

core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala

@@ -84,18 +87,18 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi
      } else {
        // The number of partitions to try in this iteration. It is ok for this number to be
        // greater than totalParts because we actually cap it at totalParts in runJob.
-        var numPartsToTry = 1L
+        var numPartsToTry = Math.max(self.conf.get(RDD_LIMIT_INITIAL_NUM_PARTITIONS), 1)


Enforce it in Config itself and always use self.conf.get(RDD_LIMIT_INITIAL_NUM_PARTITIONS) ?

For RDD_LIMIT_INITIAL_NUM_PARTITIONS:

... .intConf .checkValue(_ > 0, "") ...

nice idea, modified accordingly

JoshRosen · 2022-08-27T00:17:47Z

LGTM.

I'm going to merge this to master (Spark 3.4.0). Thanks!

…4.9f463a9 ### What changes were proposed in this pull request? This PR aims to upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 PS: > 1.Following #37489 > 2.Fix https://issues.apache.org/jira/browse/SPARK-40221 ### Why are the changes needed? The last upgrade occurred 2 year ago. <img width="810" alt="image" src="https://user-images.githubusercontent.com/15246973/184291798-b1960f0a-f5ab-4037-835a-c788fb9447fc.png"> maven repo version: https://mvnrepository.com/artifact/org.antipathy/mvn-scalafmt The new version specifically update the following issues: > 1.SimonJPegg/mvn_scalafmt@6b9e0a4 > <img width="566" alt="image" src="https://user-images.githubusercontent.com/15246973/184291619-f77c3df3-0044-4f66-bc24-c60b8148e44b.png"> > 2.mvn_scalafmt_2.11(scala 2.11) be deprecated: SimonJPegg/mvn_scalafmt@9f3d109 > <img width="473" alt="image" src="https://user-images.githubusercontent.com/15246973/184292603-74f70241-05e7-4c4e-9ef7-60041c6d9e76.png"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA & Manual test > 1.mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false > 2.mvn -Pscala-2.13 scalafmt:format -Dscalafmt.skip=false > 3.Download #37661 branch SPARK-40211), and execute ./dev/scalafmt, result: <img width="555" alt="image" src="https://user-images.githubusercontent.com/15246973/187426893-13001dc4-155a-4305-9324-437387d437e4.png"> Closes #37727 from panbingkun/upgrade_scalafmt. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… take() behavior [SPARK-40211](https://issues.apache.org/jira/browse/SPARK-40211) add a `initialNumPartitions` config parameter to allow customizing initial partitions to try in `take()` Currently, the initial partitions to try to hardcode to `1`, which might cause unnecessary overhead. By setting this new configuration to a high value we could effectively mitigate the “run multiple jobs” overhead in take behavior. We could also set it to higher-than-1-but-still-small values (like, say, 10) to achieve a middle-ground trade-off. NO Unit test Closes apache#37661 from liuzqt/SPARK-40211. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com>

liuzqt added 2 commits August 24, 2022 15:32

SPARK-40211 add initialNumPartitions config in take()

1f5426f

add testcase

7708e49

github-actions bot added CORE SQL labels Aug 25, 2022

JoshRosen reviewed Aug 25, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

liuzqt added 3 commits August 25, 2022 12:53

polish

93f959d

use AtomicInteger for job listener counting

9fa64f1

polish

f59c984

liuzqt requested a review from JoshRosen August 25, 2022 20:49

JoshRosen changed the title ~~[SPARK-40211][CORE][SQL]allow customize initial partitions number in take() behavior~~ [SPARK-40211][CORE][SQL] Allow customize initial partitions number in take() behavior Aug 26, 2022

JoshRosen approved these changes Aug 26, 2022

View reviewed changes

empty

d5718d1

mridulm reviewed Aug 26, 2022

View reviewed changes

add positive constraint on LIMIT_INITIAL_NUM_PARTITIONS

062f8ac

JoshRosen closed this in 1178bce Aug 27, 2022

panbingkun mentioned this pull request Aug 30, 2022

[SPARK-40056][BUILD] Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 #37727

Closed

chenzhx mentioned this pull request Nov 3, 2022

[SPARK-40211][CORE][SQL] Allow customize initial partitions number in… Kyligence/spark#554

Merged

wayneguow mentioned this pull request Feb 14, 2023

[SPARK-40238][PYTHON] Support scaleUpFactor and initialNumPartition in pySpark RDD take API #40009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40211][CORE][SQL] Allow customize initial partitions number in take() behavior #37661

[SPARK-40211][CORE][SQL] Allow customize initial partitions number in take() behavior #37661

liuzqt commented Aug 25, 2022 •

edited

Loading

JoshRosen Aug 25, 2022

JoshRosen left a comment

mridulm left a comment •

edited

Loading

mridulm Aug 26, 2022

liuzqt Aug 26, 2022

JoshRosen commented Aug 27, 2022

[SPARK-40211][CORE][SQL] Allow customize initial partitions number in take() behavior #37661

[SPARK-40211][CORE][SQL] Allow customize initial partitions number in take() behavior #37661

Conversation

liuzqt commented Aug 25, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

JoshRosen Aug 25, 2022

Choose a reason for hiding this comment

JoshRosen left a comment

Choose a reason for hiding this comment

mridulm left a comment • edited Loading

Choose a reason for hiding this comment

mridulm Aug 26, 2022

Choose a reason for hiding this comment

liuzqt Aug 26, 2022

Choose a reason for hiding this comment

JoshRosen commented Aug 27, 2022

liuzqt commented Aug 25, 2022 •

edited

Loading

mridulm left a comment •

edited

Loading