Spark 3.4: Adaptive split size #7714

aokolnychyi · 2023-05-26T19:51:36Z

This PR is an alternative to #7688 and what was initially envisioned by #7465.

aokolnychyi · 2023-05-26T20:02:11Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java

@@ -232,6 +234,13 @@ protected synchronized List<ScanTaskGroup<T>> taskGroups() {
    return taskGroups;
  }

+  private long targetSplitSize() {
+    long scanSize = tasks().stream().mapToLong(ScanTask::sizeBytes).sum();
+    int parallelism = sparkContext().defaultParallelism();


This gives the number of cores in the cluster or spark.default.parallelism if set explicitly.

aokolnychyi · 2023-05-26T20:04:16Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java

@@ -232,6 +234,13 @@ protected synchronized List<ScanTaskGroup<T>> taskGroups() {
    return taskGroups;
  }

+  private long targetSplitSize() {
+    long scanSize = tasks().stream().mapToLong(ScanTask::sizeBytes).sum();


This would even handle runtime filtering, so that the number of splits after runtime filtering may be different.

core/src/main/java/org/apache/iceberg/TableProperties.java

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java

aokolnychyi · 2023-06-23T01:18:55Z

I gave it a bit of testing on a cluster. In some cases, I experienced quite some degradation when the split size was adjusted to a higher value. The shuffle write time increased quite dramatically when I was processing entire records. I think it is related to the fact that Spark needs to sort the records based on reducer ID during the map phase of a shuffle if the hash shuffle manager is not used (> 200 reducers). There were cases when it helped but it seems too risky to do by default.

I will rework this approach to only pick a smaller split size to utilize all cluster slots.

aokolnychyi · 2023-06-26T16:44:35Z

@rdblue, I've updated the approach after testing it on the cluster. Could you take another look?

rdblue · 2023-07-26T21:35:53Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java

+  private long targetSplitSize() {
+    if (readConf().adaptiveSplitSizeEnabled()) {
+      long scanSize = tasks().stream().mapToLong(ScanTask::sizeBytes).sum();
+      int parallelism = sparkContext().defaultParallelism();


Why use the default parallelism instead of the shuffle partitions setting? Is this set correctly by default when using dynamic allocation? I've always used the shuffle partitions because that's more likely to be tuned correctly for a job.

The default parallelism is populated via TaskScheduler from SchedulerBackend:

override def defaultParallelism(): Int = { conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) }

The core count is being updated each time an executor is added/dropped so dynamic allocation should work.

There may be issues if the parallelism is set by a config but I doubt people actually set that. We need to know the number of slots in the cluster and this seems to be the closest. What do you think?

I took another look and believe the current logic would perform better than the number of shuffle partitions.

@aokolnychyi We will use the spark.dynamicAllocation.initialExecutors * spark.executor.cores as the parallelism if dynamic resource allocation is enabled for a newly submitted application. The initial executors maybe is a small number (such as 2) when the application startup. Should this be a problem?

The core count is being updated each time an executor is added/dropped so dynamic allocation should work.

I don't think it would because the job may be planned before the initial stage is submitted and the cluster scales up. I think shuffle parallelism is the most reliable way to know how big to go.

@rdblue, I meant the core count would adjust once the cluster scales up. The initial job may not benefit from this. I wasn't sure whether that is a big deal given that acquiring new executors is generally slow.

I feel we should use the current core count if dynamic allocation is disabled (which we can check). When dynamic allocation is enabled, we can rely on the number of shuffle partitions or check the dynamic allocation config (e.g. we know the core count per each executor and the max number of executors). It seems the dynamic allocation config would give us a more precise estimate.

Thoughts, @rdblue @ConeyLiu?

I feel we should use the current core count if dynamic allocation is disabled (which we can check).

I agree with this. This should be easy to check and get the parallelism.

When dynamic allocation is enabled, we can rely on the number of shuffle partitions or check the dynamic allocation config (e.g. we know the core count per each executor and the max number of executors). It seems the dynamic allocation config would give us a more precise estimate.

From my option. I would be more likely to calculate the parallelism from the max number of executors. Because the number of shuffle partitions seems more like a parameter for the shuffle stage or reduce stage parallelism.

rdblue · 2023-07-26T21:37:05Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

@@ -80,10 +80,18 @@ abstract class SparkScan implements Scan, SupportsReportStatistics {
    this.branch = readConf.branch();
  }

+  protected JavaSparkContext sparkContext() {


Can we get the parallelism a different way? What about exposing just the parallelism and not actually the conf or context?

I moved the method to SparkScan and neither sparkContext nor readConf are exposed now.

rdblue · 2023-07-26T21:37:44Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java

@@ -232,6 +232,16 @@ protected synchronized List<ScanTaskGroup<T>> taskGroups() {
    return taskGroups;
  }

+  private long targetSplitSize() {


Don't we need this in other places as well? Or does everything go through SparkPartiitoningAwareScan now?

I'll check but I think so.

The only place we miss is SparkChangelogScan as it plans tasks directly. We can update it later to plan files first or it will automatically inherit this functionality once we support adaptive split size in core.

rdblue

Overall, this is a good change. I like that it is simpler than the other approach, which is limited by passing parallelism through the scan anyway. That makes the alternative require a lot more changes for not a lot of benefit.

My only issue with this is how it's plugged in. Seems like we should be using shuffle parallelism, I don't think I'd add a Spark SQL property, and I'd prefer if it were a little cleaner (not exposing sparkContext()). We also need to make sure this is applied everywhere, but I think this was just for demonstration not really to commit yet?

puchengy · 2023-07-28T23:02:18Z

@aokolnychyi is there a plan to port this to spark 3.2 ? thanks

aokolnychyi · 2023-08-03T23:03:45Z

@puchengy, we can, we have to discuss the best way to determine the parallelism, though.

github-actions bot added the spark label May 26, 2023

aokolnychyi mentioned this pull request May 26, 2023

Add adaptive split size #7688

Draft

aokolnychyi commented May 26, 2023

View reviewed changes

rdblue mentioned this pull request May 28, 2023

Core: Implement adaptive split planning in core. #7731

Closed

aokolnychyi force-pushed the adaptive-planning branch from 5d26ed9 to 4a7468e Compare June 6, 2023 19:17

github-actions bot added the core label Jun 6, 2023

aokolnychyi commented Jun 6, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

aokolnychyi commented Jun 6, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

aokolnychyi commented Jun 6, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java Outdated Show resolved Hide resolved

aokolnychyi force-pushed the adaptive-planning branch from b537560 to 36098c6 Compare June 26, 2023 16:43

aokolnychyi closed this Jun 27, 2023

aokolnychyi reopened this Jun 27, 2023

rdblue reviewed Jul 26, 2023

View reviewed changes

rdblue approved these changes Jul 26, 2023

View reviewed changes

aokolnychyi mentioned this pull request Jul 27, 2023

Spark 3.4: Support distributed planning #8123

Merged

aokolnychyi force-pushed the adaptive-planning branch from 31f6e6f to 1204822 Compare July 28, 2023 19:55

Core, Spark 3.4: Adjust split size to benefit from parallelism

c49ad96

aokolnychyi force-pushed the adaptive-planning branch from 1204822 to c49ad96 Compare July 28, 2023 20:04

aokolnychyi merged commit 869301b into apache:master Jul 28, 2023
41 checks passed

aokolnychyi mentioned this pull request Aug 15, 2023

Spark 3.4: Take into account number of shuffle partitions in parallelism #8327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.4: Adaptive split size #7714

Spark 3.4: Adaptive split size #7714

aokolnychyi commented May 26, 2023 •

edited

aokolnychyi May 26, 2023 •

edited

aokolnychyi May 26, 2023

aokolnychyi commented Jun 23, 2023 •

edited

aokolnychyi commented Jun 26, 2023

rdblue Jul 26, 2023

aokolnychyi Jul 26, 2023 •

edited

aokolnychyi Jul 26, 2023

aokolnychyi Jul 28, 2023

ConeyLiu Jul 29, 2023

rdblue Jul 30, 2023

aokolnychyi Aug 3, 2023

ConeyLiu Aug 4, 2023

rdblue Jul 26, 2023

aokolnychyi Jul 28, 2023

rdblue Jul 26, 2023

aokolnychyi Jul 26, 2023

aokolnychyi Jul 28, 2023 •

edited

rdblue left a comment

puchengy commented Jul 28, 2023

aokolnychyi commented Aug 3, 2023

Spark 3.4: Adaptive split size #7714

Spark 3.4: Adaptive split size #7714

Conversation

aokolnychyi commented May 26, 2023 • edited

aokolnychyi May 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Jun 23, 2023 • edited

aokolnychyi commented Jun 26, 2023

Choose a reason for hiding this comment

aokolnychyi Jul 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Jul 28, 2023 • edited

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

puchengy commented Jul 28, 2023

aokolnychyi commented Aug 3, 2023

aokolnychyi commented May 26, 2023 •

edited

aokolnychyi May 26, 2023 •

edited

aokolnychyi commented Jun 23, 2023 •

edited

aokolnychyi Jul 26, 2023 •

edited

aokolnychyi Jul 28, 2023 •

edited