[SPARK-37528][SQL][CORE] Schedule Tasks By Input Size #34791

ulysses-you · 2021-12-03T06:01:32Z

What changes were proposed in this pull request?

In general, the larger input data size means longer running time. So ideally, we can let DAGScheduler submit bigger input size task first. It can reduce the whole stage running time.

design doc

this pr add two cases as initialization implementation:

datasourcev1 file scan for leaf node
coalesce partition spec for AQE

Why are the changes needed?

For example, we have one stage with 4 tasks and the defaultParallelism is 2 and the 4 tasks have different running time [1s, 3s, 2s, 4s].

in normal, the running time of the stage is: 7s
- 1, 2
- 3, 4
if big task first, the running time of the stage is: 5s
- 4, 1
- 3, 2

In worse, if we have a skewed task set [1s, 3s, 3s, 7s, 7s, 20s] with the 2 defaultParallelism:

in normal, the running time of the stage is: 31s
- 1, 3, 7, 20
- 3, 3, 7
if big task first, the running time of the stage is: 21s
- 20, 1
- 7, 7, 3, 3

Does this PR introduce any user-facing change?

yes, a new config spark.scheduler.sortTasksByInputSize.enabled at core module to decide if we allow to sort tasks.

How was this patch tested?

Add test in:

org.apache.spark.scheduler.DAGSchedulerSuite
org.apache.spark.sql.execution.ScheduleTasksByInputSizeSuiteBase

SparkQA · 2021-12-03T07:36:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50359/

SparkQA · 2021-12-03T08:40:08Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50359/

SparkQA · 2021-12-03T09:08:30Z

Test build #145884 has finished for PR 34791 at commit 2ccebe0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ulysses-you · 2021-12-03T09:47:17Z

retest this please

SparkQA · 2021-12-03T10:40:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50372/

SparkQA · 2021-12-03T11:41:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50372/

SparkQA · 2021-12-03T12:53:58Z

Test build #145897 has finished for PR 34791 at commit 2ccebe0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2022-02-09T02:24:58Z

The idea looks reasonable.
If this functionality can only be used in SQL with AQE enabled, what about just making it a AQEShuffleReadRule which insert/update AQEShuffleReadExecs with reordered ShufflePartitionSpecs?

cc @cloud-fan

cloud-fan · 2022-02-10T05:32:44Z

The general idea is to make tasks report more statistics so that the task scheduler can schedule them better. This is really a big feature and I'm a bit hesitant to merge any partial improvements without an overall design.

mridulm · 2022-02-10T16:38:36Z

Let us hold off on this until #35185 has been merged - else users will have no way to identify partitions in a stage.

mridulm · 2022-02-10T16:40:05Z

Btw, this is weak precedence assuming all tasks match the same locality (or no locality) - we should word the config documentation appropriately.

ulysses-you · 2022-02-11T01:48:21Z

Btw, this is weak precedence assuming all tasks match the same locality (or no locality) - we should word the config documentation appropriately.

it's ture, It doest not affect if the tasks have different locality. It just try the best to make larger task run first without breaking the exists task scheduling.

…-task

github-actions · 2022-09-29T00:32:04Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added CORE SQL labels Dec 3, 2021

Support reorder tasks during scheduling by shuffle partition size in AQE

2ccebe0

ulysses-you force-pushed the reorder-task branch from e16c42a to 2ccebe0 Compare December 3, 2021 06:06

ulysses-you added 4 commits March 25, 2022 15:42

Merge branch 'master' of https://github.com/apache/spark into reorder…

042e092

…-task

fix

16aaae7

Merge branch 'master' of https://github.com/apache/spark into reorder…

86f20c0

…-task

test

61e73f0

ulysses-you changed the title ~~[SPARK-37528][SQL][CORE] Support reorder tasks during scheduling by shuffle partition size in AQE~~ [SPARK-37528][SQL][CORE] Schedule Tasks By Input Size Apr 1, 2022

ulysses-you added 4 commits April 7, 2022 14:19

Merge branch 'master' of https://github.com/apache/spark into reorder…

33b7132

…-task

simplify

ee44d03

test

f8b6928

nit

c6cbe48

github-actions bot added the Stale label Sep 29, 2022

github-actions bot closed this Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37528][SQL][CORE] Schedule Tasks By Input Size #34791

[SPARK-37528][SQL][CORE] Schedule Tasks By Input Size #34791

ulysses-you commented Dec 3, 2021 •

edited

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

ulysses-you commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

zhengruifeng commented Feb 9, 2022

cloud-fan commented Feb 10, 2022

mridulm commented Feb 10, 2022

mridulm commented Feb 10, 2022

ulysses-you commented Feb 11, 2022

github-actions bot commented Sep 29, 2022

[SPARK-37528][SQL][CORE] Schedule Tasks By Input Size #34791

[SPARK-37528][SQL][CORE] Schedule Tasks By Input Size #34791

Conversation

ulysses-you commented Dec 3, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

ulysses-you commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

zhengruifeng commented Feb 9, 2022

cloud-fan commented Feb 10, 2022

mridulm commented Feb 10, 2022

mridulm commented Feb 10, 2022

ulysses-you commented Feb 11, 2022

github-actions bot commented Sep 29, 2022

ulysses-you commented Dec 3, 2021 •

edited