New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37528][SQL][CORE] Schedule Tasks By Input Size #34791
Conversation
e16c42a
to
2ccebe0
Compare
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145884 has finished for PR 34791 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145897 has finished for PR 34791 at commit
|
The idea looks reasonable. cc @cloud-fan |
The general idea is to make tasks report more statistics so that the task scheduler can schedule them better. This is really a big feature and I'm a bit hesitant to merge any partial improvements without an overall design. |
Let us hold off on this until #35185 has been merged - else users will have no way to identify partitions in a stage. |
Btw, this is weak precedence assuming all tasks match the same locality (or no locality) - we should word the config documentation appropriately. |
it's ture, It doest not affect if the tasks have different locality. It just try the best to make larger task run first without breaking the exists task scheduling. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
In general, the larger input data size means longer running time. So ideally, we can let DAGScheduler submit bigger input size task first. It can reduce the whole stage running time.
design doc
this pr add two cases as initialization implementation:
Why are the changes needed?
For example, we have one stage with 4 tasks and the defaultParallelism is 2 and the 4 tasks have different running time [1s, 3s, 2s, 4s].
In worse, if we have a skewed task set [1s, 3s, 3s, 7s, 7s, 20s] with the 2 defaultParallelism:
Does this PR introduce any user-facing change?
yes, a new config
spark.scheduler.sortTasksByInputSize.enabled
at core module to decide if we allow to sort tasks.How was this patch tested?
Add test in: