[SPARK-30448][Core] accelerator aware scheduling enforce cores as limiting resource #27138

tgravescs · 2020-01-08T17:26:42Z

What changes were proposed in this pull request?

This PR is to make sure cores is the limiting resource when using accelerator aware scheduling and fix a few issues with SparkContext.checkResourcesPerTask

For the first version of accelerator aware scheduling(SPARK-27495), the SPIP had a condition that we can support dynamic allocation because we were going to have a strict requirement that we don't waste any resources. This means that the number of slots each executor has could be calculated from the number of cores and task cpus just as is done today.

Somewhere along the line of development we relaxed that and only warn when we are wasting resources. This breaks the dynamic allocation logic if the limiting resource is no longer the cores because its using the cores and task cpus to calculate the number of executors it needs. This means we will request less executors then we really need to run everything. We have to enforce that cores is always the limiting resource so we should throw if its not.

The only issue with us enforcing this is on cluster managers (standalone and mesos coarse grained) where we don't know the executor cores up front by default. Meaning the spark.executor.cores config defaults to 1 but when the executor is started by default it gets all the cores of the Worker. So we have to add logic specifically to handle that and we can't enforce this requirements, we can just warn when dynamic allocation is enabled for those.

Why are the changes needed?

Bug in dynamic allocation if cores is not limiting resource and warnings not correct.

Does this PR introduce any user-facing change?

no

How was this patch tested?

Unit test added and manually tested the confiditions on local mode, local cluster mode, standalone mode, and yarn.

…0 better

…0446

tgravescs · 2020-01-08T17:42:16Z

@jiangxb1987

tgravescs · 2020-01-08T17:42:55Z

@Ngone51 this will fix the issue you mention in #27126

SparkQA · 2020-01-08T19:31:56Z

Test build #116322 has finished for PR 27138 at commit b7ffda8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
logDebug(s\"isMulticlass = $
logDebug(s\"isMulticlass = $
case class CreateFunctionStatement(
case class AlterNamespaceSetLocation(

jiangxb1987 · 2020-01-08T20:53:29Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -2818,12 +2825,28 @@ object SparkContext extends Logging {
        // multiple executor resources.
        val resourceNumSlots = Math.floor(execAmount * taskReq.numParts / taskReq.amount).toInt
        if (resourceNumSlots < numSlots) {
+          if (shouldCheckExecCores) {
+            throw new IllegalArgumentException("The number of slots on an executor has to be " +


only throw this exception when dynamic allocation is enabled?

I think its safer to always require it just in case there are other places in the code that use cores and task cpus to determine slots. I know in doing the stage level sched work there were a bunch of places that did this but I would have to go back thru to see if they were only during dynamic allocation.
Actually one example of this is #27126

sounds good

jiangxb1987 · 2020-01-08T20:58:23Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

@@ -911,7 +927,7 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu
        """{"name": "gpu","addresses":["0", "1", "2", "3", "4", "5", "6", "7", "8"]}""")

      val conf = new SparkConf()
-        .setMaster("local-cluster[3, 3, 1024]")
+        .setMaster("local-cluster[3, 1, 1024]")


IIUC we require 3 cores on each executor in this case.

the test passes with it like this. The requirement looks like its 3 executors, not 3 tasks per executor and it waits for all executors to be up.

jiangxb1987 · 2020-01-08T21:22:59Z

LGTM

SparkQA · 2020-01-08T23:05:06Z

Test build #116323 has finished for PR 27138 at commit b0ddd1c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2020-01-08T23:23:10Z

retest this please

SparkQA · 2020-01-09T01:55:46Z

Test build #116327 has finished for PR 27138 at commit b0ddd1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-01-10T14:33:17Z

merged this, thanks @jiangxb1987

tgravescs added 5 commits January 7, 2020 09:15

[SPARK-30445] Accelerator aware scheduling handle setting configs to …

ae4a7cb

…0 better

Add checks for cpus and custom resources

c2ab8b4

Enforce cores as the limiting resource

9abb1b4

Fix test

12a4d2b

Merge branch 'master' of https://github.com/apache/spark into SPARK-3…

b7ffda8

…0446

dongjoon-hyun added the SPARK CORE label Jan 8, 2020

jiangxb1987 reviewed Jan 8, 2020

View reviewed changes

Fix test with new requirements cores is limiting resource

b0ddd1c

asfgit closed this in d6532c7 Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30448][Core] accelerator aware scheduling enforce cores as limiting resource #27138

[SPARK-30448][Core] accelerator aware scheduling enforce cores as limiting resource #27138

tgravescs commented Jan 8, 2020 •

edited

tgravescs commented Jan 8, 2020

tgravescs commented Jan 8, 2020

SparkQA commented Jan 8, 2020

jiangxb1987 Jan 8, 2020

tgravescs Jan 8, 2020

jiangxb1987 Jan 8, 2020

jiangxb1987 Jan 8, 2020

tgravescs Jan 8, 2020

jiangxb1987 Jan 8, 2020

jiangxb1987 commented Jan 8, 2020

SparkQA commented Jan 8, 2020

jiangxb1987 commented Jan 8, 2020

SparkQA commented Jan 9, 2020

tgravescs commented Jan 10, 2020

[SPARK-30448][Core] accelerator aware scheduling enforce cores as limiting resource #27138

[SPARK-30448][Core] accelerator aware scheduling enforce cores as limiting resource #27138

Conversation

tgravescs commented Jan 8, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

tgravescs commented Jan 8, 2020

tgravescs commented Jan 8, 2020

SparkQA commented Jan 8, 2020

jiangxb1987 Jan 8, 2020

Choose a reason for hiding this comment

tgravescs Jan 8, 2020

Choose a reason for hiding this comment

jiangxb1987 Jan 8, 2020

Choose a reason for hiding this comment

jiangxb1987 Jan 8, 2020

Choose a reason for hiding this comment

tgravescs Jan 8, 2020

Choose a reason for hiding this comment

jiangxb1987 Jan 8, 2020

Choose a reason for hiding this comment

jiangxb1987 commented Jan 8, 2020

SparkQA commented Jan 8, 2020

jiangxb1987 commented Jan 8, 2020

SparkQA commented Jan 9, 2020

tgravescs commented Jan 10, 2020

tgravescs commented Jan 8, 2020 •

edited