[SPARK-29149][YARN] Update YARN cluster manager For Stage Level Scheduling #27583

tgravescs · 2020-02-14T16:24:18Z

What changes were proposed in this pull request?

Yarn side changes for Stage level scheduling. The previous PR for dynamic allocation changes was #27313

Modified the data structures to store things on a per ResourceProfile basis.
I tried to keep the code changes to a minimum, the main loop that requests just goes through each Resourceprofile and the logic inside for each one stayed very close to the same.
On submission we now have to give each ResourceProfile a separate yarn Priority because yarn doesn't support asking for containers with different resources at the same Priority. We just use the profile id as the priority level.
Using a different Priority actually makes things easier when the containers come back to match them again which ResourceProfile they were requested for.
The expectation is that yarn will only give you a container with resource amounts you requested or more. It should never give you a container if it doesn't satisfy your resource requests.

If you want to see the full feature changes you can look at https://github.com/apache/spark/pull/27053/files for reference

Why are the changes needed?

For stage level scheduling YARN support.

Does this PR introduce any user-facing change?

no

How was this patch tested?

Tested manually on YARN cluster and then unit tests.

…ler backend changes

sparkcontext

tests and SparkContext creations it gets re initialized

…9148

… """ This reverts commit 5449cda.

tgravescs · 2020-02-24T17:20:28Z

General question about priority, I did not find much here [1].
How is the value of priority interpreted ?
Is it simply to "tag" requests ?
Or are higher priority requests 'prioritized' over lower priority requests from an application (to a queue) ?

How does it compare with [2] ? Will that be cleaner (using tags) ?

[1] https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-api/apidocs/org/apache/hadoop/yarn/api/records/Priority.html

[2] https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-api/apidocs/org/apache/hadoop/yarn/api/records/SchedulingRequest.html

I don't think the Priority is documented very well at all. We ran into this issue with TEZ, where you can't have different container sizes within the same Priority. A priority is as it sounds, higher priorities get allocated first. For Spark I don't think this matters since we finish a stage before proceeding to the next. If we had a slow start feature like MapReduce then it would be. It does mean that if you have 2 stages with different resourceProfile running at the same time, one of those stages containers would be prioritized over the other, but again I don't think that is an issue. If you can think of a case it would be let me know. There is actually a way to get around using different priorities but you have to turn on a feature in YARN to use like tags. Since that is optional feature I didn't want to rely on it and I didn't see any issues with the Priority.

I haven't looked at the SchedulingRequest in detail but its more about placement and gang scheduling - https://issues.apache.org/jira/browse/YARN-6592. That is definitely something interesting but would prefer to do it separate from this, unless you see an issue with the Priority? I can look at it more to see if it would get around having to use Priority, but the schedulingRequest itself also has a priority, though has a separate resource sizing. I would almost bet it has the same restriction, but maybe its using the tags to get around this.

tgravescs · 2020-02-24T17:37:06Z

I guess the one case I can think of is if you are running spark in a job server scenario the priorities could favor certain jobs more if they used ResourceProfiles vs using the default profile. I think we could document this for now.

tgravescs · 2020-02-24T19:10:39Z

note that YARN-6592 only went into hadoop 3.1.0 so it wouldn't work for older versions, which might go back to your version question.

mridulm · 2020-02-25T05:22:09Z

I was not advocating for SchedulingRequest, just wanted to understand whether the requirement matched what was supported by SchedulingRequest (though it was probably designed for something else, conceptually it seemed to apply based on my cursory read).

Given the lack of availability in earlier hadoop versions, we can punt on using SchedulingRequest - something we can look at in future when minimum hadoop version changes.

About priority - given it had scheduling semantics associated with it, I was not sure if overloading it would be a problem. I had not thought about jobserver usecase - but that is an excellent point !
Given this, do we want to change priority of default to very high value ? Else all resource profiles will have a higher priority than default ?

tgravescs · 2020-02-25T15:11:44Z

@dongjoon-hyun Sorry to bug you again, similar question here, how do I rerun the checks. I clicked on Details but I don't have any "rerun" button. I'm logged in with my github apache account. Do I need permissions? or am I logged in wrong?

tgravescs · 2020-02-25T15:13:13Z

Sure we can make default profile highest priority. I put a note in the documentation jira as well to make sure to document the behavior.

tgravescs · 2020-02-25T15:14:39Z

Sorry I forgot, actually the default profile is already the highest priority. In Yarn lower numbers are higher priority and default profile has id 0. So my example above is wrong, job server would favor the default profiles over the custom ones, but seems that would be fine for default behavior and we can document it for now.

synchronized and add GuardedBy

tgravescs · 2020-02-25T18:16:40Z

Updated the locking to use synchronized everywhere and removed the concurrent structures since most of them were only being used by the metrics system since things have changed since originally added. I did also move some things around trying to put them in sections that were easier to read, if that is to confusing I can move things back.

tgravescs · 2020-02-25T18:19:45Z

Note that most accesses are synchronized in allocateResources, the others places are separately synchronized and called either from applicationmastersource or AMEndPoint

SparkQA · 2020-02-25T20:41:51Z

Test build #118928 has finished for PR 27583 at commit f9c1a05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm

Looks much cleaner, can you please make sure that all variables are approrpriately locked before access/update ? Looks like build time validation is not enabled/catching those. Please let me know if I am missing something !

mridulm · 2020-02-27T06:14:55Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala


-  def getNumReleasedContainers: Int = releasedContainers.size()
+  def getNumExecutorsStarting: Int = {


synchronized on this ? I was expecting static analysis via @GuardedBy to catch this in build, apparently we dont have that validation.
Can you also check use of some of the other variables as well ? targetNumExecutorsPerResourceProfileId, etc also seems to have similar issues.

I went through all the variables, they are all protected via a higher up call. We can add in more synchonizes if we want to nest (re-entrant) it just to make it more readable?
For instances this one is only called from allocateResources which is synchronized and that is the case with most of these.

I went ahead an added in more synchronized call in each funciton those variables are touched. I believe the re-entrant of synchronized is cheap so shouldn't be much overhead and help wiht readability and future breakages. If this is not what you intended let me know

make things more readable. This doesn't change the actual locking because all of these places were already synchronized by the calling functions.

SparkQA · 2020-02-27T18:34:46Z

Test build #119042 has finished for PR 27583 at commit 9e79f1a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-02-27T21:15:55Z

introduced a bug fixing it

tgravescs · 2020-02-28T00:14:15Z

@dongjoon-hyun can you kick the check again and how do I get permissions - I don't see any rerun buttons?

SparkQA · 2020-02-28T01:50:33Z

Test build #119056 has finished for PR 27583 at commit 14b6251.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-28T05:59:48Z

Test build #119065 has finished for PR 27583 at commit bd3509c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2020-02-28T09:02:51Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

@@ -336,7 +338,7 @@ private[yarn] class YarnAllocator(
        val resource = Resource.newInstance(totalMem, cores)
        ResourceRequestHelper.setResourceRequests(customResources.toMap, resource)
        logDebug(s"Created resource capability: $resource")
-        rpIdToYarnResource(rp.id) = resource
+        rpIdToYarnResource.putIfAbsent(rp.id, resource)


Can there be a race such that rp.id is present in the map ?
And if it does, should we be overwriting it here ?

no not at the moment anyway, this function is synchronized and no where else adds it so only one can run at a time. I put in putIfAbsent but it doesn't really matter. ResourceProfile ids are unique and ResourceProfiles are immutable. Even if this code ran in multiple threads at the same time the result should be exactly the same so we would put the same thing in twice and it wouldn't matter which one got inserted first.
Strictly speaking that doesn't need to be a concurrent hashmap due to locking of the calling functions but to be more strict on it and ot help with future changes I made it one.
If you think its more clear one way or another let me know and I can modify.

We changed rpIdToYarnResource to ConcurrentHashMap in commit
e89a8b5 above from mutable.HashMap ... wanted to make sure this was only for concurrent reads and not writes which might insert keys here in parallel.

mridulm · 2020-02-28T09:03:19Z

Looks good to me, just had a minor query.

tgravescs · 2020-02-28T14:38:38Z

test this please

SparkQA · 2020-02-28T17:05:04Z

Test build #119093 has finished for PR 27583 at commit bd3509c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-02-28T21:24:33Z

thanks @mridulm, I appreciate the reviews. merged this to master

…uling ### What changes were proposed in this pull request? Yarn side changes for Stage level scheduling. The previous PR for dynamic allocation changes was apache#27313 Modified the data structures to store things on a per ResourceProfile basis. I tried to keep the code changes to a minimum, the main loop that requests just goes through each Resourceprofile and the logic inside for each one stayed very close to the same. On submission we now have to give each ResourceProfile a separate yarn Priority because yarn doesn't support asking for containers with different resources at the same Priority. We just use the profile id as the priority level. Using a different Priority actually makes things easier when the containers come back to match them again which ResourceProfile they were requested for. The expectation is that yarn will only give you a container with resource amounts you requested or more. It should never give you a container if it doesn't satisfy your resource requests. If you want to see the full feature changes you can look at https://github.com/apache/spark/pull/27053/files for reference ### Why are the changes needed? For stage level scheduling YARN support. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Tested manually on YARN cluster and then unit tests. Closes apache#27583 from tgravescs/SPARK-29149. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

tgravescs added 30 commits January 17, 2020 14:56

[SPARK-29148]Add stage level scheduling dynamic allocation and schedu…

0ad97e3

…ler backend changes

Fix tests and modify Stage info to have resource profile id

1d8e1cf

revert pom

6c56fbf

cleanup

66745a1

Fix empty map

1bf5faf

minor comments and error on cores being limiting resource

24ddabd

cleanup

92c0fd2

fix typo

a0c3ade

clean up warning on shutdown

54e5b43

Add checks make sure cores limiting resource in local mode

0408c02

Update comments and fix check when no resources

0a93cc9

Remove some tests that need scheduler changes

c3358fc

Style fix ups

35e0a4d

Add resourceProfileManager to kubernetes test that is mocking

be4e542

sparkcontext

Make temporary directory for test of standalone resources

1bfd706

Address review comments

cd3e000

Update to have sparkcontext clear the default profile so that in between

8540b33

tests and SparkContext creations it gets re initialized

put clearnresource profile back in for tests

c5954b8

Fix spacing

7b7c513

Minor comments from late review of PR 26682

ae4db1e

Attempt to clarify commment of calculateAmountAndPartsForFraction

d270a73

Add () to calls to clearDefaultProfile

56e34d7

Merge branch 'master' of https://github.com/apache/spark into SPARK-2…

e0a9d0e

…9148

Fix from merge to master

2193e91

Fix merge issue

9cbce12

Change to val's for review comments

89dfb19

Update test added in master

5435640

Change to use Optional for ExecutorResourceRequest instead of ""

5449cda

Revert "Change to use Optional for ExecutorResourceRequest instead of…

fa3f5a4

… """ This reverts commit 5449cda.

Fix speculative test

87aab30

tgravescs added 2 commits February 25, 2020 09:27

Add GuardedBy to the class datastructures

48db848

Update locking to remove unneeded concurrent structures and use

f9c1a05

synchronized and add GuardedBy

mridulm reviewed Feb 27, 2020

View reviewed changes

Add in more explicit synchronized calls to go along with GuardedBy to

9e79f1a

make things more readable. This doesn't change the actual locking because all of these places were already synchronized by the calling functions.

tgravescs added 2 commits February 27, 2020 16:21

Update to fix locking in matchContainerToRequest

e89a8b5

Update to use concurrentHashMap

14b6251

Update comment to kick jenkins

bd3509c

dongjoon-hyun added the YARN label Feb 28, 2020

mridulm reviewed Feb 28, 2020

View reviewed changes

asfgit closed this in 0e2ca11 Feb 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29149][YARN] Update YARN cluster manager For Stage Level Scheduling #27583

[SPARK-29149][YARN] Update YARN cluster manager For Stage Level Scheduling #27583

tgravescs commented Feb 14, 2020 •

edited

tgravescs commented Feb 24, 2020

tgravescs commented Feb 24, 2020

tgravescs commented Feb 24, 2020

mridulm commented Feb 25, 2020

tgravescs commented Feb 25, 2020

tgravescs commented Feb 25, 2020

tgravescs commented Feb 25, 2020 •

edited

tgravescs commented Feb 25, 2020

tgravescs commented Feb 25, 2020

SparkQA commented Feb 25, 2020

mridulm left a comment

mridulm Feb 27, 2020

tgravescs Feb 27, 2020 •

edited

tgravescs Feb 27, 2020 •

edited

SparkQA commented Feb 27, 2020

tgravescs commented Feb 27, 2020

tgravescs commented Feb 28, 2020

SparkQA commented Feb 28, 2020

SparkQA commented Feb 28, 2020

mridulm Feb 28, 2020

tgravescs Feb 28, 2020

mridulm Feb 28, 2020

mridulm commented Feb 28, 2020

tgravescs commented Feb 28, 2020

SparkQA commented Feb 28, 2020

tgravescs commented Feb 28, 2020


		def getNumReleasedContainers: Int = releasedContainers.size()
		def getNumExecutorsStarting: Int = {

[SPARK-29149][YARN] Update YARN cluster manager For Stage Level Scheduling #27583

[SPARK-29149][YARN] Update YARN cluster manager For Stage Level Scheduling #27583

Conversation

tgravescs commented Feb 14, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

tgravescs commented Feb 24, 2020

tgravescs commented Feb 24, 2020

tgravescs commented Feb 24, 2020

mridulm commented Feb 25, 2020

tgravescs commented Feb 25, 2020

tgravescs commented Feb 25, 2020

tgravescs commented Feb 25, 2020 • edited

tgravescs commented Feb 25, 2020

tgravescs commented Feb 25, 2020

SparkQA commented Feb 25, 2020

mridulm left a comment

Choose a reason for hiding this comment

mridulm Feb 27, 2020

Choose a reason for hiding this comment

tgravescs Feb 27, 2020 • edited

Choose a reason for hiding this comment

tgravescs Feb 27, 2020 • edited

Choose a reason for hiding this comment

SparkQA commented Feb 27, 2020

tgravescs commented Feb 27, 2020

tgravescs commented Feb 28, 2020

SparkQA commented Feb 28, 2020

SparkQA commented Feb 28, 2020

mridulm Feb 28, 2020

Choose a reason for hiding this comment

tgravescs Feb 28, 2020

Choose a reason for hiding this comment

mridulm Feb 28, 2020

Choose a reason for hiding this comment

mridulm commented Feb 28, 2020

tgravescs commented Feb 28, 2020

SparkQA commented Feb 28, 2020

tgravescs commented Feb 28, 2020

tgravescs commented Feb 14, 2020 •

edited

tgravescs commented Feb 25, 2020 •

edited

tgravescs Feb 27, 2020 •

edited

tgravescs Feb 27, 2020 •

edited