-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1937: fix issue with task locality #892
Changes from 7 commits
a225ac2
3dfae86
cf0d6ac
539a578
cab4c71
3d7da02
c7b93b5
7b0177a
685ed3d
fff4123
99f843e
5b3fb2f
18f9e05
fafd57f
8444d7c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -54,8 +54,15 @@ private[spark] class TaskSetManager( | |
clock: Clock = SystemClock) | ||
extends Schedulable with Logging | ||
{ | ||
// Remember when this TaskSetManager is created | ||
val creationTime = clock.getTime() | ||
val conf = sched.sc.conf | ||
|
||
// The period we wait for new executors to come up | ||
// After this period, tasks in pendingTasksWithNoPrefs will be considered as PROCESS_LOCAL | ||
private val WAIT_NEW_EXEC_TIMEOUT = conf.getLong("spark.scheduler.waitNewExecutorTime", 3000L) | ||
private var waitingNewExec = true | ||
|
||
/* | ||
* Sometimes if an executor is dead or in an otherwise invalid state, the driver | ||
* does not realize right away leading to repeated task failures. If enabled, | ||
|
@@ -118,7 +125,7 @@ private[spark] class TaskSetManager( | |
private val pendingTasksForRack = new HashMap[String, ArrayBuffer[Int]] | ||
|
||
// Set containing pending tasks with no locality preferences. | ||
val pendingTasksWithNoPrefs = new ArrayBuffer[Int] | ||
var pendingTasksWithNoPrefs = new ArrayBuffer[Int] | ||
|
||
// Set containing all pending tasks (also used as a stack, as above). | ||
val allPendingTasks = new ArrayBuffer[Int] | ||
|
@@ -182,15 +189,16 @@ private[spark] class TaskSetManager( | |
for (loc <- tasks(index).preferredLocations) { | ||
for (execId <- loc.executorId) { | ||
if (sched.isExecutorAlive(execId)) { | ||
addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new ArrayBuffer)) | ||
hadAliveLocations = true | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this check redundant with the one on line 197 now? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This check is used when we add a task to pending lists. If the task has any preferred location available (executor / host / rack), we won't add it to pendingTasksWithNoPrefs. Do you mean check for executor and host is redundant? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes -- if the executor is alive (so if the if statement on line 191 evaluates to true), then there will certainly be an executor alive on the host (the if-statement on line 196), and hadAliveLocations will be set to true on line 197. So this line is not needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Thanks :) |
||
} | ||
addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new ArrayBuffer)) | ||
} | ||
if (sched.hasExecutorsAliveOnHost(loc.host)) { | ||
addTo(pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer)) | ||
for (rack <- sched.getRackForHost(loc.host)) { | ||
addTo(pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer)) | ||
} | ||
hadAliveLocations = true | ||
} | ||
addTo(pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer)) | ||
for (rack <- sched.getRackForHost(loc.host)) { | ||
addTo(pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer)) | ||
hadAliveLocations = true | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess technically we might have no hosts in this rack, but right now our TaskScheduler doesn't track that. Maybe we should open another JIRA to track it. I can imagine this happening in really large clusters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean the TaskScheduler should provide something like "hasHostOnRack", and we have to check that before set hadAliveLocations to true? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, but we can do it in another JIRA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure :-) |
||
} | ||
} | ||
|
@@ -361,7 +369,8 @@ private[spark] class TaskSetManager( | |
} | ||
|
||
// Look for no-pref tasks after rack-local tasks since they can run anywhere. | ||
for (index <- findTaskFromList(execId, pendingTasksWithNoPrefs)) { | ||
for (index <- findTaskFromList(execId, pendingTasksWithNoPrefs) | ||
if (!waitingNewExec || tasks(index).preferredLocations.isEmpty)) { | ||
return Some((index, TaskLocality.PROCESS_LOCAL)) | ||
} | ||
|
||
|
@@ -391,6 +400,9 @@ private[spark] class TaskSetManager( | |
if (allowedLocality > maxLocality) { | ||
allowedLocality = maxLocality // We're not allowed to search for farther-away tasks | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Editing previous comment: Instead of restricting schedule to upto PROCESS_LOCAL, this will now relax it all the way till RACK_LOCAL - which is incorrect (myLocalityLevels might not have PROCESS_LOCAL but could have NODE_LOCAL for example). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mridulm - Thanks for replying. In my opinion, however, relaxing the allowed locality won't change the scheduling order. NODE_LOCAL tasks (if any) still get scheduled before RACK_LOCAL ones. And if we allow RACK_LOCAL but get a NODE_LOCAL task, currentLocalityIndex will be updated so that next time we will use NODE_LOCAL as the constraint. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mridulm , I think I got your point. Restricting the allowed locality can help achieve some delay scheduling. Anyway, this change is meant to keep pendingTasksWithNoPrefs from messing up the scheduling. It's better to address this in another JIRA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @mridulm , another possible change: maxLocality always starts from PROCESS_LOCAL, how about making it start from the highest level of myLocalityLevels? Do you think this makes sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Back at desktop, so can elaborate better. Simple scenario extending my earlier example, suppose there is only one task t1 left and two executors become available. We start with PROCESS_LOCAL as maxLocality - and suppose enough time had elapsed so allowedLocality == RACK_LOCAL or ANY. In this case, if resourceOffer is called on exec1 first, we get RACK_LOCAL schedule The reason for that if condition was exactly to prevent this. I am actually surprised I did not have any testcase to catch this ... |
||
if (waitingNewExec && curTime - creationTime > WAIT_NEW_EXEC_TIMEOUT) { | ||
waitingNewExec = false | ||
} | ||
|
||
findTask(execId, host, allowedLocality) match { | ||
case Some((index, taskLocality)) => { | ||
|
@@ -738,4 +750,20 @@ private[spark] class TaskSetManager( | |
logDebug("Valid locality levels for " + taskSet + ": " + levels.mkString(", ")) | ||
levels.toArray | ||
} | ||
|
||
// Re-compute pendingTasksWithNoPrefs since new preferred locations may become available | ||
def executorAdded() { | ||
def newLocAvail(index: Int): Boolean = { | ||
for (loc <- tasks(index).preferredLocations) { | ||
if (sched.hasExecutorsAliveOnHost(loc.host) || | ||
(loc.executorId.isDefined && sched.isExecutorAlive(loc.executorId.get)) || | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to above, I think this line is just a more specific version of the previous one -- so is redundant |
||
sched.getRackForHost(loc.host).isDefined) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here |
||
return true | ||
} | ||
} | ||
false | ||
} | ||
logInfo("Re-computing pending task lists.") | ||
pendingTasksWithNoPrefs = pendingTasksWithNoPrefs.filter(!newLocAvail(_)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something the application can just do: if it wants to wait 3 seconds before scheduling anything on non-local executors, just sleep for 3 seconds before trying to launch any jobs? I'm wary of adding more config options to the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This waiting period is only intended for pendingTasksWithNoPrefs. Suppose pendingTasksWithNoPrefs contains tasks whose preference is unavailable. Within this waiting period, we want to try pendingTasksForExecutor, pendingTasksForHost and pendingTasksForRack first because tasks in these lists do have some locality. And when an executor is added, we remove tasks newly have locality from pendingTasksWithNoPrefs. Then after the waiting period, we believe no executor will come for tasks still remain in pendingTasksWithNoPrefs. So they can be shceduled as PROCESS_LOCAL.
You can see tasks in pendingTasksForHost can still get scheduled even within the period. We're just holding back on pendingTasksWithNoPrefs. I think it's better than holding back the whole application and schedule nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to put this here because it will apply to every TaskSet, no matter how late into the application it was submitted, so you'll get a 3-second latency on every TaskSet that is missing one of its preferred nodes. Can we not add this as part of this patch, and simply make the change to put tasks in the node- and rack-local lists even if no nodes are available in those right now? Then later we can update the code that calls resourceOffer to treat tasks that have preferred locations but are missing executors for them specially.