[SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. #27207

bmarcott · 2020-01-15T03:30:45Z

What changes were proposed in this pull request?

Delay scheduling is an optimization that sacrifices fairness for data locality in order to improve cluster and workload throughput.

One useful definition of "delay" here is how much time has passed since the TaskSet was using its fair share of resources.

However it is impractical to calculate this delay, as it would require running simulations assuming no delay scheduling. Tasks would be run in different orders with different run times.

Currently the heuristic used to estimate this delay is the time since a task was last launched for a TaskSet. The problem is that it essentially does not account for resource utilization, potentially leaving the cluster heavily underutilized.

This PR modifies the heuristic in an attempt to move closer to the useful definition of delay above.
The newly proposed delay is the time since a TasksSet last launched a task and did not reject any resources due to delay scheduling when offered its "fair share".

See the last comments of #26696 for more discussion.

Why are the changes needed?

cluster can become heavily underutilized as described in SPARK-18886

How was this patch tested?

TaskSchedulerImplSuite

@cloud-fan
@tgravescs
@squito

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

cloud-fan · 2020-01-15T08:25:22Z

ok to test

cloud-fan · 2020-01-15T09:03:51Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+  // keyed by task set stage id
+  // value is true if there have been no resources rejected due to delay scheduling
+  // since the last "full" resource offer
+  private val noDelayScheduleRejects = new mutable.HashMap[Int, Boolean]()


can we ask each TSM to track it? We can pass the isAllFreeResources parameter to TSM when calling resourceOffer

what do you see as pros/cons of keeping this map vs putting in TSM?
The TaskSchedulerImpl is the only one setting/updating the variable based on knowledge only it has (no schedule delay reject).

SparkQA · 2020-01-15T10:30:51Z

Test build #116771 has finished for PR 27207 at commit d52de6b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-01-15T18:44:21Z

so please update the description with information from the other PR. The description should have basically a high level design of this approach with enough details before someone reads the code to make sure the code is doing what you are proposing. What cases it covers and what cases you know it doesn't.

So the case you mentioned it doesn't cover:

The case I am referring to is: imagine you have 2 resources and an "all resource offer" is scheduled every second. when TSM1 is submitted, it'll also get an "all resource offer", and assume it rejects both, causing a prexisting TSM2 to utilize them. Assume those 2 tasks finish, and the freed resources are offered one by one to TSM1, which accepts both, all within 1 second (before any "all resource offer"). This should reset the timer, but it won't in the implementation.

So the issue here is that we aren't really tracking when all resources are used we are proxying that.
To really calculate the free slots though is pretty complex when you take into account blacklisting (have both application and taskset level).
I'm kind of thinking at this point the above case is ok, it favors not delaying and it will be fixed up on the next "all resource offer"

One thing I don't think I like is that if you are fully scheduled, we keep trying to schedule "all resources" but if there are no resources, then we continue to reset the timer. This means that it takes a long time to fall back in case where you may have multiple tasksets and the first task set rejects it and the second one takes it and the tasks are finishing such that you get an all resources offer in between the task finishes. In this scenario the first taskset can get starved. We would need to perhaps track this separately.

I need to take another walk through all the scenarios again as well

tgravescs · 2020-01-15T14:58:44Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

          } while (launchedTaskAtCurrentMaxLocality)
        }

+        if (!hadAnyDelaySchedulingReject) {


nit - I think the variable names are a bit confusing because you have noDelay and then hadAnyDelay, I think it would be easier to read if they were the same.

bmarcott · 2020-01-16T06:56:41Z

@tgravescs
Thanks for the comments.

so please update the description with information from the other PR

Which one of my snippets from the previous PR was most clear to you? I can put that one in the description.

One thing I don't think I like....

Really good point on this scenario. It's bad even for the same scenario that you described but where the all resource offer has only 1 executor, and the first taskset accepts it.
Let me know if you have any good ideas here 😉

Is it ok I do follow up changes, such as variable names, unit tests, and other backend schedulers only once we iron out the problematic scenarios?

tgravescs · 2020-01-21T20:36:42Z

yes its fine not to do any minor changes until we decide on design.
I'm not so worried about the 1 executor case if the first taskset takes it because that one should be higher priority and that should work as expected, when there isn't anywhere to put the tasks, then have them wait a bit to try to get locality. Its the case with the fair scheduler where the one that isn't the highest priority that I'm more concerned with.

I wonder if we can add in a separate tracking/check in TaskSchedulerImpl that tracks to see if it keeps rejecting it on the non all resource offers but then resets on the all offers, then after some number of those we stop resetting it, thoughts on that?

bmarcott · 2020-01-22T03:53:25Z

That may be reasonable to do, but I'd like to avoid adding more tracking/accounting if possible. I already don't like the boolean map I added.
What do you think about adding back the old condition of "must have launched a task".
so the new condition for reset would be must have no rejects and launch a task on an all resource offer
there is starvation still, but should be no more starvation than current master code I believe.

cloud-fan · 2020-01-22T08:19:21Z

I'm fine with a non-perfect solution as long as it's no worse than the existing one and fixes the local wait problem.

tgravescs · 2020-01-23T16:00:46Z

So the conditions would be like:

offer 1 resource that was rejected - no timer reset
offer all resources with no rejects and launched a task - timer is reset
offer 1 resource, no reject - timer is reset
offer 1 resource that was rejected - no timer reset
offer 1 resource, no reject - no timer reset because previous offer was rejected

I think that makes sense, it definitely addresses the case I was talking about with 2 task sets.

I was also wondering about leaving the old logic in there but configured off by default. While I don't like it, it would be safer if we haven't thought of some corner case and user could turn it back on if necessary, thoughts?

bmarcott · 2020-01-24T06:17:17Z

@cloud-fan
thanks for the input

@tgravescs
yep, that sequence and explanation matches with my understanding

I was also wondering about leaving the old logic in there but configured off by default. While I don't like it, it would be safer if we haven't thought of some corner case and user could turn it back on if necessary, thoughts?

I'm not opposed to adding a new config that disables new way and enables old way, but is this standard practice in spark for the more risky changes?

I would feel more comfortable knowing there was a fallback too ;)

tgravescs · 2020-01-24T15:36:19Z

its a case by case basis. While we don't like adding more configs and code for maintenance, if its somewhat risky change and with all the corner cases and different ways to run I think its warranted. Actually what we can do is leave the config undocumented and default to new algorithm and put a deprecated message by it so we can remove it in like a 3.1 if we want.

tgravescs · 2020-01-31T17:37:04Z

@bmarcott will you be able to update this?

bmarcott · 2020-01-31T18:01:46Z

yea, I'll try to take a look this weekend. Thanks for checking in.

bmarcott · 2020-02-02T05:56:15Z

@tgravescs
I updated with the new config to switch to legacy behavior as well as some doc/variable renaming.
I also updated the description of this PR.

After reading more through SchedulerBackend impls:

decided to assume resource offers are full resource by default, to match previous behavior for any SchedulerBackends not touched in this PR.
I only found MesosFineGrainedSchedulerBackend which doesn't use CoarseGrainedSchedulerBackend and decided not to touch it since it is deprecated and from the design looks like "all free resources" can't be tracked.

Test updates are still needed, but wanted to get your feedback first.

SparkQA · 2020-02-02T06:33:15Z

Test build #117728 has finished for PR 27207 at commit e0ac12e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-02T06:35:18Z

Test build #117727 has finished for PR 27207 at commit 2924c5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51

This makes me feel that we're trying to put cluster utilized before data locality while assigning resources. Right?

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

bmarcott · 2020-02-04T06:10:17Z

@Ngone51 thanks for taking a look.
I wouldn't say it puts one thing before another. The idea is to move closer to what a reasonable definition of scheduling delay (locality.wait time) is: how long you want to sacrifice "fairness" (using your fair share of resources) in favor of data locality.

I'll do the touch ups you mentioned after there is enough support on the design.

cloud-fan · 2020-02-04T13:25:58Z

The newly proposed delay is the time since a TasksSet last launched a task and did not reject any resources due to delay scheduling when offered its "fair share".

Thanks for updating the PR description! The proposal makes sense to me.

Is it possible to centralize the delay scheduling code? Now it's in both TaskSchedulerImpl and TaskSetManager, which makes it a bit hard to understand as you need to think about the interactions between them.

tgravescs · 2020-02-04T14:01:00Z

I'll take a closer look later, I think its fine not to support mesos fine grain scheduler

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

core/src/main/scala/org/apache/spark/internal/config/package.scala

bmarcott

I appreciate everyones feedback. I've updated some nits locally and will follow up on tests later.

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

bmarcott · 2020-02-05T05:08:39Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -469,8 +486,10 @@ private[spark] class TaskSetManager(
          extraResources,
          serializedTask)
      }
+      (taskDescription,
+        taskDescription.isEmpty && maxLocality == TaskLocality.ANY && pendingTasks.all.nonEmpty)


👍
What are your thoughts on:
I'm thinking instead of making assumptions about what taskDescription.isEmpty means, maybe it'd be better to pass maxLocality into dequeueTask and then change its logic near the bottom to be something like:

if (TaskLocality.isAllowed(allowedLocality, TaskLocality.ANY)) { dequeue(pendingTaskSetToUse.all).foreach { index => return Some((index, TaskLocality.ANY, speculative)) } } else if (maxLocality == TaskLocality.ANY && pendingTasks.all.nonEmpty) { hasReject = true }

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

bmarcott · 2020-02-10T04:11:11Z

@cloud-fan

Is it possible to centralize the delay scheduling code? Now it's in both TaskSchedulerImpl and TaskSetManager, which makes it a bit hard to understand as you need to think about the interactions between them.

I am not sure a good way to centralize because

The TSM is called multiple times with various offers and we need to keep track of what happened across those calls
TSM today doesn't differentiate whether it didn't launch a task due to blacklisting or due to delay scheduling, hence the new boolean returned.

SparkQA · 2020-02-10T04:30:42Z

Test build #118111 has finished for PR 27207 at commit 61671b3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-10T07:42:38Z

Test build #118113 has finished for PR 27207 at commit f341ebb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-11T07:45:54Z

Test build #118214 has finished for PR 27207 at commit e39680b.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-11T11:51:19Z

Test build #118223 has finished for PR 27207 at commit 894bebb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-03T14:29:54Z

Test build #120771 has finished for PR 27207 at commit 24c8ad9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-04-03T14:45:44Z

@bmarcott accepted your linked in but we can discuss here as well. what all manual testing have you done? I was hoping to do some myself but haven't had time yet. We can continue to test after its merged in.

bmarcott · 2020-04-04T02:44:46Z

I have not tested this manually yet, nor have played around with using multiple executors locally before.
Maybe I can just use spark standalone: https://spark.apache.org/docs/latest/spark-standalone.html

bmarcott · 2020-04-06T00:37:26Z

running into problems running standalone. posted question on stack

....updated...
was using wrong port. Answered my own question

tgravescs · 2020-04-08T13:31:03Z

@bmarcott did you get it to work on standalone cluster?

bmarcott · 2020-04-08T16:35:16Z

@tgravescs I did setup everything and can run jobs. Now I am trying to create a use case where the tasks prefer a particular executor. Have any ideas on a way to easily repro this issue?

cloud-fan · 2020-04-09T05:39:33Z

Can you create a custom RDD which generates random data and set the preferred location to one executor?

bmarcott · 2020-04-09T05:55:47Z

Thanks @cloud-fan
That is the approach I was going to start with.
In the wild/production I hit this issue with a join, but I forget the exact reason/context.

bmarcott · 2020-04-09T07:51:50Z

First manual test looks good. Ran a simple app which creates 1000 partitions all preferring executor 0.
Each task sleeps 1 second. Here is the app's code .
New code utilized all executors, whereas with legacy flag enabled, only process local tasks were run, making it much slower.

Run with new command:
../../projects/apache/spark/bin/spark-submit --class "TestLocalityWait" --master spark://localhost:7077 --conf spark.executor.instances=4 --conf spark.executor.cores=2 target/scala-2.12/simple-project_2.12-1.0.jar

Processed 326 partitions in 35 seconds:

Many tasks run at ANY locality level:

Run with legacy flag set to true:
../../projects/apache/spark/bin/spark-submit --class "TestLocalityWait" --master spark://localhost:7077 --conf spark.executor.instances=4 --conf spark.executor.cores=2 --conf spark.locality.wait.legacyResetOnTaskLaunch=true target/scala-2.12/simple-project_2.12-1.0.jar

Processed 146 partitions in 1.3 min:

All tasks run at PROCESS_LOCAL locality level

cloud-fan · 2020-04-09T10:59:59Z

Thanks for the manual testing! Great Job!

Merging to master!

cloud-fan · 2020-04-09T11:01:42Z

It conflicts with 3.0, can you send a new PR for 3.0?

dongjoon-hyun · 2020-04-10T02:32:44Z

Hi, All.
The last test was 7 days ago. This causes a UT failure on all master Jenkins jobs which is added by another PR. I made a follow-up to recover master branch.

[SPARK-18886][CORE][TESTS][FOLLOWUP] Fix a test failure due to InvalidUseOfMatchersException #28174

cloud-fan · 2020-04-10T03:18:50Z

thanks for fixing!

bmarcott · 2020-04-10T06:22:29Z

@cloud-fan I thought 3.0 was already branched and frozen? Will open a PR for it if I am given the go ahead.
Thanks for all the review and feedback along the way (@tgravescs as well!)

I am further manual testing and finding either I'm testing wrong, or locality wait isn't being respected even when I set spark.locality.wait higher . Looking into it...(may have several day delay)
I believe I may need to add a special case for when there is an all resource offer with 0 offers (there will be no launched task, so no reset).

cloud-fan · 2020-04-10T09:24:14Z

This is a high-value perf fix, but seems too risky for 3.0 after a hindsight. @tgravescs are you OK with having this fix in master only?

tgravescs · 2020-04-10T13:13:15Z

yeah at this point I'm fine with leaving it in master only.

dongjoon-hyun · 2020-04-10T19:19:33Z

Hi, @cloud-fan . If the decision is final, could you resolve SPARK-18886 with the fixed version, 3.1.0? The JIRA issue is still open. Thanks.

cloud-fan · 2020-04-13T06:01:21Z

done

…ilization due to delay scheduling ### What changes were proposed in this pull request? [Delay scheduling](http://elmeleegy.com/khaled/papers/delay_scheduling.pdf) is an optimization that sacrifices fairness for data locality in order to improve cluster and workload throughput. One useful definition of "delay" here is how much time has passed since the TaskSet was using its fair share of resources. However it is impractical to calculate this delay, as it would require running simulations assuming no delay scheduling. Tasks would be run in different orders with different run times. Currently the heuristic used to estimate this delay is the time since a task was last launched for a TaskSet. The problem is that it essentially does not account for resource utilization, potentially leaving the cluster heavily underutilized. This PR modifies the heuristic in an attempt to move closer to the useful definition of delay above. The newly proposed delay is the time since a TasksSet last launched a task **and** did not reject any resources due to delay scheduling when offered its "fair share". See the last comments of apache#26696 for more discussion. ### Why are the changes needed? cluster can become heavily underutilized as described in [SPARK-18886](https://issues.apache.org/jira/browse/SPARK-18886?jql=project%20%3D%20SPARK%20AND%20text%20~%20delay) ### How was this patch tested? TaskSchedulerImplSuite cloud-fan tgravescs squito Closes apache#27207 from bmarcott/nmarcott-fulfill-slots-2. Authored-by: Nicholas Marcott <481161+bmarcott@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… no task was launched ### What changes were proposed in this pull request? Remove the requirement to launch a task in order to reset locality wait timer. ### Why are the changes needed? Recently #27207 was merged, but contained a bug which leads to undesirable behavior. The crux of the issue is that single resource offers couldn't reset the timer, if there had been a previous reject followed by an allResourceOffer with no available resources. This lead to a problem where once locality level reached ANY, single resource offers are all accepted, leading allResourceOffers to be left with no resources to utilize (hence no task being launched on an all resource offer -> no timer reset). The task manager would be stuck in ANY locality level. Noting down here the downsides of using below reset conditions, in case we want to follow up. As this is quite complex, I could easily be missing something, so please comment/respond if you have more bad behavior scenarios or find something wrong here: The format is: > **Reset condition** > - the unwanted side effect > - the cause/use case Below references to locality increase/decrease mean: ``` PROCESS_LOCAL, NODE_LOCAL ... .. ANY ------ locality decrease ---> <----- locality increase ----- ``` **Task launch:** - locality decrease: - Blacklisting, FAIR/FIFO scheduling, or task resource requirements can minimize tasks launched - locality increase: - single task launch decreases locality despite many tasks remaining **No delay schedule reject since last allFreeResource offer** - locality decrease: - locality wait less than allFreeResource offer frequency, which occurs at least 1 per second - locality increase: - single resource (or none) not rejected despite many tasks remaining (other lower priority tasks utilizing resources) **Current impl - No delay schedule reject since last (allFreeResource offer + task launch)** - locality decrease: - all from above - locality increase: - single resource accepted and task launched despite many tasks remaining The current impl is an improvement on the legacy (task launch) in that unintended locality decrease case is similar and the unintended locality increase case only occurs when the cluster is fully utilized. For the locality increase cases, perhaps a config which specifies a certain % of tasks in a taskset to finish before resetting locality levels would be helpful. **If** that was considered a good approach then perhaps removing the task launch as a requirement would eliminate most of downsides listed above. Lemme know if you have more ideas for eliminating locality increase downside of **No delay schedule reject since last allFreeResource offer** ### Does this PR introduce any user-facing change? No ### How was this patch tested? TaskSchedulerImplSuite Also manually tested similar to how I tested in #27207 using [this simple app](https://github.com/bmarcott/spark-test-apps/blob/master/src/main/scala/TestLocalityWait.scala). With the new changes, given locality wait of 10s the behavior is generally: 10 seconds of locality being respected, followed by a single full utilization of resources using ANY locality level, followed by 10 seconds of locality being respected, and so on If the legacy flag is enabled (spark.locality.wait.legacyResetOnTaskLaunch=true), the behavior is only scheduling PROCESS_LOCAL tasks (only utilizing a single executor) cloud-fan tgravescs Closes #28188 from bmarcott/nmarcott-locality-fix. Authored-by: Nicholas Marcott <481161+bmarcott@users.noreply.github.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

gatorsmile · 2020-04-26T06:02:55Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+      l2
+    }
+  }
+
  /**
   * Called by cluster manager to offer resources on slaves. We respond by asking our active task
   * sets for tasks in order of priority. We fill each node with tasks in a round-robin manner so
   * that tasks are balanced across the cluster.


Can we update the description of this function and explain the parameter "isAllFreeResources"?

yes I can add something like: if true, then the parameter offers contains all workers and their free resources. See delay scheduling comments in class description.

is there are changes suggested and required, please file a separate new jira for this and link them. this pr has been merged and we have had to many followups at this point.

gatorsmile · 2020-04-26T06:03:10Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

-  def resourceOffers(offers: IndexedSeq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
+  def resourceOffers(
+      offers: IndexedSeq[WorkerOffer],
+      isAllFreeResources: Boolean = true): Seq[Seq[TaskDescription]] = synchronized {


Do we need to set the default value?

I think I originally did this to not break the api + maintain something closer to previous behavior for callers who hadn't migrated to setting it to false.
Lemme know if this is the wrong approach.

…ilization due to delay scheduling Ref: LIHADOOP-57393 [Delay scheduling](http://elmeleegy.com/khaled/papers/delay_scheduling.pdf) is an optimization that sacrifices fairness for data locality in order to improve cluster and workload throughput. One useful definition of "delay" here is how much time has passed since the TaskSet was using its fair share of resources. However it is impractical to calculate this delay, as it would require running simulations assuming no delay scheduling. Tasks would be run in different orders with different run times. Currently the heuristic used to estimate this delay is the time since a task was last launched for a TaskSet. The problem is that it essentially does not account for resource utilization, potentially leaving the cluster heavily underutilized. This PR modifies the heuristic in an attempt to move closer to the useful definition of delay above. The newly proposed delay is the time since a TasksSet last launched a task **and** did not reject any resources due to delay scheduling when offered its "fair share". See the last comments of apache#26696 for more discussion. cluster can become heavily underutilized as described in [SPARK-18886](https://issues.apache.org/jira/browse/SPARK-18886?jql=project%20%3D%20SPARK%20AND%20text%20~%20delay) TaskSchedulerImplSuite cloud-fan tgravescs squito Closes apache#27207 from bmarcott/nmarcott-fulfill-slots-2. Authored-by: Nicholas Marcott <481161+bmarcott@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… no task was launched Ref: LIHADOOP-57393 Remove the requirement to launch a task in order to reset locality wait timer. Recently apache#27207 was merged, but contained a bug which leads to undesirable behavior. The crux of the issue is that single resource offers couldn't reset the timer, if there had been a previous reject followed by an allResourceOffer with no available resources. This lead to a problem where once locality level reached ANY, single resource offers are all accepted, leading allResourceOffers to be left with no resources to utilize (hence no task being launched on an all resource offer -> no timer reset). The task manager would be stuck in ANY locality level. Noting down here the downsides of using below reset conditions, in case we want to follow up. As this is quite complex, I could easily be missing something, so please comment/respond if you have more bad behavior scenarios or find something wrong here: The format is: > **Reset condition** > - the unwanted side effect > - the cause/use case Below references to locality increase/decrease mean: ``` PROCESS_LOCAL, NODE_LOCAL ... .. ANY ------ locality decrease ---> <----- locality increase ----- ``` **Task launch:** - locality decrease: - Blacklisting, FAIR/FIFO scheduling, or task resource requirements can minimize tasks launched - locality increase: - single task launch decreases locality despite many tasks remaining **No delay schedule reject since last allFreeResource offer** - locality decrease: - locality wait less than allFreeResource offer frequency, which occurs at least 1 per second - locality increase: - single resource (or none) not rejected despite many tasks remaining (other lower priority tasks utilizing resources) **Current impl - No delay schedule reject since last (allFreeResource offer + task launch)** - locality decrease: - all from above - locality increase: - single resource accepted and task launched despite many tasks remaining The current impl is an improvement on the legacy (task launch) in that unintended locality decrease case is similar and the unintended locality increase case only occurs when the cluster is fully utilized. For the locality increase cases, perhaps a config which specifies a certain % of tasks in a taskset to finish before resetting locality levels would be helpful. **If** that was considered a good approach then perhaps removing the task launch as a requirement would eliminate most of downsides listed above. Lemme know if you have more ideas for eliminating locality increase downside of **No delay schedule reject since last allFreeResource offer** No TaskSchedulerImplSuite Also manually tested similar to how I tested in apache#27207 using [this simple app](https://github.com/bmarcott/spark-test-apps/blob/master/src/main/scala/TestLocalityWait.scala). With the new changes, given locality wait of 10s the behavior is generally: 10 seconds of locality being respected, followed by a single full utilization of resources using ANY locality level, followed by 10 seconds of locality being respected, and so on If the legacy flag is enabled (spark.locality.wait.legacyResetOnTaskLaunch=true), the behavior is only scheduling PROCESS_LOCAL tasks (only utilizing a single executor) cloud-fan tgravescs Closes apache#28188 from bmarcott/nmarcott-locality-fix. Authored-by: Nicholas Marcott <481161+bmarcott@users.noreply.github.com> Signed-off-by: Thomas Graves <tgraves@apache.org> RB=2466127 BUG=LIHADOOP-57393 G=spark-reviewers R=mmuralid,minyang,mshen,chsingh A=mmuralid,mshen

bmarcott commented Jan 15, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 15, 2020

View reviewed changes

tgravescs reviewed Jan 15, 2020

View reviewed changes

Ngone51 reviewed Feb 4, 2020

View reviewed changes

tgravescs reviewed Feb 4, 2020

View reviewed changes

bmarcott commented Feb 5, 2020

View reviewed changes

dongjoon-hyun added the SCHEDULER label Feb 5, 2020

cloud-fan closed this in 8b48629 Apr 9, 2020

ChenjunZou mentioned this pull request Apr 9, 2020

[SPARK-31395][CORE]reverse preferred location to make schedule more even #28168

Closed

bmarcott mentioned this pull request Apr 11, 2020

[SPARK-18886][CORE][FOLLOWUP] allow follow up locality resets even if no task was launched #28188

Closed

viirya reviewed Apr 23, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Show resolved Hide resolved

viirya reviewed Apr 23, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala Show resolved Hide resolved

gatorsmile reviewed Apr 26, 2020

View reviewed changes

cloud-fan mentioned this pull request May 27, 2020

[SPARK-31837][CORE] Shift to the new highest locality level if there is when recomputeLocality #28656

Closed

[SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. #27207

[SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. #27207

Conversation

bmarcott commented Jan 15, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

cloud-fan commented Jan 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 15, 2020

tgravescs commented Jan 15, 2020

Choose a reason for hiding this comment

bmarcott commented Jan 16, 2020 • edited Loading

tgravescs commented Jan 21, 2020

bmarcott commented Jan 22, 2020 • edited Loading

cloud-fan commented Jan 22, 2020

tgravescs commented Jan 23, 2020

bmarcott commented Jan 24, 2020

tgravescs commented Jan 24, 2020

tgravescs commented Jan 31, 2020

bmarcott commented Jan 31, 2020

bmarcott commented Feb 2, 2020 • edited Loading

SparkQA commented Feb 2, 2020

SparkQA commented Feb 2, 2020

Ngone51 left a comment

Choose a reason for hiding this comment

bmarcott commented Feb 4, 2020

cloud-fan commented Feb 4, 2020

tgravescs commented Feb 4, 2020

bmarcott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmarcott commented Feb 10, 2020 • edited Loading

SparkQA commented Feb 10, 2020

SparkQA commented Feb 10, 2020

SparkQA commented Feb 11, 2020

SparkQA commented Feb 11, 2020

SparkQA commented Apr 3, 2020

tgravescs commented Apr 3, 2020

bmarcott commented Apr 4, 2020

bmarcott commented Apr 6, 2020 • edited Loading

tgravescs commented Apr 8, 2020

bmarcott commented Apr 8, 2020

cloud-fan commented Apr 9, 2020

bmarcott commented Apr 9, 2020

bmarcott commented Apr 9, 2020

cloud-fan commented Apr 9, 2020 • edited Loading

cloud-fan commented Apr 9, 2020

dongjoon-hyun commented Apr 10, 2020 • edited Loading

cloud-fan commented Apr 10, 2020

bmarcott commented Apr 10, 2020 • edited Loading

cloud-fan commented Apr 10, 2020

tgravescs commented Apr 10, 2020

dongjoon-hyun commented Apr 10, 2020

cloud-fan commented Apr 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmarcott commented Jan 15, 2020 •

edited

Loading

bmarcott commented Jan 16, 2020 •

edited

Loading

bmarcott commented Jan 22, 2020 •

edited

Loading

bmarcott commented Feb 2, 2020 •

edited

Loading

bmarcott commented Feb 10, 2020 •

edited

Loading

bmarcott commented Apr 6, 2020 •

edited

Loading

cloud-fan commented Apr 9, 2020 •

edited

Loading

dongjoon-hyun commented Apr 10, 2020 •

edited

Loading

bmarcott commented Apr 10, 2020 •

edited

Loading