Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29994][CORE] Add WILDCARD task location #26633

Closed
wants to merge 7 commits into from

Conversation

maryannxue
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds a new WILDCARD task location that can match any host. This WILDCARD location can be used together with other regular locations in the list of preferred locations to indicate that the task can be assigned to any host/executor if none of the preferred locations is available.

Why are the changes needed?

This is motivated by the requirement from LocalShuffledRowRDD. When the number of initial mappers of LocalShuffledRowRDD is smaller than the number of worker nodes, it can cause serious regressions if short-running tasks all wait on their preferred locations while they could have otherwise finished quickly on non-preferred locations too.

We have a "locality wait time" configuration that allows a task set to downgrade locality requirement after a certain time has passed. Yet, this configuration affects all task sets in the scheduler, and tasks all differ in penalty of locality miss. Thus, we need this finer-grained option for individual tasks to opt out of locality.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UT.

@maryannxue
Copy link
Contributor Author

@maryannxue maryannxue changed the title [SPARK-29994] Add WILDCARD task location [SPARK-29994][CORE] Add WILDCARD task location Nov 22, 2019
* preferred locations to indicate that the task can be assigned to any host if it cannot get any
* desired location immediately.
*/
private [spark] case class WildcardLocation() extends TaskLocation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case object/object?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we can use object here.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114270 has finished for PR 26633 at commit bb91814.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114415 has finished for PR 26633 at commit 72a946c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

From my POV this is a shortcut to allow some tasks to get rid of the 3 seconds locality wait time limitation from delay scheduling. This change looks good to me because it avoids to change the global locality wait time, thus the influence can be restricted in a desired way.

One concern that I shall rise is, how shall we restrict the WILDCARD locality being used less properly? i.e, include WILDCARD into preferedLocations where the penalty of locality miss is nontrivial.

Also cc @squito @tgravescs

@maryannxue
Copy link
Contributor Author

I don't think there is a need to restrict it. Every RDD should "know" their own locality preference as well as the penalty for a locality miss. If we ever needed to make sure the WILDCARD is being used properly, we would have to worry about whether other regular preferred locations are returned correctly and truly reflect their best possible locality choice.
That said, the 3 minute global locality wait time is not gonna work for all cases, and yet the WILDCARD location alone is not fine-grained enough either. Ideally we should have penalty, or say, the importance of locality, "encoded" with location itself, in the form of wait time, e.g., "wildcard, 2s", so that the task can still try to wait a minimum time before getting randomly assigned. However, this would involve big changes to the current Spark scheduler.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiangxb1987
Copy link
Contributor

Could we please hold this PR for a couple more days to allow more eyes on it, since it's a critical improvement to Spark Core?

@maryannxue
Copy link
Contributor Author

Sure, @jiangxb1987 !

@tgravescs
Copy link
Contributor

I'm not seeing how this is different then any other task in Spark? In my opinion the locality fall back in Spark is broken. I generally recommend people to set the locality delay to 0 on the node side because you can get very weird results where tasks wait way to long to be scheduled. On most networks these days its better to just run the task somewhere then wait for locality. I realize though there are other conditions this was added for and I've never spent the time to go look at a proper solution for it.
This just seems like a workaround to doing the right fix that can only be used by this specialized RDDs. Maybe that is ok for now but I would like to make sure its very clear on that and somewhat hesitate because this is essentially a public API that would be hard to remove if it starts to get used elsewhere.

I'm assuming the intention is to just have LocalShuffledRowRDD always add it to the preferred locations? I'm a bit surprised that change isn't in here as well as it seems relatively small and would show the use of it, maybe I'm wrong and its large though, which would make sense to split apart.

Are there other specific usecases you have for this?

I think this should be discussed more before going in.

@squito

@cloud-fan
Copy link
Contributor

cloud-fan commented Nov 26, 2019

I generally recommend people to set the locality delay to 0 on the node side because you can get very weird results where tasks wait way to long to be scheduled.

well, if we look at resource utilization, it may be better to wait for locality, and save resources for other jobs/tasks.

This is really a hard problem, and the default 3 seconds locality wait may not be optimal either. We can only know the optimal solution if we know what jobs/tasks will be submitted in the future.

For LocalShuffledRowRDD, we don't need an optimal solution. We only need to avoid regressions. Compared to the norma shuffle reader, which fetches shuffle blocks from different hosts, this new WILDCARD location won't make things worse. It tries to satisfy the locality, and maximums the resource utilization like the normal shuffle reader.

Is it possible to make this thing internal? e.g. do not document it publicly. This is not a perfect solution but I'm afraid there is no perf solution. This solution at lease gives us an option: if the locality is not that important for some certain tasks, you can use WILDCARD to let Spark schedule your tasks in other hosts.

@maryannxue
Copy link
Contributor Author

Thanks for the feedback, @tgravescs! This is a workaround. A complete solution would be bring the current locality fallback to task level instead instead of node level, as I said in the previous comment. An RDD knows the importance of locality based on its job and/or data size and sets a wait time for itself. Setting the cluster/node level wait time would definitely affect other workloads and is not a solution here.

I could add the usage by LocalShuffledRowRDD here in this PR but was thinking it was a Spark SQL change and better be put in a separate PR. I'm fine either way, it's one line of code change plus some code comments.
Back to its usage. Normally a LocalShuffledRowRDD can just read locally, using its map output locations. And we try to match its parallelism with the original ShuffledRowRDD, so we get the same parallel level and better locality. But the problem is when we have less mappers (from the shuffle map stage) than the number of worker nodes, e.g., 5 vs. 10, and if we stick to the preferred locations, the LocalShuffledRowRDD will suffer from locality wait and be even slower than the original ShuffledRowRDD. If we had a way to know we'd end up with this situation, we could back out of the LocalShuffledRowRDD optimization and fall back to the regular shuffle. But the number of nodes is dynamic and we can't decide at compile-time.
The WILDCARD location can make it totally adaptive in this sense. Given that Spark offers resources always from high locality level to low locality level, a node can never be "stolen" (for a specific task set) as long as it has unassigned tasks of higher locality match. So when the number of mappers is equal to or larger than the number of worker nodes, even with WILDCARD location specified, the read can be near completely local. Otherwise, if the number of mappers is smaller, you can get roughly "number_of_mappers/number_of_nodes" locality hit. In the worst case, if the number of mappers is significantly smaller, the LocalShuffledRowRDD will regress to a regular ShuffledRowReader but cannot be worse.

@attilapiros
Copy link
Contributor

I am wondering whether you should return here with true for WildcardLocation without checking the host equality, shouldn't you?

def isAcceptable: Boolean = acceptableExecutors.exists {
case loc: ExecutorCacheTaskLocation => loc.executorId == executorId
case loc: TaskLocation => loc.host == host
}

case _ =>
}
pendingTaskSetToAddTo.forHost.getOrElseUpdate(loc.host, new ArrayBuffer) += index
if (loc == WildcardLocation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would avoid this if by introducing a new function which gets the pendingTaskSetToAddTo and the resolveRacks flag then handles the forHost and forRack part from these lines:

pendingTaskSetToAddTo.forHost.getOrElseUpdate(loc.host, new ArrayBuffer) += index
if (resolveRacks) {
sched.getRackForHost(loc.host).foreach { rack =>
pendingTaskSetToAddTo.forRack.getOrElseUpdate(rack, new ArrayBuffer) += index
}
}

Then call this new function with the relevant case branches and add a new case for WildcardLocation.
I think this way it would be easier to follow what happens where.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... you can match for case object this way:

 case e: WildcardLocation.type =>
   pendingTaskSetToAddTo.noPrefs += index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does case WildcardLocation => work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. It is even better as here nothing is needed from the matching object (which would be identical with WildcardLocation anyway).

@squito
Copy link
Contributor

squito commented Nov 26, 2019

For a while I've been saying that we should set the locality wait to 0 in general (the biggest problem IMO is https://issues.apache.org/jira/browse/SPARK-18886). You will notice even Kay comments in the discussion there:

But often for Spark, you have one job running alone, in which case delay scheduling should arguably be turned of altogether

Spark 3.0 might be a good chance to change that default.

@tgravescs
Copy link
Contributor

Fixing the scheduler locality algorithm is definitely more changes. The locality delay to me should be a per task delay, if a task doesn't get scheduled in 3 seconds then fall to the next locality. Right now it waits for any tasks to not be scheduled for 3 seconds at that locality. I know Kay has an argument for the FairScheduler use case but I don't know that I agree with that or that it isn't handled by the per task delay. If you really want your task to wait that long for locality you can simple set it higher. I'm not sure the code changes required to make that change though and if we really wanted to leave the old way in there with a config how ugly the code gets.

| But the problem is when we have less mappers (from the shuffle map stage) than the number of worker nodes, e.g., 5 vs. 10, and if we stick to the preferred locations, the LocalShuffledRowRDD will suffer from locality wait and be even slower than the original ShuffledRowRDD.

I'm not sure I follow this statement. If you have less mappers -lets say you have 5 and you have 10 worker nodes (assuming this is standalone mode - or do you mean executors?) - the 5 maps will run one 5 of those nodes. Your LocalShuffledRowRDD uses the map output location as the preferred locations so why wouldn't the scheduler schedule on those nodes? Are you saying the 10 workers nodes (not sure if you mean executors or workers?) are being used by others (either job or stage) and some might be busy and the delay on waiting is more then just reading over the network? Is this case with dynamic allocation or not? It sounds to me like the normal case of you ran on some executors, you may not have the same executors when your reduce phase runs so you are being delayed scheduling because you can't get node locality. I think you could have the same thing with shuffledRowRDD with a small number maps/reducers.

The issue I want to understand is why are we special casing this one RDD for a performance improvement when in my opinion the majority of jobs would get a benefit from not having to wait for the locality (as implemented today). Changing the default like Imran mentioned might be a good first step and then fixing the algorithm would be the second in my opinion.

Do you think a default of 0 node locality would solve your problem? Obviously if a user does set it then it gets applied again though.

@maryannxue
Copy link
Contributor Author

Changing the default locality wait time to 0 (or whatever it is) is based on the assumption that all workloads do not have serious penalty from a locality miss, coz we are looking at shuffles only. There can be exceptions where locality does matter a lot and it would be worth some wait time.

Back to the local shuffle reader. I'll explain from the very beginning what it does.
Background: In Adaptive Query Execution, we may decide to change a sort-merge-join to broadcast-hash-join in the middle of query execution. A sort-merge-join usually (not always) comes with both sides topped by a shuffle and a sort. Once the shuffle is materialized (the map stage is done), we come to a natural sync point where we can try re-planning based on the shuffle stats and data metrics we've got so far. So if one side is small enough to be broadcast, we can change the SMJ into a BHJ, but note that the shuffles underneath have either been completed or been started at least.
LocalShuffleReader: We can't save the materialization (map stage) cost apparently, but what we can do is avoid the network cost of the reduce stage of a shuffle, because a BHJ does not care about the shuffle partitioning at all. The LocalShuffledRowRDD alters the behavior of a shuffle reduce stage, by just fetching data from a single mapper and setting the preferred location as where the mapper output is. Note that there can be multiple LocalShuffledRowRDD partitions for each mapper, which means a LocalShuffledRowRDD partition can fetch all the output or part of the output from a single mapper.
Problem: The number of shuffle mappers is decided by the number of partitions of the upstream operator. The number of mappers can sometimes be small, esp. when the upstream operator is a file scan. If it's smaller than the number of worker nodes (we are talking about "hosts" here and it doesn't matter much how many executors per host, coz the locality specifies hosts only), the tasks will all be scheduled on those hosts but not the others. For example, we have 10 worker nodes, each having 4 executors, and then we have one of the join tables has 5 partitions after scan and default shuffle partition number is 200. So after shuffle map stage, we'll get 5 mappers, each having 200 shuffle blocks. In the reduce stage, if using a regular ShuffledRowRDD, we'll get 200 shuffle reduce tasks, and assume there's no skew, the 200 tasks will have no location preference and be evenly scheduled on the 4 * 10 = 40 executors. And if otherwise, we use the LocalShuffledRowReader, in order to match parallelism, we'd like to have 200 output partitions too, which means there will be 200 / 5 = 40 tasks for fetching data from each mapper. Yet, because of the locality preference, all these 200 tasks will be scheduled on 5 hosts only, while the other 5 hosts will not share the workload.

@tgravescs
Copy link
Contributor

Thanks for the explanation.

There can be exceptions where locality does matter a lot and it would be worth some wait time.

What are these cases? I'm sure there are but based on what myself and many others I've talked to, its the exception instead of the rule. Doesn't matter whether its HDFS data or shuffle data. People are setting this to zero now anyway so changing default makes sense to me.

we'd like to have 200 output partitions too, which means there will be 200 / 5 = 40 tasks for fetching data from each mapper.

I don't follow this logic how do you go from 200 output partitions to 40 tasks? I would expect 200 output partitions to have 200 tasks. Doesn't matter to much as the main issue is your next sentence.

all these 200 tasks will be scheduled on 5 hosts only, while the other 5 hosts will not share the workload

Now this part I understand. But goes back to what I said before, I don't see how this is any different then any other RDD. About a month ago, we specifically had a job that during shuffle was hitting this same thing. We had 20 nodes, but only 1 node was being scheduling on and it had significant impact on the job time, we set locality delay to 0 and worked around the issue. Another example, running on YARN. Let say I have 10 tasks reading hdfs data, I get 10 nodes, 5 of those nodes actually have HDFS blocks on them. With locality turned on those 5 nodes will be loaded up and depending on how long they take could keep the other 5 tasks from running as quickly as they should.

So what happens if these 5 mappers have very large data or skewed data? Is it better to skip locality? That is going to add to the network usage - who is to say I want that vs waiting? It might be that the tasks take long enough that it actually does fall back to rack locality - it depends on the harmonics of when tasks finish and are scheduled. I'm willing to bet that in general having locality delay 0 still is more performant, which is why we go back to locality delay 0 for a default. If they actually need locality then they can turn it back on. Now that will also affect your RDD here as well - but this seems more of a very specific case.

@tgravescs
Copy link
Contributor

what kind of performance impact do you see if you just don't set preferred locations at all in your RDD?

@maryannxue
Copy link
Contributor Author

I don't follow this logic how do you go from 200 output partitions to 40 tasks? I would expect 200 output partitions to have 200 tasks. Doesn't matter to much as the main issue is your next sentence.

It's 200 tasks overall, but each mapper has 50. That simple, but doesn't really matter.

what kind of performance impact do you see if you just don't set preferred locations at all in your RDD?

It would be no different from ShuffledRowRDD, and why would we bother to do the LocalShuffledRowRDD in the first place.

But goes back to what I said before, I don't see how this is any different then any other RDD.

It is no different from any other RDDs (you mentioned). The only difference is that this RDD has a definitive "baseline" or "goal": it looks to perform no worse than a regular shuffle and better if possible. For other RDDs, I can't say what the target is, and what performance impact is considered acceptable.

Yes, setting the locality wait to 0 would solve the problem of LocalShuffledRowRDD perfectly, and the effect is equivalent to this PR's proposal on a single task set alone. This leaves us only one difference of opinion: the exceptions, which I choose not to disclose.

@jiangxb1987
Copy link
Contributor

Please correct me if I'm wrong, it seems to me that the LocalShuffleRowRDD is a special case of ShuffleRowRDD, with the only difference that ShuffleRowRDD normally doesn't have preferred locations, while LocalShuffleRowRDD has a preferred locations list. On execution time, if the preferred locations list is short, it's highly possible that the tasks from LocalShuffleRowRDD would wait for preferred locations(executors/hosts) due to delay scheduling, which sometimes make the wait time even longer than the task duration.

Set the locality wait time to 0 should be an answer to this case (and possibly many other use cases, too). But on the other hand, it would cause regression to other jobs/stages, where task locality is critical(those exceptions as Thomas mentioned), we just can't ignore those regressions, and I can image how many efforts it would take to fix the regressions on exception cases.

How about we accept the current PR as a temporary solution to workaround the delay scheduling issue, thus those RDDs that don't want to wait for perfect locality can just add WILDCARD to their perferredLocations? To me it's better than setting the locality wait time to 0 directly, as it won't affect other workloads.

@jiangxb1987
Copy link
Contributor

jiangxb1987 commented Nov 26, 2019

Of course, it worth a separated JIRA/PR to discuss changing the default value of delay scheduling.

@tgravescs
Copy link
Contributor

I'm on the other side, I would rather see default set to 0 which I think most people do anyway and I believe will help a lot of other cases then add extra one off maintenance code here.

But if others disagree I'm ok with this it just needs to he heavily documented as internal only developer api that should go away.

Do we know how much performance diff this makes and how often?

@@ -70,7 +80,9 @@ private[spark] object TaskLocation {
def apply(str: String): TaskLocation = {
val hstr = str.stripPrefix(inMemoryLocationTag)
if (hstr.equals(str)) {
if (str.startsWith(executorLocationTag)) {
if (str == "*") {
WildcardLocation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this goes in I would like to see documentation about this being DeveloperAPI and a temporary workaround and will be removed once locality is fixed.

@tgravescs
Copy link
Contributor

I agree the default setting change needs to happen in a bigger conversation, but if that conversation is going to happen we shouldn't check this in until that is had in my opinion.

I have not seen a real argument why this RDD is different than any other. But if we fix the real issue with locality then it helps everything. The argument that its a special version of ShuffledRowRDD and that sometimes you hit this locality issue doesn't convince me. I can hit the locality issue with ShuffledRowRDD, I might not hit the issue with the LocalShuffleRowRDD. Why not change ShuffledRowRDD or HadoopRDD to use this as well because I can hit the same issue? The only argument I can see is limited scope, but at the same time does it only turn it on then when you hit the case described with mappers < reducers and I have more executors then mappers? If it turns it on more than that, then one could argue you aren't following the semantics defined by Spark for locality wait.

I don't see any concrete numbers here on performance impact or how much this affects users or why we should special case this? If it has a huge impact then I can see why we would special case it but I haven't seen any evidence of that. Do we have any cases this is seen in production - is there negative impact of user just setting node locality wait = 0?

Again the main issue I have is that once it's introduced anyone can use it in an RDD - therefore I consider it a public interface. You say its limited impact and only used by adaptive execution but once introduced nothing stopping others from using it.

Adding more people to get opinions.
@vanzin @dongjoon-hyun @srowen

@maryannxue
Copy link
Contributor Author

@tgravescs

People are setting this to zero now anyway so changing default makes sense to me.

Not sure how representative these "people" are. So let's bring the whole thing to dev list discussion.

Again the main issue I have is that once it's introduced anyone can use it in an RDD - therefore I consider it a public interface. You say its limited impact and only used by adaptive execution but once introduced nothing stopping others from using it.

I actually see it the other way. If we do see that regular ShuffledRowRDD suffer from locality wait when it does happen to have a preferred location (because of satisfying REDUCER_PREF_LOCS_FRACTION and some other conditions), we might end up finding this new location handy.
I don't see why I need to make an argument about how LocalShuffledRowRDD is different from other RDDs. On the contrary, if other RDDs have the same requirement, they can opt to use this approach as well. But I have to point out we can't rule out the possibility that there are other other RDDs that don't enjoy the world of no locality wait at all.
As I pointed out in the very beginning #26633 (comment), each RDD should know their own locality preference as well as the importance of such locality. If we ever had to worry about this location being used improperly, we'd have to worry about if any other regular location is returned correctly by the RDD as well.

That said, this is a partial solution only. I'd like to see a complete fix as well, but I don't think we should go completely the other way, by changing the default wait time to 0.

@cloud-fan
Copy link
Contributor

Seems we all agree that the delay scheduling is problematic. And theoretically delay scheduling can be critical for some jobs (we can easily think of a custom RDD to prove this).

Setting locality wait to 0 by default is another topic. Even if we do it, people that run jobs that need delay scheduling still need to set the locality wait. For these users, we need this WILDCARD location feature to enable AQE.

That said, changing the default locality wait to 0 doesn't solve the problem. It doesn't mean locality wait is always 0. As long as it can be non-zero, we need to deal with it in AQE.

@attilapiros
Copy link
Contributor

It would be good to go through all the places where TaskLocation or its host member is used.

Like this place which returns a Seq[String] from the locations host:

def currPrefLocs(part: Partition, prev: RDD[_]): Seq[String] = {
prev.context.getPreferredLocs(prev, part.index).map(tl => tl.host)
}

Which latter will be used to create only one PartitionGroup for all the partitions where WildcardLocation is the preferred location.

@tgravescs
Copy link
Contributor

Nobody has answered my questions above as to why this RDD should be treated differently and the impact of this. You just keep saying this is for adaptive scheduling. As far as I can see, this is purely another instance of https://issues.apache.org/jira/browse/SPARK-18886 and I don't see why we aren't using the same workaround or really fixing the real issue.

As I pointed out in the very beginning #26633 (comment), each RDD should know their own locality preference as well as the importance of such locality. If we ever had to worry about this location being used improperly, we'd have to worry about if any other regular location is returned correctly by the RDD as well.

I don't agree. HadoopRDD for instance knows its locality, but how important the locality is very user/cluster specific. I don't see how the LocalShuffledRowRDD is any different. You are saying the user never cares about the locality on this - please explain to me why and how it is different from HadoopRDD? If we were to turn this on for HadoopRDD though then we would essentially be bypassing the locality settings.

Even if we do it, people that run jobs that need delay scheduling still need to set the locality wait. For these users, we need this WILDCARD location feature to enable AQE.

Again why is AQE different? lets say I really want my HadoopRDD to use locality but then the shuffledRDD hits this issue. As a user I can't just turn locality off for my shuffleRDD so what makes the LocalShuffledRowRDD any different? From what has been described here, this is a very particular case. You have more nodes and reducers then maps, the maps finish very quickly (probably within 3 seconds), these are the same conditions other RDDs can hit the same issue

@tgravescs
Copy link
Contributor

I don't see why I need to make an argument about how LocalShuffledRowRDD is different from other RDDs.

You do because you are doing a one off hack here that Spark has to maintain, adds a new public api, and we can't use for anything else, why not fix locality for all RDDs as they can hit the same issue.

@maryannxue
Copy link
Contributor Author

maryannxue commented Nov 27, 2019

why not fix locality for all RDDs as they can hit the same issue.

So what fix exactly are you talking about here?

@maryannxue
Copy link
Contributor Author

@attilapiros Very good point. I'll go thru all references of host as well as TaskLocation.apply.

@cloud-fan
Copy link
Contributor

It is the same problem as https://issues.apache.org/jira/browse/SPARK-18886 . It would be great if we can solve that problem first, but seems there is no conclusion yet.

There is one difference in LocalShuffledRowRDD: it has a baseline. It's converted from the normal shuffle reader, and we shouldn't be slower than it. The normal shuffle reader mostly doesn't have locality (a reducer needs to read blocks from many hosts), so the WILDCARD location is a good solution. It kinds of turn off the locality wait for LocalShuffledRowRDD, to make it not slower than normal shuffle reader.

Do we have an ETA about when we can resolve https://issues.apache.org/jira/browse/SPARK-18886 ? We can't remove locality wait as nowadays we usually run many jobs on a Spark cluster. It's unclear to me what's the best solution to it.

BTW, this feature won't be documented and it's not that public to me. Users can only know it by reading the discussion here. We can still remove it later if https://issues.apache.org/jira/browse/SPARK-18886 is resolved. To me this is just a workaround to turn off delay scheduling for certain tasks instead of globally, which does have value.

@cloud-fan
Copy link
Contributor

any more comments? This workaround makes sense to me when there is no perfect solution for https://issues.apache.org/jira/browse/SPARK-18886 . This is also an important fix to turn on AQE by default in 3.0.

@tgravescs
Copy link
Contributor

In my opinion the ideal thing is to fix SPARK-18886, its the perfect time, this is a new major release and this isn't something impacting production now so we don't really need a "quick fix". I disagree with your comment there is no perfect solution, no one here has tried and really no one here has give me any metrics as to why this is so important of a fix.

But I realize that is a lot more change so I'm ok with this going in as a temporary fix. Please update based on the comments made - I want to make sure this is clearly documented in the code has a hack that will go away and no on else should use it.

Also can someone give me any performance metrics - how much of a different does the LocalShuffledRowRDD make?

@bmarcott
Copy link
Contributor

bmarcott commented Dec 3, 2019

Could someone help review my proposed solution for SPARK-18886 here:
#26696

The idea is to only reset scheduling delay timers if allocated slots, based on the scheduling policy (FIFO vs FAIR), are fully utilized.

@maryannxue
Copy link
Contributor Author

@tgravescs Our benchmark comparing AQE w/ LSR (local shuffle reader) with AQE w/o LSR showed that before locality wait fix, there were 2 queries with over 10% regressions, and after the fix, there was no regression and one query had over 27% improvement.

@attilapiros Thank you again for the careful review! I changed a code in a way that WildcardLocation is only recognized and handled by DAGScheduler and TasksetManager internally and calling TaskLocation.apply will not instantiate a WildcardLocation, so the new location will have no effect or impact over all other usages of TaskLocation. I also refined addPendingTask so that the code change is minimized.

I also added javadoc stating the limited application and the experimental nature of the WildcardLocation class.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114802 has finished for PR 26633 at commit 78a123a.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114823 has finished for PR 26633 at commit e821543.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maryannxue
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114863 has finished for PR 26633 at commit e821543.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

thanks for the performance stats, obviously its going to change some depending on job size and such when you say 10% and 27%, what are the total job run times?

@squito
Copy link
Contributor

squito commented Dec 5, 2019

are you sure that in your case the slowdown is even caused by SPARK-18886? Even when that is solved, you could still get end up with one 3s wait for almost all tasks.

I really do see how this can help, I see why folks want this. But my hesitance is that we're going to start putting in these random changes to delay scheduling, which will make the code even harder to understand; users will end up with even more knobs to tune; and we may be stuck with this even after SPARK-18886 because it would still be a performance regression against this change.

I agree with Tom's point -- I don't see how we know that ignoring locality waits is right for just this one RDD but not for others. Though I want the default locality wait set to 0, I could see a cluster admin wanting to increase the locality wait because they know their cluster is very network constrained. In fact this may be against the wishes of of one particular spark application, but still best for the cluster as a whole. In that case, you really might want a 3s wait on LocalShuffledRowRDD

@maryannxue
Copy link
Contributor Author

Thank you @squito , for the feedback!

In fact this may be against the wishes of of one particular spark application, but still best for the cluster as a whole. In that case, you really might want a 3s wait on LocalShuffledRowRDD

Actually, as I stated in previous comments, the only difference LocalShuffledRowRDD bears from other RDDs is the fact that it has been optimized from ShuffledRowRDD and there comes a worse case guarantee that this new RDD cannot perform any worse than the original ShuffledRowRDD, both to the application and to the cluster.

@squito
Copy link
Contributor

squito commented Dec 10, 2019

In fact this may be against the wishes of of one particular spark application, but still best for the cluster as a whole. In that case, you really might want a 3s wait on LocalShuffledRowRDD

Actually, as I stated in previous comments, the only difference LocalShuffledRowRDD bears from other RDDs is the fact that it has been optimized from ShuffledRowRDD and there comes a worse case guarantee that this new RDD cannot perform any worse than the original ShuffledRowRDD, both to the application and to the cluster.

I think you're comparing LocalShuffledRowRDD vs. ShuffledRowRDD; I'm comparing LocalShuffledRowRDD with this change vs. LocalShuffledRowRDD without it.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 20, 2020
@github-actions github-actions bot closed this Mar 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet