New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35022][CORE] Task Scheduling Plugin in Spark #32136
Conversation
May need to have more test cases. But I'd like to ask the opinions from the community first. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137245 has finished for PR 32136 at commit
|
cc @mridulm and @tgravescs FYI |
@viirya why is the current locality scheduling mechanism not good enough? What problem are you trying to fix? |
#30812 is a previous attempt to use locality mechanism to stabilize state store location. Basically I want to do is to avoid Spark schedule streaming tasks which use state store (let me call them stateful tasks) to arbitrary executors. In short it wastes resource consumption on state store, and costs extra time on restoring state store on different executors. For the use-case, current locality seems a hacky approach as we can just blindly assign stateful tasks to executors evenly. We do not know if the assignment makes sense for the scheduler. It makes me think that we may need an API that we can use to provide scheduling suggestion. |
I don't think we can guarantee it. It's a best effort and tasks should be able to run on any executor, thought tasks can have preferred executors (locality). Otherwise, we need to revisit many design decisions like how to avoid infinite wait, how to auto-scale, etc.
Can you elaborate? If it's a problem of delay scheduling let's fix it instead. |
* This trait provides a plugin interface for suggesting task scheduling to Spark | ||
* scheduler. | ||
*/ | ||
private[spark] trait TaskSchedulingPlugin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this private to spark if people are supposed to implement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be marked with developer api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure now if we want it to be open to implement outside Spark. So just as private. Can be public if we have a consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not to be exposed, then why make it a plugin? Isn't the risk that you are now overgeneralizing something for a non-existent benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we add a locality constraints that forces you to schedule on an executor instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I replied on the doc too, locality is static. During the assignment of task set, locality is static. I also considered this option, e.g. dynamically change locality during task scheduling, but I think the change will be more, and it's more easily to affect current code/app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not to be exposed, then why make it a plugin? Isn't the risk that you are now overgeneralizing something for a non-existent benefit?
I agree now the class name might be confusing. I thought it as a plugin. Now it is not a plugin you can plug into Spark but a private API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure now if we want it to be open to implement outside Spark.
IMHO I'm thinking opposite on this. I'm not sure if we want to add some implementation in Spark codebase, unless the implementation of plugin proves that it works well for all possible cases.
I'd rather be OK to let Spark users (including us) to be hackers and customize the scheduler with "taking their own risks", but I have to be conservative if we want to change the behavior of Spark. If changing the behavior of Spark to cope with state by default is the final goal, I think the behavioral change itself is a major one requiring SPIP, not something we can simply go after doing POC on happy case.
I would like to see more overview and design details. I think the idea of having something here is good because some people may want to cluster tasks, while some might want to spread them. You might want to place them based on hardware or something else. I want to understand how flexible this plugin api you are proposing it. I just saw https://docs.google.com/document/d/1wfEaAZA7t02P6uBH4F3NGuH_qjK5e4X05v1E5pWNhlQ/edit# which has a few details. Would be good to link from description. questions:
I would like to see how this applies to other use cases I mentioned above before putting this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although there is locality configuration, the configuration is used for data locality purposes
a) In the case of streaming workloads, I think the locality info here is about the state store instead of data. e.g.,
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDD.scala
Lines 51 to 54 in 0494dc9
override def getPreferredLocations(partition: Partition): Seq[String] = { | |
val stateStoreProviderId = getStateProviderId(partition) | |
storeCoordinator.flatMap(_.getLocation(stateStoreProviderId)).toSeq | |
} |
So I think that locality preference scheduling (or delay scheduling) would also apply to it (the state store location).
b) That being said, I actually had the same concern when I toughed the streaming code. Because I know that delay scheduling doesn't guarantee the final scheduling location to be the preferred location provided by the task. So, the cost of reloading statestore would still exist potentially.
I want to make sure first that which case you're trying to fix here? a or b?
Sorry I think this sentence is misleading. I don't mean to break "tasks should be able to run on any executor" design. This API doesn't break it. What we want is to be able to set constraint/condition on choosing which task to be scheduled to an executor. So basically tasks are still able to run on any executor. Just for some purpose, we need an executor to pick up a task.
In #30812, @zsxwing, @HeartSaVioR and me have a long discussion around using locality for stateful tasks. You can see my original approach is to use locality, but overall it is considered too hacky and now I share the points from them. You may catch up the comments there. Basically I think the problem is, for a stateful job that we want to evenly distribute tasks to all executors and let the executor-task mapping relatively stable. With locality, we can only assign tasks to executors blindly. For example, the scheduler knows more about executor capacity and knows what executors should be assigned with tasks. But in SS, we don't have such info (and should not have it too). |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137307 has finished for PR 32136 at commit
|
Let me figure the difference between a and b. So (a) looks like using locality for state store location and (b) is that locality cannot guarantee actual location. Right? Please let me know if I misunderstand. I have tried to use locality for tasks with state stores in #30812. As you know (in the code snippet), actually SS already uses locality for state store location. However it has a few problems: 1. it uses previous state store location as locality, so if no previous location info, we still let Spark pick up executor arbitrarily. 2. It depends on initial chosen executor-state store mapping. So if Spark choose a sub-optimal mapping, locality doesn't work well for later batches. 3. Forcibly assigning state stores to executors can possibly lead to unreasonable scheduling decision. For example, we don't know if the executor satisfy resource requirement. |
Linked it from the description. Thanks for reminding.
For now it still respects locality. My thought is to not interfere the scheduler too much. For each locality level, the scheduler will try to pick up one task from a list of tasks for the particular locality level. The API jumps in at the moment and let the scheduler know which tasks are most preferred on the executor. If users don't want locality to take effect, it is doable by disabling locality configs. Otherwise, from the API perspective, if it doesn't want a particular task to be on an executor, it can also let the scheduler know (aka, don't have it in the returned list).
We may not need to know. The parameter can be discussed. As
Yea, I thought about it, but forgot to add the comment. I will do it in next commit.
Good point. I think the reason it runs before blacklisting is that it is easier to fit into current logic and seems safer. Currently the scheduler iterates each task from the list, and then checks blacklist before picking it up. If I let the API gets a list after blacklisting, seems it might be a larger change to the dequeue logic.
IIUC, the plugin does not affect or change how the scheduler acts on barrier tasks. In the current dequeue logic, the scheduler doesn't have different behavior on barrier task/general task. For now if the scheduler cannot schedule all barrier tasks at once, it will reset the assigned resource offers. It is the same with the plugin. |
Yes.
So, how would the plugin help when there's no previous location info?
I think this is the point that matches my point of case b above. But looking at the code, it seems the plugin is still applied after locality scheduling?
I don't get this. Do you mean some executors may not be suitable for having a state store? |
Correct me if I'm wrong: Spark tries its best to schedule SS tasks on executors that have existing state store data. This is already the case and is implemented via the preferred location. The problem we are solving here is the first micro-batch, where there is no existing state store data and we want to schedule the tasks of the first micro-batch evenly on the cluster. This is to avoid skews in the future that many SS tasks are running on very few executors. |
For example, in the plugin implementation, it can distribute tasks to executors more evenly.
Locality still works. Generally, this plugin API doesn't break locality if it is set.
For example, an executor is not capable for running the task? If we blindly assign stateful task to executor, we don't actually know if the executor is capable for the task. Only the scheduler knows the info. I think this is the major point discussed in my previous PR. |
I think the scheduler already distributes tasks evenly when there's no locality preference as we'll shuffle the executors before scheduling:
Doesn't it work? |
For the code snippet, doesn't it depend on if all executors are available during the moment of making offers? It seems to be unreliable due to a problem like race condition. For example, running SS job with state stores, it is easier to see the initial tasks are scheduled to part of all executors. |
Yes, that's true. But normally, even in the case of offering a single resource that released from a single task, it seems it's less possible to schedule tasks unevenly unless the resources are really scarce.
Do you have logs related to the scheduling? I'd like to see how it happens. |
As I saw in previous tests, it is by chance to have all tasks are evenly distributed to all executors. Sometimes it is, but sometimes only partial executors are scheduled at the first batch.
I don't have logs now. It is a general and simple SS job reading from Kafka. |
That is correct. However, even for not first micro-batch, we currently use preferred location + non-trivial locality config (e.g., 10h) to force Spark schedule tasks to previous locations. I think it is not flexible because locality is a global setting. A non-trivial locality config might cause sub-optimal result for other stages. And, it requires end-users to set it. It makes me feel that it is not a user-friendly approach. |
Kubernetes integration test starting |
BTW, @Ngone51 Is dynamic allocation is required for stage-level scheduling? At least it is what I got from reading the code. But seems dynamic allocation has some issues (or not work well?) with SS (e.g. SPARK-24815)? Right? @HeartSaVioR |
It's required for the classic use case (when you really need to change executor resources dynamically) but not this case. In this case, we only leverage the task scheduling characteristic of the stage level scheduling. Thus, adding the task request for the state store resource only should be enough. |
thanks for explaining some of the SS specific things. the stage level scheduling changes you are proposing are definitely not what it does now and we would definitely need a few features we were wanting to do anyway, like reusing existing containers, but that is all doable just needs to be implemented. it sounds like really this would be a special stage level scheduling feature that is SS state store specific, because user would not specify an exact requirement just that tasks must be scheduling on executors with its specific state store. This means that if an executor goes down it has to wait for something else on executor to start up the task specific it seems like this feature could existing outside creating a new ResourceProfile with the stage level scheduling api's and user should be able to specify this option that would only work with stateStoreRDD. Is it useful outside of that? I don't see how unless we added another plugin point for executor to report back any resource and then come up with some api that it could call to do the mapping of taskId to resourceId reported back to it.
I think this would mean scheduler would have some specific logic to be able to match task id to state store id, right? Otherwise stage level scheduling would schedule a task on anything in that list., which seems like at that point makes a list not relavent if Spark knows how to do some sort of mapping. |
Good to know that. So, does it mean I can remove the dynamic allocation check for our case without affecting classic stage-level scheduling?
I think this could be configurable. Users can configure it to schedule on anywhere and load from checkpointed data, or in other case just fail the streaming app, after a certain period.
Hmm, @Ngone51, is it true? For other resources like GPUs, it makes sense but in this case we need specific (task id)-(resource id, e.g. state store id, PVC claim name, etc.) bound. |
Which check are you referring to?
Yes. Besides, in the Spark default use case, we can also move the state store resources to other active executors (
That's true. We need the mapping. I thought about the using the exiting task info (e.g., prtitionId) should be enough to match the right state store but looks wrong. We'd have to add the extra info for the mapping.
@tgravescs yes the target might be archieved directly with the mapping. But I think that's would be the last choice as the community wants to introduce less invading changes when working across the modules. And that's the reason that @viirya proposed plugin APIs first and we're now discussing the possibility of reusing existing feature - stage level scheduling.
I think it's only used for |
There is assertion that dynamic allocation should be enabled under stage-level scheduling. I mean if we remove such assertion, will it affect normal cases of stage-level scheduling?
That's correct as this is one major point behind this API. It is proposed to keep as separate as possible to not affect Spark other than SS code path. Personally I'd refer a separate design. You may know more about how much invading change we need to support the use-case in stage-level scheduling. Let me know if you need to revisit the decision so I can follow up with the direction. I'll be willing to continue on this API or turn to stage-level scheduling with some changes if you think it is better. |
(Sorry about the late reply.)
Could you post the code snapshot? -- I’m thinking about how to establish the “mapping” with stage level scheduling. My current idea is:
, and probably using a map (from task index to its StateStoreProviderId) to store the mapping. @tgravescs @viirya WDYT? |
I agree with this, its a matter of coming up with the right design to solve the problem and possibly others (in the case of plugin). If we discuss alternatives that become to complex we should drop them. But we should have the discussion like we are.
if user can't specify it themselves with stage level api, you are saying Spark would internally do it for the user?
We can relax the requirement if something like this is specified. If we were to add allowing new ResourceProfiles to fit into existing containers that requirement would be relaxed for that also. We just need to make sure its clear to user so jobs don't hang waiting on getting containers they will never get.
so essentially this is extending the locality feature and then the only thing you would need in stage level scheduling api is ability to say use this new locality algorithm for this stage? |
Yes. And we can add a new conf for users to control the behavior.
Yes. I'm thinking a bit more: we probably even don't need the stage level scheduling API ability. After knowing the "mapping", we can use it directly in |
E.g., ResourceProfileManager {
private[spark] def isSupported(rp: ResourceProfile): Boolean = {
...
val YarnOrK8sNotDynAllocAndNotDefaultProfile =
isNotDefaultProfile && (isYarn || isK8s) && !dynamicEnabled // <= Remove it?
...
}
Isn't it the mapping still executor id <-> statestore? Executor id could be changed due to executor lost. More robust mapping, e.g. for our use-case, might be PVC id <-> statestore.
Yes, agree. Appreciate for the discussion. |
The mapping between executor id and statestore is necessarily needed. And it can be achieved by the existing framework - As mentioned above, we probably even don't need the stage level scheduling API ability if we follow the |
@tgravescs what's your opinion on the |
Hm? We don't need mapping between tasks <-> statestore. Do you mean PVC id <-> task (i.e. statestore)? |
Sounds making sense. So let me rephrase it, and correct me if I misunderstand it. Basically, we introduce new task location When |
BTW, |
Your rephrase looks good except for one point here. "task (i.e. state store)"? You mean task is kind of a type of state store? is it a typo? I actually expect that it's a mapping between PVC and task Id.
I don't understand this. I assume each statestore must be bound to a specific location. Why we can't schedule the task?
|
A specific statestore is bound to a task, e.g. task 0 is bound with state store 0. This cannot be changed. But where to schedule the task, could be changed generally. This is current situation for HDFS-backed statestore. In other word, task-statestore is moved together if Spark schedule the task to different executor. So actually the mapping between PVC and task id, also means a (implicit) mapping between PVC and statestore of the task. That is why I add "i.e." there. Sorry for if any confusing.
For current HDFS-backed statestore, the location can be changed. It is when Spark schedules the task with the statestore to new executor. Once it is changed to different executor, Spark will reload from checkpointed data from HDFS to construct the state store in new executor (location). For resource-specific case (e.g. PVC), the location is fixed generally, because it is bound to specific resource on the executor. But in case like executor lost and the resource is re-mountable. Spark can schedule the task with the statestore to new executor re-mounted with the resource.
|
I see. And to ensure we're on the same page - for the |
Yea, I have not tried yet as we are still in discussion phase. But my idea is to retrieve PVC info from scheduler backend (k8s) when it retrieves executor info. I guess it doesn't return such info now. So in state store rdd, when preparing preferred locations, it queries scheduler backend (if it is k8s) to get PVC info and fill into At task scheduler side, during scheduling task set, it looks at required resources to meet task requirement in the location. Sounds okay? |
sgtm. |
I am trying to catch up on this discussion, and it is a very long thread already :-) Thanks for all the discussion ! |
yeah it would be great to have a summary and also please describe exactly the flow when an executor is lost and how everything is updated. |
@mridulm @tgravescs Yeah, I will update the doc. Thanks for the discussion! |
Note that I just updated the doc. |
Let me close this as we have some conclusions. |
is the plan to continue discussion in the doc and jira then? |
Yea, we can continue discussion in the jira if needed. |
Oh, we can continue the discussion here too if you prefer. I just think that as we are not to review the code here, maybe it is good to close the PR. |
What changes were proposed in this pull request?
This patch proposes to add a plugin API for providing scheduling suggestion to Spark task scheduler.
Design doc: https://docs.google.com/document/d/1wfEaAZA7t02P6uBH4F3NGuH_qjK5e4X05v1E5pWNhlQ/edit?usp=sharing
Why are the changes needed?
Spark scheduler schedules tasks to executors in an arbitrary (maybe not accurate description, but I cannot find good term here) way. The schedule schedules the tasks by itself. Although there is locality configuration, the configuration is used for data locality purposes. Generally we cannot suggest the scheduler where a task should be scheduled to. Normally it is not a problem because the general task is executor-agnostic.
But for special tasks, for example stateful tasks in Structured Streaming, state store is maintained at the executor side. Changing task location means reloading checkpoint data from the last batch. It has disadvantages from the performance perspective and also casts some limitations when we want to implement some features in Structured Streaming. We need an API to tell Spark scheduler how to schedule tasks.
Does this PR introduce any user-facing change?
No. This API should be developer-only and currently for Spark internal only too.
How was this patch tested?
Unit test.