New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26268][CORE] Do not resubmit tasks when executors are lost #24462
Conversation
Note that this is a work in progress/suggestion for how this might be done. I don't know if we want to use the property in this exact way, but the "externalness" of the shuffle service should be independently togglable. |
Do you mean you customized an external shuffle manager to store data outside the cluster by your self? If so, I think maybe you should do this change in your custom branch. |
@gczsjdy Yes, it's pluggable, but I am not sure whether it's good to add this config, for it's almost unused unless adding custom shuffle manager and compile a new spark package. |
@@ -1791,7 +1791,8 @@ private[spark] class DAGScheduler( | |||
// if the cluster manager explicitly tells us that the entire worker was lost, then | |||
// we know to unregister shuffle output. (Note that "worker" specifically refers to the process | |||
// from a Standalone cluster, where the shuffle service lives in the Worker.) | |||
val fileLost = workerLost || !env.blockManager.externalShuffleServiceEnabled | |||
val fileLost = workerLost || (!env.blockManager.externalShuffleServiceEnabled && | |||
!sc.conf.get(config.EXTERNAL_SHUFFLE_ENABLED)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With worker lost and EXTERNAL_SHUFFLE_ENABLED = true, fileLost should be false? Due to the shuffle files metadata doesn't reside in Worker.
I think this should be !sc.conf.get(config.EXTERNAL_SHUFFLE_ENABLED) && (workerLost || (!env.blockManager.externalShuffleServiceEnabled))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch--I had missed this since I never tested it in standalone mode (only YARN).
The DAGScheduler assumes that shuffle data is lost if an executor is lost and the Spark external shuffle service is not enabled. However, external shuffle managers other than the default external shuffle service can store shuffle data outside of the Spark cluster itself. In this case, completed map taks are not rerun even if the corresponding executors are lost. The new `spark.shuffle.external.enabled` property allows this new external shuffle behavior to be toggled (independently off the built-in Spark external shuffle service).
157e354
to
37d4931
Compare
@liupc The problem with the current situation is that you have to recompile and ship all of Spark core if you want to use a custom external shuffle manager. This configuration change allows you to use a custom shuffle manager as a "plugin" that you install alongside Spark. This substantially speeds compilation time and means that you don't need to make error-prone changes to the scheduler. |
IIUC, there is another problem: even if the shuffle data is stored outside the cluster, nobody would serve for the fetch of these blocks when executor is lost, then if not rerun these completed map tasks, how can the subsequent shuffle fetch the data? |
In this case, the shuffle data server is external to Spark. This is the entire reason we do not want to rerun completed map tasks. |
I understand the motivation here, but I think this should be handled as part of the new shuffle storage plugin mechanism under https://issues.apache.org/jira/browse/SPARK-25299. We've actually been discussing the right way to expose this kind of knob to a plugin, if its necessary. I don't think we've gotten around to writing up docs about this yet, sorry So while I don't think this we should make the change in this PR, I'd be really interested in hearing more about the system you are using for storing shuffle data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this change especially the part:
"The new spark.shuffle.external.enabled property allows this new external shuffle behavior to be toggled (independently off the built-in Spark external shuffle service)."
Independently?
But if the old spark.shuffle.service.enabled
is false and the new spark.shuffle.external.enabled
is true how do you got the shuffle blocks? As in the BlockManager you are not using the ExternalShuffleClient
for accessing the blocks:
spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
Lines 190 to 196 in 37d4931
private[spark] val shuffleClient = if (externalShuffleServiceEnabled) { | |
val transConf = SparkTransportConf.fromSparkConf(conf, "shuffle", numUsableCores) | |
new ExternalShuffleClient(transConf, securityManager, | |
securityManager.isAuthenticationEnabled(), conf.get(config.SHUFFLE_REGISTRATION_TIMEOUT)) | |
} else { | |
blockTransferService | |
} |
Moreover this change very much overloads the word "external" as the spark built-in shuffle service is already external. What about independent or any other a synonym? (On the other hand I am just interested in the idea behind; currently not convinced by the change.)
@liupc Under remote shuffle, if certain executors are lost, we can still fetch the shuffle data from remote storage(i.e. HDFS), we don't need executors to 'serve' the shuffle data. Also I don't understand why you mentioned this is too special... 'shuffle data server external to Spark' may be the new common... |
@gczsjdy I understand this use case, it's a common need, but the current pluggable implementation is too flexible, and the whole impl of the shuffle manager are customed, but this PR is so highly relies on the kind of shuffle manager which must be external, that's why I think we should not do this in master branch, or we should put them together. So maybe the suggestion of @squito is the good way to do it. |
@liupc Thanks for the explanation. |
To give context: this is to support a HDFS shuffle implementation, which I have not yet had a chance to upload. This shuffle implementation could live outside of Spark itself, but it does need to have this configuration param added. I've been following along with SPARK-25299, but it looks like these scheduler changes have not yet been introduced. This is one component of SPARK-25299 that we need to figure out, and I think it makes sense to have it broken out as its own blocking issue. (In general, that jira should be decomposed into issues that can be addressed/discussed with smaller scope.) @attilapiros I'm aware of the issues you point out here. If the in-Spark "external" shuffle service is used, you need to manually set this new property to This approach requires users to understand the implications of the new configurations, but only those who wish to use HDFS shuffle (or some other external shuffle implementation). Adding a flag is the simplest way to allow the flexibility required to introduce an "external-to-Spark" shuffle manager without introducing core interface changes. While this could be configured incorrectly, it's unlikely and many of our configurations require understanding before changing from the default value. A better long-term approach could be something like: add a method to the |
We have also developed our own shuffle manager. Here is the code: |
I think @attilapiros has a good point here. And now I'm wondering that how does your(@bsidhom) pluggable shuffle manager achives a shuffle process (map & reduce) with And I have similar question for @jealous . Does your shuffle manager still works with |
Hi @Ngone51 , our shuffle manager is an in-place replacement of the vanilla shuffle manager. It allows the user to write the shuffle file to external storage. The files could be located with the app id, shuffle id, mapper id and reducer id so that we don't need an extra registry for the shuffle files. And yes, we don't rely on an |
@bsidhom I agree that it would be ideal if there's a field in |
@Ngone51 As @jealous mentions, shuffle implementations that maintain their own state outside of the Spark system does not need executors or the built-in Spark shuffle service to serve shuffle data. @gczsjdy By "outside of Spark" I mean that shuffle manager implementations do not need to live inside of (a fork of) the Spark repo but can instead be built and shipped as "plugins"*. Building separately makes it easier and faster to compile, test, and iterate. That's how we currently build the HDFS/HCFS shuffle manager. You can then distribute the jar with a Spark deployment by dropping it under the Unfortunately because the shuffle manager class is instantiated at the time of * Note that because |
agreed. I think that is a better way to expose this. But I'd rather not put in a config here in the meantime, and I'd like to wait a bit on the other shuffle plugin stuff to say what that api should be exactly. The old ShufflePlugin api was there really just to manage the Sort vs. Hash shuffle, when that was introduced long ago. I think it needs some updates for the new proliferation of distributed shuffle managers (I'm trying to say "distributed" instead of "external" for this new class of shuffle managers). |
Hi @jealous Can you give any link or source file name for that part ? I'd like to learn it more for implementation details. |
Hi @Ngone51 , check our project page here: |
FWIW, its at least clear to me know that this will be needed even with SPARK-25299, as that work is orthogonal to this mostly (as discussed here: https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit?disco=AAAADAYd_10). I think it makes sense to expose this as a method on |
@squito I agree with you, but still want to make sure I understand it right: The function https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit?disco=AAAADN6g3wY resembles this PR's work. However, it just works when users leverage the new So actually we have plans to push this PR forward? @bsidhom Could we create a field in |
Actually, I think how it would work with the shuffle storage API is that you now return |
@yifeih It's very interesting, but I didn't 100% get it. The new |
The new |
@yifeih Thank you, I understand now. But can your way (making In other words, what Driver decides to do when invalidating an executor(what this PR works on) and how the Executors tell Driver the |
I took another look at @yifeih 's changes, and I think she's right, that will be sufficient. Now you're custom shuffle manager should just return a It seems like even now this would almost work, except that having a Most importantly, we should just document the semantics of returning a null executorId in the MapStatus as part of the ShuffleManager contract. |
Yup, actually, as @squito pointed out, it's actually probably better to allow |
@squito I met with a condition that cannot be satisfied without this PR:
This is what I mentioned:
|
why do you want to store the data files on hdfs, but the index files on the executors? This seems to have the worst of both worlds -- the (bad) resiliency of local storage, and the (bad) performance of remote reads. Or is the index file backed up somewhere as well? I definitely understand the general problem with knowing what to do about shuffle data when an executor is lost. So I understand why you want to do something like what this PR does. But it probably makes more sense to address this as part of the other shuffle api changes, if possible, not as another config. You might not be able to do everything you want -- in particular, the new api does not support multiple locations for shuffle data. We decided that was out-of-scope, for now (but maybe a future enhancement). Is that what you're looking for -- one copy on the executors local disk, and another copy on hdfs? |
@squito Index and data files are both stored on DFS, the difference is that: data files are directly read from DFS, however, for index files, a reducer fetches them from the executors('s cache) who wrote them, if there aren't required index files in cache, they will be loaded from DFS. This approach simulates the external shuffle service's cache, but instead of in another Java process, it's in Executor. This approach needs a reasonable place(and it's the coordinated map executor) to cache index files. Returning a
|
ok, I see what you're trying to do -- and yeah I don't think you can do it with the api we are proposing. That is a bummer, but we also need to try to draw a reasonable balance here, so I don't know if this case is really that compelling for extending the api. Are you sure your cache feature really saves you that much? You've still got to make a remote read for the index file |
@squito Yeah it saves us much, from a TPC-DS 1T benchmark, 30% queries get 1.1x+ performance boost, 13% get 1.2x + performance boost. There's still remote read, but only once(if index files are not swapped out because of insufficient cache space), and this feature can take advantage of internal network bandwith inside computing cluster, releasing the compute-storage network, which may be the bottleneck of the workload. By 'reasonable balance', did you mean not considering complex conditions? I think probably it's beneficial to make it clear through discussion. Making this work in a long term is also fine by me. I tried to make a point that the current solution to not resubmitting map tasks by modifying |
The "reasonable balance" I was talking about was between extending the spark api to cover more use cases, while still keeping it supportable and maintainable, by making incremental improvements where we can. For example, I completely think that having multiple block locations is reasonable; but I'd rather not take it on right now. thanks for providing your benchmarks. Those are pretty compelling numbers ... still I'm hesitant to add a config for just this one use case right now. I'm open to hearing from others about more use cases that would benefit from this |
Thank you @squito |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
The DAGScheduler assumes that shuffle data is lost if an executor is lost and the Spark external shuffle service is not enabled. However, external shuffle managers other than the default external shuffle service can store shuffle data outside of the Spark cluster itself. In this case, completed map taks are not rerun even if the corresponding executors are lost.
The new
spark.shuffle.external.enabled
property allows this new external shuffle behavior to be toggled (independently off the built-in Spark external shuffle service).