[SPARK-14957][Yarn] Adopt healthy dir to store executor meta#12735
[SPARK-14957][Yarn] Adopt healthy dir to store executor meta#12735suyanNone wants to merge 1 commit intoapache:masterfrom
Conversation
|
Test build #57126 has finished for PR 12735 at commit
|
|
I don't understand the problem or your change based on your JIRA. Are you trying to test for writeability? I don't think we want to arbitrarily require certain free space. This patch looks like it deletes directories though, which seems wrong. |
|
I agree with @srowen can you please describe what you are trying to do in more detail. Why are you looking for multiple of the db files and deleting older ones, when would there be more then 1? Any sort of min disk space would need to be configurable. Note that in hadoop 2.5 they added a getRecoveryPath() routine to the AuxiliaryService that gets a path to put the ldb. I don't know if we decided on what hadoop version we are support in 2.x, @srowen do you remember if that was decided? |
|
Hadoop 2.2 is still the base build for 2.x. If you think this is a small additional good reason to up the requirement, let's highlight that. |
|
thanks, this is pretty minor and wouldn't warrant changing that. |
we got that local dirs from yarnConfig, there a lot of dirs, but current we always adopt the first dir. right? first assume there don't exist any meta file: So may be we can choose a more healthy disk dir for storing meta file, to avoid necessary exception. |
|
for why will have multi-meta files, assume we found one, but the disk will be a read-only filesystem, still use that ? or choose another healthy dir to create new one? if choose the second, we can't delete right, and so we will have 2 files exist... |
|
To be honest, I not walk through all Yarn shuffle server process, I just fix our user reported problem, why can't connect with shuffle server due to create level db in an non-exists folder. I will take some time to re-produce problem, and be more comprehend about this. @tgravescs I will look into the getRecoveryPath api... |
|
If the disk is bad or missing there is nothing else you can do then create a new db since as you say deleting wouldn't work. Note I think all it does is log a message because we didn't want it to kill the entire nodemanager, but I think we should change that. We should throw an exception if registration doesn't work because the way the nodemanager currently works is that it doesn't fail until someone goes to fetch the data. If it failed quick when the executor registered with it that would be preferable, but that is a YARN bug. If you are going to look at the getRecoveryPath api, I think we can do it without reflection by defining our own setRecoveryPath function in YarnShuffleService (leave override off so it works with older versions of hadoop). Have it default to some invalid value and if its hadoop 2.5 or greater it will get called and set it to a real value. Then in our code we could check to see if its set and if it is use it, if not we could fall back to currently implementation. Note that setRecoverPath is the only one we really need define since getRecoverPath is protected, but to be safe we might also implement that. We can store our own path. I think between the throw and using the new api we shouldn't need to check the disks. The recovery path that is supposed to be set by administrators is supposed to be something very resilient such that if its down nothing on the node would work. |
|
Actually after looking a bit more, Spark does fail fast if the shuffle service isn't there because very soon after start up the BlockManager registers with the shuffle service so if it didn't come up the executors should fail quickly. Is this what you were seeing? |
|
Yes, agree with @tgravescs , |
|
OK sounds like we should close this PR |
|
@tgravescs, yes, the executor will failed fast... |
What changes were proposed in this pull request?
Adopt a healthy dir to store executor meta.
How was this patch tested?
Unit tests