-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
entire jobTree crashes when a job's pickle file isn't written correctly #26
Comments
a) atomic file creation will prevent partial file from being present on error Joel Armstrong notifications@github.com writes:
|
Didn't mean to imply that GPFS itself was truncating the file somehow -- just that there were issues with read/writes blocking for long times that could've easily contributed to the process itself writing only a partial file and triggering a previously unnoticed race condition, or simply getting killed at an inopportune time. Atomic file creation is a great idea for this file, I think that's the correct fix here. |
We already do try to do this, see: https://github.com/benedictpaten/jobTree/blob/master/src/job.py#L43 On Mon, Feb 2, 2015 at 9:58 AM, Joel Armstrong notifications@github.com
|
Oh wow, hadn't looked at the code yet. That seems pretty foolproof. Could this have the same root cause as the issue we had with the cactus sequences file? I'm stumped. |
Actually, the code can be simplified. The ".updating" file was there If the filesystem itself went wrong I think all bets are off. On Mon, Feb 2, 2015 at 10:23 AM, Joel Armstrong notifications@github.com
|
I have just encountered the same error in jobTree as part of a progressiveCactus run. The error message is below:
It is possible that there is some filesystem issues that are in play here, as our main shared filesystem had some recent instability, but at least in theory that was fixed >24 hours ago and after I started the progressiveCactus run. I should also note that this is using the Harvard informatics SLURM fork of jobTree (https://github.com/harvardinformatics/jobTree), but I think the relevant code here is unchanged. Edited to add: the directory with the failed job identified above has a 0-length job file and a 0-length updating file in it, in case that is relevant. |
I just had several jobTrees fail (presumably due to a filesystem problem) with this error:
I haven't looked into jobTree internals in depth, but I think this is due to the job writing a completely blank pickle file (the t2/t3/t1/t2/job file has size 0). I'll look to see if we can be resilient to these types of errors and just retry the job if this happens. It wouldn't have helped in this particular case, since all jobs were failing in the same way, but presumably this could also happen if a single job is killed in the wrong way.
The text was updated successfully, but these errors were encountered: