Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm killing only parts of jobs? #65

Open
jeanlain opened this issue May 21, 2018 · 0 comments
Open

Slurm killing only parts of jobs? #65

jeanlain opened this issue May 21, 2018 · 0 comments

Comments

@jeanlain
Copy link

jeanlain commented May 21, 2018

I'm not sure if it is an issue or a feature, but I've noticed a problem with jobs submitted on a cluster with slurm 17.02.7

A job is actually an R script that launches system commands in parallel (a command is the blastn executable, which is used to find homologies between DNA sequences).
So the R function would look like:

runBlast = function(in, db, out, temp, done) {
    system(paste("blastn -in", in, "-db", db, "-out", temp))     #the blast command writes results to filename stored in variable temp
    file.rename(temp, done)                                         #result file is renamed upon completion of the blastn command
}

With mcMap(), I launch a dozen of runBlast() call in parallel from R. For each runBlast() call, there is an R process (using 0% CPU) and a child blastn process that does the job.

I have noticed that some results files were incomplete, even though they have been renamed with the file.rename() call and blastn reported no issue.

This seems to happen when slurm reports :

slurmstepd: Exceeded step memory limit at some point

But the job wasn't interrupted and is reported as "completed".

So I think that when memory exceeds the requested amount, under some conditions, slurm kills a blastn process, but not (at least not immediately) the parent R process, which simply goes on to rename the file as the file.rename() call is performed after the system() call.

Anyhow, this is highly problematic for me as there is no easy way to tell on which blastn task this has occurred, as there is no indication of which process was killed. Everything looks like it finished properly.

Note that when slurms kills a whole job (and reports it as "cancelled"), this doesn't seem to happen.

So I guess my question is: does Slurm kills only "parts" of jobs, meaning killing some child process without killing the parent process (immediately) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant