-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodes sitting idle #741
Comments
Daryl; |
Oh...
|
Only looks like 11GB max ram consumption was reported. That should be fine so maybe it's a slurm reporting issue.
|
Daryl; |
|
An update. Rerunning the pipeline with the original memory request consistently causes these idle nodes. The logs stop reporting shortly after initialization and are only informative after the jobs get killed due to timelimits or manual intervention. Is there a way for to monitor the engine processes for these types of shutdown? |
Daryl; Does increasing the memory in |
I increased the memory and the jobs are still sitting empty. This time I don't see the memory error in the logs. I'll upload the logs, maybe you can see something. |
I hung out on a node and watched. Looks like it's the alignment step that is dying. I'm using local scratch based on an environment variable for tmp (code updated to development). Nothing looks to be in the tmp directory, might that indicate an issue? If so, would the failure to create tmp directories be reflected in the logs? |
Leaving tmp as default. I can see files being generated but still all the nodes go dormant.
|
Daryl; I dug around in the different tools and it looks like @lomereiter has been working on eliminating some memory leaks in sambamba (biod/sambamba#116). I grabbed the latest development version so if you update the tools with:
maybe this will avoid the underlying issue and keep things running. Sorry about all the problems and please let us know if this doesn't fix it. |
What a pleasure it is getting code to work on other peoples systems! The out logs are empty and the err logs stop reporting at the same time the
After scancel:
|
Looks like the upgrade had an error with cmake. Running the failed line:
|
Installing cmake without docs avoids the sphynx problem and allows the bcbio upgrade to complete.
|
…ild without-docs if not previously build into bottle (bcbio/bcbio-nextgen#741)
Daryl; |
Using the latest update seems to have got me over the alignment hump. Thanks. I've run into a similar problem at another point where the slurm jobs still seem to be running however the nodes are essentially idle and logging has stopped. Checking the nodes I can see python and slurm_script processes running under my name but nothing happening. Attached are some selected sections from the submission err.log.
|
Daryl;
to see if that resolves the issue? Hope this also works to avoid the memory errors you're seeing from SLURM. Thanks much for the patience and help debugging. |
Getting closer. Any idea what step I should try bumping the memory for?
|
Daryl; |
The latest log shows some gatk errors. I didn't realize gatk was used in the freebayes joint calling pipeline. Regardless, looks like freebayes is running on the nodes now that I increased the memory on the submission script. |
Daryl; The error you're seeing is the same as previously. The input files are missing BAM indexes. This is going to make FreeBayes really slow, as it'll have to read over the entire BAM files instead of jumping directly to the regions of interest. I'm confused as to why we're lacking indexes. What version of bcbio and sambamba do you have? It's trying to index the files earlier in the process but fails for some reason. You'll want to resolve that to get reasonable speed out of the calling. Hope this helps. |
|
Is it looking for the *sort.bam indices in the default align directory? I see nonempty indices for all the bams, but there were a few that still had tmp files i.e.
From the log how do I figure out what files are missing?
|
Daryl; |
I'll start running a separate test case. The issues might related to the frequent restarting using different versions of the code. I doubled the memory for the submission script but it still failed. Based on sacct it looks like it's the controller job that is going over the default 2G. Where do I increase the memory for that step? |
…') through ipython-cluster-helper and document #741
Daryl;
to bump it to 6Gb, for instance. You'll need to upgrade |
Excellent! Simply restarting the pipeline got it moving again and it's running the Going forward, if there is anything you can do on the logging side to Is the observed behaviour of idle jobs consistent with a killed controller This was only 143 high coverage exomes, so not too much. I'm going to try On Tue, Feb 17, 2015 at 11:23 AM, Brad Chapman notifications@github.com
|
Things are currently running at the qc step but I'm seeing errors in the log:
|
Daryl; |
Ah, multiple processes could be the issue. I have an sbatch script which submits the bcbio job. It looked like this was running out of memory so I increased the allocation from a default 4G to 8G. Due to the cluster setup this required increasing the number of cores from 1 to 2. After I did that I was seeing multiple bcbio processes in squeue. I just assumed this was a reporting quirk. My setup is below.
|
Reverting back to 1 cpu and restarting causes a memory issue on the submission script. sacct reports it completed without error.
|
Daryl |
I'll rejig the submission script. I was just following some examples. The srun in this case is unnecessary. |
I need a bit of help understanding how the scheduling is working.
I've restarted a joint caller pipeline on a slurm cluster using ipython (500 cores requested). A limited set of engine jobs have been kicked off and initially complete alignments and then go dormant with only a python script running. Is this expected? It could take a while for the other engine jobs in the array to start. Do all jobs in the array have to be running concurrently or can jobs with stage completions be recycled?
I guess I'm trying to get a sense of if I should drop the number of core request to an extremely conservative number when on busy cluster.
The text was updated successfully, but these errors were encountered: