-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guidelines and scripts for efficiently running freebayes in parallel #465
Comments
@ekg are you envisioning using |
I would definitely use openmp to implement parallelization in freebayes. |
Looking at the documentation as it pertains to compiling freebayes there is a one liner: "Note that the freebayes-parallel script and the programs on which it depends are not installed by this command." Is there any other documentation about enabling a parallel build? I'm not a biologist but rather the administrator of an HPC cluster, and I suspect that there may be additional steps involved. |
You need GNU parallel and the tools in vcflib..
…On Thu, Sep 13, 2018 at 9:26 AM drwhitehouse ***@***.***> wrote:
Looking at the documentation as it pertains to compiling freebayes there
is a one liner:
"Note that the freebayes-parallel script and the programs on which it
depends are not installed by this command."
Is there any other documentation about enabling a parallel build? I'm not
a biologist but rather the administrator of an HPC cluster, and I suspect
that there may be additional steps involved.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#465 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAI4EQEHlFbU2EkKGvuS-hwc-7GauZu2ks5uahaagaJpZM4TbmZN>
.
|
o.k. we have parallel, do the documented compilation instructions not build the tools in vcflib? |
o.k. the reason I ask this is if I do the following: git clone git@github.com:ekg/freebayes.git --branch v1.2.0 --recurse-submodules if I then descend into that directory and "make" everything goes well. ../vcflib/src/Variant.h:18:21: fatal error: tabix.hpp: No such file or directory (edit: same error if I just "make" in the vcflib directory...) |
@ekg What is the purpose of the |
Remove pv -l. It is not important.
…On Sat, Jan 26, 2019, 18:38 rwhetten ***@***.*** wrote:
@ekg <https://github.com/ekg> What is the purpose of the pv -l step in
the sambamba step? That is not available on the cluster I'm using, but the
'pv' utility seems to be a simple progress meter, which won't be useful for
a batch job on the cluster in any case. Is this invoking a different 'pv'
command than the progress meter utility?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#465 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAI4EQJh2jUTUg1iEhm6tI4Oh96FJ_n3ks5vHKCggaJpZM4TbmZN>
.
|
For anyone interested in running freebayes in parallel jobs optimizing the way regions are created, I created toil_freebayes a wrapper tool around freebayes, that allows you to run freebayes using toil. I split jobs by regions that have a fixed number of reads, instead of fixed number of base pairs, to reduce the variance of runtime and memory usage across different regions, as @ekg has discussed above. I use split_bed_by_reads script to create the regions from a bedfile. The wrapper tool was mainly developed to support freebayes within our bioinformatics pipelines infrastructure. Hope this is useful to someone, and happy to receive feedback and contributions. |
Also depending on your application, the dDocent pipeline ddocent.com can help with efficiently parallelizing freebayes.
Jon Puritz, PhD
Assistant Professor
Department of Biological Sciences
University of Rhode Island
120 Flagg Road, Kingston, RI 02881
Webpage: MarineEvoEco.com
Cell: 401-338-8739
Work: 401-874-9020
"The most valuable of all talents is that of never using two words when one will do. ” -Thomas Jefferson
On February 18, 2019 at 11:27:18 AM, Juan Esteban Arango Ossa (notifications@github.com) wrote:
For anyone interested in running freebayes in parallel jobs, optimizing the way regions are created, I created toil_freebayes a wrapper tool around freebayes, that allows you to run freebayes using toil.
I split jobs by regions that have a fixed number of reads, instead of fixed number of base pairs, to reduce the variance of runtime and memory usage across different regions, as @ekg has discussed above. I use [split_bed_by_reads](regions https://github.com/papaemmelab/split_bed_by_reads) script to create the regions from a bedfile.
Hope this is useful to someone, and happy to receive feedback and contributions.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Don't want to engage in too much issue necromancy, but it seems like these approaches are all single or pool... Many sample joint calling is trickier. Any ideas there? Oh, also any word about bad effects from chopping up regions generally? I've been using a script which only chops gaps in the reference since that should be safe, but as the references get better the available safe cut points are getting more rare. If there isn't a significant downside to cutting at arbitrary points, that would simplify things. |
does this mean it is not possible to use freebayes parallel for joint, pooled calling? |
I've been using freebayes all the time for joint calling of many samples (e.g. RADseq, exome capture) without any major issue. |
I know - this was specifically thinking about the freebayes-parallel wrapper. I now parallelise it within snakemake which seems to get around these issues anyway. Ive written about it in this blog post |
Yeah sorry, that's what I meant.. |
I've been using snakemake as well. I stopped using the more complicated approach of only splitting on Ns and haven't had any obvious problems. |
@janxkoci interesting. I could never get it to work for some reason - perhaps it was the pooled aspect. UPDATE ^ - retested and freebayes-parallel can perform both pooled and joint calling. @travc Thanks Travis, the script looks useful. I may incorporate it into a pipeline, if I do, I'll let you know, and ill update the blog post. Currently my naive approach works fine but the final genome region can occasionally take a while, so this should help. |
@sanjaynagi Slurm and other cluster job managers tend to require you tell them how much time (and other resources) to allocate per job. There are a lot of simple scripts to divide up the regions, including several which do it based on the approximate number of reads. Problem was that they tend to require you to give them a number of segments, and I wanted to specify an approximate data size per segment so the job resource requirements would be more constant across not just a single dataset/run but also between runs. Not sure that makes sense. Anyways, if you aren't running on a cluster, it doesn't really matter that much. |
We use Torque PBS and a modern cluster with decent-enough nodes (80 cores & 192 GB ram per each node) where I can run I hope I understood you right though - I generally use multiplexed libraries, where we pool samples, but they are individually barcoded and can be demultiplexed after sequencing. Then I map them individually and call freebayes on all bam files to get a single vcf. I think this is called joint calling (at least that's how I recall the term from the dark days of using GATK - ugh, I hope those are over). |
@travc that makes sense now, thanks. the server we have in-house does not use SLURM/SGE etc, so I havent needed this thus far. @janxkoci Yeah, so this data I am referring to would be actual pooled samples, which cannot be demultiplexed. In this case, I am calling multiple samples at the same time (joint calling), but these samples are also pooled (and thus i use the --pooled-discrete argument of freebayes) |
Great work all! @sanjaynagi and @travc I am thinking we could add your scripts to the freebayes repos and a small description in the README shortly discussing GNU parallel, snakemake and slurm options. Does that make sense? If we make a directory ./examples/snakemake and ./examples/slurm you could drop your code there or in ./scripts if it is a callable script. Please do a PR so we have your contribution visible. |
@pjotrp cool, Ill make a PR soon |
@pjotrp I'm also working with freebayes & snakemake. I like @sanjaynagi's idea of having Snakemake handle parallelizing Freebayes, but I'd prefer not to use his R script, since it would mean having to install R/tidyverse/data.table in addition to Freebayes (which would add complexity in HPC/cloud computing situations). I've submitted a pull request (#689) to modify |
This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days |
This issue was closed for lack of activity. Feel free to re-open if someone feels like working on it. |
This is an important step when making jobs for freebayes to run on a cluster. I've run into this recently and I've also discussed pitfalls in this process with several users. It seems that one of the major impediments to running freebayes in a highly parallel fashion is high variance in job memory and runtimes. This can be mitigated by making target regions that respect the underlying read density. (Another consideration is the allelic complexity, which I will discuss at the end of this post.)
The following snippet of shell script implements the depth-aware chunking. The basic idea is to use sambamba to produce a coverage map. We then count the total per-base coverage (
$base_cov
), divide this by the desired number of jobs ($nchunks
), and then run another script on the output to produce the--region
compatible targets:Now
target.regions
has a list of regions that you can run freebayes over, using e.g. the freebayes-parallel script or a cluster job submission script of your own design. The bin width determination is based on coverage, so this will tend to resolve the common issue that very deep regions have extremely long runtimes, often many orders of magnitude longer than an average region.The other major issue that affects runtime can be allelic complexity combined with sample (or ploidy) number. To deal with this it can help to limit the number of considered alleles arbitrarily, such as to a low integer like 7 or 10. This is done via the option
--use-best-n-alleles
. In cases of extremely high ploidy and excessive depth it can help to set even 2 or 3.TODO to resolve this PR: put this documentation into the README and scripts directory.
Bonus: adjust freebayes to take the region list directly and run in parallel, without freebayes-parallel. This would require significant coding but is a frequent request.
The text was updated successfully, but these errors were encountered: