Guidelines and scripts for efficiently running freebayes in parallel #465

ekg · 2018-04-19T11:13:55Z

This is an important step when making jobs for freebayes to run on a cluster. I've run into this recently and I've also discussed pitfalls in this process with several users. It seems that one of the major impediments to running freebayes in a highly parallel fashion is high variance in job memory and runtimes. This can be mitigated by making target regions that respect the underlying read density. (Another consideration is the allelic complexity, which I will discuss at the end of this post.)

The following snippet of shell script implements the depth-aware chunking. The basic idea is to use sambamba to produce a coverage map. We then count the total per-base coverage ($base_cov), divide this by the desired number of jobs ($nchunks), and then run another script on the output to produce the --region compatible targets:

sambamba depth base --combined aln.bam | cut -f 1-3 | pv -l | pigz >aln.bam.coverage.gz
base_cov=$(zcat aln.bam.coverage.gz | awk 'NR>1 { x += $3; } END { print x }')
nchunks=1000; zcat pepperGBS.bam.coverage.gz | head -100000000 | awk 'BEGIN { bin='$base_cov' / '$nchunks' } NR==1 { next } NR==2 { chr=$1; pos=$2; last=$2; } ($1==chr && sum < bin) { sum += $3; last=$2 } ($1!=chr || sum > bin) { print chr":"pos"-"last; sum = $3; chr=$1; pos=$2; last=$2; } END { print chr":"pos"-"last; } ' >targets.regions
freebayes-parallel targets.regions 32 -f ref.fa >out.vcf

Now target.regions has a list of regions that you can run freebayes over, using e.g. the freebayes-parallel script or a cluster job submission script of your own design. The bin width determination is based on coverage, so this will tend to resolve the common issue that very deep regions have extremely long runtimes, often many orders of magnitude longer than an average region.

The other major issue that affects runtime can be allelic complexity combined with sample (or ploidy) number. To deal with this it can help to limit the number of considered alleles arbitrarily, such as to a low integer like 7 or 10. This is done via the option --use-best-n-alleles. In cases of extremely high ploidy and excessive depth it can help to set even 2 or 3.

TODO to resolve this PR: put this documentation into the README and scripts directory.

Bonus: adjust freebayes to take the region list directly and run in parallel, without freebayes-parallel. This would require significant coding but is a frequent request.

The text was updated successfully, but these errors were encountered:

tseemann · 2018-04-22T03:51:06Z

@ekg are you envisioning using -fopenmp or a more explicit threading library?

ekg · 2018-05-02T11:35:50Z

I would definitely use openmp to implement parallelization in freebayes.

drwhitehouse · 2018-09-13T08:26:00Z

Looking at the documentation as it pertains to compiling freebayes there is a one liner:

"Note that the freebayes-parallel script and the programs on which it depends are not installed by this command."

Is there any other documentation about enabling a parallel build? I'm not a biologist but rather the administrator of an HPC cluster, and I suspect that there may be additional steps involved.

ekg · 2018-09-13T08:38:24Z

You need GNU parallel and the tools in vcflib..

…

On Thu, Sep 13, 2018 at 9:26 AM drwhitehouse ***@***.***> wrote: Looking at the documentation as it pertains to compiling freebayes there is a one liner: "Note that the freebayes-parallel script and the programs on which it depends are not installed by this command." Is there any other documentation about enabling a parallel build? I'm not a biologist but rather the administrator of an HPC cluster, and I suspect that there may be additional steps involved. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#465 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAI4EQEHlFbU2EkKGvuS-hwc-7GauZu2ks5uahaagaJpZM4TbmZN> .

drwhitehouse · 2018-09-13T08:39:41Z

o.k. we have parallel, do the documented compilation instructions not build the tools in vcflib?
I have a script here from a colleague who no longer works here and it refers to "make openmp" in the vcflib repo directory so I am guessing there might be some additional steps.

drwhitehouse · 2018-09-13T09:27:17Z

o.k. the reason I ask this is if I do the following:

git clone git@github.com:ekg/freebayes.git --branch v1.2.0 --recurse-submodules

if I then descend into that directory and "make" everything goes well.
If however I descend into the vcflib directory first, and "make openmp" first, then cd .. && make the build fails with:

../vcflib/src/Variant.h:18:21: fatal error: tabix.hpp: No such file or directory

(edit: same error if I just "make" in the vcflib directory...)

rwhetten · 2019-01-26T18:38:24Z

@ekg What is the purpose of the pv -l step in the sambamba step? That is not available on the cluster I'm using, but the 'pv' utility seems to be a simple progress meter, which won't be useful for a batch job on the cluster in any case. Is this invoking a different 'pv' command than the progress meter utility?

ekg · 2019-01-26T21:35:54Z

Remove pv -l. It is not important.

…

On Sat, Jan 26, 2019, 18:38 rwhetten ***@***.*** wrote: @ekg <https://github.com/ekg> What is the purpose of the pv -l step in the sambamba step? That is not available on the cluster I'm using, but the 'pv' utility seems to be a simple progress meter, which won't be useful for a batch job on the cluster in any case. Is this invoking a different 'pv' command than the progress meter utility? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#465 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAI4EQJh2jUTUg1iEhm6tI4Oh96FJ_n3ks5vHKCggaJpZM4TbmZN> .

juanesarango · 2019-02-18T16:27:15Z

For anyone interested in running freebayes in parallel jobs optimizing the way regions are created, I created toil_freebayes a wrapper tool around freebayes, that allows you to run freebayes using toil.

I split jobs by regions that have a fixed number of reads, instead of fixed number of base pairs, to reduce the variance of runtime and memory usage across different regions, as @ekg has discussed above. I use split_bed_by_reads script to create the regions from a bedfile.

The wrapper tool was mainly developed to support freebayes within our bioinformatics pipelines infrastructure. Hope this is useful to someone, and happy to receive feedback and contributions.

jpuritz · 2019-02-18T16:31:32Z

Also depending on your application, the dDocent pipeline ddocent.com can help with efficiently parallelizing freebayes. Jon Puritz, PhD Assistant Professor Department of Biological Sciences University of Rhode Island 120 Flagg Road, Kingston, RI 02881 Webpage: MarineEvoEco.com Cell: 401-338-8739 Work: 401-874-9020 "The most valuable of all talents is that of never using two words when one will do. ” -Thomas Jefferson On February 18, 2019 at 11:27:18 AM, Juan Esteban Arango Ossa (notifications@github.com) wrote: For anyone interested in running freebayes in parallel jobs, optimizing the way regions are created, I created toil_freebayes a wrapper tool around freebayes, that allows you to run freebayes using toil. I split jobs by regions that have a fixed number of reads, instead of fixed number of base pairs, to reduce the variance of runtime and memory usage across different regions, as @ekg has discussed above. I use [split_bed_by_reads](regions https://github.com/papaemmelab/split_bed_by_reads) script to create the regions from a bedfile. Hope this is useful to someone, and happy to receive feedback and contributions. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

travc · 2019-11-14T22:01:52Z

Don't want to engage in too much issue necromancy, but it seems like these approaches are all single or pool... Many sample joint calling is trickier. Any ideas there?

Oh, also any word about bad effects from chopping up regions generally? I've been using a script which only chops gaps in the reference since that should be safe, but as the references get better the available safe cut points are getting more rare. If there isn't a significant downside to cutting at arbitrary points, that would simplify things.

sanjaynagi · 2020-08-23T09:42:27Z

Don't want to engage in too much issue necromancy, but it seems like these approaches are all single or pool... Many sample joint calling is trickier. Any ideas there?

does this mean it is not possible to use freebayes parallel for joint, pooled calling?

janxkoci · 2021-01-25T10:41:36Z

does this mean it is not possible to use freebayes parallel for joint, pooled calling?

I've been using freebayes all the time for joint calling of many samples (e.g. RADseq, exome capture) without any major issue.

sanjaynagi · 2021-01-25T11:50:14Z

I know - this was specifically thinking about the freebayes-parallel wrapper. I now parallelise it within snakemake which seems to get around these issues anyway. Ive written about it in this blog post

janxkoci · 2021-01-25T12:41:30Z

this was specifically thinking about the freebayes-parallel wrapper

Yeah sorry, that's what I meant..

travc · 2021-01-25T15:03:57Z

I've been using snakemake as well. I stopped using the more complicated approach of only splitting on Ns and haven't had any obvious problems.
You might find this script useful, especially if you are running on a cluster where each job needs to roughly the same size.
https://gist.github.com/travc/0c53df2c8eca81c3ebc36616869930ec

sanjaynagi · 2021-01-25T17:15:28Z

@janxkoci interesting. I could never get it to work for some reason - perhaps it was the pooled aspect.

UPDATE ^ - retested and freebayes-parallel can perform both pooled and joint calling.

@travc Thanks Travis, the script looks useful. I may incorporate it into a pipeline, if I do, I'll let you know, and ill update the blog post. Currently my naive approach works fine but the final genome region can occasionally take a while, so this should help.

travc · 2021-01-25T18:17:36Z

@sanjaynagi Slurm and other cluster job managers tend to require you tell them how much time (and other resources) to allocate per job. There are a lot of simple scripts to divide up the regions, including several which do it based on the approximate number of reads. Problem was that they tend to require you to give them a number of segments, and I wanted to specify an approximate data size per segment so the job resource requirements would be more constant across not just a single dataset/run but also between runs.

Not sure that makes sense. Anyways, if you aren't running on a cluster, it doesn't really matter that much.

janxkoci · 2021-01-25T18:28:39Z

We use Torque PBS and a modern cluster with decent-enough nodes (80 cores & 192 GB ram per each node) where I can run freebayes-parallel on a single node with my dataset, so the job script can be quite simple.

I hope I understood you right though - I generally use multiplexed libraries, where we pool samples, but they are individually barcoded and can be demultiplexed after sequencing. Then I map them individually and call freebayes on all bam files to get a single vcf. I think this is called joint calling (at least that's how I recall the term from the dark days of using GATK - ugh, I hope those are over).

sanjaynagi · 2021-01-25T18:42:38Z

@travc that makes sense now, thanks. the server we have in-house does not use SLURM/SGE etc, so I havent needed this thus far.

@janxkoci Yeah, so this data I am referring to would be actual pooled samples, which cannot be demultiplexed. In this case, I am calling multiple samples at the same time (joint calling), but these samples are also pooled (and thus i use the --pooled-discrete argument of freebayes)

pjotrp · 2021-01-31T06:52:20Z

Writeups here with slurm https://bioinformaticsworkbook.org/dataAnalysis/VariantCalling/freebayes-dnaseq-workflow.html and https://wiki.rc.usf.edu/index.php/FreeBayes

pjotrp · 2021-01-31T07:08:21Z

Great work all! @sanjaynagi and @travc I am thinking we could add your scripts to the freebayes repos and a small description in the README shortly discussing GNU parallel, snakemake and slurm options. Does that make sense? If we make a directory ./examples/snakemake and ./examples/slurm you could drop your code there or in ./scripts if it is a callable script. Please do a PR so we have your contribution visible.

sanjaynagi · 2021-02-03T19:06:50Z

@pjotrp cool, Ill make a PR soon

mglubber · 2021-02-12T22:48:22Z

@pjotrp I'm also working with freebayes & snakemake. I like @sanjaynagi's idea of having Snakemake handle parallelizing Freebayes, but I'd prefer not to use his R script, since it would mean having to install R/tidyverse/data.table in addition to Freebayes (which would add complexity in HPC/cloud computing situations). I've submitted a pull request (#689) to modify fasta_generate_regions.py to work with @sanjaynagi's Snakemake example (i.e. split into N total chunks rather than N-sized chunks; write output to bed files; allow for filtering chromosomes). The output with default options should be the same, so that the script still works with existing pipelines/examples.

github-actions · 2021-06-13T02:07:41Z

This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions · 2021-06-18T02:11:56Z

This issue was closed for lack of activity. Feel free to re-open if someone feels like working on it.

ekg mentioned this issue May 2, 2018

Freebayes looks frozen (478 samples, 5kb) #470

Closed

juanesarango mentioned this issue Feb 18, 2019

can I resume the freebayes run which was killed by memory issue? #520

Closed

pjotrp added the documentation label Dec 17, 2020

pjotrp added this to the v1.3.5 milestone Dec 17, 2020

sanjaynagi mentioned this issue Jan 25, 2021

split the genome into freebayes jobs of roughly equal size sanjaynagi/rna-seq-pop#32

Closed

sanjaynagi mentioned this issue Feb 4, 2021

Adding snakemake example and script to split genome by bam size #682

Merged

github-actions bot added the Stale label Jun 13, 2021

github-actions bot closed this as completed Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidelines and scripts for efficiently running freebayes in parallel #465

Guidelines and scripts for efficiently running freebayes in parallel #465

ekg commented Apr 19, 2018

tseemann commented Apr 22, 2018

ekg commented May 2, 2018

drwhitehouse commented Sep 13, 2018

ekg commented Sep 13, 2018 via email

drwhitehouse commented Sep 13, 2018 •

edited

Loading

drwhitehouse commented Sep 13, 2018 •

edited

Loading

rwhetten commented Jan 26, 2019

ekg commented Jan 26, 2019 via email

juanesarango commented Feb 18, 2019 •

edited

Loading

jpuritz commented Feb 18, 2019 via email

travc commented Nov 14, 2019

sanjaynagi commented Aug 23, 2020

janxkoci commented Jan 25, 2021

sanjaynagi commented Jan 25, 2021

janxkoci commented Jan 25, 2021

travc commented Jan 25, 2021

sanjaynagi commented Jan 25, 2021 •

edited

Loading

travc commented Jan 25, 2021

janxkoci commented Jan 25, 2021 •

edited

Loading

sanjaynagi commented Jan 25, 2021

pjotrp commented Jan 31, 2021

pjotrp commented Jan 31, 2021

sanjaynagi commented Feb 3, 2021

mglubber commented Feb 12, 2021

github-actions bot commented Jun 13, 2021

github-actions bot commented Jun 18, 2021

Guidelines and scripts for efficiently running freebayes in parallel #465

Guidelines and scripts for efficiently running freebayes in parallel #465

Comments

ekg commented Apr 19, 2018

tseemann commented Apr 22, 2018

ekg commented May 2, 2018

drwhitehouse commented Sep 13, 2018

ekg commented Sep 13, 2018 via email

drwhitehouse commented Sep 13, 2018 • edited Loading

drwhitehouse commented Sep 13, 2018 • edited Loading

rwhetten commented Jan 26, 2019

ekg commented Jan 26, 2019 via email

juanesarango commented Feb 18, 2019 • edited Loading

jpuritz commented Feb 18, 2019 via email

travc commented Nov 14, 2019

sanjaynagi commented Aug 23, 2020

janxkoci commented Jan 25, 2021

sanjaynagi commented Jan 25, 2021

janxkoci commented Jan 25, 2021

travc commented Jan 25, 2021

sanjaynagi commented Jan 25, 2021 • edited Loading

travc commented Jan 25, 2021

janxkoci commented Jan 25, 2021 • edited Loading

sanjaynagi commented Jan 25, 2021

pjotrp commented Jan 31, 2021

pjotrp commented Jan 31, 2021

sanjaynagi commented Feb 3, 2021

mglubber commented Feb 12, 2021

github-actions bot commented Jun 13, 2021

github-actions bot commented Jun 18, 2021

drwhitehouse commented Sep 13, 2018 •

edited

Loading

drwhitehouse commented Sep 13, 2018 •

edited

Loading

juanesarango commented Feb 18, 2019 •

edited

Loading

sanjaynagi commented Jan 25, 2021 •

edited

Loading

janxkoci commented Jan 25, 2021 •

edited

Loading