Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidelines and scripts for efficiently running freebayes in parallel #465

Closed
ekg opened this issue Apr 19, 2018 · 26 comments
Closed

Guidelines and scripts for efficiently running freebayes in parallel #465

ekg opened this issue Apr 19, 2018 · 26 comments

Comments

@ekg
Copy link
Collaborator

ekg commented Apr 19, 2018

This is an important step when making jobs for freebayes to run on a cluster. I've run into this recently and I've also discussed pitfalls in this process with several users. It seems that one of the major impediments to running freebayes in a highly parallel fashion is high variance in job memory and runtimes. This can be mitigated by making target regions that respect the underlying read density. (Another consideration is the allelic complexity, which I will discuss at the end of this post.)

The following snippet of shell script implements the depth-aware chunking. The basic idea is to use sambamba to produce a coverage map. We then count the total per-base coverage ($base_cov), divide this by the desired number of jobs ($nchunks), and then run another script on the output to produce the --region compatible targets:

sambamba depth base --combined aln.bam | cut -f 1-3 | pv -l | pigz >aln.bam.coverage.gz
base_cov=$(zcat aln.bam.coverage.gz | awk 'NR>1 { x += $3; } END { print x }')
nchunks=1000; zcat pepperGBS.bam.coverage.gz | head -100000000 | awk 'BEGIN { bin='$base_cov' / '$nchunks' } NR==1 { next } NR==2 { chr=$1; pos=$2; last=$2; } ($1==chr && sum < bin) { sum += $3; last=$2 } ($1!=chr || sum > bin) { print chr":"pos"-"last; sum = $3; chr=$1; pos=$2; last=$2; } END { print chr":"pos"-"last; } ' >targets.regions
freebayes-parallel targets.regions 32 -f ref.fa >out.vcf

Now target.regions has a list of regions that you can run freebayes over, using e.g. the freebayes-parallel script or a cluster job submission script of your own design. The bin width determination is based on coverage, so this will tend to resolve the common issue that very deep regions have extremely long runtimes, often many orders of magnitude longer than an average region.

The other major issue that affects runtime can be allelic complexity combined with sample (or ploidy) number. To deal with this it can help to limit the number of considered alleles arbitrarily, such as to a low integer like 7 or 10. This is done via the option --use-best-n-alleles. In cases of extremely high ploidy and excessive depth it can help to set even 2 or 3.

TODO to resolve this PR: put this documentation into the README and scripts directory.

Bonus: adjust freebayes to take the region list directly and run in parallel, without freebayes-parallel. This would require significant coding but is a frequent request.

@tseemann
Copy link
Contributor

@ekg are you envisioning using -fopenmp or a more explicit threading library?

@ekg
Copy link
Collaborator Author

ekg commented May 2, 2018

I would definitely use openmp to implement parallelization in freebayes.

@drwhitehouse
Copy link

Looking at the documentation as it pertains to compiling freebayes there is a one liner:

"Note that the freebayes-parallel script and the programs on which it depends are not installed by this command."

Is there any other documentation about enabling a parallel build? I'm not a biologist but rather the administrator of an HPC cluster, and I suspect that there may be additional steps involved.

@ekg
Copy link
Collaborator Author

ekg commented Sep 13, 2018 via email

@drwhitehouse
Copy link

drwhitehouse commented Sep 13, 2018

o.k. we have parallel, do the documented compilation instructions not build the tools in vcflib?
I have a script here from a colleague who no longer works here and it refers to "make openmp" in the vcflib repo directory so I am guessing there might be some additional steps.

@drwhitehouse
Copy link

drwhitehouse commented Sep 13, 2018

o.k. the reason I ask this is if I do the following:

git clone git@github.com:ekg/freebayes.git --branch v1.2.0 --recurse-submodules

if I then descend into that directory and "make" everything goes well.
If however I descend into the vcflib directory first, and "make openmp" first, then cd .. && make the build fails with:

../vcflib/src/Variant.h:18:21: fatal error: tabix.hpp: No such file or directory

(edit: same error if I just "make" in the vcflib directory...)

@rwhetten
Copy link
Contributor

@ekg What is the purpose of the pv -l step in the sambamba step? That is not available on the cluster I'm using, but the 'pv' utility seems to be a simple progress meter, which won't be useful for a batch job on the cluster in any case. Is this invoking a different 'pv' command than the progress meter utility?

@ekg
Copy link
Collaborator Author

ekg commented Jan 26, 2019 via email

@juanesarango
Copy link

juanesarango commented Feb 18, 2019

For anyone interested in running freebayes in parallel jobs optimizing the way regions are created, I created toil_freebayes a wrapper tool around freebayes, that allows you to run freebayes using toil.

I split jobs by regions that have a fixed number of reads, instead of fixed number of base pairs, to reduce the variance of runtime and memory usage across different regions, as @ekg has discussed above. I use split_bed_by_reads script to create the regions from a bedfile.

The wrapper tool was mainly developed to support freebayes within our bioinformatics pipelines infrastructure. Hope this is useful to someone, and happy to receive feedback and contributions.

@jpuritz
Copy link

jpuritz commented Feb 18, 2019 via email

@travc
Copy link
Contributor

travc commented Nov 14, 2019

Don't want to engage in too much issue necromancy, but it seems like these approaches are all single or pool... Many sample joint calling is trickier. Any ideas there?

Oh, also any word about bad effects from chopping up regions generally? I've been using a script which only chops gaps in the reference since that should be safe, but as the references get better the available safe cut points are getting more rare. If there isn't a significant downside to cutting at arbitrary points, that would simplify things.

@sanjaynagi
Copy link
Contributor

Don't want to engage in too much issue necromancy, but it seems like these approaches are all single or pool... Many sample joint calling is trickier. Any ideas there?

does this mean it is not possible to use freebayes parallel for joint, pooled calling?

@pjotrp pjotrp added this to the v1.3.5 milestone Dec 17, 2020
@janxkoci
Copy link

does this mean it is not possible to use freebayes parallel for joint, pooled calling?

I've been using freebayes all the time for joint calling of many samples (e.g. RADseq, exome capture) without any major issue.

@sanjaynagi
Copy link
Contributor

I know - this was specifically thinking about the freebayes-parallel wrapper. I now parallelise it within snakemake which seems to get around these issues anyway. Ive written about it in this blog post

@janxkoci
Copy link

this was specifically thinking about the freebayes-parallel wrapper

Yeah sorry, that's what I meant..

@travc
Copy link
Contributor

travc commented Jan 25, 2021

I've been using snakemake as well. I stopped using the more complicated approach of only splitting on Ns and haven't had any obvious problems.
You might find this script useful, especially if you are running on a cluster where each job needs to roughly the same size.
https://gist.github.com/travc/0c53df2c8eca81c3ebc36616869930ec

@sanjaynagi
Copy link
Contributor

sanjaynagi commented Jan 25, 2021

@janxkoci interesting. I could never get it to work for some reason - perhaps it was the pooled aspect.

UPDATE ^ - retested and freebayes-parallel can perform both pooled and joint calling.

@travc Thanks Travis, the script looks useful. I may incorporate it into a pipeline, if I do, I'll let you know, and ill update the blog post. Currently my naive approach works fine but the final genome region can occasionally take a while, so this should help.

@travc
Copy link
Contributor

travc commented Jan 25, 2021

@sanjaynagi Slurm and other cluster job managers tend to require you tell them how much time (and other resources) to allocate per job. There are a lot of simple scripts to divide up the regions, including several which do it based on the approximate number of reads. Problem was that they tend to require you to give them a number of segments, and I wanted to specify an approximate data size per segment so the job resource requirements would be more constant across not just a single dataset/run but also between runs.

Not sure that makes sense. Anyways, if you aren't running on a cluster, it doesn't really matter that much.

@janxkoci
Copy link

janxkoci commented Jan 25, 2021

We use Torque PBS and a modern cluster with decent-enough nodes (80 cores & 192 GB ram per each node) where I can run freebayes-parallel on a single node with my dataset, so the job script can be quite simple.

I hope I understood you right though - I generally use multiplexed libraries, where we pool samples, but they are individually barcoded and can be demultiplexed after sequencing. Then I map them individually and call freebayes on all bam files to get a single vcf. I think this is called joint calling (at least that's how I recall the term from the dark days of using GATK - ugh, I hope those are over).

@sanjaynagi
Copy link
Contributor

@travc that makes sense now, thanks. the server we have in-house does not use SLURM/SGE etc, so I havent needed this thus far.

@janxkoci Yeah, so this data I am referring to would be actual pooled samples, which cannot be demultiplexed. In this case, I am calling multiple samples at the same time (joint calling), but these samples are also pooled (and thus i use the --pooled-discrete argument of freebayes)

@pjotrp
Copy link
Collaborator

pjotrp commented Jan 31, 2021

@pjotrp
Copy link
Collaborator

pjotrp commented Jan 31, 2021

Great work all! @sanjaynagi and @travc I am thinking we could add your scripts to the freebayes repos and a small description in the README shortly discussing GNU parallel, snakemake and slurm options. Does that make sense? If we make a directory ./examples/snakemake and ./examples/slurm you could drop your code there or in ./scripts if it is a callable script. Please do a PR so we have your contribution visible.

@sanjaynagi
Copy link
Contributor

@pjotrp cool, Ill make a PR soon

@mglubber
Copy link
Contributor

@pjotrp I'm also working with freebayes & snakemake. I like @sanjaynagi's idea of having Snakemake handle parallelizing Freebayes, but I'd prefer not to use his R script, since it would mean having to install R/tidyverse/data.table in addition to Freebayes (which would add complexity in HPC/cloud computing situations). I've submitted a pull request (#689) to modify fasta_generate_regions.py to work with @sanjaynagi's Snakemake example (i.e. split into N total chunks rather than N-sized chunks; write output to bed files; allow for filtering chromosomes). The output with default options should be the same, so that the script still works with existing pipelines/examples.

@github-actions
Copy link

This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the Stale label Jun 13, 2021
@github-actions
Copy link

This issue was closed for lack of activity. Feel free to re-open if someone feels like working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests