Home
In order to use accuMUlate
you will need to compile it from source. The software has been tested with Ubuntu and OSX (using XCode 9). If you encounter any errors in trying to compile the software please open an issue and we will help you work out what's going wrong.
You can get the source code for accuMUlate
by downloading a realease, or cloning this repository.
git clone https://github.com/dwinter/accuMUlate.git
accuMUlate
uses CMake
to generate executable files on different operating systems. If you need to install CMake it is likely that your operating system's package manager can help you. With CMake installed you can prepare to compile accuMUlate
by moving into the "build" directory and running cmake
:
cd build/
cmake ..
CMake
may throw errors and this stage, letting you know that some of the required packages are not installed. Two of the required packages, Eigen and Boost program options can usually be installed via a package manager. The third (Bamtools) is available as a precompiled package on some systems, but might need to be compiled from source. If you need to install bamtools start by downloading the latest release and unpacking the contents. The package can then be installed following the project's own documentation.
mkdir build
cd build
cmake ..make
make
make install
By default, the installation step will place files in /usr/local/
. You may not have write access to this directory, or you might just prefer to install bamtools locally. If so, you can set a custom installation address using cmake
. Here I create a folder called local
in my home directory and install the files there.
mkdir ~/local/
cmake -DCMAKE_INSTALL_PREFIX=~/local/ ..
make
make install
If the prerequisites are installed compiling accuMUlate
should be as simple as entering the build/
directory and typing
cmake ..
make
Which will create an executable file called accuMUlate
in the build/
directory.
accuMUlate
requires a single BAM file containing data from all of your samples mapped to a reference genome. Because all of the sequencing data is included in a single file, it is crucial that the header of this BAM file contains read-group information, allowing reads to be assigned to samples. accuMUlate
treats the sample (SM:
) field of the read-group tag as the identifier for a unique mutation accumulation line (or ancestor).
If your data is contained in multiple BAM files (perhaps one per sample), a single one can be generated by samtools merge
(using the -r
flag to create @RG headers) or Picard's MergeSamFiles
.
The accuMUlate
model includes data from an ancestor. If you have not sequenced the ancestor of your experiment you can nevertheless run the program (accuMUlate
will use the data from descendants to infer the ancestral state at each locus). You will, however, need to include a readgroup for the ancestor in the header of the bam file. Like all readgroups, this will need ID
an SM
fields.
To run accuMUlate
you will need a bam file and a reference genome. Although all of the options to accuMUlate
can be set via the command line, in practice it will usually be easier to set at least some of these via a .ini
file, which can be specified using the -c
argument. Details of all of the options that can be passed to the program are described here. The accuMulate-tools
repository contains scripts that may be useful for producing entries in an .ini file. For example, the sample information and GC-content can be extractd from a bam file header and reference genome.
samtools view -H bam/realigned.bam | ./extract_samples.py [ANCESTOR-NAME] - >> params.ini
./GC_content.py [REFERENCE-GENOME] >> params.ini
Once the files are prepared, accuMUlate
can be called (with results printed to screen, or written to a file using the -o
option):
./accuMUlate -c parameters.ini -b alignment.bam -r reference_genome.fasta > putative_mutations.tsv
accuMUlate
produces a tab-delimited table of results. Because the first three rows match those of a bed file (chromosome, start, end), this file can be used as such in downstream analysis or subsequent calls to accuMUlate
(e.g. to test the impact of parameter values of mutation probabilities). In addition to the position of putative mutations, the output contains data on the mutant sample, mutant and ancestral genotypes and a suite of summary statistics that can be used to identify possible false positives.
A complete description of the output format is provided here, and detailed information on the summary statistics accuMUlate
calculates for each site is available on this page.
Very often the goal of a mutation accumulation experiment is to estimate the rate of mutation. This requires knowing both the number of mutations (the numerator) and the number of sites at which you had the power to detect mutations (the denominator). accuMUlate
comes with a program called denominate
that tries to calculate the number of callable sites. It does this by simulating mutations in your data (by changing the bases in sequencing reads) and attempting to call a mutation. denominate
takes all of the arguments accuMUlate
does, but also allows you to set the following filtering criteria:
--min-depth arg (=0) Minimum sequencing depth for a site to be included
--max-depth arg (=4294967295) Maximum sequencing depth for a site to be included
--min-mutant-strand arg (=0) Minimum number of alleles supporting the mutant on each strand
--max-anc-in-mutant arg (=4294967295) Maximum number of ancestral alleles in mutant sample
--max-mutant-in-anc arg (=4294967295) Maximum number of mutant alleles in ancestral samples
--max-MQ-AD arg (=inf) Maximum value of the AD test for mapping quality differences
--max-insert-AD arg (=inf) Maximum value of the AD test for insert length differences
--min-strand-pval arg (=0) Minimum p-value for strand bias
--min-mapping-pval arg (=0) Minimum p-value for paired-mapping bias
As with accuMUlate
all of these arguments can be specified via a config file
./denominate -c denom_params.ini -r ref_genome.fasta -b alignment.fasta > denom.out
The output of denominate
is a single row of counts of callable sites, the first four values are the number of ancestrally "A", "C", "G" and "T" bases that could be called for the first sample and each subsequent sample is represented by four more columns. The accumulate-tools repostiory includes an R script with functions to read these files and manipulate them.
A rmarkdown document demonstrating a down-stream analysis can be seen in athal_analysis.Rmd
file in this repository.
Running accuMUlate
on whole-genome data with many samples can be time-consuming. Fortunately, the mutation-calling model treats each site in the genome independently, meaning the process can be parallelized very easily. The program itself can only make use of a single processor, but it is possible to make use of multiple processors using GNU parallel.
The first step in running accuMUlate
in parallel is to generate a set of non-overlapping windows from the reference genome. This can be achieved using the bedtools
command makewindows
, where the -w
option to specifies a window size. To use bedtools makewindows
, the reference genome must first be converted into a dictionary containing the chromosome name and chromosome length, a python script to do this is available in accuMUlate-tools
.
dictionary_converter.py [reference_genome] > reference_genome.dict
To use accuMUlate
in parallel, each window generated should be written into its own file (here in a temporary directory)
mkdir -p tmp
bedtools makewindows -g reference_genome.dict -w 100000 | split -l 1 - tmp/
Finally, it is possible to combine GNU parallel
and the -i
option of accuMUlate
to call mutations from each of these windows:
parallel -j [n. CPUs] ./accuMUlate -c parameters.ini -b alignment.bam -r reference_genome.fasta -i {} ::: temporary_directory/* > unsorted_out.txt`
When running accuMUlate
with multiple processors, the results are printed in whatever order they are generated, and can be sorted into the correct order using the command sort
:
sort -k1,1 -k2,2n unsorted_out.txt` > results_final.out