Skip to content
David Winter edited this page Oct 30, 2017 · 39 revisions

Compiling

In order to use accuMUlate you will need to compile it from source. The software has been tested with Ubuntu and OSX (using XCode 9). If you encounter any errors in trying to compile the software please open an issue and we will help you work out what's going wrong.

Getting the source code

You can get the source code for accuMUlate by downloading a realease, or cloning this repository.

git clone https://github.com/dwinter/accuMUlate.git

Prerequisites

accuMUlate uses CMake to generate executable files on different operating systems. If you need to install CMake it is likely that your operating system's package manager can help you. With CMake installed you can prepare to compile accuMUlate by moving into the "build" directory and running cmake:

cd build/
cmake ..

CMake may throw errors and this stage, letting you know that some of the required packages are not installed. Two of the required packages, Eigen and Boost program options can usually be installed via a package manager. The third (Bamtools) is available as a precompiled package on some systems, but might need to be compiled from source. If you need to install bamtools start by downloading the latest release and unpacking the contents. The package can then be installed following the project's own documentation.

mkdir build
cd build
cmake ..make
make
make install

By default, the installation step will place files in /usr/local/. You may not have write access to this directory, or you might just prefer to install bamtools locally. If so, you can set a custom installation address using cmake. Here I create a folder called local in my home directory and install the files there.

mkdir ~/local/
cmake -DCMAKE_INSTALL_PREFIX=~/local/ ..
make
make install

Compile

If the prerequisites are installed compiling accuMUlate should be as simple as entering the build/ directory and typing

cmake ..
make

Which will create an executable file called accuMUlate in the build/ directory.

Preparing your BAM files

accuMUlate requires a single BAM file containing data from all of your samples mapped to a reference genome. Because all of the sequencing data is included in a single file, it is crucial that the header of this BAM file contains read-group information, allowing reads to be assigned to samples. accuMUlate treats the sample (SM:) field of the read-group tag as the identifier for a unique mutation accumulation line (or ancestor).

If your data is contained in multiple BAM files (perhaps one per sample), a single one can be generated by samtools merge (using the -r flag to create @RG headers) or Picard's MergeSamFiles.

The accuMUlate model includes data from an ancestor. If you have not sequenced the ancestor of your experiment you can nevertheless run the program (accuMUlate will use the data from descendants to infer the ancestral state at each locus). You will, however, need to include a readgroup for the ancestor in the header of the bam file. Like all readgroups, this will need ID an SM fields.

Running accuMUlate

To run accuMUlate you will need a bam file and a reference genome. Although all of the options to accuMUlate can be set via the command line, in practice it will usually be easier to set at least some of these via a .ini file, which can be specified using the -c argument. Details of all of the options that can be passed to the program are described here. The accuMulate-tools repository contains scripts that may be useful for producing entries in an .ini file. For example, the sample information and GC-content can be extractd from a bam file header and reference genome.

samtools view -H bam/realigned.bam | ./extract_samples.py [ANCESTOR-NAME] - >> params.ini
./GC_content.py [REFERENCE-GENOME] >> params.ini 

Once the files are prepared, accuMUlate can be called (with results printed to screen, or written to a file using the -o option):

./accuMUlate -c parameters.ini -b alignment.bam -r reference_genome.fasta > putative_mutations.tsv

Making sense of the results

accuMUlate produces a tab-delimited table of results. Because the first three rows match those of a bed file (chromosome, start, end), this file can be used as such in downstream analysis or subsequent calls to accuMUlate (e.g. to test the impact of parameter values of mutation probabilities). In addition to the position of putative mutations, the output contains data on the mutant sample, mutant and ancestral genotypes and a suite of summary statistics that can be used to identify possible false positives.

A complete description of the output format is provided here, and detailed information on the summary statistics accuMUlate calculates for each site is available on this page.

Calculating a denominator

Very often the goal of a mutation accumulation experiment is to estimate the rate of mutation. This requires knowing both the number of mutations (the numerator) and the number of sites at which you had the power to detect mutations (the denominator). accuMUlate comes with a program called denominate that tries to calculate the number of callable sites. It does this by simulating mutations in your data (by changing the bases in sequencing reads) and attempting to call a mutation. denominate takes all of the arguments accuMUlate does, but also allows you to set the following filtering criteria:

  --min-depth arg (=0)                  Minimum sequencing depth for a site to be included
  --max-depth arg (=4294967295)         Maximum sequencing depth for a site to   be included
  --min-mutant-strand arg (=0)          Minimum number of alleles supporting the mutant on each strand 
  --max-anc-in-mutant arg (=4294967295) Maximum number of ancestral alleles in  mutant sample
  --max-mutant-in-anc arg (=4294967295) Maximum number of mutant alleles in ancestral samples
  --max-MQ-AD arg (=inf)                Maximum value of the AD test for  mapping quality differences
  --max-insert-AD arg (=inf)            Maximum value of the AD test for insert length differences
  --min-strand-pval arg (=0)            Minimum p-value for strand bias
  --min-mapping-pval arg (=0)           Minimum p-value for paired-mapping bias

As with accuMUlate all of these arguments can be specified via a config file

./denominate -c denom_params.ini -r ref_genome.fasta -b alignment.fasta > denom.out

The output of denominate is a single row of counts of callable sites, the first four values are the number of ancestrally "A", "C", "G" and "T" bases that could be called for the first sample and each subsequent sample is represented by four more columns. The accumulate-tools repostiory includes an R script with functions to read these files and manipulate them.

A rmarkdown document demonstrating a down-stream analysis can be seen in athal_analysis.Rmd file in this repository.

Running in parallel

Running accuMUlate on whole-genome data with many samples can be time-consuming. Fortunately, the mutation-calling model treats each site in the genome independently, meaning the process can be parallelized very easily. The program itself can only make use of a single processor, but it is possible to make use of multiple processors using GNU parallel.

Generating genomic windows from a reference genome

The first step in running accuMUlate in parallel is to generate a set of non-overlapping windows from the reference genome. This can be achieved using the bedtools command makewindows, where the -w option to specifies a window size. To use bedtools makewindows, the reference genome must first be converted into a dictionary containing the chromosome name and chromosome length, a python script to do this is available in accuMUlate-tools.

dictionary_converter.py [reference_genome] > reference_genome.dict

To use accuMUlate in parallel, each window generated should be written into its own file (here in a temporary directory)

mkdir -p tmp
bedtools makewindows -g reference_genome.dict -w 100000 | split -l 1 - tmp/

Calling mutations from each window

Finally, it is possible to combine GNU parallel and the -i option of accuMUlate to call mutations from each of these windows:

parallel -j [n. CPUs] ./accuMUlate -c parameters.ini -b alignment.bam -r reference_genome.fasta -i {} ::: temporary_directory/* > unsorted_out.txt`

When running accuMUlate with multiple processors, the results are printed in whatever order they are generated, and can be sorted into the correct order using the command sort:

sort -k1,1 -k2,2n unsorted_out.txt` > results_final.out
Clone this wiki locally