Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
Introduction ------------ MaCS is a simulator of the coalescent process that simulates geneologies spatially across chromosomes as a Markovian process. The algorithm is similar to the Wiuf and Hein algorithm (Wiuf and Hein, 1999) in that an ancestral recombination graph (ARG) is stored in memory. Where the algorithm deviates from the Wiuf and Hein are the following: 1) Recombination events occur only on the local geneology at the current position on the sequence instead of anywhere on the ARG, but can coalesce to any lineage on the ARG including those not on the local geneology (i.e. a non-ancestral edge) 2) Waiting times (i.e. the distance between recombinations on the sequence) are calculated from exponential draws with intensity based on the local geneology's branch length instead of the ARG length 3) The algorithm is n-th order Markovian where n is based on a parameter the user enters. This makes the algorithm more general than one like FastCoal which is 1st order. Higher values provide a better approximation to the coalescent. These changes make the algorithm substantially more efficient than the Wiuf and Hein with little loss in accuracy. MaCS also supports all the demographic history semantics of MS. Typing ./macs with no arguments at the command line lists the usage parameters. Most command line arguments are the same as those in ms. This document briefly summarizes how one compiles and runs MaCS. Requirements: g++ C++ boost development library (http://www.boost.org) - If one has yum package manager, this is a matter of simply typing (under root): yum install boost-devel Compilation: There are two executables. MaCS is the simulator itself. msformatter takes in generated data from the simulator and generates output that is compatible with that from Hudson's ms. To compile everything: make all Compiling the simulator: make macs Compiling the MS formatter: make msformatter For moderate sample sizes and sequence lengths, a typical command line would like like: ./macs 100 1e6 -T -t .001 -r .001 -h 1e2 -R example_input/hotspot.txt -F example_input/ascertainment.txt 0 2>trees.txt | ./msformatter > haplotypes.txt The program is designed to send run-time information such as tree statistics to STDERR and simulated output to STDOUT. This allows one to pipe desired output to other programs that may process data differently (e.g. a database loader) such as msformatter. The command line arguments above says: Simulate 100 sequences on a region 1e6 basepairs long. The per base pair mutation and recombination rate scaled at 4N is .001. The h parameter approximates the Markovian order by instructing the program to include all geneologies from the current point to 1e2 base pairs to the left if one considers simulation proceeding from the left end to the right end of the entire sequence. Any branches unique to the geneology that is beyond 1e2 base pairs is pruned from the ARG. -T tells MACS to output the local trees in Newick format similar to MS output. The option -R instructs the program to read in a variable recombination file. The first line of the file hotspot.txt says that from position 0 to .3 (unit scaled on entire sequence to be simulated), the cM to Mb ratio is 0.57. If there are any coordinates not covered by ranges in the file, the ratio defaults to 1. The option -F instructs the program to read in a SNP ascertainment file. Entering 1 after the filename instructs the program to assume any derived allele frequency (DAF) > .5 to have a DAF of 1-DAF. This might be useful for scenarios where one is interested only in the minor allele frequency and the identity of the derived allele is unknown. The flag 0 disables this behavior. The first line of the file ascertainment.txt instructs the program to completely filter out all SNPs with DAF of range 0 to 0.01. The second line says to filter SNPs with DAF from 0.01 to 0.05 to the point where SNPs in this DAF range comprise 1% of the SNPs output. For very large sample sizes and/or sequence lengths, we recommend storing output in a text file to postprocess later (e.g. import into a database) ./macs 10000 1e9 -t .001 -r .001 -h 1e2 -F example_input/ascertainment.txt 0 2>trees.txt 1> sites.txt For comments or questions, please post an issue at http://code.google.com/p/macs