GitHub - gchen98/macs: Automatically exported from code.google.com/p/macs

gchen98 / macs Public

Notifications You must be signed in to change notification settings
Fork 6
Star 16

Automatically exported from code.google.com/p/macs

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
example_input		example_input
LICENSE-2.0.txt		LICENSE-2.0.txt
README		README
RELEASE_NOTES		RELEASE_NOTES
algorithm.cpp		algorithm.cpp
constants.h		constants.h
datastructures.cpp		datastructures.cpp
makefile		makefile
matrixformat.cpp		matrixformat.cpp
msformat.cpp		msformat.cpp
simulator.cpp		simulator.cpp
simulator.h		simulator.h

Repository files navigation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Introduction
------------

MaCS is a simulator of the coalescent process that simulates geneologies spatially across chromosomes as a Markovian process. The algorithm is similar to the Wiuf and Hein algorithm (Wiuf and Hein, 1999) in that an ancestral recombination graph (ARG) is stored in memory. Where the algorithm deviates from the Wiuf and Hein are the following:

1) Recombination events occur only on the local geneology at the current position on the sequence instead of anywhere on the ARG, but can coalesce to any lineage on the ARG including those not on the local geneology (i.e. a non-ancestral edge)
2) Waiting times (i.e. the distance between recombinations on the sequence) are calculated from exponential draws with intensity based on the local geneology's branch length instead of the ARG length
3) The algorithm is n-th order Markovian where n is based on a parameter the user enters. This makes the algorithm more general than one like FastCoal which is 1st order. Higher values provide a better approximation to the coalescent.

These changes make the algorithm substantially more efficient than the Wiuf and Hein with little loss in accuracy.

MaCS also supports all the demographic history semantics of MS. Typing ./macs with no arguments at the command line lists the usage parameters. Most command line arguments are the same as those in ms.

This document briefly summarizes how one compiles and runs MaCS.

Requirements:
g++
C++ boost development library (http://www.boost.org)
- If one has yum package manager, this is a matter of simply typing (under root):
yum install boost-devel

Compilation:
There are two executables. MaCS is the simulator itself. msformatter takes in generated data from the simulator and generates output that is compatible with that from Hudson's ms.

To compile everything:
make all

Compiling the simulator:
make macs

Compiling the MS formatter:
make msformatter

For moderate sample sizes and sequence lengths, a typical command line would like like:

./macs 100 1e6 -T -t .001 -r .001 -h 1e2 -R example_input/hotspot.txt -F example_input/ascertainment.txt 0 2>trees.txt | ./msformatter > haplotypes.txt

The program is designed to send run-time information such as tree statistics to STDERR and simulated output to STDOUT. This allows one to pipe desired output to other programs that may process data differently (e.g. a database loader) such as msformatter.

The command line arguments above says:
Simulate 100 sequences on a region 1e6 basepairs long. The per base pair mutation and recombination rate scaled at 4N is .001. The h parameter approximates the Markovian order by instructing the program to include all geneologies from the current point to 1e2 base pairs to the left if one considers simulation proceeding from the left end to the right end of the entire sequence. Any branches unique to the geneology that is beyond 1e2 base pairs is pruned from the ARG. -T tells MACS to output the local trees in Newick format similar to MS output.

The option -R instructs the program to read in a variable recombination file. The first line of the file hotspot.txt says that from position 0 to .3 (unit scaled on entire sequence to be simulated), the cM to Mb ratio is 0.57. If there are any coordinates not covered by ranges in the file, the ratio defaults to 1.
The option -F instructs the program to read in a SNP ascertainment file. Entering 1 after the filename instructs the program to assume any derived allele frequency (DAF) > .5 to have a DAF of 1-DAF. This might be useful for scenarios where one is interested only in the minor allele frequency and the identity of the derived allele is unknown. The flag 0 disables this behavior. The first line of the file ascertainment.txt instructs the program to completely filter out all SNPs with DAF of range 0 to 0.01. The second line says to filter SNPs with DAF from 0.01 to 0.05 to the point where SNPs in this DAF range comprise 1% of the SNPs output.

For very large sample sizes and/or sequence lengths, we recommend storing output in a text file to postprocess later (e.g. import into a database)

./macs 10000 1e9 -t .001 -r .001 -h 1e2 -F example_input/ascertainment.txt 0 2>trees.txt 1> sites.txt

For comments or questions, please post an issue at http://code.google.com/p/macs