cosi2: Efficient simulator of exact and approximate coalescent with selection
cosi2 is an efficient coalescent simulator with support for selection, population structure, variable recombination rates, and gene conversion. It supports exact and approximate simulation modes.
This is the user’s manual for cosi2 coalescent simulator.
Downloading and installing cosi2
cosi2 can be downloaded from http://broadinstitute.org/mpg/cosi2 .
Quick start instructions:
tar xvfz cosi-2.0.tar.gz cd cosi-2.0 ./configure make
coalescent -p paramfile -m
will execute the simulation, writing results to standard output in ms simulator format.
The basic invocation format is
coalescent -p paramfile -m
The -p paramfile option specifies the parameter file, which is a text file that describes the demographic model to be simulated. The -m option requests output to be written to stdout in the format of Richard Hudson’s ms simulator.
Format of the parameter file
The parameter file defines the population structure and other controlling parameters for the run, using keywords. Comments are indicated by “#” at the beginning of a line.
Specifying the simulated region
The length of the simulated region, in basepairs, is specified as:
length <length in bp>
The mutation rate is specified as:
mutation_rate <mutation rate per bp per generation>
The recombination rate is specified as follows:
specifies the file describing the genetic map. The file has two columns, separated by whitespace. The first column gives a basepair position; the second column sets the crossover recombination rate, per generation, from that point until either the end of the region or the position specified by the next line. The basepair positions in the first column must be in strictly increasing order. The first line of the genetic map file also specifies the recombination rate from the beginning of the region to the basepair position of that line. Thus a one-line genetic map file specifies a constant recombination rate for the region.
The command-line parameter
-R can be used to specify the genetic map file, overriding any specified in the
recosim, described later in this manual, can be used to generate a genetic map file with a
specified distribution of recombination hotspots.
Gene conversion is specified by the following parameters:
gene_conversion_relative_rate <rate of gene conversion initiation relative to crossover recombination rate> gene_conversion_mean_tract_length <mean gene conversion tract length in bp> gene_conversion_min_tract_length <minimum gene conversion tract length in bp>
gene_conversion_relative_rate is a proportionality factor giving the ratio of the
gene conversion initiation rate to the crossover recombination rate. If omitted, it is set to zero,
disabling gene conversion. Gene conversion tract length is distributed geometrically, with the mean
gene_conversion_mean_tract_length; the distribution is sampled until a value of at least
gene_conversion_min_tract_length is obtained.
Specifying the populations
Any population that appears in the simulation, either as a source of samples or in the history of those samples, must be defined in the parameter file; at least one sampled population is required. The syntax for defining a population is
pop_define <pop id> <label> pop_size <pop id> <size> sample_size <pop id> <n samples>
<pop id> is an integer ID of the population, used to refer to the population when specifying demographic
<label> is human-readable name for the population (a string, with no spaces, and not put in quotes).
Population size, here and elsewhere, is specified as the number of diploids in the population.
Sample size is specified as the number of haploid samples.
For example, the following entries
pop_define 1 European pop_size 1 10000 sample_size 1 50
define population 1 (with the label “European”) and set the effective present-day population size to be 10,000 diploid individuals and the number of sampled haploid chromosomes to be 50.
Specifying the demographic history
Parameters that define the demographic history of the populations are specified as follows. They can be supplied in
any order. In the entries below, the time
<T> is measured in generations (which can be fractional) and increases
going into the past (present = 0). Labels are used only to provide documentation; they must be enclosed in double
<pop id> denoes the integer population id from the
pop_define line. Population sizes are given
as the number of diploid individuals.
pop_event change_size <label> <pop id> <T> <size for time > T>
Set the size of the population
<pop id> from time
<T> pastward. The setting affects the population size
<T> pastward until the next (in the pastward direction)
pop_event that affects population size.
Note that the size at generation 0 is set by a
pop_size line when the population is defined.
pop_event exp_change_size <label> <pop id> <Tend> <Tstart> <final size> <start size>
Specify exponential expansion (in the forward time sense) from size
<start size> at generation
<Tstart> to size
<final size> at generation
<Tstart> must be pastward of
(numerically greater than)
<Tend>. Population size pastward from
<start size> until
changed by another
pop_event. For example,
pop_event exp_change_size "expansion" 1 50 500 10000 1000
represents an exponential population increase in population 1 that started 500 generations ago and ended 50 generations ago, increasing from 1000 to 10000.
pop_event bottleneck <label> <pop id> <T> <inbreeding coefficient>
defines a population bottleneck (point-like reduction in population size) in population
<pop id> at time
with bottleneck strength given by
pop_event migration_rate <label> <source pop id> <target pop id> <T> <probability of migration/chrom/gen>
sets the migration rate from population
<source pop id> to population
<new pop id> from time
as the probability of migration per chromosome per generation.
pop_event split <label> <source pop id> <new pop id> <T>
specifies a population split goint forward (population join going pastward). In the forward sense,
this specifies the origin of population
<new pop id>, when it splits off from population
<source pop id>;
prior to time
<T>, the population =<new pop id> does not exist (is empty).
pop_event admix <label> <admixed pop id> <source pop id> <T> <fraction of admixed chroms from source>
specifies an admixture event.
Specifying the selected sweep:
pop_event sweep <pop> <Tend> <selection coefficient> <position of causal allele (0.0-1.0)> <freq at Tend>
<pop> gives the population in which the advantageous allele is born.
<Tend> gives the generation at which sweep
ends (in the forward-time sense). The position of the advantageous allele is specified as a floating-point number in
the range 0.0-1.0, giving its relative position within the simulated region (for example, 0.5 puts the advantageous
allele in the middle of the region). The frequency of the advantageous allele at time
<Tend> is specified as a
number in the range 0.0-1.0 (not as a chromosome count).
random_seed <integer seed>
specifies the random seed used for the simulations. This value can be overridden by the
-r command-line option.
tells cosi to use the infinite-sites model. This is the default with the new
ms output format, but with the legacy
output format (described below) this option must be specified; otherwise, mutations are filtered so that at most one
lands within each basepair-length stretch of the simulated region.
The following describes the main command-line options of
Specifying the model: -p [ --paramfile ] arg parameter file -R [ --recombfile ] arg genetic map file (if specified, overrides the one in paramfile) -n [ --nsims ] arg (=1) number of simulations to output -r [ --seed ] arg (=0) random seed (0 to use current time) -J [ --trajfile ] arg file from which to read sweep trajectory. It has two columns: first column gives the generation, second gives the fraction of chromosomes in the sweep population carrying the derived (advantageous) allele. -u [ --max-coal-dist ] arg (=1) for Markovian approximation mode, the level of approximation: the maximum distance between node hulls for coalescence to be allowed. Distance is specified as a fraction of the total length of the simulated region, in the range [0.0-1.0]; 1.0 (default) means no approximation. Specifying the output format: -o [ --outfilebase ] arg base name for output files in cosi format -m [ --outms ] write output to stdout in ms format Specifying output details: -P [ --output-precision ] arg number of decimal places used for floats in the outputs -M [ --write-mut-ages ] output mutation ages -L [ --write-recomb-locs ] output recombination locations Misc options: -h [ --help ] produce help message -V [ --version ] print version info and compile-time options -v [ --verbose ] verbose output -g [ --show-progress ] [=arg(=10)] print a progress message every N simulations
The main output format, specified by the
-m option, is that used by Richard Hudson’s ms simulator .
The samples are written to the standard output.
-m, the generation at which each mutation occurred can be additionally output by adding the -M option.
The mutation times are output following a “muttimes:” header on a line immediately after the “positions:” line of each
simulated sample. (See ms documentation for details of ms output format). Also, recombination locations can be output
by adding the
-L option; recombination locations are then output following a “recomblocs: ” header on a line after the
“positions” line, and after the “muttimes” line if present. Precision (number of digits after the decimal point) of the
output may controlled by adding
-P <ndigits> command-line option; it is especially useful when using recombination
hotspots, since many recombination locations will then share the first few decimal digits.
A legacy format, used by older versions of
cosi, is also supported. It is specified by the
-o basename is specified, then for each population
the mutation locations are written to the file named
basename.pos-<pop id> and the haplotypes are written
basename.hap-<pop id>. If multiple simulations are done in one run with the
-n command-line option,
the simulation number is appended to
basename. In the
.hap- files, ‘1’ denotes the derived allele and ‘2’
the ancestral allele. Note that in the legacy format, a finite-sites mutation model is used unless
infinite_sites yes is specified in the parameter file.
Generating recombination maps using
recosimulate <parameter file name> <region size (bp)>.
recosimulate without arguments prints the list of
valid parameters. Note that the binary name is
but the program is called
recosim and its source is in the
Valid entries in the parameter file are as follows. They can be in any order, and all are optional.
outfile <output file name> [default="model.out"] model <0,1> [default=0] model 0: uniform recombination, constant or drawn from distribution. model 1: model 0 + gamma-distributed hotspots. baserate <mean recomb (cM/Mb)> [default=1.0] distribution <recomb distr. file name> [default=none (const value)] space <mean hotspot spacing (bp)> [default=9000] distance_shape <gamma function shape param> [default=1.0] intensity_shape <gamma function shape param> [default=0.3] local_shape <gamma function shape param, local variation> [default=0.3] local_size <size of region of local variation (bp) (e.g. 100000)> [default=50000000] bkgd <fraction in flat bkgd> [default=0.1] random_seed <integer seed> (0=>picked by program based on time and PID) [default=0]
If model 0 is selected, the recombination rate for the region will be
constant. The constant value can be set directly, using the
baserate keyword, or alternatively the value can be chosen at random
from a distribution file supplied with the
baserate.) The format for the
distribution file is three records per line:
bin_start bin_end cumulative_fraction
bin_end specify a range of recombination rates (in
cM/Mb) and the cumulative fraction is the probability that the
recombination rate lies within this or earlier bins. Entries should
be in order of increasing rate; see
examples/bestfit/autosomes_deCODE.distr for an example.
If model 1 is selected, the recombination rate varies across the
region; the variation can be on both local and fine scales. With this
distribution parameters are still valid,
but they now set the expected value of the recombination rate in the
entire region, the value around which local rates vary. A fraction of
the mean rate can be kept constant across the region, using the
parameter. The remainder of the recombination rate varies locally in
windows across the region, with the size of the window controlled by
local_size; that is, if
local_size is set to 100 kb, a
new value is chosen every 100 kb. The value is chosen from a gamma
distribution (with shape parameter set by
local_shape), with a mean
value determined by the regional rate (and the background fraction).
Within each window, recombination is clustered into point-like
hotspots of recombination. These have a gamma-distributed intensity
with shape parameter
intensity_shape (and mean determined by the
local rate), and a gamma-distributed spacing with shape parameter
distance_shape and mean set by parameter
With model=1 and a small value for
intensity_shape, there is
a small but extended tail at very high recombination rates; when
simulating long sequences, this can make the coalescent simulator take
orders of magnitude longer on a small fraction of runs. I have
therefore found it useful to truncate the tail within
is a commented-out line for doing so in the code.)
A final option is
random_seed, which permits the user to specify a
seed for the random number generator; this is useful for debugging or
recreating a previous run. If a seed of zero is supplied, or
the keyword is not found, a random seed will be generated from the
time and process id of the job. In any case, the random seed used is
always output to stdout during execution.
User-supplied recombination map
As an alternative to using
recosim, you can supply your own recombination map to
coalescent; the (tab-delimited) file format is
<position (kb)> <recomb prob/bp/generation>
Each line specifies the recombination rate that will be used from that position until the next specified position, or the end of the sequence. The first line also specifies the recombination rate from the beginning of the simulated region to that line’s position.
Some working examples are included in the
“Cosi2: an efficient simulator of exact and approximate coalescent with selection”, by Ilya Shlyakhter Pardis C. Sabeti and Stephen F. Schaffner, Bioinformatics (2014) 30 (23): 3427-3429. doi: 10.1093/bioinformatics/btu562 .