# Quick Guide

Welcome to the *quick guide* tutorial for *ipcoal*. This page is intended to introduce major concepts of coalescent simulations and to provide a concise overview of several types of statistical evolutionary analyses that can be performed in *ipcoal*. This documentation demonstrates *ipcoal* through a series of examples that combine analyses alongside visualizations, in the style of a jupyter notebook, where Python code can be executed interactively. The Python package *toytree* is installed alongside *ipcoal* and should typically be imported with it, like below, as the two are intended to work hand-in-hand.

Follow along in this guide to learn how to set up a parameterized demographic model; simulate coalescent genealogies and sequences; calculate likelihoods of observed data; apply phylogenetic inference tools to sequences; and compare inferred gene trees to simulated genealogies. Explore the documentation to learn more details about each of these steps and explore other features of *ipcoal*.

In [1]:
import ipcoal
import toytree

### A population model
A core concept in evolutionary genetics is the use of population demographic models to represent a set of simplifying assumptions about how gene copies are replicated and inherited from one generation to the next (see Kingman coalescent). Complex evolutionary simulations can extend and relax many of these simplifying assumptions to develop highly detailed models of molecular evolution to emulate almost any natural phenomenon. However, simple models of evolution are also very useful, as they provide a means to establish null expectations for patterns of diversity and divergence in the absence of more complex processes. The coalescent model sits in the latter category, as a model for simulating neutral evolution given the parameters of a population demographic model. The major application of the coalescent is typically to treat the genealogical variation at any particular region of the genome as a random variable, and to examine the distribution of genealogies from many parts of the genome -- and/or DNA sequences evolved on those genealogies -- to infer features of the demographic model under which they evolved.

The main object in *ipcoal* is the `ipcoal.Model` class object, which represents a parameterized population demographic model, and can be used to perform coalescent simulations and/or fit simulated data to model parameters, and perform additional analyses on simulated trees or sequences. It is easy to set up parameterized Model objects to represent a single population, multiple structured populations (i.e., a species tree), or a network of connected populations (i.e., phylogenetic network). 

### Single population
The simplest population demographic model is a single panmictic population with constant diploid effective population size (Ne). In *ipcoal* this can be initiated as a `Model` class object provided with an `Ne` parameter. Here I also set the `nsamples` and `seed_trees` parameters to set the number of gene copies to sample, and a random seed for coalescent simulation, respectively. More details about these function calls to simulate trees and draw visualizations will be explained further below. The key concept here is to understand that the genealogy (shown with green edges) represents one randomly sampled history for a set of six gene copies in a population with diploid effective size of 1000. The visualization, which shows a gene tree embedded in a grey rectangle, represents an *embedding* of the genealogy within the demographic model (i.e., the model is a container within which the genealogy must fit). In this example the gene copies coalesce to a common ancestor a little over 2000 generations ago.

In [119]:
# setup a single-population demographic model
model = ipcoal.Model(Ne=1000, nsamples=6, seed_trees=123)

# simulate one genealogical tree under this model
model.sim_trees(nloci=1, nsites=1)

# draw the first genealogy embedded in the demographic model
model.draw_demography(idx=0);

### Population divergence
As a slightly more complex example, we can next set up a simulation that involves population structure in the form of a species tree. Here we use `toytree` to create a `ToyTree` object that represents a tree with three lineages that have equally spaced divergence times and a crown height of 1e4 generations (note: the `toytree` library can be used to create a species tree in a number of ways). This tree object can be passed as the first argument to the `ipcoal.Model` to designate the number of populations and their relationships. We can then set a single Ne to apply to all populations, or different Ne values for each. Here we set a single value. Similarly, we could sample the same number of gene copies from each population, or sample different numbers from each, as we do here using a dict. You can hover your cursor over the visualization to view additional information about each population interval in the demographic model. This will list values for interval's index ID, Ne value, and length in units of generations (t$_g$) and coalescent units (t$_c$). Here we can see by *embedding* the simulated genealogy into the species tree that the gene copies from the orange clade ("r1") do not all coalesce before the last divergence event, causing incomplete lineage sorting.

In [120]:
# set up a 3-tip species tree with root at 1e4 generations
sptree = toytree.rtree.imbtree(ntips=3, treeheight=1e4)

# set up a demographic model using the species tree
model = ipcoal.Model(tree=sptree, Ne=5000, nsamples={0: 2, 1: 3, 2: 4}, seed_trees=1234)

# simulate one genealogy
model.sim_trees(1)

# draw the genealogy embedded in the species tree model
model.draw_demography(idx=0);

<div class="admonition tip">
    <p class="admonition-title">Branch lengths on demographic models</p>
    <p>See Demography and Species Trees for tips on translating branch lengths on empirical trees from units of absolute time or coalescent units to generations.</p>
</div>

### *ipcoal* and *msprime*
Coalescent simulations in *ipcoal* are performed using function calls from *msprime* which stores results in the TreeSequence format implemented in *tskit*. We strive to keep *ipcoal* up to date with new versions of *msprime* and *tskit* and to implement requested features like new substitution models, rate maps, or demographic modelling functions. However, we do not aim to implement every feature of *msprime*, as that would be redundant. In *ipcoal*, users can also access `TreeSequence` objects as the result of simulations (see Interaction with tskit), but in our default setting these objects are discarded and only a summarized tabular result is stored. In this way, `ipcoal` should be viewed a complementary tool to *msprime* and *tskit*, not as a replacement. It relies heavily on these tools for simulations, however, *ipcoal* has an entirely separate code base for our data analysis tools (see Phylogenetic Inference and Likelihood).

### Simulation functions
An `ipcoal.Model` object has three methods for coalescent simulation, `sim_trees()`, `sim_loci()`, and `sim_snps` (see Simulation Functions). Each of these serves a different purpose, and accepts a number of arguments to modify its behavior. Under the hood, they represent different algorithm that make function calls in *msprime*. If you intend to set up highly complex simulations it may often be advantageous to perform your simulations in *msprime* directly, rather than using *ipcoal*. The main advantages of *ipcoal* come from the use of these functions, and from its more limited scope, and miminalist ethos, which make it easier to simulate and analyze data focused on phylogenetic trees (e.g., newick trees or sequence alignments).

#### sim_trees
The `sim_trees` function is the simplest and fastest simulation function. It generates only coalescent trees as a result, and does not perform mutations. It takes two arguments, `nloci` and `nsites`. In *ipcoal* we always treat loci as being independent of one another. You can think of them as separate chromosomes. The length of each locus is represented by some number of sites. To simulate completely unlinked genealogies we can request the genealogy from a single site (nsites=1) from multiple independent loci. If you set nsites > 1, and the `Model`'s recombination rate is >0, then recombination events can occur within a locus, giving rise to multiple linked genealogies  (i.e., known as a tree sequence or ARG).

In [191]:
# simulate a single genealogy (i.e., for 1 locus at one site)
model.sim_trees()

In [192]:
# same as above (showing default arguments)
model.sim_trees(nloci=1, nsites=1)

In [193]:
# simulate 10 independent loci each containing one genealogy
model.sim_trees(10)

In [194]:
# simulate 1 locus of len=10000. May contain multiple trees if recomb.
model.sim_trees(nloci=1, nsites=1e4)

#### sim_loci
The `sim_loci` function is a simple extension of the `sim_trees` function that adds mutations to the simulated trees using the mutation rate and substitution model stored to the `Model` object. If you are only interested in tree data then you should use `sim_trees`, whereas if you are interested in both trees and sequences then you should use this function.

In [196]:
# simulate 2 loci each 100bp in length
model.sim_loci(nloci=2, nsites=100)

#### sim_snps
The `sim_snps` function is the most complex. It is used to generate *unlinked SNPs*. This allows you to perform coalescent simulations with mutation while also conditioning on the observation of variation. There are several options that can be implemented in this function to affect how the conditioning works, which can be toggled to trade-off potential biases versus speed. Note that the speed of this function can be very slow if both the mutation rate and coalescent times of your trees are very small.

In [197]:
# simulate 5 unlinked SNPs
model.sim_snps(nsnps=5)

### Linked genealogies
An important distinction that we highlight in *ipcoal* is whether you are simulating linked or unlinked data. Unlinked data represents independent draws from a distribution, whereas linked data represents correlated draws, meaning that the next data point is influenced by the previous one. In the context of a genome we expect that regions located on different chromosomes are independent of each other, whereas sites that are located close together on the same chromosome are not independent. In the context of the coalescent, the correlation among nearby regions of the genome represents that one or more samples shares the same ancestors in both regions. Recombination causes this similarity to decay since it has the effect of causing different genomic regions to trace back to different sampled ancestors. The ability to simulate correlated tree sequences over large genomic regions using algorithms that approximate the coalescent with recombination has opened many new opportunities for studying genome-wide genealogical patterns.

There are instances where we may be interested in simulating linked data to study the effect of recombination, or alternatively, we may sometimes wish to simulate unlinked data. Many population genetic and phylogenetic inference tools assume that data are unlinked. One useful application of *ipcoal* is to generate linked and unlinked datasets to explore the effect of linkage on analytical results.

<div class="admonition question">
    <p class="admonition-title">Why do you need to specify `nsites` when simulating genealogies?</p>
    <p>If the simulation includes recombination (which it does by default) then a single locus extending over more than one site may actually represent multiple coalescent genealogies if a recombination crossover occurred in the history of the samples at that locus. See the linked genealogy example below.</p>
</div>

### Simulation results
The results of a simulation function call are stored to the `Model` object. The main results are the `.df` dataframe and `.seqs` numpy array.

#### dataframe (.df)
All simulation functions generate a `.df` dataframe as a result. This is a pandas DataFrame object that can be used to access the results. **What does this tables show?** You can see in the **unlinked** results table that 10 different loci are represented, numbered 0-9 in the “locus” column. Each locus is represented by only a single site, stretching from start=0 to end=1. Each is 1bp in length and contains no SNPs since we have not simulated sequence data yet, only genealogies. A column labeling the tree index (tidx) for each row shows that all are labeled 0, meaning that each genealogy is the first (and here only) tree simulated in that locus. Finally, the results of greatest interest are in the final column, genealogy, which contains newick strings.

Now look at the **linked** results table. In contrast to the previous table we see that multiple rows correspond to each locus ID. The first genealogy stretches from position 0 (start=0) to position 3380 (end=3380) and it is 3380bp in length. Following down the table we can see that recombination has broken this locus into many small chunks each represented by a different sized chunk of the locus, and by a slightly different genealogy. Each genealogy has a different tree index (tidx) indicating their order along the locus.

Of course it is hard to tell from the table how different these genealogies are. The next step is to use visualization tools and statistical analyses to compare trees.

In [229]:
# simulate unlinked genealogies
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123)
mod.sim_loci(nloci=10, nsites=1)
mod.df

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,1,1,0,0,(p_1:287537.351532411179...
1,1,0,1,1,0,0,(p_3:110157.870294456748...
2,2,0,1,1,0,0,((p_0:1313.6091112858539...
3,3,0,1,1,0,0,(p_0:314684.858664838480...
4,4,0,1,1,0,0,(p_1:147864.053017845842...
5,5,0,1,1,0,0,(p_0:93326.6807769358565...
6,6,0,1,1,0,0,(p_3:143619.984664270363...
7,7,0,1,1,0,0,((p_2:70699.778029327673...
8,8,0,1,1,0,0,((p_0:14776.914259654118...
9,9,0,1,1,0,0,((p_1:17689.798992805852...


In [230]:
# simulate linked genealogies
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123)
mod.sim_loci(nloci=2, nsites=1e4)
mod.df

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,3380,3380,23,0,((p_1:12194.366367953536...
1,0,3380,4888,1508,11,1,((p_1:12194.366367953536...
2,0,4888,6243,1355,11,2,((p_1:12194.366367953536...
3,0,6243,6811,568,4,3,(p_4:504559.306961862370...
4,0,6811,6910,99,0,4,((p_1:12194.366367953536...
5,0,6910,6971,61,1,5,((p_1:12194.366367953536...
6,0,6971,8463,1492,29,6,((p_1:12194.366367953536...
7,0,8463,10000,1537,12,7,((p_1:12194.366367953536...
8,1,0,7642,7642,22,0,((p_2:13777.964596574553...
9,1,7642,7935,293,2,1,((p_0:25416.063867531960...


In [228]:
# draw the first tree
mod.draw_genealogy(idx=0, scale_bar=1e3);

# draw N trees on the same scale
mod.draw_genealogies(idxs=range(4), scale_bar=1e3, shared_axes=True);

#### sequences (.seqs)
The other major result of simulation is sequences. These are stored efficiently as a numpy array in the `.seqs` attribute of a `Model` class object. Users can interact with this array directly, or use a number of convenience functions. Most often, users will want to use the `write` functions to write data to a number of different formats, or to use tools for analyzing the sequence variation directly.

In [255]:
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123)
mod.sim_loci(nloci=1, nsites=70)
mod.seqs

array([[[0, 0, 0, 0, 3, 3, 1, 3, 3, 2, 0, 3, 0, 2, 2, 3, 0, 3, 2, 0, 0,
         1, 2, 0, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 1, 3, 3, 1, 2, 0, 2, 0,
         3, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 2, 1, 3, 2, 1, 3, 1, 3, 2, 1,
         1, 0, 1, 2, 3, 1, 1],
        [0, 0, 0, 0, 3, 3, 1, 3, 3, 2, 0, 3, 0, 2, 2, 3, 0, 3, 2, 0, 0,
         1, 2, 0, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 1, 3, 3, 1, 2, 0, 2, 0,
         3, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 2, 1, 3, 2, 1, 3, 1, 3, 2, 1,
         1, 0, 1, 2, 3, 1, 1],
        [0, 0, 0, 0, 3, 3, 1, 3, 3, 2, 0, 3, 0, 2, 2, 3, 0, 3, 2, 0, 0,
         1, 2, 0, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 1, 3, 3, 1, 2, 0, 2, 0,
         3, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 2, 1, 3, 2, 1, 3, 1, 3, 2, 1,
         1, 0, 1, 2, 3, 1, 1],
        [0, 0, 0, 0, 3, 3, 1, 3, 3, 2, 0, 3, 0, 2, 2, 3, 0, 3, 2, 0, 0,
         1, 2, 0, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 1, 3, 3, 1, 2, 0, 2, 0,
         3, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 2, 1, 3, 2, 1, 3, 1, 3, 2, 1,
         1, 0, 1, 2, 3, 1, 1],
        [0, 

In [250]:
# simulate 1 locus and write as phylip format
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123)
mod.sim_loci(nloci=1, nsites=70)
print(mod.write_concat_to_phylip())

5 70
p_0        AAAATTCTTGATAGGTATGAACGAACGTGTTATGCTTCGAGATACAACCACACGCTGCTCTGCCACGTCC
p_1        AAAATTCTTGATAGGTATGAACGAACGTGTTATGCTTCGAGATACAACCACACGCTGCTCTGCCACGTCC
p_2        AAAATTCTTGATAGGTATGAACGAACGTGTTATGCTTCGAGATACAACCACACGCTGCTCTGCCACGTCC
p_3        AAAATTCTTGATAGGTATGAACGAACGTGTTATGCTTCGAGATACAACCACACGCTGCTCTGCCACGTCC
p_4        AAAATTCTTGATAGGTATGAACGAACGTGTTATGCTTCGAGATACAACCACACGCTGCTCTGCCACGTCC


In [252]:
# simulate 5 unlnked SNPs and write as VCF format
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123)
mod.sim_snps(5)
mod.write_vcf()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,p_0,p_1,p_2,p_3,p_4
0,1,1,.,G,A,99,PASS,.,GT,0|0,0|0,0|0,0|0,1|1
1,2,1,.,C,T,99,PASS,.,GT,0|0,0|0,1|1,0|0,0|0
2,3,1,.,T,G,99,PASS,.,GT,1|1,0|0,1|1,1|1,1|1
3,4,1,.,G,A,99,PASS,.,GT,1|1,0|0,1|1,0|0,0|0
4,5,1,.,C,T,99,PASS,.,GT,1|1,0|0,0|0,0|0,1|1


In [115]:
# simulate linked genealogies
linked = ipcoal.Model(Ne=1e5, nsamples=5, recomb=1e-9, seed_trees=1234)
linked.sim_trees(nloci=1, nsites=100000)
linked.df.head(10)

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,1727,1727,0,0,((p_3:67790.975409283491...
1,0,1727,3972,2245,0,1,((p_3:67790.975409283491...
2,0,3972,5462,1490,0,2,((p_3:67790.975409283491...
3,0,5462,11745,6283,0,3,(p_0:216324.742533758428...
4,0,11745,12782,1037,0,4,(((p_1:28107.65126587526...
5,0,12782,17877,5095,0,5,((p_0:112441.59260623497...
6,0,17877,19547,1670,0,6,((p_0:112441.59260623497...
7,0,19547,21444,1897,0,7,((p_0:112441.59260623497...
8,0,21444,21464,20,0,8,(p_4:513071.323730750300...
9,0,21464,23818,2354,0,9,(p_4:534342.730756848352...


In [116]:
linked.draw_genealogies(shared_axes=True, scale_bar=1000);

In [118]:
linked.draw_tree_sequence();

NotImplementedError: TODO..

### Simulating sequences

In [80]:
model.draw_seqview(idx=0, start=0, end=100);

In [75]:
# set Model to store TreeSequences as results
model.store_tree_sequences = True

# simulate a 10Kb locus
model.sim_loci(1, 10000)

# draw genealogy with substitutions
model.draw_genealogy(idx=0, show_substitutions=True);

Core features of the *ipcoal* + *toytree* framework include:
- xMethods to describe and visualize demographic models.
- Methods to access tree sequences directly.
- Methods to access coalescent trees as a summarized dataframe.
- Methods for simulating conditioned unlinked SNPs
- Methods for simulating multi-locus linked or unlinked data.
- Methods for writing SNPs or entire loci in a number of file formats.
- Methods for inferring phylogenetic trees from simulated data.
- Methods for calculating distances/statistics on trees.

### Accessing TreeSequences
A demographic model provides the simplest

TreeSequence objects can be very large in terms of memory usage. In general, `ipcoal` is intended

In [71]:
# setup a single-population demographic model
model = ipcoal.Model(Ne=1000, nsamples=10, store_tree_sequences=True)

# simulate a genealogy
model.sim_trees(nloci=2, nsites=1)

# access the tree sequences
model.ts_dict

{0: <tskit.trees.TreeSequence at 0x7fee67206c20>,
 1: <tskit.trees.TreeSequence at 0x7fee677d0760>}

If you are only interested in sampling a TreeSequence object, this can also be done in a simple way...

In [75]:
# setup a single-population demographic model
model = ipcoal.Model(Ne=1000, nsamples=10, store_tree_sequences=True)

# sample one TreeSequence
ts = model.get_tree_sequence(nsites=10)

# show the TreeSequence
repr(ts)

'<tskit.trees.TreeSequence object at 0x7fee690232e0>'

## Sequences

## Dataframe summary

In [77]:
# setup a single-population demographic model
model = ipcoal.Model(Ne=1000, nsamples=10, store_tree_sequences=True)

model.sim_trees(nloci=2, nsites=100)

model.df

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,100,100,0,0,((p_2:1724.1829204845337...
1,1,0,100,100,0,0,(p_5:4708.95883675882396...


The species tree is the primary *model* on which *ipcoal* is designed to simulate genealogies and sequences within the multispecies coalescent framework. One of the primary features of *ipcoal* is the ability to feed it a tree which it will then parse to build a demographic model (which is used by the `msprime` coalescent simulator), and which describes when and how different populations (lineages) are able to coalesce with each other. You can think of coalescence on a species tree as several distinct coalescent processes occurring within panmictic populations that are simply connected to each other by the tree structure (See [Degnan and Rosenberg 2009](https://www.sciencedirect.com/science/article/pii/S0169534709000846) for a nice description). 

To simulate genealogies and sequences on a tree we need to first define the tree. This can be done by either loading an inferred tree from a newick string or by generating a random tree. For this we will use the tree manipulation and visualization library [toytree](https://toytree.readthedocs.io). In the example cell below I use toytree to generate a random tree with a set number of tips, a total tree height, and a random seed, and store it as a variable named `tree1`. 

We can visualize this tree by calling `.draw()` from the toytree and here I provide the argument `tree_style='p'` to set a style for drawing the figure that will make it look nice for representing 'population trees' (i.e., species trees). This is helpful in that it provides numeric labels on the nodes of the tree, shows tip labels, and provides a scalebar for the height of the tree. 

In [3]:
ipcoal.Model

ipcoal.model.Model