# Species trees


### Topic outline:
1. A visual introduction to genealogical variation. 
2. Connecting genealogies to species trees (demographic model).
3. Connecting genealogies to sequence variation (observations)
4. Connecting genealogies to gene trees (inference)

### Learning objectives: 
By the end of this notebook series you should:
1. Be familiar with the `toytree` and `ipcoal` Python libraries.
2. Recognize the power of coalescent simulations to test hypotheses.
3. Have an improved understanding of gene-tree/species-tree concepts.


### Additional recommended reading:

- [Rosenberg and Nordborg (2002) Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Review Genetics](https://eaton-lab.org/slides/genomics/readings/Rosenberg-and-Nordborg-2002.pdf)

### The toytree and ipcoal packages
The [toytree](https://toytree.readthedocs.io) and [ipcoal](https://ipcoal.readthedocs.io) Python packages are designed to be used together within jupyter notebooks to execute interactive code to create, manipulate, and visualize tree data objects, and to simulate the coalescent process and generate sequence data. The `ipcoal` package is built as a wrapper around the popular `msprime` coalescent simulator, and extends the functionality of this package for phylogenetic-scale analyses. Put simply, these two tools together allow us to easily explore and learn about the coalescent.

In [1]:
import ipcoal
import toytree

### The species tree
In this example we will start with a known parameterized species tree model from which we will sample genealogical histories. Parameters of this model, such as the effective population sizes (Ne), will affect the amount of genealogical variation. A species tree model defines a topology (branching order) with edge lengths (in units of generations), and effective population sizes on each edge. <span style="color:tomato; font-weight:600"> A species tree effectively defines a container in which genealogies are embedded</span>. In contrast to a single population coalescent where any pair of samples has the potential to coalesce in each generation, a species tree enforces barriers between samples that are in different populations until they are joined in an ancestral interval. As described in the suggested reading by Rosenberg and Degnan (2002), a species tree can be thought of as many individual coalescent populations stitched together.

Let's start by using `toytree` to generate a random species topology that we will use to represent a species tree. This will have a crown height of 1M generations, and evenly distributed speciation events between the root and tips.

In [2]:
# generate a random species tree with 10 tips and a crown age of 10M generations
stree = toytree.rtree.unittree(10, treeheight=1e6, seed=123)

# draw the species tree
stree.draw(ts='c', tip_labels=True, edge_widths=10);

### An embedded genealogy

To sample a random coalescent genealogy given this species tree we can use the `ipcoal` package. Here we will create an `ipcoal.Model` object and provide it the species tree object that we defined above (`stree`) as a parameter. We also provide it a value for the Ne in every species tree interval. Here the Ne value is relatively low compared to the edge lengths and thus discordance on the species tree is expected to be low. 

A random coalescent genealogy is shown in the visualization below. <span style="color:tomato; font-weight:600">The genealogy (shown with orange edges) matches the species tree topology (shown with black edges above) almost identically</span>, meaning that there is no discordance in this example. If you run the code block below multiple times, each execution will simulate a different random coalescent genealogy that is constrained by the species tree. Because the Ne value is very low, relative to the species tree interval lengths in units of generations, coalescence occurs very rapidly within each species tree interval. You will notice only very slight variation among repetitions in the height of each coalescent time.

In [3]:
# setup a coalescent simulator for this species tree and Ne value
model1 = ipcoal.Model(tree=stree, Ne=1e4)

# sample a genealogy from this model
model1.sim_trees(1)

# draw the genealogy
model1.draw_genealogy(idx=0, edge_colors='orange', fixed_order=stree.get_tip_labels());

### Incomplete lineage sorting (ILS)
Now let's look at what happens when the Ne value is increased. Here we set Ne to 2e5 across the entire species tree, representing a higher level of expected discordance. The sampled genealogy below no longer matches the species tree topology. This is because many gene copies do not coalesce in the first species tree interval in which they have the opportunity to. Instead, they persist backwards through multiple species tree intervals before eventually coalescing. At that point, there are many other gene copies to coalesce with, and thus the relationships among gene copies become effectively random. Consider the extreme case where Ne in nearly infinite. The barriers between population imposed by the species tree have no effect, and all samples will eventually coalesce in the common ancestor interval of the species tree. Thus, the scenario becomes identical to a single population coalescent. The example below is somewhere between small and infinite Ne, where the species tree imposes some constraints, but many deep coalescences occur also. 

In [4]:
# sample a genealogy from this species tree with much higher Ne
model2 = ipcoal.Model(tree=stree, Ne=2e5)
model2.sim_trees(1)

# draw the genealogy
model2.draw_genealogy(idx=0, edge_colors='orange');

### When does ILS occur?

Remember that we learned that the expected time until any number of $k$ samples will coalesce in a single population model is approximately $4N$. Thus, we can estimate very roughly whether ILS is likely to occur by examining whether any of the species tree intervals are shorter than $4N$ generations in length. This simply requires us to translate their lengths from units of generations into units of $N$. For example, if $N$=1000, and a branch is 1000 generations in length, then its branch length can be described as being 1N in length. By convention, we usually examine species tree branch lengths in units of $2N$, since that is the number of gene copies in a diploid population. <span style="color:tomato; font-weight:600">The length of a species tree branch in units of $2N$ generations is called a "coalescent unit"</span>. Branches that represent 2 coalescent units (i.e., 4N) are unlikely to exhibit any ILS, while those that are much shorter are more likely to exhibit ILS. 

The two plots below show our species tree with branch lengths transformed into coalescent units for two different Ne values that we used above. When Ne is very low (1e4) we saw that there was very little coalescent variation. This makes sense since the species tree edges are all >10 coalescent units. By contrast, when Ne is much higher (2e5) most species tree edges are <1 coalescent unit in length, and are thus likely to experience ILS. 

In [5]:
# plot the species tree with the scale bar in coalescent time units
c, a, m = stree.mod.edges_multiplier(1/(2 * 1e4)).draw(ts='c', edge_widths=10);
a.y.label.text = "Coalescent time units (2N)"
a.label.text = "N=1e4"

# plot the species tree with the scale bar in coalescent time units
c, a, m = stree.mod.edges_multiplier(1/(2 * 2e5)).draw(ts='c', edge_widths=10);
a.y.label.text = "Coalescent time units (2N)"
a.label.text = "N=2e5"

### Summary

- Species trees impose constraints that prevent coalescence among gene copies embedded in different branches. 
- Each branch of a species tree can be treated like a single population coalescent in which random mating occurs. 
- If Ne is very small, coalescent genealogies are expected to match the species tree.
- If two sampled gene copies do not coalesce in the first interval in which they have the opportunity, they have the potential to coalesce with other samples in later intervals, an example of "deep coalesce", also called "incomplete lineage sorting".
- Incomplete lineage sorting can cause coalescent genealogies to not match the species tree.