# Population Structure and Incomplete Lineage Sorting

### Topic outline:

1. A visual introduction to genealogical discordance.
2. Connecting genealogies to species trees (demographic models).
3. Connecting genealogies to Fst and ABBA-BABA.

### Learning objectives: 
By the end of this notebook series you should:
1. Understand how deep coalescence can cause genealogical discordance.
2. Understand how Ne and Tau both affect the probability of ILS.
3. Understand why concatenation can lead to biased species tree inference.


## Coalescent simulations
The [toytree](https://toytree.readthedocs.io) and [ipcoal](https://ipcoal.readthedocs.io) Python packages are designed to be used together within jupyter notebooks to execute interactive code to create, manipulate, and visualize tree data objects, and to simulate the coalescent process and generate sequence data. These two tools together are useful for learning about the coalescent and multispecies coalescent processes.

In [1]:
import toytree
import ipcoal

## Coalescent tree for a single population
In all of your notebook exercises up to this point we have considered only a single population. We learned that we can model the expected genetic variation among k samples in a population as an outcome of mutations occurring along the branches of the genealogical history of those samples. Different genomic regions will trace back different genealogical histories, and we can model the distribution of genealogies from this population as a probability distribution. A genealogy sampled from this distribution can be treated as a random variable, generated by sampling waiting times between coalescent events according to the coalescent process. By examining many genealogies we can learn about the model (population) that generated them. 

The only parameter affecting the rate of coalescence in this scenario is `Ne` of the population. We can refer to this single-population model as a *demographic model*. It is in fact the simplest possible demographic model, and the topic of this notebook will be to begin to describe more complex types of demographic models. In general, these models can be visualized like in the example below, as a container or shape within which a genealogy can be embedded. Below is an example genealogy sampled by coalescent simulation that is embedded in its demographic model container, represented by a grey shaded rectangle.

In [2]:
# simulate 1 coalescent genealogy of k samples embedded in a single population demographic model
model = ipcoal.Model(Ne=1000, nsamples=5, seed_trees=123)
model.sim_trees(1)
model.draw_demography(idx=0);

## Population structure
In this notebook we will introduce additional complexity to our demographic model in the form of *population structure*, which also adds additional parameters to our evolutionary model. To understand the importance of population structure on the genealogical relationships of a set of gene copies, let's first consider the visualization below. Imagine that we are interested in modeling the evolution of five gene copies using our single population demographic model, like above. However, in this scenario we set an upper limit to the age of our population, at $\tau$=1000 generations. As a consequence, in this example all of the gene copies do not coalesce in time before reaching the end of the simulated population interval; several ancestors of these gene copies exist *farther back in time* than the history of this population. To accomodate this, we will need to also model the ancestral population that came before it, as well as any other descendant populations of that ancestor.

In [3]:
# draw a coalescent genealogy of k samples embedded in a single population
model = ipcoal.Model(Ne=1000, nsamples=5, seed_trees=123)
model.sim_trees(1)
model.draw_demography(idx=0, container_root_height=1000);

The ancestral population may differ from the descendant population in a number of ways. It could have a different Ne value, and thus a different rate of coalescence, and, even more importantly for our lesson here, *it may contain other gene copies*. If other gene copies are now present in the ancestral population, then we must also consider that the remaining uncoalesced gene copies in our population have the opportunity to coalesce with those other samples. 

In the example below, I create a demographic model composed of two sibling populations and an ancestral population (3 populations total). For simplicity, we will sample only a single gene copy from the new sibling population, while still sampling the same five gene copies from our original population. We can see now that as we enter the ancestral interval, we again have three (green) gene copies that have not yet coalesced from the left-descendant population, and one gene copy (orange) that enters the ancestral interval from the right-descendant population. 

Once these four samples are present together in the ancestral interval, we now model the process of coalescence within this interval the same as we did previously for a single population: we sample an exponential waiting time until the next coalescent event with rate  $4Ne / k(k-1)$ and randomly sample which two gene copies coalesce at that time. The main difference is that now it is possible for the green and orange samples to coalesce with each other. 

In [4]:
# simulate coalescent histories embedded in a 2-population demographic model
sptree = toytree.tree("(A:1000,B:1000);")
model = ipcoal.Model(sptree, Ne=1000, nsamples={"A": 5, "B": 1}, seed_trees=123)
model.sim_trees(2)

In this first simulated genealogy, by chance, the green copies all coalesce with each other before their common ancestor coalesces with the orange gene copy. Thus, the green samples form a monophyletic clade.

In [5]:
# draw one simulated genealogy embedding
model.draw_demography(idx=0, container_blend=True, container_interval_minwidth=5);

However, the result above is just one possible outcome of the coalescent process in this structured population model. Each genomic region can trace back a different genealogical history, and some of the time, we may observe a pattern like below, where some of the green copies share a more recent common ancestor with the orange clade than they do with other samples of the green clade. In this case, the samples from the first population do not form a monophyletic clade. This observation is an example of **incomplete lineage sorting**. The samples in a lineage have not completely sorted into a monophyletic clade before entering into an ancestral interval, and consequently, their genealogy may be discordant with the relationships among populations/species. 

In [6]:
# draw another simulated genealogy embedding
model.draw_demography(idx=1, container_blend=True);

## Demographic model parameters
Now in addition to the single parameter "Ne" to define our demographic model, we must define several additional parameters to fully specify our demographic model. Each population can have its own different Ne value, and we must also define the time at which populations split, referred to as Tau. For example, the three-population scenario below has 7 parameters: 5 Ne values and 2 Tau values.

Together, these 7 parameters define the demographic history of these populations which determines the probability of different genealogical histories. When the Ne values are large, coalescence will occur more slowly, and when the lengths of the intervals are small, it is more likely coalescence events will occur deeper in time than population divergences. Thus, together Ne and Tau values determine the probability of incomplete lineage sorting.

## When does ILS occur?
Compare the two simulations below. The first has long interval lengths (large Tau values) relative to Ne, whereas the second has much smaller interval lengths. In the first, we see that gene copies from populations 1 and 2 coalescence before the end of their ancestral interval, whereas in the second, most coalesce much deeper than the divergence events, leading to ILS. The **ratio between interval length in units of generations and Ne determines the probability of deep coalescence**. This can be expressed by a compound value $t_g / 2Ne$, referred to as an interval's length in *coalescent units*. Intervals that are longer in coalescent units exhibit less ILS than intervals that are shorter in coalescent units. Because we expect most samples in a population to coalesce within ~4N generations, we expect all samples will usually coalesce within an interval if its length is >2 coalesce units.

By hovering your cursor over the grey intervals in the visualization below you will see a pop-up that shows values for that interval, including its length in generations (shown as Tg), its Ne value, and its length in coalescent units (shown as Tc). 

In [130]:
# define the seven model parameters
Nes = (3000, 3000, 3000, 3000, 3000)
Taus = (50000, 100000)

# setup demographic model and simulate genealogy
sptree = (
    toytree.rtree.baltree(3)
    .set_node_data("Ne", Nes)
    .set_node_data("height", {3: Taus[0], 4: Taus[1]})
)
model = ipcoal.Model(sptree, nsamples=1, seed_trees=123)
model.sim_trees(1)
model.draw_demography(idx=0, container_blend=True);

In [131]:
# define the seven model parameters
Nes = (3000, 3000, 3000, 3000, 3000)
Taus = (2000, 4000)

# setup demographic model and simulate genealogy
sptree = (
    toytree.rtree.baltree(3)
    .set_node_data("Ne", Nes)
    .set_node_data("height", {3: Taus[0], 4: Taus[1]})
)
model = ipcoal.Model(sptree, nsamples=1, seed_trees=123)
model.sim_trees(1)
model.draw_demography(idx=0, container_blend=True);

## Species trees 
Another term that is used to describe a demographic model is a "species tree". This is particularly the case when the goal is to infer the topology, or container tree, of the demographic model that can best explain a set of observed data. In other words, this term is often used in phylogenetics, where one of the central goals is to infer the species tree as the phylogenetic history for a set of species or populations by examining variation at many gene trees.

The example below represents a typical "phylogenomic" dataset, where multiple populations or species are each represented by a small number of samples. In this simulation we simulate many loci where each locus represents sequence data evolved on a different simulated genealogy. This is repeated for two different scenarios, the first with very low ILS and the second with very high ILS.

#### Low ILS scenario
All coalescent events occur very rapidly as soon as possible within ancestral intervals and so no ILS occurs.

In [141]:
# setup a species tree model
sptree = toytree.rtree.imbtree(5, treeheight=1e6, seed=123)
model0 = ipcoal.Model(sptree, Ne=2e4, seed_trees=123, recomb=0)

# simulate sequences and report
model0.sim_loci(250, 200)
print("simulated {0} {2}bp loci for {1} samples".format(*model0.seqs.shape))

# draw the first genealogy embedded 
model0.draw_demography(idx=0, container_blend=True);

simulated 250 200bp loci for 5 samples


#### High ILS scenario
Coalescence occurs very slowly, usually deeper than the first ancestral interval in which it can occur. This is especially true for the internal intervals 5 and 6 which are much shorter than the others.

In [134]:
# setup a species tree model
sptree = toytree.rtree.imbtree(5).set_node_data("height", {5: 2e5, 6: 2.5e5, 7: 3e5, 8: 1e6})
model1 = ipcoal.Model(sptree, Ne=1e6, seed_trees=123, recomb=0)

# simulate sequences and report
model1.sim_loci(1000, 200)
print("simulated {0} {2}bp loci for {1} samples".format(*model1.seqs.shape))

# draw the first genealogy embedded 
model1.draw_demography(idx=0, container_blend=1, container_root_height=200000);

simulated 1000 200bp loci for 5 samples


## Phylogenomics
It is clear from the visualizations above that the genealogies match very closely to the species tree in the first scenario (low ILS), but do not match the species tree in the second scenario (high ILS). This type of discordance can make it difficult to infer the correct phylogenetic history for the set of populations. It is also the reason why we cannot rely on a single gene tree to represent the phylogeny of a population, since each gene tree may exhibit a different pattern. Instead, to accurately infer the species tree we must examine many gene trees. This is the idea behind multi-locus phylogenetics, also called "phylogenomics", which examines variation across many genomic regions to infer a best species tree hypothesis. 

### Concatenation bias
Early approaches at multi-locus phylogenetics relied on a simple method of combining data from different genomic regions called "concatenation", where the sequence data from different loci is simply stitched together, end-to-end, to create one super locus. The idea here is that the gene tree inferred from this super locus may be likely to match the most common, or average gene tree across the dataset, which may be closest to the true species tree history. Although this approach is still commonly used today, it is well known that this expectation is not true: the most common gene tree under some species tree scenarios does not match the species tree. This is particularly true when ILS is very common.

This is demonstrated below, where we perform maximum likelihood tree inference on a concatenated alignment of the data from the two scenarios above, representing low versus high ILS. As we see, the low ILS scenario recovers the true species tree correctly, whereas the high ILS scenario returns a phylogeny that is incorrect.

In [142]:
# infer a concatenation tree for the low ILS scenario and draw it
concat_tree = ipcoal.phylo.infer_raxml_ng_tree(model0)
concat_tree.mod.root_on_minimal_ancestor_deviation("r4").ladderize().draw();

The high ILS scenario infers an incorrect tree from phylogenetic inference on the concatenated data set.

In [136]:
# infer a concatenation tree for the high ILS scenario and draw it (it is wrong!)
concat_tree = ipcoal.phylo.infer_raxml_ng_tree(model1)
concat_tree.mod.root_on_minimal_ancestor_deviation("r4").ladderize().draw();

### Species tree inference
You may be thinking that the high ILS scenario we simulated is unsolvable. Perhaps there is just way too much ILS to infer the correct tree. But this is not the case. There is in fact a set of methods that can infer the correct species tree even in the presence of very high levels of ILS. These methods are called "species tree inference" methods, and in contrast to concatenation, they are expected to consistently yield the correct result when provided enough data. Below is an example where we enter the genealogies from the high ILS scenario as input to a species tree inference method called ASTRAL, and we see that on the same data set it is able to recover the correct species tree topology.

The species tree inference approach differs from concatenation in that instead of assuming there is a single gene tree that can best explain variation among in a large sequence alignment, this approach instead assumes that each locus has a different gene tree history, and tries to infer a single species tree container that can best predict that distribution of gene trees. 

In [138]:
# infer a species tree from the high ILS genealogy data using ASTRAL
ipcoal.phylo.infer_astral_tree(model1.df.genealogy).root("r4").draw(use_edge_lengths=False);

## Conclusions

- Demographic models can be expanded beyong a single-population model with a single Ne parameter to complex models composed of many populations and many parameters.
- Genealogical histories can exhibit discordance (mismatch) with the demographic model structure when coalescent events occur deeper in time than the divergence events between populations. This is termed incomplete lineage sorting (ILS).
- In structured population demographic models the coalescent process is modeled similar to the way it is in a single-population model, but separately within each interval. It can be thought of as "stitching together" many single-population coalescent models.
- The probability of ILS is determined by the length of intervals in a structured demographic model in coalescent units, which is their length in generations divided by 2Ne. ILS is less likely when either Ne is smaller, or the length of intervals is longer.
- Species trees are another term used to describe demographic models, particularly within the field of phylogenetics. The goal of phylogenetics is often to test alternative species tree hypotheses to find the one that best explains the observed variation in gene trees.
- The most common gene tree across the genome does not always match the species tree. Thus, species tree inference does not simply search for the most common gene tree. This is essentially what "concatenation" does, and can potentially lead to an incorrect tree inference. Rather, species tree inference involves methods to find the species tree that best explains *a distribution of gene trees*, usually inferred from many different loci (hundreds or thousands) sampled from genomes. 