# nb2: Gene tree estimation


### Notebook outline:
1. A visual introduction to genealogical variation. 
2. Connecting genealogies to species trees (demographic model).
3. Connecting genealogies to sequence variation (observations)
4. Connecting genealogies to gene trees (inference)

### Learning objectives: 
By the end of this notebook series you should:
1. Be familiar with the `toytree` and `ipcoal` Python libraries.
2. Recognize the power of coalescent simulations to test hypotheses.
3. Have an improved understanding of gene-tree/species-tree concepts.


### Additional recommended reading:

- [Rosenberg and Nordborg (2002) Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Review Genetics](https://eaton-lab.org/slides/genomics/readings/Rosenberg-and-Nordborg-2002.pdf)

### The ipyrad-analysis package
The ipyrad-analysis (ipa) package is a wrapper for conveniently running many types of phylogenetic inference tools in jupyter notebooks. We will use it here to call the maximum likelihood inference software `RAxML` for gene tree inference.

In [34]:
import ipcoal
import toytree
import ipyrad.analysis as ipa

### Terminology (genealogies and gene trees)
This notebook aims to demonstrate and reinforce an understanding of the difference between the *true* genealogical history of a set of samples, and the *inferred* gene tree history that is estimated from observable sequence data. The following terminology is helpful in this respect. 


**Genealogies** are the true unobserved histories of a set of samples from one or more populations. They share a common ancestor at some point in their past, and the relationships among the samples and their ancestors traces back a tree-like relationship. Different regions of a genome alignment will share different genealogical histories. The boundaries between these regions are also unobservable, and can only be estimated from information in observable sequences. 

**Gene trees** are estimates of the genealogies. Because we cannot observe genealogies directly, we must infer them from the data that we can observe, in the form of substitutions that evolved on those genealogies. Because we cannot observe the boundaries between regions with different genealogical histories, gene tree estimation often involves concatenating data from multiple linked genealogies.


In this notebook we will examine some clear examples where inferred gene trees differ from the true genealogies due to estimation error. This is a common source of error in phylogenetic analyses that contributes to phylogenetic uncertainty. 

### Simulations
In the examples below we will simulate sequence data on genealogies sampled from a species tree model with recombination. The effect of recombination may cause multiple distinct genealogies to underly the history of samples across the length of a locus/chromosome. 
If only a single genealogical history is present, then we expect that sequence data evolved on this genealogy should provide strong evidence for a single gene tree matching the  genealogy. If multiple genealogies are present then the sequence data evolved on those genealogies will be biased towards an average of their relationships. Several sources of error may exist in gene tree estimation, including (1) insufficient variation may exist within loci to resolve relationships; (2) genealogical variation causes the inferred gene tree to differ from underlying genealogies; (3) homoplasy (multiple mutations) can affect gene tree estimation. Other sources of error such as sequencing and alignment error are not included in this example.

In [42]:
# generate a random species tree with 10 tips and a crown age of T generations
tree = toytree.rtree.unittree(10, treeheight=1e6, seed=123)

# draw the species tree
tree.draw(ts='c', tip_labels=True);

In this example we simulate a single locus that is 100bp in length on a species tree with very little discordance (low Ne). A print statement returns the number of genealogies that evolve across this length. A gene tree is inferred with RAxML and its topology is compared to the species tree and to the first genealogy in the simulated locus. In this case the inferred gene tree does not match the underlying genealogy or the species tree because there is little information in the 100bp locus. 

<div class="alert alert-success">
    <b>Action:</b> 
Try editing the `nsites=100` value below to a larger number and re-executing the cell to see its effect on the inferred gene tree. 
</div>

In [80]:
# setup simulation with small Ne
model1 = ipcoal.Model(tree=tree, Ne=1e4, recomb=1e-9)

# simulate a short uninformative locus
model1.sim_loci(nloci=1, nsites=100)
print("{} genealogies in {} sites".format(model1.df.shape[0], model1.seqs.shape[2]))

# infer a raxml gene tree at this locus
model1.write_concat_to_phylip(name="test", outdir="/tmp")
rax = ipa.raxml(data="/tmp/test.phy", T=1, N=10)
rax.run(force=True)

# draw the genealogy and gene tree
t0 = toytree.tree(model1.df.genealogy[0])
t0.draw(edge_colors='orange', ts='c', tip_labels=True);
t1 = toytree.tree(rax.trees.bipartitions)
t1.draw(ts='c', tip_labels=True);

# does the gene tree match the first genealogy in the locus?
rf = t1.treenode.robinson_foulds(t0.treenode, unrooted_trees=True)[0]
print("inferred gene tree matches the first genealogy: {} (rf={})".format(rf == 0, rf))

# does the gene tree match the species tree?
rf = t1.treenode.robinson_foulds(tree.treenode, unrooted_trees=True)[0]
print("inferred gene tree matches the species tree: {} (rf={})".format(rf == 0, rf))

1 genealogies in 100 sites
wrote concat locus (10 x 100bp) to /tmp/test.phy
job test finished successfully
inferred gene tree matches the first genealogy: False (rf=6)
inferred gene tree matches the species tree: False (rf=6)


When we extend the length of the locus to make it more informative (e.g., 2Kb) the resulting gene tree is better resolved and matches the genealogy and species tree.

In [81]:
# setup simulation with small Ne
model1 = ipcoal.Model(tree=tree, Ne=1e4)

# simulate a short uninformative locus
model1.sim_loci(1, 2000)
print("{} genealogies in {} sites".format(model1.df.shape[0], model1.seqs.shape[2]))

# infer a gene tree at this locus
model1.write_concat_to_phylip(name="test", outdir="/tmp")
rax = ipa.raxml(data="/tmp/test.phy", T=1, N=10)
rax.run(force=True)

# draw the genealogy and gene tree
t0 = toytree.tree(model1.df.genealogy[0])
t0.draw(edge_colors='orange', ts='c', tip_labels=True);
t1 = toytree.tree(rax.trees.bipartitions)
t1.draw(ts='c', tip_labels=True);

# does the gene tree match the first genealogy in the locus?
rf = t1.treenode.robinson_foulds(t0.treenode, unrooted_trees=True)[0]
print("inferred gene tree matches the first genealogy: {} (rf={})".format(rf == 0, rf))

# does the gene tree match the species tree?
rf = t1.treenode.robinson_foulds(tree.treenode, unrooted_trees=True)[0]
print("inferred gene tree matches the species tree: {} (rf={})".format(rf == 0, rf))

3 genealogies in 2000 sites
wrote concat locus (10 x 2000bp) to /tmp/test.phy
job test finished successfully
inferred gene tree matches the first genealogy: True (rf=0)
inferred gene tree matches the species tree: True (rf=0)


As we learned in the last notebook, when we increase the effective population size any sampled genealogy is more likely to differ from the species tree. Let's examine this effect with sequence data and gene tree inference. Here I simulate a 2Kb locus again but on a species tree with Ne=2e5.

In [82]:
# setup simulation with small Ne
model1 = ipcoal.Model(tree=tree, Ne=2e5)

# simulate a short uninformative locus
model1.sim_loci(1, 2000)
print("{} genealogies in {} sites".format(model1.df.shape[0], model1.seqs.shape[2]))

# infer a gene tree at this locus
model1.write_concat_to_phylip(name="test", outdir="/tmp")
rax = ipa.raxml(data="/tmp/test.phy", T=1, N=10)
rax.run(force=True)

# draw the genealogy and gene tree
t0 = toytree.tree(model1.df.genealogy[0])
t0.draw(edge_colors='orange', ts='c', tip_labels=True);
t1 = toytree.tree(rax.trees.bipartitions)
t1.draw(ts='c', tip_labels=True);

# does the gene tree match the first genealogy in the locus?
rf = t1.treenode.robinson_foulds(t0.treenode, unrooted_trees=True)[0]
print("inferred gene tree matches the first genealogy: {} (rf={})".format(rf == 0, rf))

# does the gene tree match the species tree?
rf = t1.treenode.robinson_foulds(tree.treenode, unrooted_trees=True)[0]
print("inferred gene tree matches the species tree: {} (rf={})".format(rf == 0, rf))

12 genealogies in 2000 sites
wrote concat locus (10 x 2000bp) to /tmp/test.phy
job test finished successfully
inferred gene tree matches the first genealogy: False (rf=12)
inferred gene tree matches the species tree: False (rf=6)


When there is greater genealogical discordance (high Ne) it is more likely that multiple linked genealogies within a locus will exhibit greater differences with each other, and thus that concatation of their sequences will affect gene tree inference. This problem occurs to greater extents when the tree is very deep in units of generation times, Ne is very large, and recombination rates are high. In the example below I set Ne=2e5, and I print the number of genealogies that are contained within a 10Kb locus. 

In [83]:
# setup simulation with small Ne
model1 = ipcoal.Model(tree=tree, Ne=2e6)

# simulate a short uninformative locus
model1.sim_loci(1, 10000)
print("{} genealogies in {} sites".format(model1.df.shape[0], model1.seqs.shape[2]))

# infer a gene tree at this locus
model1.write_concat_to_phylip(name="test", outdir="/tmp")
rax = ipa.raxml(data="/tmp/test.phy", T=1, N=10)
rax.run(force=True)

# draw the genealogy and gene tree
t0 = toytree.tree(model1.df.genealogy[0])
t0.draw(edge_colors='orange', ts='c', tip_labels=True);
t1 = toytree.tree(rax.trees.bipartitions)
t1.draw(ts='c', tip_labels=True);

# does the gene tree match the first genealogy in the locus?
rf = t1.treenode.robinson_foulds(t0.treenode, unrooted_trees=True)[0]
print("inferred gene tree matches the first genealogy: {} (rf={})".format(rf == 0, rf))

# does the gene tree match the species tree?
rf = t1.treenode.robinson_foulds(tree.treenode, unrooted_trees=True)[0]
print("inferred gene tree matches the species tree: {} (rf={})".format(rf == 0, rf))

281 genealogies in 10000 sites
wrote concat locus (10 x 10000bp) to /tmp/test.phy
job test finished successfully
inferred gene tree matches the first genealogy: False (rf=12)
inferred gene tree matches the species tree: False (rf=14)


### Gene tree estimation errors
You can see that among the examples above the inferred gene trees rarely matched either the species tree, or even the true underlying genealog(y/ies) for a given locus. This demonstrates the connection between <kbd>genealogies</kbd> -> <kbd>sequence variation</kbd> -> <kbd>gene tree inference</kbd>. Please proceed to the next notebook (nb-3) where this is further examined in the context of species tree inference.