# nb1: Genealogical variation and sequence variation


### Notebook outline:
1. A visual introduction to genealogical variation. 
2. Connecting genealogies to species trees (demographic model).
3. Connecting genealogies to sequence variation (observations)
4. Connecting genealogies to gene trees (inference)

### Learning objectives: 
By the end of this notebook series you should:
1. Be familiar with the `toytree` and `ipcoal` Python libraries.
2. Recognize the power of coalescent simulations to test hypotheses.
3. Have an improved understanding of gene-tree/species-tree concepts.


### Additional recommended reading:

- [Rosenberg and Nordborg (2002) Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Review Genetics](https://eaton-lab.org/slides/genomics/readings/Rosenberg-and-Nordborg-2002.pdf)

### The ipyrad-analysis package
These ipyrad-analysis (ipa) package is a wrapper for conveniently running many types of phylogenetic inference tools in jupyter notebooks. We will use it here to call the maximum likelihood inference software `RAxML` for gene tree inference.

In [1]:
import ipcoal
import toytree
import ipyrad.analysis as ipa

### Terminology (genealogies and gene trees)
**Genealogies** are the true unobserved histories of a set of samples from one or more populations. They share a common ancestor at some point in their past, and the relationships among the samples and their ancestors traces back a true tree-like relationship. 

**Gene trees** are estimates of the genealogies. Because we cannot observe genealogies directly, we must infer their structure from the data that we can observe, in the form of mutations that evolved on those genealogies. 

In this notebook we will examine some clear examples where inferred gene trees differ from the true genealogies due to estimation error. This is a common source of error in phylogenetic analyses that contributes to phylogenetic uncertainty. 

In [2]:
# generate a random species tree with 10 tips and a crown age of 10M generations
tree = toytree.rtree.unittree(10, treeheight=1e6, seed=123)

# draw the species tree
tree.draw(ts='c', tip_labels=True);

In [10]:
# setup simulation with small Ne
model1 = ipcoal.Model(tree=tree, Ne=1e4)

# simulate a short uninformative locus
model1.sim_loci(1, 100)

# infer a gene tree at this locus
model1.infer_gene_trees()

# draw the genealogy and gene tree
toytree.tree(model1.df.genealogy[0]).draw(edge_colors='orange', ts='c', tip_labels=True);
toytree.tree(model1.df.inferred_tree[0]).draw(ts='c', tip_labels=True);

wrote concat locus (10 x 100bp) to /tmp/7587.phy


We can extend the length of the locus to make it more informative, in this case to 2Kb. The resulting gene tree is better resolved and matches closer to the genealogy. 

In [28]:
# setup simulation with small Ne
model1 = ipcoal.Model(tree=tree, Ne=1e4)

# simulate a short uninformative locus
model1.sim_loci(1, 2000)

# infer a gene tree at this locus
model1.infer_gene_trees()

# draw the genealogy and gene tree
t0 = toytree.tree(model1.df.genealogy[0])
t0.draw(edge_colors='orange', ts='c', tip_labels=True);
t1 = toytree.tree(model1.df.inferred_tree[0])
t1.draw(ts='c', tip_labels=True);

# does the genealogy match the gene tree?
rf = t0.treenode.robinson_foulds(t1.treenode, unrooted_trees=True)[0]
if rf:
    print("inferred gene tree does not match the genealogy")
else:
    print("inferred gene tree matches the genealogy")

wrote concat locus (10 x 2000bp) to /tmp/7587.phy
inferred gene tree matches the genealogy


As we learning in the last notebook, when we increase the effective population size any sampled genealogy is more likely to differ from the species tree. Let's examine this effect with sequence data and gene tree inference. Here I simulate a 2Kb locus again but on a species tree with Ne=2e5.

In [26]:
# setup simulation with small Ne
model1 = ipcoal.Model(tree=tree, Ne=2e5)

# simulate a short uninformative locus
model1.sim_loci(1, 2000)

# infer a gene tree at this locus
model1.infer_gene_trees()

# draw the genealogy and gene tree
t0 = toytree.tree(model1.df.genealogy[0])
t0.draw(edge_colors='orange', ts='c', tip_labels=True);
t1 = toytree.tree(model1.df.inferred_tree[0])
t1.draw(ts='c', tip_labels=True);

# does the genealogy match the gene tree?
rf = t0.treenode.robinson_foulds(t1.treenode, unrooted_trees=True)[0]
if rf:
    print("inferred gene tree does not match the genealogy")
else:
    print("inferred gene tree matches the genealogy")

wrote concat locus (10 x 2000bp) to /tmp/7587.phy
inferred gene tree does not match the genealogy


When there is greater genealogical discordance this also makes it more likely that IF there are multiple linked genealogies concatenated within a locus, that those genealogies may represent different topologies. This problem occurs to greater extents when the tree is very deep in units of generation times, Ne is very large, and recombination rates are high. In the example below I set Ne=2e6, and I print the number of genealogies that are contained within the 10Kb locus. 

In [25]:
# setup simulation with small Ne
model1 = ipcoal.Model(tree=tree, Ne=2e6)

# simulate a short uninformative locus
model1.sim_loci(1, 10000)
print("10Kb locus contains {} genealogies".format(model1.df.shape[0]))

# infer a gene tree at this locus
model1.infer_gene_trees()

# draw the genealogy and gene tree
t0 = toytree.tree(model1.df.genealogy[0])
t0.draw(edge_colors='orange', ts='c', tip_labels=True);
t1 = toytree.tree(model1.df.inferred_tree[0])
t1.draw(ts='c', tip_labels=True);

# does the genealogy match the gene tree?
rf = t0.treenode.robinson_foulds(t1.treenode, unrooted_trees=True)[0]
if rf:
    print("inferred gene tree does not match the genealogy")
else:
    print("inferred gene tree matches the genealogy")

10Kb locus contains 212 genealogies
wrote concat locus (10 x 10000bp) to /tmp/7587.phy
inferred gene tree does not match the genealogy
