# nb1: Genealogical variation and sequence variation


### Notebook outline:
1. A visual introduction to genealogical variation. 
2. Connecting genealogies to species trees (demographic model).
3. Connecting genealogies to sequence variation (observations)
4. Connecting genealogies to gene trees (inference)

### Learning objectives: 
By the end of this notebook series you should:
1. Be familiar with the `toytree` and `ipcoal` Python libraries.
2. Recognize the power of coalescent simulations to test hypotheses.
3. Have an improved understanding of gene-tree/species-tree concepts.


### Additional recommended reading:

- [Rosenberg and Nordborg (2002) Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Review Genetics](https://eaton-lab.org/slides/genomics/readings/Rosenberg-and-Nordborg-2002.pdf)

### The toytree and ipcoal packages
These two Python packages are designed to be used together within jupyter notebooks to execute interactive code to create, manipulate, and visualize tree data objects, and to simulate the coalescent process and generate sequence data. The `ipcoal` package is built as a wrapper around the popular `msprime` coalescent simulator, and extends the functionality of this package for phylogenetic-scale analyses. 

In [249]:
import ipcoal
import toytree

### Start with a species tree
In this example we will start with a known parameterized species tree model from which we will sample genealogical histories. Depending on parameters of this model, specificall the effective population size (Ne), the amount of genealogical variation will vary. A species tree acts a container in which genealogies must be contained. A species tree defines a topology with edge lengths in units of generations, and effective population sizes on each edge.

In [251]:
# generate a random species tree with 10 tips and a crown age of 10M generations
tree = toytree.rtree.unittree(10, treeheight=1e6, seed=123)

# draw the species tree
tree.draw(ts='c', tip_labels=True);

To sample a genealogy from the species tree above we will create an `ipcoal.Model` object by providing it the species tree object as a parameter, and an Ne value as an additional parameter. Here the Ne value is relatively low compared to the edge lengths and thus discordance on the species tree is very low. The genealogy (tree with orange edges) matches the species tree topology (tree with black edges).

In [253]:
# sample a genealogy from this species tree with Ne=1e5
model1 = ipcoal.Model(tree=tree, Ne=1e4)
model1.sim_trees(1)
model1.draw_genealogy(idx=0, edge_colors='orange');

Another way to visualize this is to examine coalescent times with respect to the divergence/speciation times of the lineages. Here you can see that coalescent events occur almost instantaneously within each edge of the species tree. Each sample is colored differently going backwards in time from the tips of the species tree until they coalesce with another sample. 

In [254]:
model1.draw_demography(idx=0, spacer=1, height=300);

Now let's look at what happens when the Ne value is increased. Here we set Ne to 2e5 across the entire tree, representing a higher level of expected discordance. The sampled genealogy below no longer matches the species tree topology. And similarly, when we examine the coalescent times with respect to the species tree divergences in the next plot, we can see that many deep coalescent events occured near the root of the species tree. When this occurs the relationships among samples in the genealogy are no longer required to match the species tree. Try executing the cells below multiple times to examine stochastic coalescent variation over multiple sampled genomic regions. 

In [268]:
# sample a genealogy from this species tree with Ne=2e5
model2 = ipcoal.Model(tree=tree, Ne=2e5)
model2.sim_trees(1)

# draw the genealogy
model2.draw_genealogy(idx=0, edge_colors='orange');

# draw the genealogy within a container
model2.draw_demography(idx=0, spacer=1, height=300);

### Setting and checking demographic parameters
In addition to setting Ne as a fixed value across the tree, you can also set variable Ne values across different edges of the tree. The best way to accomplish this is by using toytree to set values to nodes of the tree. This allows you to visually assess that your simulation was properly set up. Here I use the `.set_node_values` function of the tree to set greater Ne values on several edges by referencing their node index numbers. 

In [290]:
# create a new tree copy with Ne values mapped to nodes
vtree = tree.set_node_values(
    feature="Ne",
    values={i: 2e5 for i in (6, 7, 8, 9, 12, 15, 17)},
    default=1e4,
)

# draw with ts='p' to show Ne on edges
vtree.draw(ts='p');

In [291]:
# create simulator 
model3 = ipcoal.Model(tree=vtree)

# sample 1 genealogy
model3.sim_trees(1)

# draw genealogy within container 
model3.draw_demography(idx=0, spacer=1, height=300);

# draw genealogy alone to show discordance
model3.draw_genealogy(idx=0, edge_colors='orange');

So far we have been sampling only one genealogy at a time. But in most cases our interest in coalescent variation is examine the distribution of variation over many sampled genealogies. This can be visualized in the example below by drawing an overlapping cloud of multiple sampled genealogies. 

In [292]:
# sample 100 genealogies
model3.sim_trees(100)

# draw a cloud tree of 100 samples
mtre = toytree.mtree(model3.df.genealogy)
mtre.draw_cloud_tree(layout='d', edge_colors='orange');

### Genealogical variation and sequence variation
Across a genome alignment different regions will trace back different coalescent histories, representing the fact that each genome is a mosaic of different ancestors -- the effect of recombination. Using `ipcoal` we can examine this variation in a number of ways. Let's first examine *unlinked* variation, representing genealogies that are completely independent of each other. These can be sampled by using the functions `sim_trees`, `sim_loci`, or `sim_snps`. The latter two functions simulate sequence data on the sampled genealogies. 

In [294]:
# sample 10 unlinked genealogies
model = ipcoal.Model(tree, Ne=2e5)
model.sim_trees(10)

# draw several unlinked genealogies
mtre = toytree.mtree(model.df.genealogy)
mtre.draw(ts='c', layout='d', tip_labels=True, shared_axes=True, height=200);

In [295]:
# same drawing but with fixed tip order to highlight discordance
mtre.draw(ts='c', layout='d', tip_labels=True, shared_axes=True, height=200, fixed_order=True);

### Simulate sequence data
Simulated sequence data can be exported to a number of file formats. A visualization function is also available to validate the observed variation. The `sim_snps` function below will continue to run simulations until it produces the request number of variable sites. Each SNP is statistically unlinked from the others. The second function `sim_loci` returns the number of requested sites whether or not they accumulate mutations during the simulation. The sites within a locus are in linkage disequilibrium, meaning they are statistically non-indepedent, or linked. 

In [296]:
# sample sequence data on genealogies
model.sim_snps(10)

# visualize the sequence variation
model.draw_seqview(show_text=True);

In [297]:
# sample sequence data on genealogies
model.sim_loci(nloci=1, nsites=30)

# visualize the sequence variation
model.draw_seqview(show_text=True);

### Simulating linked data
In some cases we may be interested in simulating the effect of genetic linkage. For example, to model the spatial variation in genealogies along a chromosome. Or to properly model the expected variation within large loci that could arise from recombination. Let's examine that now. This simulation is similar to the one above where we sampled 10 unlinked genealogies, but here we sample multiple linked genealogies. You can see in the plot that each genealogy differs slightly from the one that comes before it, perhaps only in branch lengths. This is the effect of genetic linkage. The genealogies in neighboring genomic blocks share many of the same genealogical ancestors, and thus are highly correlated. 

In [304]:
# sample 10 unlinked genealogies
model = ipcoal.Model(tree, Ne=2e5, recomb=1e-9)

# sample trees from 1 10Kb locus
model.sim_trees(nloci=1, nsites=10000)

# draw several unlinked genealogies
mtre = toytree.mtree(model.df.genealogy)
mtre.draw(ts='c', layout='d', tip_labels=True, shared_axes=True, height=200);

# draw several unlinked genealogies (highlight discordance)
mtre.draw(ts='c', layout='d', tip_labels=True, shared_axes=True, height=200, fixed_order=True);

### Simulate sequence data
We can use this framework to simulate more realistic sequence data that approximates the type of data that is produced by existing genome sequencing technologies. This could include many short loci which would look similar to RAD-seq, or a small number of longer loci, approximating hybrid capture probes. 

In [309]:
# generate sequence data for 1000 loci each 150bp in length
model.sim_loci(nloci=100, nsites=150)

The `.df` dataframe contains the simulated sequence of genealogies across the length of each simulated locus. In the example below you can find loci that are represented by multiple genealogies, each covering some part of its extent. When we simulate sequence data on this locus the resulting sequence data represents the concatenation of sequence mutations evolved on more than one genealogical history. This can introduce error into gene tree estimation if loci are very long. 

In [343]:
# the genealogical variation is also stored in a dataframe
model.df.head()

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,99,99,6,0,"((r5:1.03988e+06,(r8:977..."
1,0,99,150,51,2,1,"((r5:1.03988e+06,(r8:816..."
2,1,0,150,150,20,0,"(((r2:910862,(r0:448776,..."
3,2,0,150,150,13,0,"((r6:732175,(r7:675391,r..."
4,3,0,9,9,2,0,"((r2:567949,(r0:248826,r..."


In [312]:
# the sequence matrix has 100 loci, 10 samples, and 150 sites
model.seqs.shape

(100, 10, 150)

In [314]:
# view the first locus
model.draw_seqview(idx=0, start=0, end=50);

In [315]:
# write the loci as a concatenated alignment
model.write_concat_to_phylip(name="test", outdir="/tmp")

wrote concat locus (10 x 15000bp) to /tmp/test.phy


In [316]:
# write the first 10 loci as a individual phylip files
model.write_loci_to_phylip(outdir="/tmp", idxs=range(10))

wrote 10 loci (10 x 150bp) to /tmp/[...].phy


### Migration (gene flow)
You can easily visualize and model migration in coalescent simulations. This can be used to validate methods for detecting hybrid introgression, or to examine its effect on other phylogenetic inference methods. To draw admixture edges you only need to designate the source and destination node indices. To designate admixture in a simulation you must provide (source, destination, timing, proportion), where the timing of introgression can be entered as a proportion of the shared edge length between two edges (e.g., 0.5)).

In [327]:
# draw a hybrid edge on a tree by designatin the source and destination node indices
tree.draw(ts='p', admixture_edges=(5, 6));

Here we simulate genealogies on the phylogenetic network above using relatively small effective population sizes such that there is very little discordance except that which is caused by the additional admixture edge, which allows these two divergent samples to occasionally experience rapid coalesce.

In [338]:
# simulate unlinked data with admixture
model4 = ipcoal.Model(tree, Ne=2e4, admixture_edges=[(5, 6, 0.5, 0.25)])

# sample genealogies from a species network
model4.sim_trees(200)

# draw a cloud tree of 100 samples
mtre = toytree.mtree(model4.df.genealogy)
mtre.draw_cloud_tree(
    layout='d',
    edge_style={"stroke-opacity": 0.01},
    fixed_order=tree.get_tip_labels(),
);

When we increase the effective population size such that discordance is caused by both incomplete lineage sorting and by admixture you can see that the admixture signal is less clear. In this case, to distinguish these two sources of discordance from each other may require a larger number of sampled loci. This is a demonstration of why sampling thousands of loci from across the genome is sometimes required to test evolutionary hypotheses.

In [341]:
# simulate unlinked data with admixture
model4 = ipcoal.Model(tree, Ne=1e5, admixture_edges=[(5, 6, 0.5, 0.25)])

# sample genealogies from a species network
model4.sim_trees(200)

# draw a cloud tree of 100 samples
mtre = toytree.mtree(model4.df.genealogy)
mtre.draw_cloud_tree(
    layout='d',
    edge_style={"stroke-opacity": 0.01},
    fixed_order=tree.get_tip_labels(),
    height=400,
);