## Testing phylogenetic inference methods using simulations

### Notebook MSC-3: SNP-based quartet inference

In this notebook we will simulate *unlinked SNPS*, representing a single variable site from many sampled locations throughout the genome, to test the SVDQuartets algorithm. This type of data is generally obtained by whole genome sequencing or RAD-seq.

In [2]:
import toytree
import ipcoal
import ipyrad.analysis as ipa

### Phylogenomic inference methods (tetrad quartet species tree inference)

Large multi-locus datasets are typically analyzed in one of three ways to infer a species tree efficiently. This notebook focuses on the 3rd method below, where the tree inference problem is first decomposed into many separate quartet inference problems. Each quartet tree is inferred using a genome-wide sample of unlinked SNPs. The estimated quartet trees are then joined together into a supertree that represents a consistent estimate of the species tree topology under the multispecies coalescent (MSC). This method was first developed and implemented in the **SVDquartets** software, but we will implement the same algorithm using the **tetrad** program below. 

<img src="https://eaton-lab.org/slides/data-svg/consensus-pre.svg" style="width:85%">

## The true species tree

The imbalanced (comb-shaped) tree topology below represents the true species tree history that we hope to infer from sequence data. By setting demographic parameters on this species tree history we can create a difficult phylogenetic inference problem that involves very high levels of genealogical discordance. 

The example below is a famous case where large effective population sizes (or short edge lengths) on several internal edges can cause high levels of genealogical discordance such that the incorrect topology occurs more frequently than the correct topology. This scenario has been termed the "anomaly zone". Phylogenetic inference methods that are consistent with the multispecies coalescent model can correctly infer the true species tree in this scenario, whereas other methods will infer an incorrect tree.


With edge lengths in units of **generation times** and widths representing **effective population size ($N_e$)**.

In [3]:
# get an imbalanced species tree with crown age of 5M generations
tree = toytree.rtree.imbtree(8, treeheight=5e6)

# set Ne values on nodes of the tree
tree = tree.set_node_values(
    feature="Ne", 
    values={i: 2e7 for i in (9,10,11)},
    default=1e6,
)

# draw the tree showing parameters
tree.draw(ts='p', edge_type='p');

In [23]:
# short edge lengths in coalescent units
node = tree.idx_dict[10]
print("coalescent units: {:.3f}".format(node.dist / (2 * node.Ne)))

# long edge lengths in coal units
node = tree.idx_dict[8]
print("coalescent units: {:.3f}".format(node.dist / (2 * node.Ne)))

coalescent units: 0.018
coalescent units: 0.357


### Setup the simulation

In [24]:
# setup a coalescent simulator
mod = ipcoal.Model(tree, mut=5e-8, recomb=1e-9)

In [32]:
# examine 
mod.sim_trees(1)
toytree.container(mod, spacer=1, idx=0);

In [12]:
# sample 4 genealogies from this species tree
mod.sim_trees(4)
mod.draw_genealogies(fixed_order=True);

### Infer species tree from unlinked SNPs using tetrad (SVDquartets algorithm)

In [46]:
# simulate 10000 unlinked SNPs and write to a file
mod.sim_snps(10000)
mod.write_snps_to_hdf5(name='test', outdir='/tmp')

wrote 10000 SNPs to /tmp/test.snps.hdf5


In [47]:
# setup tetrad analysis
tet = ipa.tetrad(name='test', data='/tmp/test.snps.hdf5', workdir='/tmp', nboots=10)

# run distributed inference
tet.run(auto=True, quiet=True, force=True)

# draw inferred tree
toytree.tree(tet.trees.tree).root("r7").draw(ts='s', node_labels="support");

loading snps array [8 taxa x 10000 snps]
max unlinked SNPs per quartet [nloci]: 10000
quartet sampler [full]: 70 / 70
[####################] 100% 0:00:00 | boot rep. 10 | avg SNPs/qrt: 7647 

### Infer a tree with raxml (concatenation)
Here we simulate 1000 loci that are each 1000bp in length. This is typical of a modern phylogenomic dataset. Because each individual locus contains few variant sites, a simple approach is to combine all of the loci into a single large locus (supermatrix). 

In [33]:
# simulate a 1000 loci each 500bp in length and write supermatrix to file
mod.sim_loci(nloci=1000, nsites=500)
mod.write_concat_to_phylip(name="test", outdir="/tmp")

wrote concat locus (8 x 1000000bp) to /tmp/test.phy


In [34]:
# setup raxml inference command
rax = ipa.raxml(name='test', data="/tmp/test.phy", workdir="/tmp")

# run inference 
rax.run(force=True)

# draw inferred tree
toytree.tree(rax.trees.bipartitions).root("r7").draw(ts='s', node_labels="support");

job test finished successfully


### Infer a species tree with ASTRAL3 (multi-step)
This involves first estimating a gene tree for every locus, and then using the gene trees as input to the astral. 

Normally, this method is quite fast and efficient, but since we are working on a small cloud-based instance for this workshop, we have few computing cores available. In testing I found the gene tree inference step below to take about 30 minutes.

In [48]:
# simulate a 1000 loci each 1000bp in length and write supermatrix to file
mod.sim_loci(nloci=1000, nsites=500)
mod.write_loci_to_hdf5(name="test", outdir="/tmp")

wrote 1000 loci to /tmp/test.seqs.hdf5


In [None]:
# setup raxml gene tree inference for every locus.
ts = ipa.treeslider(
    name='test', 
    data="/tmp/test.seqs.hdf5", 
    workdir="/tmp",
    inference_args={'f': 'd', 'x': None, "N": 10},
)
ts.run(auto=True, force=True)

building database: nwindows=1000; minsnps=1
[###############     ]  77% 0:08:25 | inferring trees 

In [37]:
# setup astral inference from inferred gene trees
ast = ipa.astral(name='test', data="...", workdir="/tmp")
ast.run()

# draw the inferred species tree
toytree.tree(ast.tree).draw();