## Phylogenetic inference: hands-on

In [33]:
import toytree
import ipcoal
import msprime

In [34]:
# conda install raxml-ng phylip -c conda-forge -c bioconda

### Select a true tree on which to generate data

To validate that our phylogenetic inference methods are working correctly we will use simulation to generate data on a known tree, and then test the ability for an inference
method to reconstruct a result matching the true tree. For this we can start by generating a tree topology using `toytree` and then simulate sequences on this tree using `ipcoal`.

In [35]:
true_tree = toytree.rtree.imbtree(ntips=10, treeheight=1e6)

In [36]:
true_tree.draw(scale_bar=True);

### Simulate sequences on the true tree
We can pass this tree to `ipcoal` to setup a coalescent simulation. For now, we will set the effective population size (Ne) to a very low number so that there is effectively no incomplete lineage sorting (i.e., genealogies will match the species tree), and we will also set recombination to 0 so that we are simulating data on only a single genealogy.

In [81]:
# setup the coalescent model
model1 = ipcoal.Model(
    tree=true_tree, Ne=2, mut=1e-8, recomb=0, seed_mutations=123, seed_trees=321)

In [82]:
# simulate a (small) chromosome
model1.sim_loci(nloci=1, nsites=30)

In [83]:
# view the sequences data
model1.draw_seqview(show_text=True);

- Question 1: Do you think we can infer the correct phylogeny from this sequence data?
- Question 2: What parameters of the tree, model or function calls above should/could we change to make the data more informative?
- Question 3: What are some of the assumptions that are being made to generate this data?

### Writing sequence data to files

You can call the `.write_...` functions of the `ipcoal.Model` object to write sequence data to a number of file formats that are commonly used by phylogenetic inference software tools. The most common formats are `.phy` (phylip format), `.nex` (nexus format), and `.vcf` (variant call format). You can call these functions without any arguments to have the data returned as a string, or you can enter a filepath to have the string written to a file. Both are demonstrated below.

In [92]:
# write to a file (/tmp/test.phy)
model1.write_concat_to_phylip(name="test", outdir="/tmp")

wrote concat locus (10 x 30bp) to /tmp/test.phy


In [93]:
# view in PHY data file format
print(model1.write_concat_to_phylip())

10 30
r0         AAAATTCTTGATAGGTATGAACGAACGTGT
r1         AAAATTCTTGATAGGTATGAACGAACGTGT
r2         AAAATTCTTGATAGGTATGAACGAACGTGT
r3         AAAATTCTTGATAGGTATGAACGAACGTGT
r4         AAAATTCTTGATAGGTATGAACGAACGTGT
r5         AAAATTCTTGATAGGTATGAACGAACGTGT
r6         AAAATTCTTGATAGGTATGAACGAACGTGT
r7         AAAATTCTCGATAGGTATGAACGAACGTGT
r8         AAAATTCTTGATAGGTATGAACGAACGTGT
r9         AAAATTCTTGATAGGTATGAACGAACGTGT


In [94]:
# view in NEX data file format
print(model1.write_concat_to_nexus())

#nexus
begin data;
  dimensions ntax=10 nchar=30;
  format datatype=DNA missing=N gap=- interleave=yes;
  matrix

  r0	AAAATTCTTGATAGGTATGAACGAACGTGT
  r1	AAAATTCTTGATAGGTATGAACGAACGTGT
  r2	AAAATTCTTGATAGGTATGAACGAACGTGT
  r3	AAAATTCTTGATAGGTATGAACGAACGTGT
  r4	AAAATTCTTGATAGGTATGAACGAACGTGT
  r5	AAAATTCTTGATAGGTATGAACGAACGTGT
  r6	AAAATTCTTGATAGGTATGAACGAACGTGT
  r7	AAAATTCTCGATAGGTATGAACGAACGTGT
  r8	AAAATTCTTGATAGGTATGAACGAACGTGT
  r9	AAAATTCTTGATAGGTATGAACGAACGTGT

	;
end;


In [95]:
# view in VCF data file format (appears wrapped to a second line here)
print(model1.write_vcf())

   CHROM  POS ID REF ALT  QUAL FILTER INFO FORMAT   r0   r1   r2   r3   r4  \
0      1    9  .   T   C    99   PASS    .     GT  0|0  0|0  0|0  0|0  0|0   

    r5   r6   r7   r8   r9  
0  0|0  0|0  1|1  0|0  0|0  


### Simulating sequence data files
Let's now setup another model object and simulate a larger amount of sequence data, including a few more complex arguments for specifying the substitution model. Below I show both a JC69 and a HKY model. The HKY model specifies a transition-to-tranversion ratio parameter (kappa) as well as an equilibrium frequencies (long term average frequencies) parameter. We'll use the HKY one going forward.

In [98]:
# parameterize a subst model object from msprime
jc_model = msprime.JC69()

# show its transition matrix repr
jc_model.transition_matrix

array([[0.        , 0.33333333, 0.33333333, 0.33333333],
       [0.33333333, 0.        , 0.33333333, 0.33333333],
       [0.33333333, 0.33333333, 0.        , 0.33333333],
       [0.33333333, 0.33333333, 0.33333333, 0.        ]])

In [99]:
# parameterize a subst model object from msprime
hky_model = msprime.HKY(kappa=2, equilibrium_frequencies=(0.3, 0.2, 0.2, 0.3))

# show its transition matrix repr
hky_model.transition_matrix

array([[0.18181818, 0.18181818, 0.36363636, 0.27272727],
       [0.27272727, 0.        , 0.18181818, 0.54545455],
       [0.54545455, 0.18181818, 0.        , 0.27272727],
       [0.27272727, 0.36363636, 0.18181818, 0.18181818]])

In [100]:
# parameterize Model 
model = ipcoal.Model(
    tree=true_tree, Ne=2, recomb=0, mut=1e-8,
    subst_model=hky_model,
    seed_mutations=123, seed_trees=321,
)

In [101]:
# simulate a large chromosome (100_000 sites)
model.sim_loci(nloci=1, nsites=1e5)

In [102]:
# view summary of data (>5000 SNPs)
model.df

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,100000,100000,5797,0,(r0:1000008.252799180219...


In [104]:
# write data to a file
model.write_concat_to_phylip(name="concat-hky-100K", outdir="/tmp")

wrote concat locus (10 x 100000bp) to /tmp/concat-hky-100K.phy


### Infer a tree using raxml

The program RAxML is one of the most widely used tools for maximum-likelihood tree inference from sequence data. The latest version is named `raxml-ng`, and the older version is now referred to as `raxml-standard`. We will use the `-ng` version here, which was installed from conda. Details on installation and the full documentation is available on GitHub: https://github.com/amkozlov/raxml-ng. 

`raxml-ng` is a command line program, intended to be called from a terminal. Here we will use a feature of jupyter notebooks that allows you to enter code into a cell (which normally executes Python code) and to tell it to instead execute as if it were a terminal (bash shell). This is indicated by the `%%bash` header in the cell below. Let's start by examining the `--help` call to see available options (it is much easier to read if you widen your browser window).

Note: you'll see that I use the `--redo` argument in all examples below, this is just so that you can re-run the cell without it complaining that the result file already exists. 

In [143]:
%%bash

raxml-ng --help


RAxML-NG v. 1.1 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 4 cores, 15 GB RAM

Usage: raxml-ng [OPTIONS]

Commands (mutually exclusive):
  --help                                     display help information
  --version                                  display version information
  --evaluate                                 evaluate the likelihood of a tree (with model+brlen optimization)
  --search                                   ML tree search (default: 10 parsimony + 10 random starting trees)
  --bootstrap                                bootstrapping (default: use bootstopping to auto-detect #replicates)
  --all                              

## Get starting trees (--start)
By default raxml proposes 20 different starting trees from which it will start the maximum likelihood search. This includes 10 random trees, and 10 parsimony trees optimized from those 10 random trees. Normally you would not call this step of the analysis, it is performed internally when you call one of the other main functions (`--all`), but here I call `--start` so that we can examine these trees before proceeding.

In [146]:
! raxml-ng --msa /tmp/concat-hky-100K.phy --model HKY --start --redo


RAxML-NG v. 1.1 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 4 cores, 15 GB RAM

RAxML-NG was called at 28-Mar-2022 15:09:55 as follows:

raxml-ng --msa /tmp/concat-hky-100K.phy --model HKY --start --redo

Analysis options:
  run mode: Starting tree generation
  start tree(s): random (10) + parsimony (10)
  random seed: 1648494595
  SIMD kernels: AVX2
  parallelization: coarse-grained (auto), PTHREADS (auto)


[00:00:00] Reading alignment from file: /tmp/concat-hky-100K.phy
[00:00:00] Loaded alignment with 10 taxa and 100000 sites

Alignment comprises 1 partitions and 304 patterns

Partition 0: noname
Model: HKY+FO
Alig

In [147]:
# view the starting trees
start_trees = toytree.mtree("/tmp/concat-hky-100K.phy.raxml.startTree")
start_trees.draw(shape=(5, 4));

## Evaluate likelihood of a single tree (--evaluate)

Let's now explore the next step in maximum likelihood tree inference, which is to optimize the model parameters (branch lengths and substitution model parameters) to find the set that yields the maximum likelihood estimate. During a full ML analysis this will be repeated on every starting tree, as well as on every new tree topology that is proposed from those starting trees while performing a heuristic search. Here, however, we will start by first only evaluating the likelihood on a single fixed tree at a time by using the `--evaluate` method of raxml. 

In [163]:
# let's write just the first tree (a random tree) and last tree (parsimony tree) to files.
start_trees[0].write("/tmp/random-tree.nwk")
start_trees[-1].write("/tmp/parsimony-tree.nwk")

In [164]:
%%bash

# evaluate likelihood of the random tree
raxml-ng --evaluate --msa /tmp/concat-hky-100K.phy --model HKY --redo --tree /tmp/random-tree.nwk


RAxML-NG v. 1.1 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 4 cores, 15 GB RAM

RAxML-NG was called at 28-Mar-2022 15:52:15 as follows:

raxml-ng --evaluate --msa /tmp/concat-hky-100K.phy --model HKY --redo --tree /tmp/random-tree.nwk

Analysis options:
  run mode: Evaluate tree likelihood
  start tree(s): user
  random seed: 1648497135
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: coarse-grained (auto), PTHREADS (auto)


[00:00:00] Reading alignment from file: /tmp/concat-hky-100K.phy
[00:00:00] Loaded 

In [165]:
%%bash

# evaluate likelihood of the parsimony tree
raxml-ng --evaluate --msa /tmp/concat-hky-100K.phy --model HKY --redo --tree /tmp/parsimony-tree.nwk


RAxML-NG v. 1.1 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 4 cores, 15 GB RAM

RAxML-NG was called at 28-Mar-2022 15:52:19 as follows:

raxml-ng --evaluate --msa /tmp/concat-hky-100K.phy --model HKY --redo --tree /tmp/parsimony-tree.nwk

Analysis options:
  run mode: Evaluate tree likelihood
  start tree(s): user
  random seed: 1648497139
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: coarse-grained (auto), PTHREADS (auto)


[00:00:00] Reading alignment from file: /tmp/concat-hky-100K.phy
[00:00:00] Load

### Questions

- Examine the final loglikihood in each of the two evaluations above. Which is a better fit to the data?
- Examine the final Optimized model parameters: were they correctly optimized in both evaluations given that we know the true substitution model?

## Tree search (--search)

Now for a more complete version of the maximum likelihood method. The `--search` method essentially combines `--start` and `--evaluate` to get the starting trees and evaluate their likelihoods, but then it also performs a heuristic search by proposing new tree topologies as SPR moves from each tree. Here I use the options to limit the number of workers (CPUs) and threads which makes it run slower but makes the output easier to read, because only one tree is analyzed at a time. Finally, it writes the best inferred tree to a file (bestTree), the optimized result of all 20 starting trees to a file (mlTrees), and the optimized model parameters (bestModel). 

In [179]:
%%bash

raxml-ng --search --msa /tmp/concat-hky-100K.phy --model HKY --redo --workers 1 --threads 1


RAxML-NG v. 1.1 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 4 cores, 15 GB RAM

RAxML-NG was called at 28-Mar-2022 16:05:15 as follows:

raxml-ng --search --msa /tmp/concat-hky-100K.phy --model HKY --redo --workers 1 --threads 1

Analysis options:
  run mode: ML tree search
  start tree(s): random (10) + parsimony (10)
  random seed: 1648497915
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  fast spr radius: AUTO
  spr subtree cutoff: 1.000000
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: NONE/sequential


[00:00:00] Reading alignment from file: /tm

[00:00:00 -185563.604451] Model parameter optimization (eps = 10.000000)
[00:00:00 -183110.305762] AUTODETECT spr round 1 (radius: 5)
[00:00:00 -175594.912837] AUTODETECT spr round 2 (radius: 10)
[00:00:00 -175594.912835] SPR radius for FAST iterations: 5 (autodetect)
[00:00:00 -175594.912835] Model parameter optimization (eps = 3.000000)
[00:00:00 -175594.853091] FAST spr round 1 (radius: 5)
[00:00:00 -175594.853091] Model parameter optimization (eps = 1.000000)
[00:00:00 -175594.853071] SLOW spr round 1 (radius: 5)
[00:00:00 -175594.853071] SLOW spr round 2 (radius: 10)
[00:00:00 -175594.853071] Model parameter optimization (eps = 0.100000)

[00:00:00] ML tree search #9, logLikelihood: -175594.853071

[00:00:00 -327047.432499] Initial branch length optimization
[00:00:00 -184894.414403] Model parameter optimization (eps = 10.000000)
[00:00:00 -182447.760408] AUTODETECT spr round 1 (radius: 5)
[00:00:00 -176034.347595] AUTODETECT spr round 2 (radius: 10)
[00:00:00 -176034.347322] SPR 

[00:00:01 -175594.853059] Model parameter optimization (eps = 3.000000)
[00:00:01 -175594.853056] FAST spr round 1 (radius: 5)
[00:00:01 -175594.853056] Model parameter optimization (eps = 1.000000)
[00:00:01 -175594.853056] SLOW spr round 1 (radius: 5)
[00:00:01 -175594.853056] SLOW spr round 2 (radius: 10)
[00:00:01 -175594.853056] Model parameter optimization (eps = 0.100000)

[00:00:01] ML tree search #20, logLikelihood: -175594.853056


Optimized model parameters:

   Partition 0: noname
   Rate heterogeneity: NONE
   Base frequencies (ML): 0.299877 0.199778 0.200865 0.299480 
   Substitution rates (ML): 1.000000 1.959646 1.000000 1.000000 1.959646 1.000000 


Final LogLikelihood: -175594.853056

AIC score: 351231.706111 / AICc score: 351231.715353 / BIC score: 351431.477546
Free parameters (model + branch lengths): 21

Best ML tree saved to: /tmp/concat-hky-100K.phy.raxml.bestTree
All ML trees saved to: /tmp/concat-hky-100K.phy.raxml.mlTrees
Optimized model saved to: /tmp/concat-

### Results

In [168]:
# load the "bestTree" file
best_tree = toytree.tree("/tmp/concat-hky-100K.phy.raxml.bestTree")
best_tree.draw();

In [174]:
# root the "bestTree" and draw
rooted_best_tree = best_tree.root("r0").mod.ladderize().draw();

Did the ML search converge on the same result from all starting trees?

In [177]:
# load the "mlTrees" file
mtree = toytree.mtree("/tmp/concat-hky-100K.phy.raxml.mlTrees")
mtree.treelist = [i.root("r0").mod.ladderize() for i in mtree.treelist]
mtree.draw(shape=(5, 4));

## Support values (--bootstrap)

Bootstrapping is the process of re-sampling a dataset with replacement until you've resampled the original size and then repeating the analysis. It is a way of measuring support for you results under the assumption that your sample is a subset of the total data possible. It is commonly employed in ML phylogenetic inference, where it invovles re-sampling sites (columns of the alignment) and repeating the analysis. This can be performed in `raxml-ng` by using the `--bootstrap` method, or, by using the `--all` method which will perform the `--search` as we did above, and also perform bootstrapping. Let's go ahead and run an `--all` analysis and specify to run 100 bootstrap trees. This will save all of the 100 bootstrap trees to a file, and will also produce a result with a single tree that includes bootstrap support values summarized for each node in the newick string.

In [182]:
%%bash

raxml-ng --all --msa /tmp/concat-hky-100K.phy --model HKY --bs-trees 100 --redo


RAxML-NG v. 1.1 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 4 cores, 15 GB RAM

RAxML-NG was called at 28-Mar-2022 16:05:50 as follows:

raxml-ng --all --msa /tmp/concat-hky-100K.phy --model HKY --bs-trees 100 --redo

Analysis options:
  run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
  start tree(s): random (10) + parsimony (10)
  bootstrap replicates: 100
  random seed: 1648497950
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: coarse-grained (auto), PTHREADS (auto)


[00:00:00] Reading 

[00:00:01] [worker #2] Bootstrap tree #63, logLikelihood: -175452.044393
[00:00:01] [worker #0] Bootstrap tree #69, logLikelihood: -175486.396384
[00:00:01] [worker #1] Bootstrap tree #70, logLikelihood: -175188.454845
[00:00:01] [worker #3] Bootstrap tree #68, logLikelihood: -175964.246694
[00:00:01] [worker #2] Bootstrap tree #67, logLikelihood: -175617.488172
[00:00:01] [worker #0] Bootstrap tree #73, logLikelihood: -175049.083070
[00:00:02] [worker #2] Bootstrap tree #71, logLikelihood: -176076.845996
[00:00:02] [worker #3] Bootstrap tree #72, logLikelihood: -175365.436333
[00:00:02] [worker #1] Bootstrap tree #74, logLikelihood: -175541.628860
[00:00:02] [worker #3] Bootstrap tree #76, logLikelihood: -175607.969089
[00:00:02] [worker #2] Bootstrap tree #75, logLikelihood: -175461.433511
[00:00:02] [worker #0] Bootstrap tree #77, logLikelihood: -176108.945246
[00:00:02] [worker #1] Bootstrap tree #78, logLikelihood: -175751.026094
[00:00:02] [worker #3] Bootstrap tree #80, logLikel

### Finally, draw tree with supports

In [191]:
# load newick
final_tree = toytree.tree("/tmp/concat-hky-100K.phy.raxml.support")

In [229]:
# root and ladderize
final_tree = final_tree.root("r0").mod.ladderize()

In [228]:
# draw with style
final_tree.draw(
    width=400, height=400, 
    tip_labels_style={"font-size": 16},
    node_labels="support",
    node_labels_style={"baseline-shift": -10, "-toyplot-anchor-shift": -12},
    node_sizes=10,
);