# Tree thinking: interpreting phylogenetic trees


## Learning objectives:

By the end of this notebook you will:

1. Be able to interpret relationships on a phylogenetic tree. 
2. Have increased understanding of the newick format for storing tree data. 
3. Know how to root, unroot, and re-root phylogenetic trees.
4. Be more familiar with the `toytree` Python package.

### Understanding evolution using tree-thinking

The philosopher of science Robert O'Hara stated, "*It is impossible to really understand evolution without an ability to accurately interpret phylogenetic trees*", and that "*evolution itself is a theory of evolutionary trees*" (O'Hara 1988, 1997). In biology generally, and in this book already, you have seen several examples of phylogenetic trees. But have you thought carefully about how to interpret these trees, and what information is being presented? There are actually several common pitfalls that most beginners make when interpreting phylogenetic trees, and that even practiced biologists often commit as well. Recognizing and avoiding these mistakes will make you a better biologist, by allowing you to better question and interpret hypotheses about evolutionary relationships from trees. 

### Phylogenetic trees
A phylogenetic tree is a hypothesis of the inferred evolutionary relationships among a set of samples; the units represented at the tips. Sometimes, when we have fossil or ancient DNA data, we may have additional information about ancestral (internal) nodes in a tree. At the extreme, we may even have historical samples that are direct ancestors of later samples, such as in experimental evolution studies. However, most of the time the observed samples are represented at the tips, and all of other information in a tree represents a hypothesis. It is an attempt to describe a model of how a set of samples are related through evolutionary history.

The way we interact with phylogenetic trees most frequently is as images. Most people have seen a phylogeny in a museum, or television program, or in the news. Phylogenies are ubiquitous throughout biology, where they are used to describe the relationships among species, populations, individuals (genealogies), genes, and gene copies. However, not all tree diagrams are phylogenies. Many other types of hierarchically structured data can also be visualized as trees, but often with a very different interpretation.

Phylogenetic trees are distinct in that they are explicitly intended to represent evolution. This has consequence for how internal nodes in phylogenetic trees should be interpreted (as common ancestors), for the directionality in which they should read (root-to-tip, or vice versa), and for the types of relationships and information that can be extracted (i.e., how to describe evolutionary relationships).

### More than an image

Although we most often see phylogenetic trees as images, they can also be interpreted as statistical models. At its minimum, a phylogenetic tree represents a set of ancestor-descendant relationships. In addition, it may include information such as edge lengths that describe the magnitude of divergence between sets of samples. It may also include probabilities or weights as measures of confidence or support for the relationships. Many additional types of information can be contained in a phylogenetic tree that together represent a rich description of evolution, and often relate to parameters of the statistical inference method that was used to infer the phylogeny.

The best way to start to understand phylogenies as a type of data, as opposed to simply a drawing, is to explore the use of trees as data objects. Below we will use the `toytree` Python package to load, manipulate, draw, and deconstruct trees to better understand the type of information contained in phylogenetic trees.

In [None]:
import toytree

### Newick tree format

The text string below defines a tree in [newick format](https://en.wikipedia.org/wiki/Newick_format). When researchers are working with phylogenetic trees as data, this is the main type of data they are working with: simple text files!

This format could contain just the relationships -- described by nested parentheses like below -- or it can contain additional information such as branch lengths and/or support values, which we'll see later. You can see how the *nested* hierarchical relationship of a phylogeny (clades nested within clades) is easily represented by a *nested* set of parentheses. 

In the Python code block below, a <b>string</b> (the text contained within quotations) is <b>stored</b> to a <b>variable</b> called `newick1`. The name of the variable is arbitrary, we could have named it anything. Now we can reuse this object by referring to <i>newick1</i>, as we will see below.

In [None]:
# create a string variable storing a tree in newick format.
newick1 = "(gibbon, (orangutan, (gorilla, (chimp, human))));"

### A tree object
We can now parse this newick string to represent it as a `ToyTree` object in Python by using the function `toytree.tree()`. This object has many functions associated with it for manipulating, drawings, comparing, and extracting information about trees. As you can see below, the ToyTree itself is just a Python class object in memory. Next we will start to call functions of this object to investigate its structure. 

In [None]:
# parse the 'newick1' string into a ToyTree object named 'tree1'
tree1 = toytree.tree(newick1)

In [None]:
# this ToyTree is just a Python class object in memory.
tree1

### Drawing and interpreting trees
Below is visualization of this tree created by calling the `.draw` function of the `ToyTree` object with a styling argument that shows an integer label for every node. 

This tree is rooted, with the gibbon as the outgroup (more on what this means in a minute). Because we know that the tree is rooted, we can interpret the evolutionary relationships by reading from the tips towards the root of the tree. To do so, select any two tips and trace back along their edges until you find the node where they meet. That is their *most recent common ancestor* (MRCA). Ancestors are by definition older than their descendants. Two samples which share a younger MRCA  are more closely related than two samples that share an older MRCA. Using this method, we can identify which samples are more closely related to each other than others.

In [None]:
# call .draw function of tree1 to return a drawing that will display
tree1.draw(tree_style='s');

<div class="alert alert-success">
    <h3>Action 1:</h3> 
    
In terms of the numbered labels on nodes: (1) which node represents the common ancestor of chimp and human? (2) Which is the common ancestor of orangutan and gibbon? (3) Which of those pairs is more closely related, and why? 
</div>

<div class="alert alert-warning">

<details>
  <summary><h3>Click for answer.</h3></summary>
  
1. Node 5 is common ancestor of Chimp and Human.
2. Node 8 is common ancestor of Orangutan and Gibbon.
3. The Chimp and Human are more closely related, because they share a more recent common ancestor. This is known because Node 8 is an ancestor of Node 5.
</details>


<div>

### Pitfall #1: thinking the order of the tips is relevant

A common pitfall when reading phylogenies is to think that because two tips are close to each other visually on the tree that they must also be closely related. This is wrong. As we just learned, the proper way to interpret evolutionary relationships on a tree is to trace back along the edges from the tips towards the root to find common ancestors. 

This point can be made clear by examining the same tree as above with nodes arbitrarily rotated such that the tip order changes, but the tree topology itself remains the same. The tree below is one such example, where node 7 has been rotated. One might erroneously read this tree to think that it now shows the Orangutan and Human are more closely related than in the previous tree, when in fact the tree still shows the exact same relationships it did before.

<i> The tree looks different, but if we read it correctly we can see that the relationships have not changed</i>.

In [None]:
tree1.mod.rotate_node(7).draw(tree_style='s');

Consider also the following tree, in which the tip order has been greatly changed. This is of course not a standard way to visualize trees, since the overlapping edges makes it harder to interpret, but nevertheless, the tree topology remains the same. This is a clear example where intepreting the closeness of the tips would be misleading, since one might think that the Gibbon is most closely related to Humans, when in fact we know that this tree still shows Human and Chimp to be most closely related, and that the Gibbon is most distant, and is also equally related to all other samples (it shares the same MRCA with all other samples). 

In [None]:
tree1.draw(tree_style='s', edge_type='c', fixed_position=[3, 1, 0, 2, 4],);

So far we have been finding the MRCA visually, but of course this task can be automated with code when our tree is represented as a data object, and not just a drawing. This becomes quite useful when working with very large trees, and is used frequently in computational tasks involving tree data. Below we use the function `get_mrca_node` of a ToyTree object which can take any number tip node names as arguments and returns the Node object that is the MRCA.

In [None]:
# the .get_mrca_node function will return the common ancestor 
print(tree1.get_mrca_node("human", "chimp"))
print(tree1.get_mrca_node("human", "chimp", "gorilla"))
print(tree1.get_mrca_node("gorilla", "gibbon"))

### Internal labels and clades
Before proceeding we will digress to discuss the node labels that we have been using thus far to refer to the internal nodes of the tree. If you look back at the newick string that was used to create the trees that we have been working with so far you will see that they do not have these integer labels in the data. So where did they come from? 

`toytree` assigns a unique integer label to every node when a ToyTree object is created. This is simply a way to uniquely refer to every node. The numbers are assigned in order, first to the tips, and then to internal nodes in increasing order until the root is reached. However, if the tree topology *is changed* then the internal node numbering will change as well. As we just saw, rotating nodes does not change the topology, however, an actual change to the relationships would change the topology. So beware when reading the integer labels (referred to as *node idx labels*). 

A safer, and more evolutionary way to refer to internal nodes of a tree is to refer to *who they are a common ancestor of*. In other words, to refer to a node by the **clade** that it forms. By definition, a clade is a group of all samples in a tree that descend from a MRCA. Another term associated with this is to say that the group of samples is **monophyletic**. 

For example, we would refer to Node 5 from above as the common ancestor of the clade that includes Human and Chimp. This is in code below by getting the MRCA of a set of taxa and then printing which samples are members of the clade descended from that node. As you can see in the last example, we find the MRCA of Gorilla and Gibbon (just 2 samples), but the clade descended from this MRCA node includes all samples (5 samples). This is because Gorilla and Gibbon alone do not form a monophyletic clade. 

In [None]:
# print the MRCA Node object, and the tip names descended from it.
mrca = tree1.get_mrca_node("human", "chimp")
print(mrca, mrca.get_leaf_names())

mrca = tree1.get_mrca_node("human", "chimp", "gorilla")
print(mrca, mrca.get_leaf_names())

mrca = tree1.get_mrca_node("gorilla", "gibbon")
print(mrca, mrca.get_leaf_names())

<div class="alert alert-success">
    <h3>Action 2:</h3> 
    
Which of the four trees below shows a different phylogenetic relationship from the other?
</div>

In [None]:
# load and draw trees (.mtree is used to work w/ multiple trees)
test1 = '(gibbon,(orangutan,(gorilla,(chimp,human))));'
test2 = '((((human,chimp),gorilla),orangutan),gibbon);'
test3 = '((((human,chimp),orangutan),gorilla),gibbon);'
test4 = '(((gorilla,(human,chimp)),orangutan),gibbon);'
toytree.mtree([test1, test2, test3, test4]).draw(ts='s', node_sizes=15);

<div class="alert alert-warning">

<details>
  <summary><h3>Click for answer.</h3></summary>
  
Only the third tree is different. In this tree the Orangutan forms a clade with Human and Chimp, whereas in the other three trees the Gorilla forms a clade with Human and Chimp, and the drawing simply differ by rotations of nodes.
    
</details>


<div>

### Rooting trees: polarizing evolution

Earlier I mentioned that we would assume the tree was rooted, we'll now dive deeper into what that means. A phylogenetic tree can be rooted or unrooted, which indicates whether or not *we know* the direction of evolution on the tree, i.e., which node is the root (common ancestor to all samples the tree).

It turns out that most of the methods that we have for inferring phylogenetic trees are only able to infer an unrooted tree, and rely on additional evidence afterwards to identify the rooting. As an example, consider a data set composed of morphological measurements on many different mammals that is used to infer a phylogeny using the method of parsimony. This would involve proposing a phylogenetic tree hypothesis, and then counting how many evolutionary *changes* of each character are required to explain the observed trait data at the tips of the tree. Because we simply count the number of *changes*, it is not required for us to know whether each change represents a gain or a loss of each character. This turns out to be a strength of many phylogenetic inference methods, they can yield a tree hypothesis without requiring knowledge of the rooting of a tree. However, as evolutionary biologists, we *do want to know* how a tree should be rooted. 

So, the most frequent solution is to use an *outgroup*. That is to say, we include a sample in the phylogenetic analysis that we *know* (strongly believe) is more distantly related to all of the other samples. Such information may come from alternative data (DNA versus morphology), previous analyses, from fossils, or other sources. With an outgroup decided, and using the unrooted tree that we inferred, we can then root the tree by inserting a Node on the branch between the ingroup and outgroup samples, since we know that they must all share a MRCA together. Let's walk through this process visually to better understand it.

### Unrooted versus rooted trees
What is the difference between a rooted and unrooted tree? In truth, they can be hard to distinguish visually, and so it is actually best practice to state whether a tree is rooted or not when it is shown. However, most of the time, it is generally assumed from the style in which the tree is visualized whether it is rooted or not. For example, unrooted trees are generally shown using an "undirected" layout style, where a direction of root to tips is not shown, like below.

In [None]:
# a style dict that we will use on style unrooted tree drawings
style = {
    "layout": "unrooted",
    "tip_labels_style": {"font-size": 15},
    "node_sizes": 16,
    "node_labels": "idx",
    "node_colors": "lightgrey",
}

In [None]:
# draw tree1 as an unrooted tree.
tree1.unroot().draw(**style);

While the visual layout of a tree can indicate whether or not the rooting is known (or is being emphasized), a tree data object (ToyTree) is explicit in whether or not it is rooted. To understand this, let's start from how trees are stored and loaded from newick strings. 

The primary way in which a tree is represented as rooted versus unrooted *when working with trees as data* is in whether or not the top level Node (root) is dichotomous (branching into 2 child clades), or non-dichotomous (branching into >2 child clades). This is coded into the newick strings below, where one has two clades in the outer-most parentheses `(x,y);`, and the other has three `(x,y,z);`.

In [None]:
# create a rooted and unrooted tree with 4 samples
rooted = toytree.tree("((a,b),(c,d));")
unrooted = toytree.tree("(a,b,(c,d));")
toytree.mtree([rooted, unrooted]).draw(ts='s', node_sizes=14, tip_labels_style={"font-size": 15});

As you can see above, even though the tree on the right *can* be drawn using a directed layout (left to right), it is still in fact an unrooted tree, and so the drawing is misleading, as we may interpret it to show that samples "c" and "d" share a most recent common ancestor, as they do in the rooted tree on the left, but we don't actually know that for the tree on the right, since there is not a known root. (The node labeled 5 in the tree on the right is simply indicating that there is a node representing the trichotomy in the tree object.

This means that *we don't actually know* which nodes are descended from whom. This is easier to understand if we draw the same trees using the undirected layout, below. As you can see, the rooted tree on the left contains an extra node (6) that represents the root. By contrast, the unrooted tree on the right contains no such node.

In [None]:
toytree.mtree([rooted, unrooted]).draw(**style);

### "Pinching" to root trees
The best way to think about rooting of evolutionary trees is to think of the edges as lengths of string connecting the nodes of the tree. The process of rooting is analagous to selecting some point along a string (edge), pinching it, and pulling it back to form a new node that will be the root. Let's try this on our unrooted tree by inserting the root node at different edges of the tree. 

In the first example below we insert a node to make "a" and "b" form a clade, which makes this tree topology the same as the rooted tree above.
We can now plot this "re-rooted" tree in both layout styles, and see that it is rooted, and that we can interpret that "a" and "b" share a MRCA. 

In [None]:
# root the tree by inserting a node along an edge that "a", "b" a clade.
unrooted.root("a", "b").draw(**style);
unrooted.root("a", "b").draw(ts='s');

But what if we rooted it elsewhere? This is key to my statement above that "we don't actually know which nodes are descended from which" on an unrooted tree. For example, if we rooted the tree on sample "a", then "a" shares a MRCA with a clade formed by "b", "c" and "d". In other words, given this rooting, "a" and "b" *do not* share an exclusive MRCA as they do in the other example above.

In [None]:
# root the tree by inserting a node along an edge that "a", "b" a clade.
unrooted.root("a").draw(**style);
unrooted.root("a").draw(ts='s');

**Aside:** 
As a way of storing tree information, the newick format alone has a shortcoming in that a trichotomy at the root of the tree can actually have two different meanings: (1) the tree is unrooted; or (2) the tree is rooted but the deepest splits is unresolved (it is a polytomy). As we've just learned, these two scenarios would lead to trees that should be interpreted differently. Fortunately, other tree file formats such as NEXUS can include additional metadata that indicates whether or not the tree a tree is rooted, and additionally, researchers will typically be able to use their taxonomic knowledge of a system to infer this as well. 

### Edge length information (Divergence times, i.e., ages of clades)
Additional information such as the ages of clades is easy to include in the newick format. Below you can see that the lengths of branches are simply numeric values placed next to parentheses or tips (nodes of the tree). Below we use a different tree style for plotting (`tree_style='n'`) since this style will show branch length differences. 

The units of this plot are not indicated. Thus we do not know if it is thousands of years, millions of years, or if the units are even meant to represent time. **Branch lengths on a tree can represent different things**. Sometimes we represent the number of *character differences* separating taxa as units of branch lengths, and these characters could be counted from morphological data, or genetic substitutions, or even by counting other features of the genome such as inversions or transposable elements. 

If a tree is inferred from DNA sequence data then the branch lengths could represent the number of observed DNA differences between species. Converting units of nucleotide or amino acid substitutions into units of time is a tricky business that involves making assumptions about the rate of mutations and the generation times of organisms.

In [None]:
newick2 = "(gibbon:3,(orangutan:2,(gorilla:1,(chimp:0.25,human:0.25):0.75):1):1);"
toytree.tree(newick2).draw(tree_style='n', scale_bar=True);

In a tree where branch lengths represent DNA substitutions, as opposed to time, is unlikely to have all the tips align. This is because different lineages may have different rates of evolution, or, even if their rates are the same, some may have accumulated more mutations by chance. 

Below is a mock example of what an inferred tree might look like when the edges are substitutions instead of time. This is sometimes called a phylogram while the above tree is a chronogram. Generally, though, we refer to both as phylogenies and simply label the axes and figure legends to describe what they represent. 

In [None]:
newick3 = "(gibbon:0.03,(orangutan:0.02,(gorilla:0.01,(chimp:0.0075,human:0.0025):0.0075):0.001):0.001);"
tree3 = toytree.tree(newick3)
c, a, m = tree3.draw(tree_style='n', scale_bar=True);
a.x.label.text = "divergence (nucleotide substitutions)"

<div class="alert alert-success">
    <h3>Action 3:</h3> 
    
Try to write a newick string for the relationships of six taxa using letters for names (a-f), and plot the tree. Try to write it so that `'a'` and `'b'` are sister taxa (share an exclusive MRCA), and so that `'f'` is the outgroup sample. Look back at previous examples for guidance. 

If your newick string is malformed then the `toytree.tree` function will raise an Error. If this happens modify your newick string and try again. 

</div>

<div class="alert alert-warning">

<details>
  <summary><h3>Click for answer.</h3></summary>
  
```python

# simple solution: only create minimum requested clades
newick = "(f,((a,b),(c,d,e)));" 
toytree.tree(newick).draw(ts='s');
    
# full solution: added random resolution of other samples
newick = "(f,((a,b),(c,(d,e))));" 
toytree.tree(newick).draw(ts='s'); 
```
    
</details>


<div>