#**WEEK 7**

In this lab, we'll be returning to Chapter 4 of the Applied Bioinformatics textbook, and we'll be looking at the chapter on multiple sequence alignments / phylogenetics.

First, since this is a new notebook we'll have to install BioPython using the code block below.

In [None]:
##Install BioPython in Jupyter Notebook
%pip install biopython

##Part 1: Multiple sequence alignment

Multiple sequence alignment involves scaling up the concept of pairwise alignment; that is instead of aligning two sequences against each other, you align three or more sequences against each other.

If you haven't read Section 4.1 of the textbook, do it now.

You'll note that they discuss four methods of multiple sequence alignment; dynamic programming, progressive alignment, iterative alignment, and genome alignment. The categorization of MFA algorithms is kind of blurry and this is one categorization among many; the main ones we talked about in class are pairwise vs iterative, but there's many algorithms which use an adapted version of either approach or a combination of the two. The most important things to remember for now are:

1. Multiple sequence alignment is computationally difficult

2. All MSA algorithms have to balance speed vs complexity. The more complex the algorithm the more likely they are to be accurate, but the slower they tend to be.

Up until this point we've been using BioPython for all our bioinformatic analyses: you might be asking at this point whether you can use BioPython to carry out a multiple sequence alignment. Of course you can! However, it is a bit more complicated than you might think.

In [None]:
##Align sequences using the ClustalW algorithm implemented in BioPython
from Bio.Align.Applications import ClustalwCommandline


in_file = "example_sequences.fasta"

clustalw_cline = ClustalwCommandline("clustalw2", infile="example_sequences.fasta")

print(clustalw_cline)

clustalw2 -infile=example_sequences.fasta

from Bio import AlignIO
align = AlignIO.read("example_sequences.aln", "clustal")
print(align)

The code above *should* perform a MSA using the ClustalW algorithm, but it doesn't. That's because BioPython doesn't directly contain the code necessary to run the ClustalW algorithm; it has what's called a **wrapper**, which allows you to implement the algorithm without having to exit the coding environment you're currently using. However, this assumes that there is something inside the wrapper. In this case, the BioPython module is looking for the ClustalW software inside the wrapper, and is finding... nothing!

Instead of installing ClustalW from source in a Jupyter notebook, we're going to use the online server where ClustalW is handily installed for you. You can find the online version [here.](https://https://www.genome.jp/tools-bin/clustalw)

**Try using the Clustal online server to align the following sequences, and paste them into the text box below. Make sure you set the parameters correctly and save the output as a FASTA sequence. Include both the clustalw.fasta section and the clustalw.dnd section**

```
>Sequence_1
ATCGATCGATCGAATCGATCG
>Sequence_2
ATCGATATCGATCGATCGATCG
>Sequence_3
ATCGATCGATCGATCGACCTCG
```



\---------------------------
You can add your answer in this text box.
\---------------------------

**What do you notice about the order of the sequences in your alignment? Why do you think it might be? (Hint: ClustalW is a *progressive* alignment algorithm)**

\---------------------------
You can add your answer in this text box.
\---------------------------

**How might this be different if ClustalW was an *iterative* alignment algorithm?**

\---------------------------
You can add your answer in this text box.
\---------------------------

##Part 2: Tree Building
Tree building involves visualization of multiple sequence alignments, including the quantification of similarity / probability that this is the best alignment as node values / branch lengths. When we think of trees, we think of visual images with branches and leaves and nodes, but computers don't care about those. Trees are often stored in Newick format, which is much more compact and efficient (but much less readable for us poor humans!)

The textbook contains an example of a tree written in Newick format, which we will reproduce below (with some minor edits to make it work in Jupyter format, you're welcome). Before we start, make sure you've read Section 4.2.1 of the textbook.

In [47]:
##Import the relevant packages
from Bio import Phylo

##Save example as treefile
tree_content = "((A,B),(C,D));"
output_file = "tree.txt"

with open(output_file, "w") as tree_file:
    tree_file.write(tree_content)

##Visualise tree
tree = Phylo.read("tree.txt","newick")
Phylo.draw_ascii(tree)

                                        _____________________________________ A
  _____________________________________|
 |                                     |_____________________________________ B
_|
 |                                      _____________________________________ C
 |_____________________________________|
                                       |_____________________________________ D



**Run the code block. What do you notice about the tree?**

\---------------------------
You can add your answer in this text box.
\---------------------------

This is a great representation of how the sequences are related, but it doesn't include any information on weighting. Weighting is a way of drawing a tree that takes into account how similar two sequences are; this is why two sequences that are close together have short branches and two sequences that are very different have long branches. Newick format can also take this information into account.

There is some information on pairwise distance calculation in Section 4.2 of the textbook, but huge plot twist: we have actually ALREADY generated a weighted tree in Newick format! Look up above where you pasted the clustalw.dnd output - is it starting to look familiar? We're going to use the code block below to visualize it.

In [None]:
##Import the relevant packages
from Bio import Phylo

##Save example as treefile
newick_tree_string = "(\n  Sequence_1:0.24459,\n  Sequence_2:0.13636,\n  Sequence_3:0.13636\n);"
output_file = "weighted_tree.txt"

with open(output_file, "w") as tree_file:
    tree_file.write(newick_tree_string)

##Visualise tree
tree = Phylo.read("weighted_tree.txt","newick")
Phylo.draw_ascii(tree)

  __________________________________________________________________ Sequence_1
 |
_|____________________________________ Sequence_2
 |
 |____________________________________ Sequence_3



**What is the name for a node with more than two branches? What type of tree cannot have these nodes? Which sequence has the longest branch, and what does that mean?**

\---------------------------
You can add your answer in this text box.
\---------------------------

To establish branch lengths, you have to have a substitution model of evolution; basically, a set of *assumptions* you are willing to work under about how quickly a given site can be expected to change (these are sort of like a substitution matrix, but they include time as a variable). These are described in a lot more detail than you need to know in Section 4.3.2; the most important thing to remember here is that before we build a tree we have to select a substitution model, and there are programmes available (or integrated into the software you use like IQ-TREE2) that can help you select the best model.

##Part 3: Tree Searching and Evaluating Quality of a Phylogenetic Tree

So now we've figured out how to make a multiple sequence alignment and visualise / score a single phylogenetic tree. However, in real life we're usually working with many sequences, and very complex evolutionary relationships. There might be *many* potential trees, and very subtle differences between them which aren't always captured by that initial alignment and guide tree.

The solution to this problem is pretty simple: generate lots of very similar trees and evaluate all of them (this is also called "searching the tree space"). Then you use statistics to determine which of the trees you generate is most likely to be the correct one.

The mechanisms for generating the many similar trees are described in 4.3.10.


**What two mechanisms are used to generate similar trees to the original tree?**

\---------------------------
You can add your answer in this text box.
\---------------------------

These methods generate trees. We also need a way of evaluating which is the best tree within these tree spaces, and this is where our two statistical methods from the pre-Reading Week lectures come in: Maximum Likehood and Bayesian.

We're going to generate a Maximum Likelihood and Bayesian tree and compare them using online servers, but first we need some sequence data.
```
>Protein_1
MAGHTGFTYILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRL
>Protein_2
MAGHTGFTYILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRL
>Protein_3
MAGHTGFTYILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRL
>Protein_4
ILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRLMAGHTGF
>Protein_5
ILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRLMAGHTGF
>Protein_6
ILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRLMAGHTGF
>Protein_7
MAGHTGFTYILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRL
>Protein_8
MAGHTGFTYILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRL
>Protein_9
ILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRLMAGHTGF
>Protein_10
ILGKPVDKVVECTEKACVNELSQRAVTWLGNHIDPADPTVEVRGGHLTECQRELGDKANFTLRNIPFIMLRLHQPDMVAQIAAKQKWGRLMAGHTGF
```
You can probably tell just by looking at this dataset which sequences are most likely to be closely related. Let's see if your assumptions are correct!




We're going to use the IQ-TREE2 online server to generate a maximum likelihood tree from this data. Click [here](http://iqtree.cibiv.univie.ac.at/) to access it. It only takes alignment files, so first you'll have to align the above sequences, save them as a file and upload them to the server in FASTA format. Luckily, we just learned a method to do just that.

Once you have your file, upload it to the relevant box and leave the rest of the settings alone (this is also called *running with default parameters*). This might take a few minutes, but once it's complete you can access by clicking the "Analysis results" tab.

What does the resulting tree look like? Use the code block below to visualize the consensus tree.

In [None]:
##Import the relevant packages
from Bio import Phylo

##Save example as treefile
newick_tree_string = "<paste your code here>"
output_file = "weighted_tree.txt"

with open(output_file, "w") as tree_file:
    tree_file.write(newick_tree_string)

##Visualise tree
tree = Phylo.read("weighted_tree.txt","newick")
Phylo.draw_ascii(tree)

 , Protein_9
 |
 |                                           _____________________ Protein_10
 |                                          |
 |                                          |                     , Protein_1
 |                     _____________________|                     |
 |                    |                     |                     | Protein_8
 |                    |                     |                     |
 |____________________|                     |_____________________| Protein_7
_|                    |                                           |
 |                    |                                           | Protein_3
 |                    |
 |                    |_____________________ Protein_2
 |
 | Protein_4
 |
 | Protein_5
 |
 | Protein_6



**What is a consensus tree? What does it represent?**

\---------------------------
You can add your answer in this text box.
\---------------------------

One of the reasons we tend to use maximum likelihood methods first is because they run a LOT more quickly than Bayesian methods. Because of this (and because Bayesian servers that are free, online, and don't crash every five minutes are few and far between) I have already run the equivalent analysis using a Bayesian method. The resulting Newick file is saved in the notebook as nwktree.txt.

Visualize the Newick file.

In [48]:
##You can put your code in this block.
tree = Phylo.read("nwktree.txt","newick")
Phylo.draw_ascii(tree)

  ____________ Protein_9
 |
 |             ____________ Protein_10
 |____________|
 |            |____________ Protein_7
 |
 |                          ____________ Protein_6
 |             ____________|
_|            |            |             ____________ Protein_2
 |            |            |____________|
 |____________|                         |             ____________ Protein_3
 |            |                         |____________|
 |            |                                      |____________ Protein_8
 |            |
 |            |____________ Protein_1
 |
 |             ____________ Protein_4
 |____________|
              |____________ Protein_5



**What are the similarities between these trees? What are the differences?**

\---------------------------
You can add your answer in this text box.
\---------------------------

© Elisabeth Richardson, 2023. Adapted from Applied Bioinformatics by David A. Hendricks under a [CC-by-4.0 license](https://https://creativecommons.org/licenses/by/4.0/) and shared under same.