<h1><font color='DarkBlue'>PRACTICAL PHYLOGENETICS NOTEBOOK</font></h1>
<hr>
Dr Dave Lunt d.h.lunt@hull.ac.uk

<h2><font color='Blue'>Goals of these experiments</font></h2>

This Jupyter notebook will take you through two case studies using phylogenetic analysis to understand biological questions. This will involve aligning sequences and building maximum likelihood phylogenetic trees, followed by annotation an interpretation.

We hope that this will give you 
- experience in analysing DNA sequence data
- Understanding of the steps involved in phylogenetics
- Knowledge about the compleities of the specific case studies we are using

You will write up one of the case study analyses you perform today for your assessment.

<h2><font color='Blue'>Introduction to Jupyter computational notebooks</font></h2>

<font color=red>**FIRSTLY, DO NOT PANIC. EVERYTHING YOU NEED TO KNOW ABOUT COMPUTERS AND CODE WILL BE TAUGHT HERE. YOU WILL BE ABLE TO DO THIS EVEN IF YOU HAVE LITTLE EXPERIENCE WITH COMPUTERS**</font>

This class of students has mixed prior experience however, so if you have not done the bioinformatics practicals in Genetic Analysis last semester then please make yourself known and we will give you a 5 minute catch-up to make your life easier.

If you are familiar with Jupyter notebooks then you can skip this section and move to "A NEW SPECIES oF APE?" below.

This document that you are reading now is a **Jupyter Notebook**. It is a web browser based text editor that is also able to execute scripts ie  code. Today we are using the programming language `python`, probably the most used language in bioinformatics, but we could also run `R`, `bash` or many other languages. Scripts are found in the grey cells (see below) and have something like `In [ ]:` or `[1]` to their left in the margin..  

To execute a script, click the cell below and then press SHIFT+ENTER, or instead the triangular "Run" button in the tool bar above. Try running this code below now 

In [1]:
print('Hey there, good job in running the python print command!')

Hey there, good job in running the python print command!


This code cell (containing "print('Hey...") should have executed when you pressed SHIFT-ENTER and it's output was printed below the cell ("Hey there, good job..."). All Jupyter commands run in a similar fashion

Can you identify which parts of this notebook are code, which parts output, and which parts documentation like this sentence? Discuss with us if you are in doubt.

1. Try editing the code below and re-running. Replace "Good job" with "Even better job"
2. Instead of the `Run` button at the top you can click in the cell and press Shift-Enter to run the code. Most people find this faster, edit the cell below then give it a try:

In [2]:
print('Hey there, good job!')

Hey there, good job!


<h4><font color='Blue'>ACTION:</font></h4>

Now edit the cell above to have two print statements. On a new line type `print('Your new phrase')` and then run it. It might be easier to copy/paste and just change the pasted phrase. If it doesn't run well, you have a typo. Yes, its always a typo.

**Congratulations, you have now run, copy/pasted and edited cells. Those are all the skills you will need today**

This iterative edit-and-run approach is how much of modern biological data is explored and analysed. This mix of code and explanation you are seeing in this Jupyter notebook is called "literate programming"

This notebook will take you through the anaysis of the two case studies found in the practical handbook. To make this notebook concise, background information is excluded from it and only available in the practical handbook, and you will need to work with both documents. For each case study you will need to run several cells just as you did above. The programs will then align and clean the DNA sequences, build a tree and annotate it. **In most cases you will only need to run the cell just as you did above. In a few cases you will be able to tweak the script just a bit following clear instructions**. Good luck!

<h1><font color='Blue'>STUDY1: A NEW SPECIES OF APE?</font></h1>

![orangutan males](images/Bornean,_Sumatran_&_Tapanuli_orangs.jpg)

_Figure 1:_ Male Bornean, Sumatran and Tapanuli orangutans, three suggested species [wikipedia](https://en.wikipedia.org/wiki/Orangutan). 

The first aim of today is to investigate what phylogenetics can tell us about different species of great ape. It is, of course, complex. You might like to think how you would conceptually go about trying to get information using a phylogenetic approach.

**Table 1: Latin names and common names of species in this practical.** As always, Googling is encouraged.

| Name             | Common name           | Name  | Common Name |
| ----------------|:----------------------| --------------|:---------- |
| Macaca macaca | Macaque (outgroup)|  Homo sapiens sapiens | Modern humans
| Hylobates lar      | Gibbon (outgroup)     |  Homo sapiens neanderthalis  | Neanderthals (extinct)|
| Gorilla gorilla | Western Gorilla      | Homo sapiens denisovan | Denisovans (extinct)
| Gorilla beringei | Eastern/mountain Gorilla | Pongo abelii | Sumatran orangutan     |
| Pan troglodytes | Chimp      |   Pongo pygmaeus | Bornean orangutan   |
| Pan paniscus | Bonobo      |     Pongo tapanuliensis | Tapanuli orangutan|
|

<h2><font color='Blue'>How much data do you have?</font></h2>
Your working directory has some DNA sequence files in fasta format. There are a number of ways to determine the number of sequences in a file, here is a quick one-liner. 

Edit the cell to replace `name.fas` with the correct file `data/ape.fas`. Shift-Enter to run the cell as usual

In [1]:
!echo "Number of sequences: "; grep -c ">" data/ape.fas

Number of sequences: 
19


It should have displayed the number of sequences in the `ape.fas` file

Below we will use a few python packages to allow more complex analyses. In the next example we are going to find the number and total length of sequences using a useful code package called BioPython [1].

Remember: The code below has explanations of what each section does (explanations begin with the # symbol) as some people are interested in seeing bioinformatics code in action. **But you do not have to know python or understand this code. Just run the cell as usual.**

In [1]:
# --------------------------------------------
# Python code to report on number of sequences 
# in a file by using BioPython
# --------------------------------------------

# import BioPython code so we can use it
from Bio.SeqIO.FastaIO import SimpleFastaParser

# set counts to zero before starting
count = 0
total_len = 0

# open the data file and give it a handle (nickname)
with open("data/ape.fas") as in_handle:
    
# for each title line add 1 to count of records, 
# and add length of sequence to a count called total_len
     for title, seq in SimpleFastaParser(in_handle):
         count += 1
         total_len += len(seq)
            
# print the results in a readable format
print("The file contained %i records with total sequence length of %i nucleotides" % (count, total_len))

FileNotFoundError: [Errno 2] No such file or directory: 'name.fas'

<h4><font color='Blue'>QUESTIONS:</font></h4>

- Can you see which part of the above code specifies the fasta file `ape.fas`?

- How could you run this on a different file in the data directory called `testseqs.fasta`? 

You don't need any python knowledge to answer these. The idea here is that in much of bioinformatics you can modify someone else's code to point at your data file and everything will work. 

<h4><font color='Blue'>ACTIONS:</font></h4>

Try it, just change the name above and re-run the cell, or ask for help if you can't quite see it. Remember that the file is within the `data` directory. If you've done it correctly (watch for typos) then the number and length of sequence reported will change.

<hr>
<h2><font color='Blue'>Aligning the sequences</font></h2>
In order to carry out a valid analysis you have to align the DNA sequences. If you're not quite sure why, look at the images below and discuss with a demonstrator. 

![Aligned DNA sequence](./images/aligned.png "A DNA sequence alignment")
_A DNA sequence alignmnet. Each character (column) can be directly compared across the different species_

![Un-aligned DNA sequence](./images/unaligned.png "An incomplete  DNA sequence alignment")
_A set of DNA sequences not completely aligned. Each character (column) cannot be directly compared across the different species as some are 'shifted' so even though they are very similar, they look enormously different when just comparing down each column (character)_

To align the sequences we will use a program called MAFFT [2]. What piece of information will we have to add to the code? Yes, the name of the input DNA sequence file to be aligned.

<h4><font color='Blue'>ACTIONS:</font></h4>

- Change the name of the file in the following code to be `ape.fas`
- run the cell

In [2]:
# ---------------------------
# Align sequences using MAFFT
# ---------------------------

!mafft --auto --quiet data/ape.fas > ape.afa

Did it work? Can you find the `ape.afa` file? The ".afa." extension stands for 'aligned fasta'

<hr>
<h2><font color='Blue'>QC the alignment</font></h2>
Trimal [3] quality controls the alignment, removing badly aligned regions and alignment artefacts.

In [3]:
# ------------------------------------------
# Quality control the alignment using trimal
# ------------------------------------------

!trimal -in ape.afa -out ape_trimmed.afa -gappyout -keepheader

<hr>
<h2><font color='Blue'>Tree reconstruction</font></h2>
This section will reconstruct a maximum likelihood phylogenetic tree using the sequence alignment you have produced. We will use the program FastTree [4].

In [1]:
# -------------------------
# Build tree using FastTree
# -------------------------

!FastTree -gtr -nt ape_trimmed.afa > ape.nwk

FastTree Version 2.1.10 Double precision (No SSE3)
Alignment: ape_trimmed.afa
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
Initial topology in 0.00 seconds
Refining topology: 16 rounds ME-NNIs, 2 rounds ME-SPRs, 8 rounds ML-NNIs
Total branch-length 0.477 after 0.03 sec
ML-NNI round 1: LogLk = -1805.940 NNIs 2 max delta 0.00 Time 0.05
GTR Frequencies: 0.3150 0.3149 0.1056 0.2645ep 9 of 12   
GTR rates(ac ag at cg ct gt) 2.3845 14.9825 1.2609 1.5635 13.8255 1.0000
Switched to using 20 rate categories (CAT approximation)
Rate categories were divided by 0.732 so that average rate = 1.0
CAT-based log-likelihoods may not be comparable across runs
Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2: LogLk = -1589.440 NNIs 0 max delta 0.00 Time 0.

<hr>

The treebuilding program "FastTree" gives lots of output, you can just ignore all those details. When it is finished a file called `ape.nwk` will appear. This is a Newick tree file (.nwk) containing the tree as bracket notation text.

<hr>
<h2><font color='Blue'>Tree Annotation and Viewing</font></h2>

The tree alone (example below) is in bracket notation format (called Newick) and not very meaningful to examine.

```
((A,B),(C,D));
```
Instead we are going to display it as a graphic, and then annotate it to be easier to interpret. To do this we are going to use a tree graphics program called ToyTree [5].

<h4><font color='Blue'>QUESTION:</font></h4>

What treefile (.nwk) has just been written by the build tree cell above?

<h4><font color='Blue'>ACTION:</font></h4>

Take your newick treefile name and enter it into the cell below to replace "tree.nwk"

In [3]:
# -----------------------------------
# Drawing the phylogeny using ToyTree
# -----------------------------------
# import the code so we can use it here
import toytree       # a tree plotting library
import toyplot       # a general plotting library
# import numpy as np   # a numerical library, give it the shorthand 'np'

# read the newick format tree file, give it the name 'newick'
newick = "ape.nwk" # change this to point at your .nwk treefile
tre = toytree.tree(newick, tree_format=1)

tre.draw();

If you see a graphic image of a phylogenetic tree, congratulations! If not please ask for a little help, its probably a quick fix for a demonstrator.

<h4><font color='Blue'>NOW ROOT THE TREE</font></h4>

Your tree will probably look very odd because it isn't yet rooted correctly. Use the next cell to root it by entering "Macaca" (Macacque) instead of "outgroup"

In [4]:
# ----------------
# Root and re-draw
# ----------------
# root and draw the tree
rtre = tre.root(wildcard="Macaca") # specify the outgroup taxon
rtre.draw(height=600, tip_labels_align=True); # draw the tree

You should now have a tree that reveals a lot about the relationships betwen these species. It will be easier to interpret though when you put it into a report if you annotate and colour it by taxon.

<h4><font color='Blue'>NOW ANNOTATE THE TREE:</font></h4>
Although you now have 'the answer' it is not so easy to study this tree. You will need to compare the divergences between the two species of orangutan and compare those to the divergences between the two species of chimpanzee. In this simple tree its not too hard, but in general phylogeneticists label and colour to maintain focus on the correct comaprisons. You are now going to use the script below to colour in the tips by their species identity. 

Run the cell and examine the tree

In [5]:
# -----------------------------
# Colouring the tree label text
# -----------------------------

# set list of colours depending on the taxon label text
# numbers like "#5384a3" are colour 'hex' codes, cyan in
# this case (google "hex colour codes" for other options)

colorlist = ["blue" if "Pan_paniscus" in tip
             else "darkblue" if "Pan_troglodytes" in tip 
             else "red" if "Pongo_abelii" in tip 
             else "brown" if "Pongo_pygmaeus" in tip
             else "#5384a3" for tip in rtre.get_tip_labels()]

# draw the tree
canvas = rtre.draw(
    width=600,  # set dimensions of the figure
    height=600,
    scalebar=True,  # scale bar of divergence levels
    tip_labels_align=True,
    tip_labels=True,
    tip_labels_colors=colorlist,
    node_labels=None,
    node_sizes=[0 if i else 8 for i in rtre.get_node_values(None, 1, 0)],
    node_markers="s", # use "o" for circles instead of squares
    node_colors=toytree.colors[0],
)

You now have all the skills to edit this script and change colours. Pick some ones you like and rerun.

<h3><font color=red>IMPORTANT, SAVE YOUR FILE</font></h3>
Make sure that you save and take away a copy of you tree file image in a format suitable to insert into your final report. Run the cell below and then find the file in your working directory and save it somewhere accesible.

In [6]:
# --------------------------------
# Save the tree as a graphics file
# --------------------------------

# import code to draw graphics files
import toyplot.pdf
import toyplot.svg
import toyplot.html

# draw graphics files
toyplot.svg.render(canvas, "ape.svg")
toyplot.pdf.render(canvas, "ape.pdf")
toyplot.html.render(canvas, "ape.html")

ValueError: Expected <class 'toyplot.canvas.Canvas'>, received <class 'tuple'>.

## Pause

You have just loaded a data file, aligned it, quality controled the alignment, constructed a maximum likelihood phyogenetic tree, and created an annotated figure of the phylogeny. Well done!

This was quite a lot of work to do the first time, trying to understand how to pass a specific data file through the analytical satges to create a phylogeny. Fortunately, as you learned above, doing it again on a different data file just requires a simple change, ie specifying a different file.

Below you can quite rapidly analyse a "big ape" data set, containing a lot more sequences, by rerunning the same commands with different data. It should not take long. In the second case study (below) we are going to swap from apes to HIV, but again it should be rapid because the commands will be very similar.

<h4><font color='Blue'>SPECIES DIFFERENTIATION</font></h4>
Looking at the tree, it would seem the two *Pan* clades are as distant from each other as the two *Pongo* clades
 
The scale along the bottom is genetic distances from 0 to 1, so 0.06 would be 6%. Find the common ancestor node of each genus. What is the distance between the common ancestor of both Pan species and the tree tips? What is that value for Pongo? Is it very different.

<hr>
<h2><font color='Blue'>A big data analysis of great apes</font></h2>

One very useful aspect of using code to carry out analyses is that once you have written it, and it works, its very little effort to re-run it again on any number or any size of other data sets.

Here I have collected from GenBank whole mitochondrial genomes (about 16,000 nucleotides) from a lot of great apes including humas, neanderthals, and species of gorillas in addition to the species you have just analysed. The file is large but we can just run the same code again. If you want to find out how much data you have, you could insert a cell and paste in the code to quantify sequences (from the ape example) and run it for the big_ape dataset. This is optional. For efficiency reasons I've compressed the code below a little, but its the same as you have just run.

This big analysis gives you the opportunity to decide whether the similarity of divergence between groups that you have just observed is true more widely. When the class have produced their big trees we will all discuss what the divergence levels might mean. If you want to find out how much data you are analysing you can copy and paste the "how much data do you have?" cell from above, and run it here (on the new fasta file). But that is optional.

Run this cell to align, trim the alignment, and then build a tree. It might take a few minutes to complete. When there is a number not an asterisk in the left margin then it is complete.

In [7]:
# Align
!mafft --auto --quiet data/big_ape.fas > big_ape.afa
print("\nThe sequence alignment has finished")
# Trim
!trimal -in big_ape.afa -out big_ape_trimmed.afa -gappyout -keepheader
print("The alignment trimming has finished")
# Tree build
print("The phylogenetic tree construction has started\n")
!FastTree -gtr -nt big_ape_trimmed.afa > big_ape.nwk
print("\nThe phylogenetic analysis has finished")


The sequence alignment has finished
The alignment trimming has finished
The phylogenetic tree construction has started

FastTree Version 2.1.10 Double precision (No SSE3)
Alignment: big_ape_trimmed.afa
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
Ignored unknown character n (seen 42 times)
Ignored unknown character r (seen 1 times)
Ignored unknown character s (seen 1 times)
Initial topology in 0.26 seconds
Refining topology: 23 rounds ME-NNIs, 2 rounds ME-SPRs, 11 rounds ML-NNIs
Total branch-length 0.536 after 3.67 sec3, 1 of 49 splits   
ML-NNI round 1: LogLk = -77072.124 NNIs 4 max delta 7.85 Time 5.47
GTR Frequencies: 0.3092 0.3142 0.1304 0.2462ep 12 of 12   
GTR rates(ac ag at cg ct gt) 5.1842 40.6082 3.0165 1.5506 43.5488 1.0000
Switched to using 20 rate categorie

Expect this to take a couple of minutes. Remember if there is an asterisk in the top left "`In [*]:`" then it is still working, when it is a number it is finished. 

If this completed without errors then you can just run the cell below and see the output tree. You may want to adjust colours and re-run a few times. If you had an error, see if you can spot what went wrong, but seek assistance if not.

In [9]:
# import the code so we can use it here
import toytree       # a tree plotting library
import toyplot       # a general plotting library

# --------------------
# Read in the big tree
# --------------------
# read the newick format tree file, give it the name 'bignewick'
bignewick = "big_ape.nwk" # change this to point at your .nwk treefile
btre = toytree.tree(bignewick, tree_format=1)

# ----------------
# Root the tree
# ----------------
# root and draw the tree
brtre = btre.root(wildcard="Hylobates") # specify the outgroup taxon

# -----------------------------
# Colouring the tree label text
# -----------------------------

# set list of colours depending on the taxon label text
# numbers like #5384a3 are color hex codes (google it for other options)
colorlist = ["black" if "Hylobates" in tip
             else "darkblue" if "Pan" in tip 
             else "red" if "Pongo" in tip 
             else "green" if "Homo" in tip 
             else "brown" if "Gorilla" in tip 
             else "#5384a3" for tip in brtre.get_tip_labels()] # cyan

canvas, axes = brtre.draw(
    width=800,  # set dimensions of the figure
    height=1600,
    scalebar=True,  # scale bar of divergence levels
    tip_labels_align=True,
    tip_labels=True,
    tip_labels_colors=colorlist,
    node_labels=None,
    node_sizes=[0 if i else 8 for i in brtre.get_node_values(None, 1, 0)],
    node_markers="o", # use "o" for circles or "s" for squares
    node_colors=toytree.colors[0],
)

# canvas, axes = brtre.draw(width=400, height=800);

ValueError: too many values to unpack (expected 2)

In [10]:
# --------------------------------
# Save the tree as a graphics file
# --------------------------------

toyplot.svg.render(canvas, "bigape.svg")
toyplot.pdf.render(canvas, "bigape.pdf")
toyplot.html.render(canvas, "bigape.html")

ValueError: Expected <class 'toyplot.canvas.Canvas'>, received <class 'tuple'>.

**Or maybe the cell below. Merge them somehow**

In [11]:
### 


# read the tree file
newick = "big_ape.nwk"
tre = toytree.tree(newick, tree_format=1)

# specify the outgroup to be Macaque
rtre = tre.root(wildcard="Hylobates")

# change these options until you are happy with the design
colorlist = ["#d6557c" if "Pan" in tip # pink
             else "blue" if "Gorilla_b" in tip #
             else "#4169E1" if "Gorilla_g" in tip #royalblue
             else "#008000" if "Homo_sapiens_sapiens" in tip #green
             else "#32CD32" if "neanderthal" in tip #limegreen
             else "#006400" if "denisovan" in tip #darkgreen
             else "red" if "Pongo_abelii" in tip #darkgreen
             else "orange" if "Pongo_pygmaeus" in tip #darkgreen
             else "brown" if "Pongo_tapanuliensis" in tip #darkgreen
             else "#5384a3" for tip in rtre.get_tip_labels()] # cyan


# draw the tree using these colours and some other standard options
canvas, axes =rtre.draw(
    height=1200,
    scalebar=True,
    node_labels=None,
    node_sizes=[0 if i else 8 for i in rtre.get_node_values(None, 1, 0)],
    node_markers="s",
    node_colors=toytree.colors[0], # could this be =colorlist too?
    tip_labels_align=True,
    tip_labels_colors=colorlist
);

# The following code cell is needed to save it as a graphics file

ValueError: too many values to unpack (expected 2)

Run this following cell to save your tree to a graphics format. You will need this for your report.

In [28]:
# change the output file names below and run the cell.
# remember to save these files and take them away
import toyplot.svg
import toyplot.html
import toyplot.pdf

canvas, axes = rtre.draw(
    height=1200,
    scalebar=True,
    node_labels=None,
    node_sizes=[0 if i else 8 for i in rtre.get_node_values(None, 1, 0)],
    node_markers="s",
    node_colors=toytree.colors[0], # could this be =colorlist too?
    tip_labels_align=True,
    tip_labels_colors=colorlist
);


toyplot.svg.render(canvas, "big_ape2.svg")
toyplot.pdf.render(canvas, "big_ape2.pdf")

ValueError: too many values to unpack (expected 2)

<h2><font color='Blue'>Well Done</font></h2>

You are now finished with case study 1, the apes. Case study two, the origins of HIV, will be much faster now you have experience.

Please feel free to take a short break here.
<hr>

<h1><font color='DodgerBlue'>STUDY2: WHAT ARE THE ORIGINS OF HIV?</font></h1>

![HIV](images/HIV.png)

**Has HIV (human immunodeficiency virus) has coevolved with humans or does it have a recent zoonotic origin?**

How can you test this? You now have all the skills required. We are going to repeat some of the work described by Sharp and Hahn (2011) in their paper "Origins of HIV and the AIDS Pandemic". Their figure 4 is very informative, and Zimmer and Emlen redraw it in their (3rd edition) Figure 8.12. 

Today you are going to reanalyse the HIV and SIV sequence data from great apes to produce a simlar figure and answer the question set above regarding zoonotic transfer. 

<h3><font color='DodgerBlue'>Sequence data</font></h3>

I have prepared fasta files for you containing SIV and HIV sequence data from the *env* gene (Google it). They are called `SIVHIVspecies_ENV.fasta`

It would be useful for you to know how many sequences were in each file, you calculated this earlier today for different files.

<h3><font color='DodgerBlue'>Sequence alignment and tree reconstruction</font></h3>

In [35]:
#HIV
print("\nThe analysis has begun, this will take a few minutes, please be patient")

# # Align
# !mafft --auto --quiet data/SIVHIVspecies_ENV.fasta > SIVHIVspecies_ENV.afa
# print("\nThe sequence alignment has finished")
# Trim
!trimal -in SIVHIVspecies_ENV.afa -out SIVHIVspecies_ENV_trimmed.afa -gappyout -keepheader
print("The alignment trimming has finished")
# Tree build
print("The phylogenetic tree construction has started\n")
!FastTree -gtr -nt SIVHIVspecies_ENV_trimmed.afa > SIVHIVspecies_ENV.nwk
print("\nThe phylogenetic analysis has finished")


The analysis has begun, this will take a few minutes, please be patient
The alignment trimming has finished
The phylogenetic tree construction has started

FastTree Version 2.1.10 Double precision (No SSE3)
Alignment: SIVHIVspecies_ENV_trimmed.afa
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
Ignored unknown character k (seen 8 times)
Ignored unknown character m (seen 5 times)
Ignored unknown character n (seen 1 times)
Ignored unknown character r (seen 31 times)
Ignored unknown character s (seen 4 times)
Ignored unknown character w (seen 9 times)
Ignored unknown character y (seen 14 times)
Initial topology in 0.05 seconds
Refining topology: 21 rounds ME-NNIs, 2 rounds ME-SPRs, 11 rounds ML-NNIs
Total branch-length 4.225 after 0.52 sec1, 1 of 39 splits   
ML-NNI round 1:

In [36]:
# import the code so we can use it later, just run this cell
import toytree       # a tree plotting library
import toyplot       # a general plotting library
import toyplot.pdf
import toyplot.svg

In [49]:
newick = "SIVHIVspecies_ENV.nwk"
tre = toytree.tree(newick, tree_format=1)
rtre = tre.root(wildcard="schwein")

# change these options until you are happy with the design
colorlist = ["green" if "P.t" in tip 
             else "red" if "G." in tip 
             else "blue" if "H.sapiens" in tip 
             else "cyan" for tip in rtre.get_tip_labels()]

canvas, axes = rtre.draw(
    node_labels=None,
    width=600,
    height=600,
    node_sizes=[0 if i else 8 for i in rtre.get_node_values(None, 1, 0)],
    node_markers="s",
    node_colors=toytree.colors[0],
    tip_labels_align=True,
    tip_labels_colors=colorlist);

# change the output file names below
# remember to download (File-->Download) and take them away
#toyplot.html.render(canvas, "name.html")
toyplot.html.render(canvas, "tree-plot.html");

ValueError: too many values to unpack (expected 2)

You can use modify the code to colour it in more appropriately if you wish.

The name contains a lot of metadata. The letter after H.sapiens indicates the HIV1 group to which that sequence belongs. Then the country from which it was isolated. Then the year. Then some identifiers and accession number. So `H.sapiens.O.CM.96.LA51YBF35.KU168294` is HIV1 group O, from Cameroon, isolated in 1996, with sample code LA51YBF35 and genbank accession number KU168294.

It may help you to talk about HIV1 group O, or group P, or group N in your explanations and in your annotations of the tree.

Make sure you then download the annotated tree and save it somewhere safe for your report.

In [16]:
# change the output file names below and run the cell.
# remember to save these files and take them away

# THIS DOESN"T WORK. WHY?
canvas = toyplot.Canvas()
#toyplot.svg.render(canvas, "name.svg")
#toyplot.pdf.render(canvas, "name.pdf")
toyplot.html.render(canvas, "name.html")


## Interpreting the tree

The tree shows the diversity of great ape immunodeficiency viruses. Here are some questions that you could write about in your report. They are suggestions only, you can set yourself different or extra questions also, you decide.

**Does the pattern of S/HIV represent the evolutionary history of the species? Has the virus speciated along with these apes or is it more complex than that?**

**If you think there has been a zoonotic spread of SIV, ie a transfer to humans, is there a single origin or multiple transfers of HIV1?**

**Can you determine anything about the geography of the transfer of the pandemic strain (M)? Think about the source. What is their gepgraphic range? Maybe you wish to include a map in your report?**

It is useful to think about how you will provide phylogenetic evidence for your answer to each of the questions above. How can you annotate a phylogeny to demonstrate the evidence for your conclusion. Poor reports will rely largely on written descriptions of a tree, excellent reports make a powerful link between the text and the figure, using annotations to make their points very clear.


## Writing your assessed report

There is extensivehelp on the canvas site on what to include and how to structure your report. You should discuss your conclusions and figures with a demonstrator or myself before leaving however. Please make sure that you have downloaded image files of any trees that you need to include in your report. This Jupyter lab environment may continue to work but I can't guarrantee it's availability past the end of the practical (that is out of my hands).

<hr>
<h2><font color='Blue'>What skills have you acquired?</font></h2>

If you have completed this practical I think you have now showed your competency in a range of important practical and conceptual skills:
1. Understanding the use of phylogenetic trees
2. Basic use of Jupyter notebooks
3. Basic use of BioPython to characterise sequence data files
4. Basic use of python to align DNA sequecne data and build a phylogenetic tree
5. Use of python to programmatically annotate a phylogenetic tree

These are the sorts of phrases you could include on you cv if you wished.

## Software References

1. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25: 1422–1423. doi:10.1093/bioinformatics/btp163
2. Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 2008;9: 286–298. doi:10.1093/bib/bbn013
3. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25: 1972–1973. doi:10.1093/bioinformatics/btp348
4. Price MN, Dehal PS, Arkin AP. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE. 2010. p. e9490. doi:10.1371/journal.pone.0009490
5. Eaton DAR. Toytree: A minimalist tree visualization and manipulation library for Python. Methods Ecol Evol. 2020;11: 187–191. doi:10.1111/2041-210X.13313
