# Building a reference

The first step is to develop your scientific question and to determine if a phylogenetic tree can actually answer that question. If indeed phylogenetics can help, then you must build a reference first for you analysis. Today we will be looking at two examples, one were you have to consider orthologous genes and another where geographical locations and time plays a part in answering your question. 

So, let's first start by identifying our sequences. In some cases you know exactly want you are looking for and in other cases you have no idea what your initial sequence is, either way, you start with a [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) to identify similar sequences.

Other reference:
- [NCBI influenza DB](https://www.ncbi.nlm.nih.gov/genomes/FLU/Database/nph-select.cgi?go=database)
- [NCBI virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/)

## 1. Use blast to identify the following sequences:

NCBI is located here: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/). BLAST can be selected in the right-hand panel.

![blast_ncbi](images/blast_ncbi.png)

Alternatively, you can go directly to the BLAST page: [https://blast.ncbi.nlm.nih.gov/Blast.cgi](https://blast.ncbi.nlm.nih.gov/Blast.cgi)
Based on the sequence you will select a different BLAST program, however this can be changed later if necessary.

![blast_selection](images/blast_selection.png)

Legend:
1. This is were you enter you sequence. Alternatively, you can enter a NCBI access number
2. You can also choose to search a particular database/organism. The default will search the entire BLAST none redundant (nr) database.
3. Within each BLAST program, there are subprograms that determines how you BLAST search is preformed.
4. Once you have adjusted the necessary parameters you are ready to submit you search. (This can be quick or take several minutes depending on the sequence and the server load.

![blast_entry](images/blast_entry.png)

Now take sequence 1 and 2, and identify them.

Sequence 1:

Sequence 2:

From here we would use our blast results to start building our reference file, which is a multi-fasta file. However, typically that is not enough and we would need to identify additional sequences from litaure or other databases. Luckily for us we had a hard working lab member that has already done all of the leg work for us. Both sequences have references files that can be found here:
- [sequence 1](data/opsins.fasta)
- [sequence 2](data/zika.fast)

The reference file for sequence 1 were extracted from this [paper](https://bmcecolevol.biomedcentral.com/articles/10.1186/s12862-018-1276-0), while the file for sequence 2 was taken from NextStrain tutorial which can be found [here](https://docs.nextstrain.org/en/latest/tutorials/zika.html).

## 2. Create Multiple Sequence Alignments using MAFFT

Now that we have our reference files we need to identify the homologous regions across our sequencing. This process is now as Multiple Sequence Alignments (MSA). There are many tools available to complete this process, but today we will be using [MAFFT alignment server](https://mafft.cbrc.jp/alignment/server/). MAFFT has both an online server and a command-line interface. Today we will use the default parameters, but for more information on MAFFT's algorithms and paraments you can read [here](https://mafft.cbrc.jp/alignment/software/algorithms/algorithms.html), and for additional tips read [here](https://mafft.cbrc.jp/alignment/software/tips0.html)

> &#x26a0;&#xfe0f; **One problem that often occurs is at this step, is the fact that the sequence names may become trucated and cause referencing issues. For instance, MAFFT cuts names off after 15 characters or the first space. You should be award of this and edit you fasta reference files accordingly.**


![mafft](images/mafft.png)


MAFFT produce texted based alignment with no color. Highly conserved bases are marked by **'*'**, followed by **':'**, and **'.'**. There are programs that can be used to visualize you MSA with color to better see the conservation, people typically use these were they want to identify conservation domains or trim their alignments. 

Highly conserved region             |  Gapped region
:-------------------------:|:-------------------------:
![](images/alignment_c.png)  |  ![](images/alignment_g.png)

Prebaked alignment files:
- [opsins](data/opsin.aln)
- [zika](data/zika.aln)

## 3. Reconstructing the tree using iQTree

IQTREE like other tree inference software can be used on both DNA and proteins. Additionally, iqtree has a model selection module built-in if you don't know what substitution model is best for you tree inference. An indepth tutorial on iqtree can be found [here](http://www.iqtree.org/doc/Web-Server-Tutorial)

![iqtree](images/iqtree.png)

prebaked tree files:
- [opsin](data/opsin.aln.tre)
- [zika](data/zika.aln.tre)

## 4. Tree visualization with Microreact

We will use [Microreact](https://microreact.org/showcase) to generate our visualization today. 


Highlighted clades             |  geographical location + tree
:-------------------------:|:-------------------------:
![](images/opsins_full.png)  |  ![](images/zika.png)

The only maditory file for Microreact is a comma/tab seperated value (csv/tsv) with the ID's of you sequences of interest. This file can have several opitional categories (e.g. lat, lon, color, date, country, etc). These categories help to add to the categorization of the tree or map. However, we want to visualize a tree, so we will have to compliment the csv file with a tree file ([nwk](https://en.wikipedia.org/wiki/Newick_format), tre). 

> The geolocation link has not worked for me, so instead I used [https://www.latlong.net/](https://www.latlong.net/) to generate my latitudes and longitudes.

![micro](images/microreact.png)


metadata  files:
- [opsin](data/opsin_metatdata.csv)
- [zika](data/metatdata.csv)