
# Using Comparative Genomics to Identify Genes

We have predicted some genes, but what does this actually do for us? Biologically, we want to assign a function
to genes so that we can try to understand what they are doing. Now, we are going to use the second approach outlined
at the beginning of section 4 to both identify genes and attempt to assign a biological function to them.

In the previous section, we identified genes on one chromosome of _P. falciparum_. _P. falciparum_ has been extensively
studied, and a large proportion of the genes in its genome have been assigned some function. However, lets pretend
that the genes we just identified are from a new exciting organism that we just sequenced, and we want to find out a
potential function for these genes. In many cases, an organism that is relatively closely related to the one that we
are interested in has already been sequenced, and gene discovery as outlined above has already been performed. Here is
a phylogenetic tree of different _Plasmodium_ species from the paper:

M. Andreina Pacheco et. al. _**Malarial parasite diversity in chimpanzees: The value of comparative approaches to ascertain the evolution of Plasmodium falciparum antigens**_. Malaria Journal (2013).

![](images/comparative_1.png)

As shown by this figure, there are several closely related species to _P. falciparum_, of which many have been sequenced.
Today, we are going to use genes identified in the Chimpanzee parasite, _P. reichenowi_, to train Augustus as we did above.

## Exercise

To identify genes in our assembly based on genes from _P. reichenowi_, we first have to align _P. reichenowi_ genes to our
assembly. To do this, we use the protein sequence of _P. reichenowi_ genes and *not* the nucleotide sequence. This is
because the amino acid code is "degenerate" and different sets of 3 nucleotides, or a codon, can encode for the same
protein as below:

![](images/comparative_2.jpg)

In this chart, the first base of a codon is in the innermost circle, the second base in the next circle, and the final
base in the last circle. In other words, working inside out, TGC codes for the amino acid Cysteine. This chart also
demonstrates how the amino acid code is "degenerate". For example, the amino acid threonine can be coded by four
different codons: ACA, ACT, ACG, or ACC.

This means that closely related organisms can have different nucleotide sequence, but still have the same amino acid
sequence. In the image below, eventhough _P. falciparum_ has an "A" at position 9 but _P. reichenowi_ has a "G", both
species proteins have a threonine amino acid as the third residue:

![](images/comparative_3.png)

As such, if we use the amino acid sequence of _P. reichenowi_ genes to align to our _P. falciparum_ assembly instead of the
DNA sequence, we have more room for error since an amino acid can match multiple nucleotides.

**Questions:**

1. Using the codon chart above, can you align the following amino acid sequence to the nucleotide sequence below (just
label the amino acids above the matching base)?
2. What was difficult about this alignment?
3. Did you notice something at the end of your alignment?

![](images/comparative_activity_1.png)

Now, let's try this again, but with a slightly more challenging alignment:

![](images/comparative_activity_2.png)

**Questions:**

1. What was difficult in this example?
2. Do you think this is an issue, or is there something biology-related going on?

## GenomeThreader

While our exercise above was on a very short amino acid sequence, we need a tool which can perform this basic task for
 all genes in the _P. reichenowi_ genome across our entire _P. falciparum_ assembly. To do this, we are going to use the tool
 "GenomeThreader". GenomeThreader is available [online](https://genomethreader.org/) and via bioconda.

To prepare for this exercise, we have already downloaded all _P. reichenowi_ genes from
[PlasmoDB](https://plasmodb.org/plasmo/app/) and filtered them to genes that will only align to your assembly to
save time. Let's take a look at this file:

In [None]:
head -n 19 Preichenowi.prot.fa

This command will show you the first 19 lines of this file. You will notice that this file appears similar to the
 previous .fasta or .fa files that you have looked at, but now includes more letters. These are the
 [IUPAC amino acid codes](https://www.bioinformatics.org/sms/iupac.html), and are used to represent all 20 possible
 amino acids.

If you take a closer look at this file, you may notice the character `*`.

**Question:** What do you think the `*` character represents?

Now, we are going to use GenomeThreader (via the `gth` command) to align these proteins to your assembly.

In [None]:
gth -genomic PB.masked.fasta -protein Preichenowi.chr5.prot.fa -gff3out -skipalignmentout -paralogs -gcmincoverage 80 -prseedlength 20 -prminmatchlen 20 -prhdist 2 -o gth.gff3 -finalstopcodon

The options above tell genomethreader to:

* `-gff3out`: print [gff3](https://m.ensembl.org/info/website/upload/gff3.html) format output (the file `gth.gff3`)
* `-skipalignmentout`: Do not print anything other than gff3
* `-paralogs`: Allow the protein sequence to match multiple times to our assembly to find genes that are closely related, i.e. paralogs.
* `-gcmincoverage 80`: Report only proteins which match AT LEAST 80% of our assembly
* `-prseedlength 20`, `-prminmatchlen 20`, and `-prhdist 2`: Deal with the minimum initial match allowed in the "seed" of the alignment.
* `-finalstopcodon`: ensures that each gene is annotated with a "stop".

This command should take a minute or two to run. If it does not complete or takes too long, you can find a copy of the output in `annotation_backups/`

This file can also be used as the input to Augustus to find genes like we did above. We don't have enough time today,
but feel free to come back and give it a try! If you want to try, start at section 4.3 with the
commands, and proceed through the tutorial from there:

In [None]:
gth2gtf.pl gth.gff3 bonafide.gth.gtf

computeFlankingRegion.pl bonafide.gth.gtf

## Examining Results in IGV

Now that we have aligned our proteins to our genome assembly, return to IGV to see if we can learn anything else about
that gene we highlighted at the end of section 4.6. If you closed IGV, see section 4.6 on how to load your data again.

Now, let's load your new gene "annotation" information generated by GenomeThreader into IGV:

**Go to _"File -> Load From File"_. Select "gth.gff3" and click _"Open"_.**

Now, return to the gene that we previously examined at the end of section 4.6 by going to the coordinates:

    tig00000001:165,000-171,500

Now, you should see an additional model below your original gene predictions. It should be named something like
"mRNA21" - this doesn't seem very informative! However, if we click the gene model, we should see an image like the
following:

![](images/comparative_4.png)

The "Target" column tells us the original name of the gene in _P. reichenowi_ that we aligned to our assembly. Now, we
are going to use a web resource to figure out what this gene does. On your desktop click the "Firefox" emblem:

![](images/comparative_5.png)

Once firefox has loaded, go to the website:

https://plasmodb.org

Being able to query online sequence resources and databases such as PlasmoDB is an important skill. PlasmoDB contains
sequencing data, protein information, and more for a large number of _Plasmodium_ species. To prepeare this course for you,
we used a large number of resources from PlasmoDB. PlasmoDB is part of a larger database, VEuPathDB, which documents a
wide range of eukaryotic parasites such as _P. falciparum_. While we will only briefly go into PlasmoDB today, we highly
recommend you familiarize yourself with tools such as PlasmoDB. Other fantastic resources include the
[UCSC genome browser](https://genome.ucsc.edu/) and [Ensembl](https://www.ensembl.org/index.html) which are dedicated to
providing a wealth of information for hundreds of organisms.

Now, in the search bar at the top of PlasmoDB, enter the first part of the gene name we found above:

![](images/comparative_6.png)

Hit the magnifying glass or "enter".

**Questions:**

1. What is the gene that we identified in IGV?
2. Can you name a function of this gene and how did you get the answer?

Congratulations, you have reached the end of the Genome Annotation tutorial!