# Genome Annotation

## Introduction

During the last practical assignment (Genome Assembly), we generated an assembly of chromosome 5 of the eukaryotic intracellular parasite
_Plasmodium falciparum_. Today we are going to use our assembly to learn about how to compare different genome assembly versions and
identify sequences that may have some biological function. This process is known as "Genome Annotation" and is
graphically represented below:

![](images/intro_schematic.png)

Genome Annotation is about taking a raw assembled genome (top; like the one you created in the "Assembly" module) and
annotating it with features that biologists are interested in (bottom). While we will not be annotating all the features outlined
above today, this schematic should give you a general ideal of what this practical is about.

## Learning Outcomes

On completion of the tutorial, you can expect to be able to:

* Align two different reference genome assemblies against one another.
* Identify repetitive DNA sequences.
* Align RNA-seq data to an assembly and use it to identify genes.
* Use comparative genomics to identify similar proteins in other organisms.

## Tutorial sections

This tutorial comprises the following sections:

1. [Comparing Reference Genomes](reference_alignment.ipynb)
2. [Identifying repetitive DNA](repetitive_dna.ipynb)
3. [Gene Discovery](gene_discovery.ipynb)
4. [Gene Annotation using Comparative Genomics](comparative_genomics.ipynb)

## Authors

This tutorial was written by [Eugene Gardner](https://github.com/eugenegardner).

## Running the commands from this tutorial

You can follow this tutorial by typing all the commands you see into a terminal window. This is similar to the “Command Prompt” window on MS Windows systems, which allows the user to type DOS commands to manage files.

To get started, open a new terminal window and type the command below:

In [None]:
cd ~/course_data/annotation/data

## Let's get started!

This tutorial requires that you have mummer, RepeatMasker, hisat2, samtools, augustus, and GenomeThreader installed
on your computer. These are already installed on the virtual machine you are using. To check that these are installed,
run the following commands:

In [None]:
nucmer
RepeatMasker
hisat2
samtools
augustus
gth

This should return the help message for each of these programs. If you want to install this software yourself, please see the software websites:

* the [mummer](https://github.com/mummer4/mummer) website
* the [RepeatMasker](https://www.repeatmasker.org/) website
* the [hisat2](http://daehwankimlab.github.io/hisat2/) website
* the [samtools](http://www.htslib.org/) website
* the [augustus](https://github.com/Gaius-Augustus/Augustus) website
* the [GenomeThreader](https://genomethreader.org/) website

We also need to add custom scripts to our PATH environmental variable so that we can run them easily from the command-line. Run the following command:

In [None]:
export PATH=/home/manager/course_data/annotation/scripts/:$PATH

Now lets check that you can run the scripts we just added to the path:

In [None]:
computeFlankingRegion.pl
augustus

Both of these commands should return help information for each program. If you get an error, close your terminal and
restart this tutorial.

Once you have confirmed that you can run all of the above commands, proceed to
[Comparing Reference Genomes](reference_alignment.ipynb).