# Módule 5: Genome annotation

## Overview 

Genome annotation is the process of identifying and labeling all the relevant features on a genome sequence. At minimum, this should include coordinates of predicted coding regions and their putative products, but it is desirable to go beyond this to non-coding RNAs, signal peptides and so on.

*Further reading*: https://academic.oup.com/bioinformatics/article/30/14/2068/2390517


### Install condacolab

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

### Install software

In [None]:
# Install Prokka
!conda install -c conda-forge -c bioconda -c defaults prokka

In [None]:
# Check if Prokka is installed
!prokka --version

### Download data

In [None]:
!wget

## Genome annotation 

We will use a software tool called [Prokka](https://github.com/tseemann/prokka) to annotate the draft genome sequence produced after running SPAdes. Prokka is a “wrapper”; it collects together several pieces of software (from various authors), and so avoids “re-inventing the wheel”.

[Prokka](https://github.com/tseemann/prokka) finds and annotates features (both protein coding regions and RNA genes, i.e. tRNA, rRNA) present on on a sequence. Note, Prokka uses a two-step process for the annotation of protein coding regions: first, protein coding regions on the genome are identified using Prodigal; second, the function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases. Prokka is a software tool that can be used to annotate bacterial, archaeal and viral genomes quickly, generating standard output files in GenBank, EMBL and gff formats. 

Run the command to download the prokka image from docker repository

In [None]:
# Run Prokka
!prokka contigs.fasta

An explanation of this command is as follows:

**prokka**: is the tool

**contigs.fa**: input file (this file is the output from SPAdes)

Once Prokka has finished, a new folder containing Prokka output will be present in your working directory.Examine each of its output files.

- The GFF and GBK files contain all of the information about the features annotated (in different formats.)
- The .txt file contains a summary of the number of features annotated.
- The .faa file contains the protein sequences of the genes annotated.
- The .ffn file contains the nucleotide sequences of the genes annotated.

## Viewing genome annotation in IGV 

You will require the following files to view genome annotation in IGV:

1. Reference genome which will be the fna output of Prokka. This sequence will be the reference against which annotations are displayed
2. gff file which is an output of Prokka

Launch IGV using methods outlines in the section [Data, Tools and Computational Platforms (IGV)]

Load the reference sequence: In the toolbar, Click Genome > Load Genome from file > Search and select PROKKA_12252022.fna (as an example)

Load the gff file: Go to File > Load from file > PROKKA_12252022.gff (as an example)

___