## Gene Annotation

### What is it?
Genome annotation is central to the genome sequencing life cycle

![](https://image.slidesharecdn.com/apolloworkshop-curation-intro-bipaa-2017-03-22-170322120431/95/curation-introduction-apollo-workshop-4-1024.jpg?cb=1490184398)

### Why is it important?

Some goals of manual annotation are:

- To establish almost exhaustive lists of genes playing a key role in some crucial process
- To look at your genes of interest enough to be comfortable that you have the right one’s before writing a paper.
- To provide names for the known genes – based primarily on homology to what is known in other organisms.
- To fix obvious errors in the automated gene models and improve them where additional data is available – i.e. to get the intron/exon co-ordinates right.


## Introduction to Genome Annotation

In [1]:
from IPython.display import IFrame
IFrame("Buell_Lecture_GenomeAnnotation.pdf", width=600, height=300)

## General Feature Format (gff)

A file format used to describe genes and other features of DNA, RNA and protein sequences. 

### Fields

Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'

1. **seqname** - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
2. **source** - name of the program that generated this feature, or the data source (database or project name)
3. **feature** - feature type name, e.g. Gene, Variation, Similarity
4. **start** - Start position of the feature, with sequence numbering starting at 1.
5. **end** - End position of the feature, with sequence numbering starting at 1.
6. **score** - A floating point value.
7. **strand** - defined as + (forward) or - (reverse).
8. **frame** - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
9. **attribute** - A semicolon-separated list of tag-value pairs, providing additional information about each feature.


For submission to genebank, annotatons in gff can be used to create a submission file. See [this guide](https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). 