Genome Annotation Pipeline

Annotation pipeline developed for large insect genomes (e.g. Lymantria dispar spp.)

Disclaimer

👉 Do not use this pipeline 'as is' | Optimization required 👈

This pipeline was developped and optimized to work on a specific cluster using the Slurm Workload Manager. Codes and scripts included in this pipeline are thus "platform-specific" and must be adjusted according to your needs, dataset, available computer resources, and workload manager.

It is meant as a reference to help guide people who would like to have an example of how to perform the task of assembling and annotating a large and highly repeated genome (e.g. insects, plants), and what program to use in what order and which software parameters to focus on.

Overview

This pipeline takes adavantage of multiple open source softwares, programs and utility scripts to predict gene features, annotate them with gene product names, GO terms and KEGG KO numbers. It returns a complete genome annotation in GFF3 format and also produces a 'Gene Feature Table' with the complete annotation info per gene, including genomic contig in which the gene is found, and all the little pieces of information about their respective structure (3'UTR, 5'UTR, exons/introns, start/stop codon, cds) and annotation (gene product, transcript ID, GO terms, KO numbers).

Prerequisites

Python 2.7 | BioPython
blastplus (available on NCBI's webpage)
UniprotKB-SwissProt and nr blast databases
gnu parallel
RepeatMasker and RepeatModeler (available here)
exonerate (available here)
samtools (available here)
genemark (available here)
scipio (available here)
EVidenceModeler (available here)
PASA (available here)
augustus (available here)

Description

The global script called runAll.sh in the main working directory details every step of the pipeline, with embeded explanations as to which input(s) to use, what are the usefull outputs, which file to place where in the general pipeline directory structure and how to run the programs. IT IS NOT MEANT TO BE RAN AS IS: it needs to be adjusted to your input files and dataset with the corresponding file names. By changing file names, specific locations of the installed programs/softwares and computer resources (which really depend on the size of the initial raw sequencing data and expected genome size), it is possible to make it run as a whole, but I do not recommend it. Most of the steps in this pipeline will probably need further adjustments and "fine-tuning", depending on the type of data being processed. This is not an automated annotation pipeline fixed for a certain type of organism. It can be used on basically any organism because every step is independant and can be adjusted as needed. It is thus very useful in being versataile, but it requires a little bit of coding and adjustments.

I recommend dividing each step of the pipeline (well identified and numbered in the runAll.sh script) into an independent script that can be ran on its own with appropriate Grid management script/information (number of CPUs, memory, log files, what queue to use, etc.). This allows a better control over each step and parameters can be adjusted to fit your needs, output and input files can be inspected, moved to different directories, checked and modified as needed.

There is also an example in the directory called 'ZZ.example.runTrainedVersion'. This example shows how to use the pipeline if AUGUSTUS has been trained with a custom genome before. Let's say you've generated one independent script for each step in the runAll.sh main script and you've trained AUGUSTUS and predicted genes in a given genome. Now, you want to annotate another, closely related, genome that just got sequenced. For instance, AUGUSTUS has been trained with a frog genome and you want to annotate a new frog genome of the same taxonomic family using the frog-trained version of AUGUSTUS. Well, just run the scripts according to their respective number: each script run a step of the pipeline. Just modify the file names directly in the scripts, if needed change some parameters, and then submit the script to the cluster or run it locally on a (powerful) machine.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
00.archive		00.archive
01.raw.data		01.raw.data
02.repeat.modeler.masker		02.repeat.modeler.masker
03.genemark		03.genemark
04.scipio		04.scipio
05.exonerate		05.exonerate
06.pasa		06.pasa
07.EVidenceModeler		07.EVidenceModeler
08.augustusTraining		08.augustusTraining
09.augustus		09.augustus
10.functionalAnnotation		10.functionalAnnotation
ZZ.scripts		ZZ.scripts
LICENSE		LICENSE
README.md		README.md
runAll.sh		runAll.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genome Annotation Pipeline

Disclaimer

Overview

Prerequisites

Description

About

Releases

Packages

Languages

License

fohebert/GenomeAnnotation

Folders and files

Latest commit

History

Repository files navigation

Genome Annotation Pipeline

Disclaimer

Overview

Prerequisites

Description

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages