Fungal genome annotation scripts
Clone or download
Pull request Compare This branch is 596 commits behind nextgenusfs:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
docs
html_template
lib
sample_data
util
.gitignore
LICENSE.md
README.md
funannotate
funannotate.py
setup.sh

README.md

funannotate

funannotate is a pipeline for genome annotation (built specifically for fungi, but could work with other eukaryotes). Genome annotation is a complicated process that uses software from many sources, thus the hardest part about using funannotate will be getting all of the dependencies installed. After that, funannotate requires only a few simple commands to go from genome assembly all the way to a functional annotated genome (InterPro, PFAM, MEROPS, CAZymes, GO ontology, etc) that is ready for submission to NCBI. Moreover, funannotate incorporates a light-weight comparative genomics package that can get you started looking at differences between fungal genomes.

###Installation

funannotate will likely run on any POSIX system, although it has only been tested on Mac OSX and Ubuntu.

###Setup

See installation instructions, but funannotate comes with a shell script to aid you in installation of the external, python, and perl dependencies. The shell script will also download and format the databases that are required to run funannotate, the databases will occupy quite a bit of space, currently working (uncompressed) is ~ 24 GB. Whenever possible, funannotate is configured to check external dependencies at runtime, however, setup.sh will help you during the initial setup.

To run the setup script, navigate to the funannotate home directory and type:

#to display help menu
$ ./setup.sh -h
    To download databases and check dependencies:   ./setup.sh
    To just download databases:  ./setup.sh db
    To just check dependencies:  ./setup.sh dep

#to check dependencies and download databases
$ ./setup.sh

###Funannotate help menu

To see the help menu, simply type funannotate in the terminal window. Similarly, e.g funannotate predict without any arguments will give you the options available to pass to each script, this is consistent for all of the funannotate commands.

$  funannotate

Usage:       funannotate <command> <arguments>
version:     0.3.2

Description: Funannotate is a genome prediction, annotation, and comparison pipeline.
    
Command:     clean          Find/remove small repetitive contigs
             sort           Sort by size and rename contig headers (recommended)
             species        list pre-trained Augustus species
             
             predict        Run gene prediction pipeline
             annotate       Assign functional annotation to gene predictions
             compare        Compare funannotated genomes
             
             fix            Remove adapter/primer contamination from NCBI error report          

###Using funannotate: a simple walkthrough

move into the sample_data directory of funannotate.

#for example, funannotate installed in $HOMEBREW/Cellar/funannotate/0.1.3/libexec
$ cd $HOMEBREW/Cellar/funannotate/0.1.3/libexecsample_data

#run funannotate predict on genome 1
$ funannotate predict -i genome1.fasta -o genome1 -s "Genome one" \
    --isolate fun1 --name GN01_ --augustus_species botrytis_cinerea \
    --protein_evidence proteins.fa --transcript_evidence transcripts.fa --cpus 6

This command should complete in ~ 5 minutes, will produce an output folder named genome1 which contains the results. To save time, here we are using pre-trained botrytis_cinerea to run AUGUSTUS - normally funannotate will train AUGUSTUS for you depending on which input you give it.

#generate functional annotation for genome 1
$ funannotate annotate -i genome1 -e youremail@domain.edu --cpus 6

The second command, will add functional annotation to your protein models. It should complete in ~ 15 minutes - 30 minutes (depending on how long remote query to InterProScan server takes. Note you could also run InterProScan locally, funannotate requires the results to be in XML format one file per protein. The results are in the annotate_results folder and have all the necessary files for NCBI WGS submission (.tbl, .sqn, .contigs.fsa, .agp). A GBK flatfile is also provided.

You can now run similar commands for genome2.fasta and genome3.fasta

#predict genome2
$ funannotate predict -i genome2.fasta -o genome2 -s "Genome two" \
    --isolate fun2 --name GN02_ --augustus_species botrytis_cinerea \
    --protein_evidence proteins.fa --transcript_evidence transcripts.fa --cpus 6
    
#annotate genome2
$ funannotate annotate -i genome2 -e youremail@domain.edu --cpus 6

#predict genome3
$ funannotate predict -i genome3.fasta -o genome3 -s "Genome three" \
    --isolate fun3 --name GN03_ --augustus_species botrytis_cinerea \
    --protein_evidence proteins.fa --transcript_evidence transcripts.fa --cpus 6

#annotate genome3
$ funannotate annotate -i genome3 -e youremail@domain.edu --cpus 6

You can now run some "lightweight" comparative genomics on these funannotated genomes:

$ funannotate compare -i genome1 genome2 genome3 --outgroup Botrytis_cinerea

You can now visualize the results by opening up the index.html file produced in the funannotate_compare folder. A phylogeny inferred from RAxML, genome stats, orthologs, InterPro summary, PFAM summary, MEROPS, CAZymes, and GO ontology enrichment results are all included in the browser-based output. Additionally, the raw data is available in appropriate files in the output directory.

###Using funannotate: a more realistic walkthrough

Here is a step by step tutorial for annotating a genome using funannotate with a genome assembly and RNA-seq data. This is a list of the data that I have available:

#my genome assembly
genome.scaffolds.fa

#my RNA Seq Reads
forward_R1.fastq
reverse_R2.fastq
  1. Find/remove small repetitive contigs. Assumption here is haploid fungal genome. (Optional)
funannotate clean -i genome.scaffolds.fa -o genome.cleaned.fa
  1. Now sort contigs by size and relabel header (Optional)
funannotate sort -i genome.cleaned.fa -o genome.final.fa
  1. Align RNA seq reads to cleaned genome (I'll use hisat2 here, you could use a different aligner), but the BAM file needs to be sorted, e.g. samtools sort.
#build reference
hisat2-build genome.final.fa genome

#now align reads to reference, using 12 cpus
hisat2 --max_intronlen 3000 -p 12 -x genome -1 forward_R1.fastq -2 reverse_R2.fastq \
       | samtools view -bS - | samtools sort -o RNA_seq.bam - 

3b) You can also run something like Trinity/PASA/TransDecoder with the RNA-seq reads, for instructions see Trinity and PASA.

#run Trinity de novo
Trinity.pl --left forward_R1.fastq --right reverse_R2.fastq --max_memory 50G --CPU 6 \
    --jaccard_clip --trimmomatic --normalize_reads

#run Trinity genome guided
Trinity.pl --genome_guided_bam RNA_seq.bam --max_memory 50G --genome_guided_max_intron 3000 --CPU 6

#Concatenate the output of each
cat Trinity.fasta Trinity-GG.fasta > transcripts.fasta

#create transcript accessions
$PASA_HOME/misc_utilities/accession_extractor.pl < Trinity.fasta > tdn.accs

#now run PASA
$PASA_HOME/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome.final.fa \
    -t transcripts.fasta --TDN tdn.accs --ALIGNERS blat,gmap

#run PASA mediated TransDecoder
$PASAHOME/scripts/pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta genome.assemblies.fasta \
    --pasa_transcripts_gff3 genome.pasa_assemblies.gff3
  1. Now you have an assembly, RNA_seq.bam, PASA_assemblies.gff3, and transcripts you can run funannotate like so:
funannotate predict -i genome.final.fa -o fun_out --species "Fungus specious" \
    --pasa_gff genome.fasta.transdecoder.genome.gff3 --rna_bam RNA_seq.bam \
    --transcript_evidence transcripts.fasta --cpus 12

This command will first run RepeatModeler on your genome, soft-mask repeats using RepeatMasker, align UniProtKB proteins to genome using tblastn/exonerate, align transcripts.fasta to genome using GMAP, launch BRAKER to train/run AUGUSTUS and GeneMark-ET, combine all predictions and evidence into gene models using Evidence Modeler, predict tRNAs, filter bad gene models, rename gene models, and finally convert to GenBank format.

  1. You should now examine the output in the fun_out/predict_results folder, be sure to look at the NCBI discrepency report to identify any gene models that need to be adjusted manually.

  2. If you are interested in secondary metabolism gene clusters, submit your genome.gbk file to the antiSMASH web server. You can then download the results in genbank format and funannotate can parse them.

  3. To add functional annotation to your genome, you would run the following command:

funannotate annotate -i fun_out -e youremail@domain.edu --antismash scaffold_1.final.gbk \
     --sbt my_ncbi_template.sbt --cpus 12

Your results will be located in the fun_out/annotate_results folder. It contains the necessary files to submit to NCBI WGS submission system (assuming you have passed in an appropriate submission template), otherwise the template is a generic one used by funannotate.

####What if I've already run Maker2, can I use funannotate?

Yes, you can. As of v0.1.5 you can pass your Maker2 GFF file to the --maker_gff option of funannotate predict which will parse the alignment evidence and ab initio gene predictions from Maker into a format for EVidence Modeler. So the --maker_gff will bypass gene predictions and evidence alignments done by funannotate and proceed to EVM - and then the rest of the script will run normally (filtering gene models and converting to GenBank). For example:

#simple example
funannotate predict -i genome1.fasta --species "Genome one" -o test_output \
    --maker_gff maker_genome1.all.gff --cpus 6

#maker + pasa?
funannotate predict -i genome1.fasta --species "Genome one" -o test_output \
    --maker_gff maker_genome1.all.gff --pasa_gff my_pasa.gff --cpus 6