eggNOG mapper v2
Table of Contents
`eggnog-mapper` is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database to transfer functional information from fine-grained orthologs only.
Common uses of eggNOG-mapper include the annotation of novel genomes, transcriptomes or even metagenomic gene catalogs.
The use of orthology predictions for functional annotation permits a higher precision than traditional homology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence).
Benchmarks comparing different eggNOG-mapper options against BLAST and InterProScan [are](https://github.com/jhcepas/emapper-benchmark/blob/master/benchmark_analysis.ipynb).
EggNOG-mapper is also available as a public online resource: http://eggnog-mapper.embl.de
- Expanded database of precomputed orthology assignments, now based on eggNOG v5.0. This includes 5,090 representative genomes (4445 bacteria, 168 archaea and 477 eukaryota), as well as 2502 viral proteomes.
- HMMer search mode is deprecated. Read FAQ---Frequently-Asked-Questions#why-i-cannot-choose-hmmer-search-mode-in-version-20
- Updated functional sources (e.g. KEGG, GeneOntology)
- New columns in the output annotation file :
1. query_name 2. seed eggNOG ortholog 3. seed ortholog evalue 4. seed ortholog score 5. Predicted taxonomic group 6. Predicted protein name 7. Gene Ontology terms 8. EC number 8. KEGG_ko 9. KEGG_Pathway 10. KEGG_Module 11. KEGG_Reaction 12. KEGG_rclass 13. BRITE 14. KEGG_TC 15. CAZy 16. BiGG Reaction 17. tax_scope: eggNOG taxonomic level used for annotation 18. eggNOG OGs 19. bestOG (deprecated, use smallest from eggnog OGs) 20. COG Functional Category 21. eggNOG free text description
- Python 2.7 - wget - DIAMOND binaries available (otherwise using the ones packaged with eggNOG-mapper) - BioPython (required only if using the `--translate` option)
- ~40GB for the eggNOG annotation database
- ~10GB for sequence database
- Download and decompress the latest version of eggnog-mapper from
https://github.com/jhcepas/eggnog-mapper/releases. The program does not require compilation nor installation.
- or clone this git repository (master branch):
git clone https://github.com/jhcepas/eggnog-mapper.git
To donwnload necessary databases, run the following script:
This will fetch and decompress all precomputed eggNOG data into the data/ directory.
To start an annotation job, provide a FASTA file containing your query sequences, and run `emapper.py`
python emapper.py -i test/p53.fa --output p53_maNOG -m diamond
The following recommendations are based on the different experiences annotating huge genomic and metagenomic datesets (>100M proteins).
eggNOG mapper works at two phases: 1) finding seed orthologous sequences 2) expanding annotations. 1 is mainly cpu intensive, while 2 is more about disk operations. You can therefore optimize the annotation of huge files, but running each phase on different setups.
1) Split your input FASTA file into chunks, each containing a moderate number of sequences (1M seqs per file worked good in our tests). We usually work with FASTA files where sequences are in a single line, so splitting is very simple.
split -l 2000000 -a 3 -d input_file.faa input_file.chunk_
2) Use diamond mode. Each chunk can be processed independently in a cluster node, and you should tell `emapper.py` not to run the annotation phase yet. This way you can parallelize diamond searches as much as you want, even when running from a shared file system. Assuming an example with 100M proteins, the above command will generate 100 file chunks, and each should run diamond using 16 cores. The necessary commands that need to be submitted to the cluster queue can be generated with something like this:
# generate all the commands that should be distributed in the cluster for f in *.chunk_*; do echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f; done
The annotation phase needs to query `data/eggnog.db` intensively. This file is a sqlite3 database, so it is highly recommended that the file lives under the fastest local disk possible. For instance, we store `eggnog.db` in SSD disks or, if possible, under `/dev/shm` (memory based filesystem).
3) Concatenate all chunk_*.emapper.seed_orthologs file.
cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs
4) Run the orthologs search and annotation phase in a single multi core machine (10 cores in our example), reading from a fast disk.
We usually annotate at a rate of 300-400 proteins per second using a 10 cpu cores and having `eggnog.db` under the `/dev/shm` disk, but you can of course run many of those instances in parallel. If you are running `emapper.py` from a conda environment, check [these](https://github.com/jhcepas/eggnog-mapper/issues/80).
emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10
and _voilà_, you got your annotations.
Please cite the following two papers if you use eggNOG-mapper v2
 Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Jaime Huerta-Cepas, Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering and Peer Bork. Submitted (2016).  eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085