Skip to content

eggNOG mapper v2

Jaime Huerta-Cepas edited this page May 17, 2019 · 3 revisions

Table of Contents

Overview

`eggnog-mapper` is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database to transfer functional information from fine-grained orthologs only.

Common uses of eggNOG-mapper include the annotation of novel genomes, transcriptomes or even metagenomic gene catalogs.

The use of orthology predictions for functional annotation permits a higher precision than traditional homology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence).

Benchmarks comparing different eggNOG-mapper options against BLAST and InterProScan [are](https://github.com/jhcepas/emapper-benchmark/blob/master/benchmark_analysis.ipynb).

EggNOG-mapper is also available as a public online resource: http://eggnog-mapper.embl.de

What's new in eggNOG-mapper v2

v2.0.0

  • Expanded database of precomputed orthology assignments, now based on eggNOG v5.0. This includes 5,090 representative genomes (4445 bacteria, 168 archaea and 477 eukaryota), as well as 2502 viral proteomes.
  • HMMer search mode is deprecated. Read FAQ---Frequently-Asked-Questions#why-i-cannot-choose-hmmer-search-mode-in-version-20
  • Updated functional sources (e.g. KEGG, GeneOntology)
  • New columns in the output annotation file :
1. query_name
2. seed eggNOG ortholog
3. seed ortholog evalue
4. seed ortholog score
5. Predicted taxonomic group
6. Predicted protein name
7. Gene Ontology terms 
8. EC number
8. KEGG_ko
9. KEGG_Pathway
10. KEGG_Module
11. KEGG_Reaction
12. KEGG_rclass
13. BRITE
14. KEGG_TC
15. CAZy 
16. BiGG Reaction
17. tax_scope: eggNOG taxonomic level used for annotation
18. eggNOG OGs 
19. bestOG (deprecated, use smallest from eggnog OGs)
20. COG Functional Category
21. eggNOG free text description

Installation

Requirements

Software Requirements

- Python 2.7 - wget - DIAMOND binaries available (otherwise using the ones packaged with eggNOG-mapper) - BioPython (required only if using the `--translate` option)

Storage Requirements

  • ~40GB for the eggNOG annotation database
  • ~10GB for sequence database

Download

  • Download and decompress the latest version of eggnog-mapper from
  https://github.com/jhcepas/eggnog-mapper/releases. The program does not
  require compilation nor installation.
  • or clone this git repository (master branch):
git clone https://github.com/jhcepas/eggnog-mapper.git

Fetch databases

To donwnload necessary databases, run the following script:

download_eggnog_data.py 

This will fetch and decompress all precomputed eggNOG data into the data/ directory.

Basic usage

To start an annotation job, provide a FASTA file containing your query sequences, and run `emapper.py`

python emapper.py -i test/p53.fa --output p53_maNOG -m diamond

Setting up large annotation jobs

The following recommendations are based on the different experiences annotating huge genomic and metagenomic datesets (>100M proteins).

eggNOG mapper works at two phases: 1) finding seed orthologous sequences 2) expanding annotations. 1 is mainly cpu intensive, while 2 is more about disk operations. You can therefore optimize the annotation of huge files, but running each phase on different setups.

Phase 1. Homology searches

1) Split your input FASTA file into chunks, each containing a moderate number of sequences (1M seqs per file worked good in our tests). We usually work with FASTA files where sequences are in a single line, so splitting is very simple.

split -l 2000000 -a 3 -d input_file.faa input_file.chunk_


2) Use diamond mode. Each chunk can be processed independently in a cluster node, and you should tell `emapper.py` not to run the annotation phase yet. This way you can parallelize diamond searches as much as you want, even when running from a shared file system. Assuming an example with 100M proteins, the above command will generate 100 file chunks, and each should run diamond using 16 cores. The necessary commands that need to be submitted to the cluster queue can be generated with something like this:

# generate all the commands that should be distributed in the cluster
for f in *.chunk_*; do
echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f; 
done

Phase 2. Orthology and functional annotation

The annotation phase needs to query `data/eggnog.db` intensively. This file is a sqlite3 database, so it is highly recommended that the file lives under the fastest local disk possible. For instance, we store `eggnog.db` in SSD disks or, if possible, under `/dev/shm` (memory based filesystem).

3) Concatenate all chunk_*.emapper.seed_orthologs file.

cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs

4) Run the orthologs search and annotation phase in a single multi core machine (10 cores in our example), reading from a fast disk.




emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10
We usually annotate at a rate of 300-400 proteins per second using a 10 cpu cores and having `eggnog.db` under the `/dev/shm` disk, but you can of course run many of those instances in parallel. If you are running `emapper.py` from a conda environment, check [these](https://github.com/jhcepas/eggnog-mapper/issues/80).

and _voilà_, you got your annotations.











Citation

 Please cite the following two papers if you use eggNOG-mapper v2
[1] Fast genome-wide functional annotation through orthology assignment by
      eggNOG-mapper. Jaime Huerta-Cepas, Damian Szklarczyk, Lars Juhl Jensen,
      Christian von Mering and Peer Bork. Submitted (2016).

[2] eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated
      orthology resource based on 5090 organisms and 2502 viruses. Jaime
      Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia
      K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars
      J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8;
      47(Database issue): D309–D314. doi: 10.1093/nar/gky1085 
You can’t perform that action at this time.