Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

continue from where is break up when run emapper.py #249

Closed
li0604 opened this issue Nov 18, 2020 · 16 comments
Closed

continue from where is break up when run emapper.py #249

li0604 opened this issue Nov 18, 2020 · 16 comments

Comments

@li0604
Copy link

li0604 commented Nov 18, 2020

Hi everyone,

Because of the input file is large, if I could add some parameters to continue from where is break up.

@Cantalapiedra
Copy link
Collaborator

Hi @li0604 ,

From which step would you like to resume?
If the search finished entirely you could run only the annotation step with -m no_search --annotate_hits_table seed_orthologs.file

@li0604
Copy link
Author

li0604 commented Nov 21, 2020 via email

@Cantalapiedra
Copy link
Collaborator

Hi Qingmei,

what are the contents of your output folder?

@li0604
Copy link
Author

li0604 commented Nov 24, 2020

It is in this form:
output folder name is : output_file.emapper.annotations

part of contents:

k10_contig_1_1 | 335992.SAR11_0510 | 5.40E-19 | 100.1 | unclassified | Alphaproteobacteria | glc | 2.3.3.9 | ko:K0163ko00620,ko00630,ko01100,ko01110,ko01120,ko01200,map00620,map00630,map01100,map01110,map01120,map01200 | M00012 | R00472 | RC00004,RC00308,RC02747 | ko00000,ko00001,ko00002,ko01000 | acteria | 1MVEV@1224,2TS9R@28211,4 | PUK@82117,COG2225@1,COG2225@2 | NA|NA|NA | C | Involved | in | the | glycolate | utilization. | Catalyzes | the | condensation | and | subsequent | hydrolysis | of | acetyl-coenzyme | A | (acetyl-CoA) | and | glyoxylate | to | form | malate | and | CoA
k10_contig_1_2 | 1400524.KL370779_gene592 | 9.60E-26 | 122.5 | unclassified | Alphaproteobacteria | accA | 2.1.3.15,6.4.1.2 | ko:K01962,ko:K01963 | ko00061,ko00620,ko00640,ko00720,ko01100,ko01110,ko01120,ko01130,ko01200,ko01212,map00061,map00620,map00640,map00720,map01100,map01110,map01120,map01130,map01200,map01212 | M00082,M00376 | R00742,R04386 | RC00040,RC00253,RC00367 | ko00000,ko00001,ko00002,ko01000 | acteria | 1MURN@1224,2TR6V@28211,4 | P8Q@82117,COG0825@1,COG0825@2 | NA|NA|NA | I | Component | of | the | acetyl | coenzyme | A | carboxylase | (ACC) | complex. | First, | biotin | carboxylase | catalyzes | the | carboxylation | of | biotin | on | its | carrier | protein
k10_contig_2_1 | 857087.Metme_0730 | 7.50E-84 | 316.6 | Methylococcales | tldD | GO:0005575,GO:0005622,GO:0005623,GO:0005737,GO:0005829,GO:0006508,GO:0006807,GO:0008150,GO:0008152,GO:0019538,GO:0043170,GO:0044238,GO:0044424,GO:0044444,GO:0044464,GO:0071704,GO:1901564 | ko:K03568 | ko00000,ko01002 | acteria | 1MUSK@1224,1RMA5@1236,1XE3Y@135618,COG0312@1,COG0312@2 | NA|NA|NA | S | modulator | of | DNA | gyrase

@Cantalapiedra
Copy link
Collaborator

Hi,

unfortunately there is no way to directly resume the annotation step, although it would be a nice feature to implement.
With current versions you could:

  • Just re-run the annotation step using "-m no_search --annotate_hits_table output_file.emapper.annotations"
  • Create a new seed orthologs file removing the entries which are already within output_file.emapper.annotations, and run the previous command with just those entries. You could try something like:

join -v 1 -t $'\t' <(grep -v "^#" seed_orthologs_file | sort) <(grep -v "^#" annotations_file | sort) | cut -f 1-4 > remaining_seed_orthologs
emapper.py -m no_search --annotate_hits_table remaining_seed_orthologs -o remaining ...
cat output_file.emapper.annotations remaining.emapper.annotations > all.emapper.annotations
rm remaining_seed_orthologs remaining.emapper.annotations output_file.emapper.annotations

I hope this helps.

Best,
Carlos

@li0604
Copy link
Author

li0604 commented Nov 24, 2020 via email

@Cantalapiedra
Copy link
Collaborator

Glad to hear that.
Let's see if we can implement the --resume option for annotations anytime soon.

Best,
Carlos

@li0604
Copy link
Author

li0604 commented Nov 24, 2020 via email

@Sofie8
Copy link

Sofie8 commented Jan 9, 2021

Hi, I am also in favour of a --resume option for the annotations step. What is the --resume option 'resuming?' currently? I thought it was 'resuming' the annotation step, but after 72 hours of run-time, I exceeded my walltime, and with restarting it, it just erased everything :-( (and goodbye to my computing credits..). I am using the emapper.py as implemented in atlas. We split in subsets of 500,000, but on a single machine 36 threads, 198 Gb MEM, 72h is not enough to finish 1 subset... Does it scale well with more threads, or do you have a suggestion how I specify my jobs best? I have also one big mem node available, 36 threads, 760 Gb Ram, or an AMD, 64 nodes 256 Gb Ram. Thanks!

@Cantalapiedra
Copy link
Collaborator

Hi @Sofie8 ,

sorry to hear that about your computing credits. I am not sure what atlas is. The --resume option is a somewhat old option used for hmmer searches. No actual resume option for diamond, mmseqs or annotation steps.

Besides that, I would recommend not only splitting the dataset, but also the emapper steps, when running large datasets. Not sure if you are doing it already. It would be something like (depending on emapper version):

emapper -m diamond -i input.fasta -o test --output_dir outdir
emapper -m no_search --annotate_hits_table outdir/test.emapper.seed_orthologs -o test --output_dir outdir

The more threads the faster is, usually. Also, in the nodes with 256GB or greater than that you could use -m mmseqs instead of -m diamond, which should be faster, if your emapper version includes the mmseqs option.

Also, in the latest versions there is an option to load the annotation DB into memory (--dbmem), which should speed up the annotation step quite a bit. See https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2-*refactor*#Annotation_Options

Best,
Carlos

@Sofie8
Copy link

Sofie8 commented Jan 13, 2021

Hi @Cantalapiedra ,

Thanks for your answer!

Yes, atlas is the metagenome analyses pipeline from Silas https://github.com/metagenome-atlas/atlas/issues/351

I did now the eggnog annotation step outside of atlas:
emapper.py --annotate_hits_table Genecatalog/subsets/genes/remaining_seed_orthologs
--no_file_comments --resume -o Genecatalog/subsets/genes/subset2 --cpu 36
--data_dir /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2 2>> Genecatalog/subsets/genes/logs/subset2/eggNOG_annotate_hits_table.log

following this to split the file and joining them back together:
join -v 1 -t $'\t' <(grep -v "^#" subset2.emapper.seed_orthologs | sort) <(grep -v "^#" subset2.emapper.annotations | sort) | cut -f 1-4 > remaining_seed_orthologs
emapper.py -m no_search --annotate_hits_table remaining_seed_orthologs -o remaining ...
cat output_file.emapper.annotations remaining.emapper.annotations > all.emapper.annotations

But after cat, combine_egg_nogg_annotations says:
Error Expected 22 fields in line 320861, saw 64

Is cat doing something with a blank line between the two files or putting things on 1 line?
It is exactly there where the two files merge..

Ok for the other suggestions!
@SilasK can this be useful for further improving the genecatalog step? Note also that --resume, is not resuming, so if the annotation step is broken in atlas, or not completed, it starts allover again to do the annotation. So in my case I have to split into smaller subsets than 500.000 to finish it in 3 days (36 threads, 198 Gb Ram).
In the latest release:

  • the option --mmseqs
  • the option: --dbmem

Best,
Sofie

@SilasK
Copy link
Contributor

SilasK commented Jan 14, 2021

See my respons on: metagenome-atlas/atlas#351

@Cantalapiedra
Copy link
Collaborator

Cantalapiedra commented Jan 14, 2021

Hi,

just remind that the --dbmem option would need around 40GB of free mem, and that using mmseqs would require downloading the corresponding eggnog-mapper mmseqs database (using the download script), and that such option (--mmseqs) requires a lot of memory to run.

Therefore, in both cases it is recommended running less jobs and more sequences per job, and the number of jobs per computer or cluster node should be set according to the memory available.

Best,
Carlos

@SilasK
Copy link
Contributor

SilasK commented Jan 14, 2021

Is the mmseqs version already officially released? I didn't know that.
How much memory does mmseqs use? Do you use profiles or search mode?

If I'm not mistaken during the emapper.py --annotate_hits_table you don't use mmseqs do you?

@Cantalapiedra
Copy link
Collaborator

Hi,

MMseqs can be used for the search step in the "refactor" branch, which we hope to merge soon with the "master" one. https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2-*refactor*

So far we are using search mode. To be honest, I don't know exactly how much (peak) memory is using. I am currently running some jobs in nodes with 236GB, and seems to be enough. For less than 200GB I would use diamond or hmmer in server mode.

You are right, the --annotate_hits_table (along with -m no_search in the refactor version) is used to run the annotation step without running the previous search step, so no MMseqs (nor diamond) involved there. The "--dbmem" is used during the annotation step though, and using less than 40GB allows loading the sqlite3 DB into memory before annotating (which could be convenient to replace the use of /dev/shm).

Best,
Carlos

@Cantalapiedra
Copy link
Collaborator

--resume currently resumes most of emapper stages, since version 2.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants