continue from where is break up when run emapper.py #249

li0604 · 2020-11-18T14:48:16Z

Hi everyone,

Because of the input file is large, if I could add some parameters to continue from where is break up.

Cantalapiedra · 2020-11-20T07:13:55Z

From which step would you like to resume?
If the search finished entirely you could run only the annotation step with -m no_search --annotate_hits_table seed_orthologs.file

li0604 · 2020-11-21T14:44:37Z

Dear professor, I am gald to hear from you. The annotation step was put a break. Could it be resumed? Thanks a lot! Yours, Qingmei 在 2020-11-20 15:14:09，"Carlos P Cantalapiedra" <notifications@github.com> 写道： Hi @li0604 , From which step would you like to resume? If the search finished entirely you could run only the annotation step with -m no_search --annotate_hits_table seed_orthologs.file — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Cantalapiedra · 2020-11-23T17:11:21Z

Hi Qingmei,

what are the contents of your output folder?

li0604 · 2020-11-24T00:22:55Z

It is in this form:
output folder name is : output_file.emapper.annotations

part of contents:

k10_contig_1_1 | 335992.SAR11_0510 | 5.40E-19 | 100.1 | unclassified | Alphaproteobacteria | glc | 2.3.3.9 | ko:K0163ko00620,ko00630,ko01100,ko01110,ko01120,ko01200,map00620,map00630,map01100,map01110,map01120,map01200 | M00012 | R00472 | RC00004,RC00308,RC02747 | ko00000,ko00001,ko00002,ko01000 | acteria | 1MVEV@1224,2TS9R@28211,4 | PUK@82117,COG2225@1,COG2225@2 | NA|NA|NA | C | Involved | in | the | glycolate | utilization. | Catalyzes | the | condensation | and | subsequent | hydrolysis | of | acetyl-coenzyme | A | (acetyl-CoA) | and | glyoxylate | to | form | malate | and | CoA
k10_contig_1_2 | 1400524.KL370779_gene592 | 9.60E-26 | 122.5 | unclassified | Alphaproteobacteria | accA | 2.1.3.15,6.4.1.2 | ko:K01962,ko:K01963 | ko00061,ko00620,ko00640,ko00720,ko01100,ko01110,ko01120,ko01130,ko01200,ko01212,map00061,map00620,map00640,map00720,map01100,map01110,map01120,map01130,map01200,map01212 | M00082,M00376 | R00742,R04386 | RC00040,RC00253,RC00367 | ko00000,ko00001,ko00002,ko01000 | acteria | 1MURN@1224,2TR6V@28211,4 | P8Q@82117,COG0825@1,COG0825@2 | NA|NA|NA | I | Component | of | the | acetyl | coenzyme | A | carboxylase | (ACC) | complex. | First, | biotin | carboxylase | catalyzes | the | carboxylation | of | biotin | on | its | carrier | protein
k10_contig_2_1 | 857087.Metme_0730 | 7.50E-84 | 316.6 | Methylococcales | tldD | GO:0005575,GO:0005622,GO:0005623,GO:0005737,GO:0005829,GO:0006508,GO:0006807,GO:0008150,GO:0008152,GO:0019538,GO:0043170,GO:0044238,GO:0044424,GO:0044444,GO:0044464,GO:0071704,GO:1901564 | ko:K03568 | ko00000,ko01002 | acteria | 1MUSK@1224,1RMA5@1236,1XE3Y@135618,COG0312@1,COG0312@2 | NA|NA|NA | S | modulator | of | DNA | gyrase

Cantalapiedra · 2020-11-24T08:34:42Z

Hi,

unfortunately there is no way to directly resume the annotation step, although it would be a nice feature to implement.
With current versions you could:

Just re-run the annotation step using "-m no_search --annotate_hits_table output_file.emapper.annotations"
Create a new seed orthologs file removing the entries which are already within output_file.emapper.annotations, and run the previous command with just those entries. You could try something like:

join -v 1 -t $'\t' <(grep -v "^#" seed_orthologs_file | sort) <(grep -v "^#" annotations_file | sort) | cut -f 1-4 > remaining_seed_orthologs
emapper.py -m no_search --annotate_hits_table remaining_seed_orthologs -o remaining ...
cat output_file.emapper.annotations remaining.emapper.annotations > all.emapper.annotations
rm remaining_seed_orthologs remaining.emapper.annotations output_file.emapper.annotations

I hope this helps.

Best,
Carlos

li0604 · 2020-11-24T08:44:02Z

Dear profesor, That's very kind of you. I think this suggestion is helpful for me. Thanks a lot! Yours sincerely, Qingmei. At 2020-11-24 16:34:56, "Carlos P Cantalapiedra" <notifications@github.com> wrote: Hi, unfortunately there is no way to directly resume the annotation step, although it would be a nice feature to implement. With current versions you could: Just re-run the annotation step using "-m no_search --annotate_hits_table output_file.emapper.annotations" Create a new seed orthologs file removing the entries which are already within output_file.emapper.annotations, and run the previous command with just those entries. You could try something like: join -v 1 -t $'\t' <(grep -v "^#" seed_orthologs_file | sort) <(grep -v "^#" annotations_file | sort) | cut -f 1-4 > remaining_seed_orthologs emapper.py -m no_search --annotate_hits_table remaining_seed_orthologs -o remaining ... cat output_file.emapper.annotations remaining.emapper.annotations > all.emapper.annotations rm remaining_seed_orthologs remaining.emapper.annotations output_file.emapper.annotations I hope this helps. Best, Carlos — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Cantalapiedra · 2020-11-24T09:00:28Z

Glad to hear that.
Let's see if we can implement the --resume option for annotations anytime soon.

Best,
Carlos

li0604 · 2020-11-24T09:25:35Z

If --resume option can be realized, thst is wonderful. Best, Qingmei. At 2020-11-24 17:00:44, "Carlos P Cantalapiedra" <notifications@github.com> wrote: Glad to hear that. Let's see if we can implement the --resume option for annotations anytime soon. Best, Carlos — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Sofie8 · 2021-01-09T05:36:11Z

Hi, I am also in favour of a --resume option for the annotations step. What is the --resume option 'resuming?' currently? I thought it was 'resuming' the annotation step, but after 72 hours of run-time, I exceeded my walltime, and with restarting it, it just erased everything :-( (and goodbye to my computing credits..). I am using the emapper.py as implemented in atlas. We split in subsets of 500,000, but on a single machine 36 threads, 198 Gb MEM, 72h is not enough to finish 1 subset... Does it scale well with more threads, or do you have a suggestion how I specify my jobs best? I have also one big mem node available, 36 threads, 760 Gb Ram, or an AMD, 64 nodes 256 Gb Ram. Thanks!

Cantalapiedra · 2021-01-13T10:51:13Z

Hi @Sofie8 ,

sorry to hear that about your computing credits. I am not sure what atlas is. The --resume option is a somewhat old option used for hmmer searches. No actual resume option for diamond, mmseqs or annotation steps.

Besides that, I would recommend not only splitting the dataset, but also the emapper steps, when running large datasets. Not sure if you are doing it already. It would be something like (depending on emapper version):

emapper -m diamond -i input.fasta -o test --output_dir outdir
emapper -m no_search --annotate_hits_table outdir/test.emapper.seed_orthologs -o test --output_dir outdir

The more threads the faster is, usually. Also, in the nodes with 256GB or greater than that you could use -m mmseqs instead of -m diamond, which should be faster, if your emapper version includes the mmseqs option.

Also, in the latest versions there is an option to load the annotation DB into memory (--dbmem), which should speed up the annotation step quite a bit. See https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2-*refactor*#Annotation_Options

Best,
Carlos

Sofie8 · 2021-01-13T14:31:11Z

Hi @Cantalapiedra ,

Thanks for your answer!

Yes, atlas is the metagenome analyses pipeline from Silas https://github.com/metagenome-atlas/atlas/issues/351

I did now the eggnog annotation step outside of atlas:
emapper.py --annotate_hits_table Genecatalog/subsets/genes/remaining_seed_orthologs
--no_file_comments --resume -o Genecatalog/subsets/genes/subset2 --cpu 36
--data_dir /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2 2>> Genecatalog/subsets/genes/logs/subset2/eggNOG_annotate_hits_table.log

following this to split the file and joining them back together:
join -v 1 -t $'\t' <(grep -v "^#" subset2.emapper.seed_orthologs | sort) <(grep -v "^#" subset2.emapper.annotations | sort) | cut -f 1-4 > remaining_seed_orthologs
emapper.py -m no_search --annotate_hits_table remaining_seed_orthologs -o remaining ...
cat output_file.emapper.annotations remaining.emapper.annotations > all.emapper.annotations

But after cat, combine_egg_nogg_annotations says:
Error Expected 22 fields in line 320861, saw 64

Is cat doing something with a blank line between the two files or putting things on 1 line?
It is exactly there where the two files merge..

Ok for the other suggestions!
@SilasK can this be useful for further improving the genecatalog step? Note also that --resume, is not resuming, so if the annotation step is broken in atlas, or not completed, it starts allover again to do the annotation. So in my case I have to split into smaller subsets than 500.000 to finish it in 3 days (36 threads, 198 Gb Ram).
In the latest release:

the option --mmseqs
the option: --dbmem

Best,
Sofie

SilasK · 2021-01-14T09:45:16Z

See my respons on: metagenome-atlas/atlas#351

Cantalapiedra · 2021-01-14T19:30:22Z

Hi,

just remind that the --dbmem option would need around 40GB of free mem, and that using mmseqs would require downloading the corresponding eggnog-mapper mmseqs database (using the download script), and that such option (--mmseqs) requires a lot of memory to run.

Therefore, in both cases it is recommended running less jobs and more sequences per job, and the number of jobs per computer or cluster node should be set according to the memory available.

Best,
Carlos

SilasK · 2021-01-14T22:17:14Z

Is the mmseqs version already officially released? I didn't know that.
How much memory does mmseqs use? Do you use profiles or search mode?

If I'm not mistaken during the emapper.py --annotate_hits_table you don't use mmseqs do you?

Cantalapiedra · 2021-01-14T22:29:56Z

Hi,

MMseqs can be used for the search step in the "refactor" branch, which we hope to merge soon with the "master" one. https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2-*refactor*

So far we are using search mode. To be honest, I don't know exactly how much (peak) memory is using. I am currently running some jobs in nodes with 236GB, and seems to be enough. For less than 200GB I would use diamond or hmmer in server mode.

You are right, the --annotate_hits_table (along with -m no_search in the refactor version) is used to run the annotation step without running the previous search step, so no MMseqs (nor diamond) involved there. The "--dbmem" is used during the annotation step though, and using less than 40GB allows loading the sqlite3 DB into memory before annotating (which could be convenient to replace the use of /dev/shm).

Best,
Carlos

Cantalapiedra · 2021-03-09T08:29:48Z

--resume currently resumes most of emapper stages, since version 2.1.0

Cantalapiedra added the feature-request label Nov 24, 2020

SilasK mentioned this issue Jan 14, 2021

eggNOG_annotation takes very long metagenome-atlas/atlas#351

Closed

Cantalapiedra closed this as completed Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

continue from where is break up when run emapper.py #249

continue from where is break up when run emapper.py #249

li0604 commented Nov 18, 2020

Cantalapiedra commented Nov 20, 2020

li0604 commented Nov 21, 2020 via email

Cantalapiedra commented Nov 23, 2020

li0604 commented Nov 24, 2020

Cantalapiedra commented Nov 24, 2020

li0604 commented Nov 24, 2020 via email

Cantalapiedra commented Nov 24, 2020

li0604 commented Nov 24, 2020 via email

Sofie8 commented Jan 9, 2021

Cantalapiedra commented Jan 13, 2021

Sofie8 commented Jan 13, 2021

SilasK commented Jan 14, 2021

Cantalapiedra commented Jan 14, 2021 •

edited

SilasK commented Jan 14, 2021

Cantalapiedra commented Jan 14, 2021

Cantalapiedra commented Mar 9, 2021

continue from where is break up when run emapper.py #249

continue from where is break up when run emapper.py #249

Comments

li0604 commented Nov 18, 2020

Cantalapiedra commented Nov 20, 2020

li0604 commented Nov 21, 2020 via email

Cantalapiedra commented Nov 23, 2020

li0604 commented Nov 24, 2020

Cantalapiedra commented Nov 24, 2020

li0604 commented Nov 24, 2020 via email

Cantalapiedra commented Nov 24, 2020

li0604 commented Nov 24, 2020 via email

Sofie8 commented Jan 9, 2021

Cantalapiedra commented Jan 13, 2021

Sofie8 commented Jan 13, 2021

SilasK commented Jan 14, 2021

Cantalapiedra commented Jan 14, 2021 • edited

SilasK commented Jan 14, 2021

Cantalapiedra commented Jan 14, 2021

Cantalapiedra commented Mar 9, 2021

Cantalapiedra commented Jan 14, 2021 •

edited