-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
continue from where is break up when run emapper.py #249
Comments
Hi @li0604 , From which step would you like to resume? |
Dear professor,
I am gald to hear from you. The annotation step was put a break. Could it be resumed?
Thanks a lot!
Yours,
Qingmei
在 2020-11-20 15:14:09,"Carlos P Cantalapiedra" <notifications@github.com> 写道:
Hi @li0604 ,
From which step would you like to resume?
If the search finished entirely you could run only the annotation step with -m no_search --annotate_hits_table seed_orthologs.file
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi Qingmei, what are the contents of your output folder? |
It is in this form: part of contents: k10_contig_1_1 | 335992.SAR11_0510 | 5.40E-19 | 100.1 | unclassified | Alphaproteobacteria | glc | 2.3.3.9 | ko:K0163ko00620,ko00630,ko01100,ko01110,ko01120,ko01200,map00620,map00630,map01100,map01110,map01120,map01200 | M00012 | R00472 | RC00004,RC00308,RC02747 | ko00000,ko00001,ko00002,ko01000 | acteria | 1MVEV@1224,2TS9R@28211,4 | PUK@82117,COG2225@1,COG2225@2 | NA|NA|NA | C | Involved | in | the | glycolate | utilization. | Catalyzes | the | condensation | and | subsequent | hydrolysis | of | acetyl-coenzyme | A | (acetyl-CoA) | and | glyoxylate | to | form | malate | and | CoA |
Hi, unfortunately there is no way to directly resume the annotation step, although it would be a nice feature to implement.
join -v 1 -t $'\t' <(grep -v "^#" seed_orthologs_file | sort) <(grep -v "^#" annotations_file | sort) | cut -f 1-4 > remaining_seed_orthologs I hope this helps. Best, |
Dear profesor,
That's very kind of you. I think this suggestion is helpful for me.
Thanks a lot!
Yours sincerely,
Qingmei.
At 2020-11-24 16:34:56, "Carlos P Cantalapiedra" <notifications@github.com> wrote:
Hi,
unfortunately there is no way to directly resume the annotation step, although it would be a nice feature to implement.
With current versions you could:
Just re-run the annotation step using "-m no_search --annotate_hits_table output_file.emapper.annotations"
Create a new seed orthologs file removing the entries which are already within output_file.emapper.annotations, and run the previous command with just those entries. You could try something like:
join -v 1 -t $'\t' <(grep -v "^#" seed_orthologs_file | sort) <(grep -v "^#" annotations_file | sort) | cut -f 1-4 > remaining_seed_orthologs
emapper.py -m no_search --annotate_hits_table remaining_seed_orthologs -o remaining ...
cat output_file.emapper.annotations remaining.emapper.annotations > all.emapper.annotations
rm remaining_seed_orthologs remaining.emapper.annotations output_file.emapper.annotations
I hope this helps.
Best,
Carlos
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Glad to hear that. Best, |
If --resume option can be realized, thst is wonderful.
Best,
Qingmei.
At 2020-11-24 17:00:44, "Carlos P Cantalapiedra" <notifications@github.com> wrote:
Glad to hear that.
Let's see if we can implement the --resume option for annotations anytime soon.
Best,
Carlos
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi, I am also in favour of a --resume option for the annotations step. What is the --resume option 'resuming?' currently? I thought it was 'resuming' the annotation step, but after 72 hours of run-time, I exceeded my walltime, and with restarting it, it just erased everything :-( (and goodbye to my computing credits..). I am using the emapper.py as implemented in atlas. We split in subsets of 500,000, but on a single machine 36 threads, 198 Gb MEM, 72h is not enough to finish 1 subset... Does it scale well with more threads, or do you have a suggestion how I specify my jobs best? I have also one big mem node available, 36 threads, 760 Gb Ram, or an AMD, 64 nodes 256 Gb Ram. Thanks! |
Hi @Sofie8 , sorry to hear that about your computing credits. I am not sure what atlas is. The --resume option is a somewhat old option used for hmmer searches. No actual resume option for diamond, mmseqs or annotation steps. Besides that, I would recommend not only splitting the dataset, but also the emapper steps, when running large datasets. Not sure if you are doing it already. It would be something like (depending on emapper version): emapper -m diamond -i input.fasta -o test --output_dir outdir The more threads the faster is, usually. Also, in the nodes with 256GB or greater than that you could use -m mmseqs instead of -m diamond, which should be faster, if your emapper version includes the mmseqs option. Also, in the latest versions there is an option to load the annotation DB into memory (--dbmem), which should speed up the annotation step quite a bit. See https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2-*refactor*#Annotation_Options Best, |
Hi @Cantalapiedra , Thanks for your answer! Yes, atlas is the metagenome analyses pipeline from Silas https://github.com/metagenome-atlas/atlas/issues/351 I did now the eggnog annotation step outside of atlas: following this to split the file and joining them back together: But after cat, combine_egg_nogg_annotations says: Is cat doing something with a blank line between the two files or putting things on 1 line? Ok for the other suggestions!
Best, |
See my respons on: metagenome-atlas/atlas#351 |
Hi, just remind that the --dbmem option would need around 40GB of free mem, and that using mmseqs would require downloading the corresponding eggnog-mapper mmseqs database (using the download script), and that such option (--mmseqs) requires a lot of memory to run. Therefore, in both cases it is recommended running less jobs and more sequences per job, and the number of jobs per computer or cluster node should be set according to the memory available. Best, |
Is the mmseqs version already officially released? I didn't know that. If I'm not mistaken during the |
Hi, MMseqs can be used for the search step in the "refactor" branch, which we hope to merge soon with the "master" one. https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2-*refactor* So far we are using search mode. To be honest, I don't know exactly how much (peak) memory is using. I am currently running some jobs in nodes with 236GB, and seems to be enough. For less than 200GB I would use diamond or hmmer in server mode. You are right, the --annotate_hits_table (along with -m no_search in the refactor version) is used to run the annotation step without running the previous search step, so no MMseqs (nor diamond) involved there. The "--dbmem" is used during the annotation step though, and using less than 40GB allows loading the sqlite3 DB into memory before annotating (which could be convenient to replace the use of /dev/shm). Best, |
--resume currently resumes most of emapper stages, since version 2.1.0 |
Hi everyone,
Because of the input file is large, if I could add some parameters to continue from where is break up.
The text was updated successfully, but these errors were encountered: