Clarify output if spades is killed: error code -9 #76

fungs · 2018-02-02T10:08:56Z

Hi,

I ran metaspades.py (v3.11.1, Linux x86_46 pre-compiled) with for a regular Illumina paired-end library but assembly does not run through. Judging from the error message, it seems not to be an issue with memory. If it is, it needs to be clarified!

metaspades.py -o assembly_metaspades -1 r1.clean.fq.gz -2 r2.clean.fq.gz -s s.clean.fq.gz --phred-offset 33 -t 34 -m 250

This is the shortened output:

  7:14:10.214   171G / 173G  INFO   StageManager             (stage.cpp                 : 126)   STAGE == Preliminary Distance Estimation
  7:14:10.214   171G / 173G  INFO    General                 (distance_estimation.cpp   : 224)   Processing library #0
  7:14:10.214   171G / 173G  INFO    General                 (distance_estimation.cpp   : 130)   Weight Filter Done
  7:14:10.214   171G / 173G  INFO   DistanceEstimator        (distance_estimation.hpp   : 116)   Using SIMPLE distance estimator
  9:17:49.559   185G / 193G  INFO    General                 (distance_estimation.cpp   :  35)   Filtering info
  9:17:49.559   185G / 193G  INFO    General                 (pair_info_filters.hpp     : 242)   Start filtering; index size: 579977625
  9:22:57.444   200G / 209G  INFO    General                 (pair_info_filters.hpp     : 263)   Done filtering
  9:22:57.944   184G / 209G  INFO    General                 (distance_estimation.cpp   : 172)   Refining clustered pair information
  9:24:50.102   184G / 209G  INFO    General                 (distance_estimation.cpp   : 174)   The refining of clustered pair information has been finished
  9:24:50.102   184G / 209G  INFO    General                 (distance_estimation.cpp   : 176)   Improving paired information
  9:41:24.351   187G / 219G  INFO   PairInfoImprover         (pair_info_improver.hpp    : 103)   Paired info stats: missing = 38388342; contradictional = 0
 10:00:26.002   188G / 225G  INFO   PairInfoImprover         (pair_info_improver.hpp    : 103)   Paired info stats: missing = 6099975; contradictional = 0
 10:00:26.002   188G / 225G  INFO    General                 (distance_estimation.cpp   : 183)   Filling scaffolding index
 10:00:26.002   188G / 225G  INFO   DistanceEstimator        (distance_estimation.hpp   : 116)   Using SMOOTHING distance estimator
 12:21:36.894   192G / 225G  INFO    General                 (distance_estimation.cpp   :  35)   Filtering info
 12:21:36.905   192G / 225G  INFO    General                 (pair_info_filters.hpp     : 242)   Start filtering; index size: 97970244
 12:23:04.656   196G / 225G  INFO    General                 (pair_info_filters.hpp     : 263)   Done filtering
 12:23:05.117   192G / 225G  INFO    General                 (distance_estimation.cpp   : 228)   Clearing raw paired index


== Error ==  system call for: "['spades', 'K55/configs/config.info', K55/configs/mda_mode.info', 'K55/configs/meta_mode.info']" finished abnormally, err code: -9

So it seems to fail for the large k-mer size 55. What is the meaning of the error code -9?

Thanks!

The text was updated successfully, but these errors were encountered:

asl · 2018-02-02T10:10:43Z

The place looks weird. Does the problem reproduce when you re-run the SPAdes job?

fungs · 2018-02-02T12:03:11Z

yes, I did, same thing

asl · 2018-02-02T13:54:36Z

Ok. Most probably we'll need the data to reproduce and fix the issue. Will it be possible for you to share it with us?

fungs · 2018-02-02T15:17:40Z

Sorry, I don't think I can share these data at this point. I will try to run it on another machine, see if it results in the same error. It must have to do with data size, as I assembled splits of one fourth successfully.

Let me see if I can get a stripped version to reproduce the error. Thanks for your prompt responses, you are really motivated there! You don't see that very often on academic open-source projects.

fungs · 2018-02-02T15:18:15Z

I coudn't find out what the exit code means by digging through the source code, maybe you can!

asl · 2018-02-02T19:44:56Z

Here the process was killed (-9 is SIGKILL signal). The biggest question is the reason. Do you happen to have enough free RAM available?

Momosandy · 2018-02-04T21:04:19Z

Sames issue with hammer

fungs · 2018-02-05T10:18:13Z

I went through the (user-readable) system log but couldn't find anything about killing the process. Overall, this is what I suspected first. I will try to run in on different hardware to see if that runs through.

Momosandy · 2018-02-05T10:22:21Z

Thank you Johannes, What is the limit I am allowed in terms of number of thread? and CPU on Abel? How much RAM am I allocated on Abel? How could I get some more memory in order to run this analysis? Sundy Maurice Researcher Mycology Group Section for Genetics and Evolutionary Biology University of Oslo Norway sandym@ibv.uio.no<mailto:sandym@ibv.uio.no> +47-22845918 On 5 Feb 2018, at 11:18, Johannes Dröge <notifications@github.com<mailto:notifications@github.com>> wrote: I went through the (user-readable) system log but couldn't find anything about killing the process. Overall, this is what I suspected first. I will try to run in on different hardware to see if that runs through. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#76 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AYTZpmIgx_vUlqZ-U-sukcAvuVFu8SXPks5tRtVlgaJpZM4R28-6>.

fungs · 2018-02-05T10:29:18Z

@asl: I ran again on this hardware and could find an out-of-memory kill in the log, + it stopped in another step. I guess the only thing remaining for the development side is to better report such issues and communicate to the user

@Momosandy: Hi, I don't know the limits for your specific system, you need to talk to your tech support. On cluster systems, there are usually different queues and options to allocation resources like RAM. If it crashes on your system, try to run the following to verify:

dmesg -T

It should give you a long list of kernel messages, among them should be an entry such as in my case:

[Sa Feb  3 00:21:41 2018] Out of memory: Kill process 136340 (spades) score 270 or sacrifice child

Momosandy · 2018-02-05T10:32:32Z

Hi Johannes, thank you for testing it on your side. @ ANton: Any idea? Sundy Maurice Researcher Mycology Group Section for Genetics and Evolutionary Biology University of Oslo Norway sandym@ibv.uio.no<mailto:sandym@ibv.uio.no> +47-22845918 On 5 Feb 2018, at 11:29, Johannes Dröge <notifications@github.com<mailto:notifications@github.com>> wrote: @asl<https://github.com/asl>: I ran again on this hardware and could find an out-of-memory kill in the log, + it stopped in another step. I guess the only thing remaining for the development side is to better report such issues and communicate to the user @Momosandy<https://github.com/momosandy>: Hi, I don't know the limits for your specific system, you need to talk to your tech support. On cluster systems, there are usually different queues and options to allocation resources like RAM. If it crashes on your system, try to run the following to verify: dmesg -T It should give you a long list of kernel messages, among them should be an entry such as in my case: [Sa Feb 3 00:21:41 2018] Out of memory: Kill process 136340 (spades) score 270 or sacrifice child — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#76 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AYTZpgmgWRDEFg9kzq7i1YuXnnoeDR4Tks5tRtf_gaJpZM4R28-6>.

asl · 2018-02-05T15:18:19Z

@Momosandy I have no idea about your cluster configuration, you'd better ask your system administrator

fungs · 2018-02-07T12:10:45Z

I changed the title, please consider changing the handling of signals so the user knows what is going on. In my case, the process was killed by the system due to high memory usage.

asl · 2018-02-07T12:14:39Z

@fungs Unfortunately, there is no way to tell that the process was killed due to OOM killer. We're only seeing SIGKILL signal. Though when we know the reason (e.g. the system failed to allocate RAM for us and let us know this) we do report it.

fungs · 2018-02-07T15:54:52Z

yes, report that the process was killed!

fungs · 2018-02-08T10:33:18Z

Maybe you know what error code -9 means but since error codes are not standardized and not all users on a system have access to system logs, it would be much better to point the user to the source of the error as best as possible (also to avoid unnecessary issues on github).

mmokrejs · 2018-02-09T11:03:57Z

I agree SPAdes should colect dmesg -T in such cases. In a clustered environment, it is hard to get access to the same computational node after the job has exited. I wouldn't mind if output of free command was captured too.

fungs · 2018-02-09T15:39:16Z

@mmokrejs: that would probably go beyond what most programs do (and it is quite system-specific). But it would be good to give a clear message that the process was killed and by what signal. That is often enough to figure out what might have gone wrong.

mmokrejs · 2018-02-09T16:56:54Z

It does not matter if any of these fails on an obscure platform or not. It is just the very right moment to capture some system info in this situation. Many users are not aware of syslog files and dmesg and as I said, it is quite an effort to get to them after the job finished.

asl · 2018-02-12T03:37:39Z

We cannot do this reliably - on many systems reading kernel log (aka dmesg) would require root privileges. We certainly do not want to handle errors in error path (double errors are nightmare to handle), instead we would like error path to be as straightforward as possible. Users could check the result code and do whatever state capturing they want, e.g. run dmesg, free, df, squeue, etc. – whatever diagnostic tools are available on the system.

mmokrejs · 2018-02-12T17:14:24Z

As I said, it is hard to get access back to the computational node, it could have been even rebooted meanwhile. I do not mind if execuition of e.g. squeue fails as it is not available on the system, my point is: try to get any info which is available.

fungs changed the title ~~Finished abnormally with error code -9~~ Clarify output if spades is killed: error code -9 Feb 7, 2018

Docolli13 mentioned this issue Mar 8, 2018

finished abnormally err: -9 #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify output if spades is killed: error code -9 #76

Clarify output if spades is killed: error code -9 #76

fungs commented Feb 2, 2018

asl commented Feb 2, 2018

fungs commented Feb 2, 2018

asl commented Feb 2, 2018

fungs commented Feb 2, 2018

fungs commented Feb 2, 2018

asl commented Feb 2, 2018

Momosandy commented Feb 4, 2018

fungs commented Feb 5, 2018

Momosandy commented Feb 5, 2018 via email

fungs commented Feb 5, 2018

Momosandy commented Feb 5, 2018 via email

asl commented Feb 5, 2018

fungs commented Feb 7, 2018

asl commented Feb 7, 2018

fungs commented Feb 7, 2018

fungs commented Feb 8, 2018

mmokrejs commented Feb 9, 2018 •

edited

Loading

fungs commented Feb 9, 2018

mmokrejs commented Feb 9, 2018 •

edited

Loading

asl commented Feb 12, 2018

mmokrejs commented Feb 12, 2018

Clarify output if spades is killed: error code -9 #76

Clarify output if spades is killed: error code -9 #76

Comments

fungs commented Feb 2, 2018

asl commented Feb 2, 2018

fungs commented Feb 2, 2018

asl commented Feb 2, 2018

fungs commented Feb 2, 2018

fungs commented Feb 2, 2018

asl commented Feb 2, 2018

Momosandy commented Feb 4, 2018

fungs commented Feb 5, 2018

Momosandy commented Feb 5, 2018 via email

fungs commented Feb 5, 2018

Momosandy commented Feb 5, 2018 via email

asl commented Feb 5, 2018

fungs commented Feb 7, 2018

asl commented Feb 7, 2018

fungs commented Feb 7, 2018

fungs commented Feb 8, 2018

mmokrejs commented Feb 9, 2018 • edited Loading

fungs commented Feb 9, 2018

mmokrejs commented Feb 9, 2018 • edited Loading

asl commented Feb 12, 2018

mmokrejs commented Feb 12, 2018

mmokrejs commented Feb 9, 2018 •

edited

Loading

mmokrejs commented Feb 9, 2018 •

edited

Loading