Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify output if spades is killed: error code -9 #76

Open
fungs opened this issue Feb 2, 2018 · 21 comments
Open

Clarify output if spades is killed: error code -9 #76

fungs opened this issue Feb 2, 2018 · 21 comments

Comments

@fungs
Copy link

fungs commented Feb 2, 2018

Hi,

I ran metaspades.py (v3.11.1, Linux x86_46 pre-compiled) with for a regular Illumina paired-end library but assembly does not run through. Judging from the error message, it seems not to be an issue with memory. If it is, it needs to be clarified!

metaspades.py -o assembly_metaspades -1 r1.clean.fq.gz -2 r2.clean.fq.gz -s s.clean.fq.gz --phred-offset 33 -t 34 -m 250

This is the shortened output:

  7:14:10.214   171G / 173G  INFO   StageManager             (stage.cpp                 : 126)   STAGE == Preliminary Distance Estimation
  7:14:10.214   171G / 173G  INFO    General                 (distance_estimation.cpp   : 224)   Processing library #0
  7:14:10.214   171G / 173G  INFO    General                 (distance_estimation.cpp   : 130)   Weight Filter Done
  7:14:10.214   171G / 173G  INFO   DistanceEstimator        (distance_estimation.hpp   : 116)   Using SIMPLE distance estimator
  9:17:49.559   185G / 193G  INFO    General                 (distance_estimation.cpp   :  35)   Filtering info
  9:17:49.559   185G / 193G  INFO    General                 (pair_info_filters.hpp     : 242)   Start filtering; index size: 579977625
  9:22:57.444   200G / 209G  INFO    General                 (pair_info_filters.hpp     : 263)   Done filtering
  9:22:57.944   184G / 209G  INFO    General                 (distance_estimation.cpp   : 172)   Refining clustered pair information
  9:24:50.102   184G / 209G  INFO    General                 (distance_estimation.cpp   : 174)   The refining of clustered pair information has been finished
  9:24:50.102   184G / 209G  INFO    General                 (distance_estimation.cpp   : 176)   Improving paired information
  9:41:24.351   187G / 219G  INFO   PairInfoImprover         (pair_info_improver.hpp    : 103)   Paired info stats: missing = 38388342; contradictional = 0
 10:00:26.002   188G / 225G  INFO   PairInfoImprover         (pair_info_improver.hpp    : 103)   Paired info stats: missing = 6099975; contradictional = 0
 10:00:26.002   188G / 225G  INFO    General                 (distance_estimation.cpp   : 183)   Filling scaffolding index
 10:00:26.002   188G / 225G  INFO   DistanceEstimator        (distance_estimation.hpp   : 116)   Using SMOOTHING distance estimator
 12:21:36.894   192G / 225G  INFO    General                 (distance_estimation.cpp   :  35)   Filtering info
 12:21:36.905   192G / 225G  INFO    General                 (pair_info_filters.hpp     : 242)   Start filtering; index size: 97970244
 12:23:04.656   196G / 225G  INFO    General                 (pair_info_filters.hpp     : 263)   Done filtering
 12:23:05.117   192G / 225G  INFO    General                 (distance_estimation.cpp   : 228)   Clearing raw paired index


== Error ==  system call for: "['spades', 'K55/configs/config.info', K55/configs/mda_mode.info', 'K55/configs/meta_mode.info']" finished abnormally, err code: -9

So it seems to fail for the large k-mer size 55. What is the meaning of the error code -9?

Thanks!

@asl
Copy link
Member

asl commented Feb 2, 2018

The place looks weird. Does the problem reproduce when you re-run the SPAdes job?

@fungs
Copy link
Author

fungs commented Feb 2, 2018

yes, I did, same thing

@asl
Copy link
Member

asl commented Feb 2, 2018

Ok. Most probably we'll need the data to reproduce and fix the issue. Will it be possible for you to share it with us?

@fungs
Copy link
Author

fungs commented Feb 2, 2018

Sorry, I don't think I can share these data at this point. I will try to run it on another machine, see if it results in the same error. It must have to do with data size, as I assembled splits of one fourth successfully.

Let me see if I can get a stripped version to reproduce the error. Thanks for your prompt responses, you are really motivated there! You don't see that very often on academic open-source projects.

@fungs
Copy link
Author

fungs commented Feb 2, 2018

I coudn't find out what the exit code means by digging through the source code, maybe you can!

@asl
Copy link
Member

asl commented Feb 2, 2018

Here the process was killed (-9 is SIGKILL signal). The biggest question is the reason. Do you happen to have enough free RAM available?

@Momosandy
Copy link

Sames issue with hammer

@fungs
Copy link
Author

fungs commented Feb 5, 2018

I went through the (user-readable) system log but couldn't find anything about killing the process. Overall, this is what I suspected first. I will try to run in on different hardware to see if that runs through.

@Momosandy
Copy link

Momosandy commented Feb 5, 2018 via email

@fungs
Copy link
Author

fungs commented Feb 5, 2018

@asl: I ran again on this hardware and could find an out-of-memory kill in the log, + it stopped in another step. I guess the only thing remaining for the development side is to better report such issues and communicate to the user

@Momosandy: Hi, I don't know the limits for your specific system, you need to talk to your tech support. On cluster systems, there are usually different queues and options to allocation resources like RAM. If it crashes on your system, try to run the following to verify:

dmesg -T

It should give you a long list of kernel messages, among them should be an entry such as in my case:

[Sa Feb  3 00:21:41 2018] Out of memory: Kill process 136340 (spades) score 270 or sacrifice child

@Momosandy
Copy link

Momosandy commented Feb 5, 2018 via email

@asl
Copy link
Member

asl commented Feb 5, 2018

@Momosandy I have no idea about your cluster configuration, you'd better ask your system administrator

@fungs fungs changed the title Finished abnormally with error code -9 Clarify output if spades is killed: error code -9 Feb 7, 2018
@fungs
Copy link
Author

fungs commented Feb 7, 2018

I changed the title, please consider changing the handling of signals so the user knows what is going on. In my case, the process was killed by the system due to high memory usage.

@asl
Copy link
Member

asl commented Feb 7, 2018

@fungs Unfortunately, there is no way to tell that the process was killed due to OOM killer. We're only seeing SIGKILL signal. Though when we know the reason (e.g. the system failed to allocate RAM for us and let us know this) we do report it.

@fungs
Copy link
Author

fungs commented Feb 7, 2018

yes, report that the process was killed!

@fungs
Copy link
Author

fungs commented Feb 8, 2018

Maybe you know what error code -9 means but since error codes are not standardized and not all users on a system have access to system logs, it would be much better to point the user to the source of the error as best as possible (also to avoid unnecessary issues on github).

@mmokrejs
Copy link

mmokrejs commented Feb 9, 2018

I agree SPAdes should colect dmesg -T in such cases. In a clustered environment, it is hard to get access to the same computational node after the job has exited. I wouldn't mind if output of free command was captured too.

@fungs
Copy link
Author

fungs commented Feb 9, 2018

@mmokrejs: that would probably go beyond what most programs do (and it is quite system-specific). But it would be good to give a clear message that the process was killed and by what signal. That is often enough to figure out what might have gone wrong.

@mmokrejs
Copy link

mmokrejs commented Feb 9, 2018

It does not matter if any of these fails on an obscure platform or not. It is just the very right moment to capture some system info in this situation. Many users are not aware of syslog files and dmesg and as I said, it is quite an effort to get to them after the job finished.

@asl
Copy link
Member

asl commented Feb 12, 2018

We cannot do this reliably - on many systems reading kernel log (aka dmesg) would require root privileges. We certainly do not want to handle errors in error path (double errors are nightmare to handle), instead we would like error path to be as straightforward as possible. Users could check the result code and do whatever state capturing they want, e.g. run dmesg, free, df, squeue, etc. – whatever diagnostic tools are available on the system.

@mmokrejs
Copy link

As I said, it is hard to get access back to the computational node, it could have been even rebooted meanwhile. I do not mind if execuition of e.g. squeue fails as it is not available on the system, my point is: try to get any info which is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants