-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify output if spades is killed: error code -9 #76
Comments
The place looks weird. Does the problem reproduce when you re-run the SPAdes job? |
yes, I did, same thing |
Ok. Most probably we'll need the data to reproduce and fix the issue. Will it be possible for you to share it with us? |
Sorry, I don't think I can share these data at this point. I will try to run it on another machine, see if it results in the same error. It must have to do with data size, as I assembled splits of one fourth successfully. Let me see if I can get a stripped version to reproduce the error. Thanks for your prompt responses, you are really motivated there! You don't see that very often on academic open-source projects. |
I coudn't find out what the exit code means by digging through the source code, maybe you can! |
Here the process was killed (-9 is SIGKILL signal). The biggest question is the reason. Do you happen to have enough free RAM available? |
Sames issue with hammer |
I went through the (user-readable) system log but couldn't find anything about killing the process. Overall, this is what I suspected first. I will try to run in on different hardware to see if that runs through. |
Thank you Johannes,
What is the limit I am allowed in terms of number of thread? and CPU on Abel? How much RAM am I allocated on Abel?
How could I get some more memory in order to run this analysis?
Sundy Maurice
Researcher
Mycology Group
Section for Genetics and Evolutionary Biology
University of Oslo
Norway
sandym@ibv.uio.no<mailto:sandym@ibv.uio.no>
+47-22845918
On 5 Feb 2018, at 11:18, Johannes Dröge <notifications@github.com<mailto:notifications@github.com>> wrote:
I went through the (user-readable) system log but couldn't find anything about killing the process. Overall, this is what I suspected first. I will try to run in on different hardware to see if that runs through.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#76 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AYTZpmIgx_vUlqZ-U-sukcAvuVFu8SXPks5tRtVlgaJpZM4R28-6>.
|
@asl: I ran again on this hardware and could find an out-of-memory kill in the log, + it stopped in another step. I guess the only thing remaining for the development side is to better report such issues and communicate to the user @Momosandy: Hi, I don't know the limits for your specific system, you need to talk to your tech support. On cluster systems, there are usually different queues and options to allocation resources like RAM. If it crashes on your system, try to run the following to verify:
It should give you a long list of kernel messages, among them should be an entry such as in my case:
|
Hi Johannes,
thank you for testing it on your side.
@ ANton: Any idea?
Sundy Maurice
Researcher
Mycology Group
Section for Genetics and Evolutionary Biology
University of Oslo
Norway
sandym@ibv.uio.no<mailto:sandym@ibv.uio.no>
+47-22845918
On 5 Feb 2018, at 11:29, Johannes Dröge <notifications@github.com<mailto:notifications@github.com>> wrote:
@asl<https://github.com/asl>: I ran again on this hardware and could find an out-of-memory kill in the log, + it stopped in another step. I guess the only thing remaining for the development side is to better report such issues and communicate to the user
@Momosandy<https://github.com/momosandy>: Hi, I don't know the limits for your specific system, you need to talk to your tech support. On cluster systems, there are usually different queues and options to allocation resources like RAM. If it crashes on your system, try to run the following to verify:
dmesg -T
It should give you a long list of kernel messages, among them should be an entry such as in my case:
[Sa Feb 3 00:21:41 2018] Out of memory: Kill process 136340 (spades) score 270 or sacrifice child
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#76 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AYTZpgmgWRDEFg9kzq7i1YuXnnoeDR4Tks5tRtf_gaJpZM4R28-6>.
|
@Momosandy I have no idea about your cluster configuration, you'd better ask your system administrator |
I changed the title, please consider changing the handling of signals so the user knows what is going on. In my case, the process was killed by the system due to high memory usage. |
@fungs Unfortunately, there is no way to tell that the process was killed due to OOM killer. We're only seeing SIGKILL signal. Though when we know the reason (e.g. the system failed to allocate RAM for us and let us know this) we do report it. |
yes, report that the process was killed! |
Maybe you know what error code -9 means but since error codes are not standardized and not all users on a system have access to system logs, it would be much better to point the user to the source of the error as best as possible (also to avoid unnecessary issues on github). |
I agree SPAdes should colect |
@mmokrejs: that would probably go beyond what most programs do (and it is quite system-specific). But it would be good to give a clear message that the process was killed and by what signal. That is often enough to figure out what might have gone wrong. |
It does not matter if any of these fails on an obscure platform or not. It is just the very right moment to capture some system info in this situation. Many users are not aware of syslog files and dmesg and as I said, it is quite an effort to get to them after the job finished. |
We cannot do this reliably - on many systems reading kernel log (aka dmesg) would require root privileges. We certainly do not want to handle errors in error path (double errors are nightmare to handle), instead we would like error path to be as straightforward as possible. Users could check the result code and do whatever state capturing they want, e.g. run dmesg, free, df, squeue, etc. – whatever diagnostic tools are available on the system. |
As I said, it is hard to get access back to the computational node, it could have been even rebooted meanwhile. I do not mind if execuition of e.g. squeue fails as it is not available on the system, my point is: try to get any info which is available. |
Hi,
I ran metaspades.py (v3.11.1, Linux x86_46 pre-compiled) with for a regular Illumina paired-end library but assembly does not run through. Judging from the error message, it seems not to be an issue with memory. If it is, it needs to be clarified!
This is the shortened output:
So it seems to fail for the large k-mer size 55. What is the meaning of the error code -9?
Thanks!
The text was updated successfully, but these errors were encountered: