-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault with view (random access) on Lustre FS #219
Comments
Hi, Could you try to compile the tool from source? It would be helpful to see the backtrace of a debug build. ( |
I may be running into similar issues using bcbio that we've not been able to reproduce outside of a specific compute environment (Raijin @ NCI Australia). Biggest hurdle to debug is not being able to build a debug version from source as LDC is not available here. While I am trying to have it installed globally would it be possible to have a current sambamba binary with debug flags set? |
Looking at the stack trace the problem is an interplay of bgzf and the GC on specific machines. @chapmanb also complained about a similar issue. I recently built sambamba with a newer ldc and it may behave differently. @tcezard and @ohofmann: are you able to identify the machine that segfaults so we can reproduce this issue? I can send you sambamba with and without debug support for testing. |
@pjotrp Yes, absolutely -- Brad put together a reproducible example, we just haven't managed to crash it anywhere other than Raijin, and the support team there is still busy trying to install LDC so we can build sambamba from scratch. Happy to give the test version a whirl. |
OK, I'll build it and make it available on Thursday or Friday when I have good internet again. I'll also provide an ldc build using http://lists.gnu.org/archive/html/guix-devel/2017-01/msg01322.html so you can build yourself. |
Using GNU Guix I have created a relocatable version of sambamba with full debug Download the tarball from
md5sum is 6eaefc19adcf2dbce60cf18a15faea4a. Unpack the
You can find the sambamba binary in the target dir. To test the backtrace you can trigger a segfault simply with
Please test. Maybe the problem goes away without optimizations. I realise the debug output will be less for you because the source path are missing. The quick workaround is to check out the sambamba files in /home/wrk/izip/git/opensource/D/sambamba. |
Man, I am challenged. Reopening. |
Symbols are in this file http://biogems.info/contrib/genenetwork/sambamba.debug.tgz. If you check out the source dir you should be able to get full debug. Please try. Mail me if you need more instructions. |
I managed to deploy this, but of course our reproducible example to cause core dump no longer reproduces.. at least not with this build. I am now running the whole bcbio workflow (somatic WGS) to see if this holds up or if we need to come up with a new test case. More in a bit. |
That would be good news. ldc was updated. If it holds up, can you also take a look at performance - we don't want it to degrade ;). In the next step I can update LLVM to latest too. |
I haven't fully followed this thread, but just a couple of notes in case they are relevant:
Sorry if some of this does not apply here. |
I am on a slight tangent because I am using GNU Guix to build sambamba with or without symbols. These binaries can now be installed anywhere and do not require docker - you can try above URLs. We are trying to track down this particular bug - when that is done I can propose creating releases from Guix (that will also work in Docker). Building from source with ldc is a bit of a challenge and with Guix we can at least use distribution agnostic binaries that require no admin rights to install and run. |
I'm not following you – nothing I'm aware of requires Docker; I do not run sambamba in Docker. It is being used for CI builds/releases, and with/without symbols is already supported. What I'm suggesting is that the segfault symptom may be hiding a different underlying issue. For example, if you can trigger a segfault by providing an invalid command line, you may have the same issue I experienced with the official binaries a few months back (again: not Docker-related). You seemed to suggest that this was the case – If it's already clear to you that there's a "real" segfault - i.e. sambamba itself is dereferencing NULL or similar, then disregard my comments. |
Okay. False alarm.. it seems I am still segfaulting depending on what machine I end up:
I've not been able to run this with gdb:
This fails right away whereas the production run crash happens after writing the sorted file to disk. I suspect the missing libraries cause issues when trying to debug the threaded run? |
They should not cause issues as all libraries are in fact found. What is missing are debug symbols which were in a separate download. Anyway, I think the good news is that we are homing in on the problem which looks a bit like http://www.digitalmars.com/d/archives/digitalmars/D/learn/GC_dead-locking_47223.html In the next step I'll send you a download of a new sambamba build with debug symbols included and instructions. We try again and should get the point where it fails in sambamba itself dumping stack traces of all threads. |
A few things here:
|
@sambrightman Thanks, the
@pjotrp - over to you. |
dlang/druntime@b22d813 looks familiar, especially since it only contains one instruction. Let me check what is in the runtime. |
Yes, the instruction still sits in the ldc runtime we use. I'll create an update. @ohofmann thanks. I'll send you another one to try soon. Question, have you isolated the machine this happens on, or is it random on the cluster? |
…uld fix the Program received signal SIGSEGV, Segmentation fault. 0x00000000006a4254 in rt.sections_elf_shared.finiTLSRanges() () in biod#219 by updating Guix ldc runtime to 1.1.0.
comment on dlang/druntime@b22d813 I built sambamba with the latest ldc and runtime 1.1.0. Instructions for installing Guix relocatable sambamba and debugging Fetch
The md5sum should be
unpack
run the installer with target dir, e.g.
Now you should be able to run
To run with debugger you should see
it will complain about a CRC, so we need to fetch the original debug file with the command
now run
and you should see
if you see view.d:151 the symbols are loaded correctly (@sambrightman: I cause the segfault) If we get another segfault, show us all threads with
If you want listings you can add the source directory, also included with the installer, e.g.
which shows the actual line
@ohofmann I hope this version of sambamba+ldc fixes our issue |
Thanks @pjotrp -- can you check the |
I'll check. Can you run ./install.sh -v -d TARGETDIR so I get the full stack trace? Note that you need to remove the old folder TARGETDIR before installing (it won't overwrite). |
Of course:
|
That is what I thought. You may need to do
beforehand. Try using a different prefix to test. I'll add a --force switch to the installer and test again. |
@pjotrp , the directory doesn't exist beforehand. If it does the installer complains right away:
If it doesn't it creates the directory and lots of subdirs and starts setting things up, but fails at the glibc dir. |
Thanks for the detailed report. I'll look into it. |
It is a strange error because that directory is only created once. Somehow the file system reports it (still) exists. To nail it down I updated the tarball - md5sum 52def24ae8371a3b286252fddd7cf2f5, please run with ./install.sh TARGET -v -d --force you can send the log to pjotr.public01@thebird.nl |
After a bit of offline back and forth ended up with a new trace:
|
Looks like the upstream fix related to dlang/druntime@b22d813 in latest Druntime did not fix the issue. I am going to patch out the instruction that throws the exception and we retry. |
New version appears to fix the issue:
|
@ohofmann what is the status? |
…uld fix the Program received signal SIGSEGV, Segmentation fault. 0x00000000006a4254 in rt.sections_elf_shared.finiTLSRanges() () in biod#219 by updating Guix ldc runtime to 1.1.0.
Still got to finish a single WGS run. Brad is happily debugging on the NCI environment. Sorry, I’ll report back the minute I can confirm it’s working.
…On 11 Feb 2017, 03:38 +1100, Pjotr Prins ***@***.***>, wrote:
@ohofmann what is the status?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I added a new binary install of sambamba 0.6.6-pre3 with debug information on https://github.com/pjotrp/sambamba#troubleshooting. The issue around read sorting in 'sambamba depth' should be resolved by 48ac7aa. Please test this version on your HPC. When it works we'll make a proper release which will run faster. |
Fixed in 0.6.6-pre3 |
Note that you need the patched ldc compiler to have this fix. |
https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/ led us to suspect the problem is an Intel bug: On Mon, Jan 22, 2018 at 11:39:39AM +1100, Oliver Hofmann wrote:
That processor is listed as having this problem. I think we have found A microcode update is available. Also switching off hyperthreading See also https://www.digitaltrends.com/computing/intel-hyperthreading-bug-kaby-skylake/ |
We had two bugs. One is fixed in ldc dlang/druntime#1655. The other is an unfixable intel xeon bug, see #335 for more information. |
Hi,
I should start by saying that our Lustre file system is new and we're only setting things up.
I'm running a samtools view command that randomly seg fault when the data is on our Lustre file system.
The command I'm using is
When run on lustre I get this output
Here is the gdb output backtrace if that helps:
When I run the same command on the local filesystem there are no segmentation faults.
Is there any known problem with Lustre file system?
I also ran the equivalent samtools view command but that did not results in segfaults.
The text was updated successfully, but these errors were encountered: