Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markdup segmentation fault #393

Closed
Dr-Handsome opened this issue Mar 28, 2019 · 59 comments
Closed

markdup segmentation fault #393

Dr-Handsome opened this issue Mar 28, 2019 · 59 comments
Assignees
Milestone

Comments

@Dr-Handsome
Copy link

@Dr-Handsome Dr-Handsome commented Mar 28, 2019

sambamba 0.6.8 from bioconda

I run markdup on a cluster and generated the segmentation fault error and left a core.* temporary file. This error is consistent among all the computer with different CPU (either Xeon or AMD Opteron). Any thoughts?

@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Mar 29, 2019

probably similar issue here. I'm also on conda v0.6.8

For test purposes I have a very small bam and it seems releated to the number of threads used.
The same small bam with -t 1..8 works fine, with -t 9 it segfaults
Maybe some threads get zero data!?

$sambamba markdup --remove-duplicates -t 9 in.bam out.bam

sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.13.0 / DMD v2.083.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
  sorted 324798 end pairs
     and 324 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 57 ms
  found 361424 duplicates
collected list of positions in 0 min 0 sec
Segmentation fault (core dumped)
@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Mar 29, 2019

ah maybe this is due to a high number of duplicates.
I increased the --hash-table-size=4194304 and then it runs through without segfault

@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Mar 29, 2019

so actually the segfaults seem more random, I changed parameters and sometimes it crashes, sometimes it runs through...so far unpredictable what causes this....

bioconda version/build issue?
underlying OS is ubuntu 18.04

@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Mar 29, 2019

here is a dump
catchsegv.1.txt

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 1, 2019

Thanks for the dump. I am happy to chase the problem. You can see it is happening in the garbage collector. Can you try the latest binary release and see if this does the same? https://github.com/biod/sambamba/releases

Also, what hardware are you on. Some Xeons have threading problems.

@pjotrp pjotrp self-assigned this Apr 1, 2019
@pjotrp pjotrp added the bug label Apr 1, 2019
@pjotrp pjotrp added this to the 0.7.0 milestone Apr 1, 2019
@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Apr 2, 2019

different cloud servers, but last one was a Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz

dump from cloned master:

catchsegv.2.txt

I compiled with make debug, but looks less informative than with the conda version

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 2, 2019

I think that is one of the unreliable Xeons for hyperthreading. Sambamba is one of the rare tools which brings it out by utilizing all cores. See also #335. Check the list by Intel.

@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Apr 2, 2019

mhmm that's bad. so using less cores could be more stable?

I also get a segfault with sambamba depth ....

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 2, 2019

One option is to turn off hyperthreading. Sadly that is hard in most HPC environments. Alternatively don't use those machines or rerun until it gets through. You can blame Intel.

@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Apr 3, 2019

yeah thanks!

Btw, I downgraded to v0.6.6 via conda and this seems more stable, so far I don't get any segfaults with the same data.

Does this version uses a different underlying library or like that?

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 3, 2019

It is possible. What version of LLVM and ldc is it using? Underlying tools are evolving fast. Sambamba did not change markdup essentially between those versions.

@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Apr 3, 2019

It looks like the 0.6.6 conda version uses your pre-build packages from github
(https://github.com/biod/sambamba/releases/download/v0.6.6/sambamba_v0.6.6_linux.tar.bz2),

whereas the 0.6.8. build it from scratch on the conda server as given here:
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/sambamba/build.sh

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 3, 2019

@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Apr 3, 2019

unfortunately this doesn't help, the pre-build 0.6.9 also segfaults, and also a locally build 0.7.0-pre1

but what I observed I rarely/(never?) got segfaults with all versions from 1-8 threads, strangely also not with the local build with 9 threads, higher thread numbers really often/(always?) segfaults

@steffenheyne

This comment has been minimized.

Copy link

@steffenheyne steffenheyne commented Apr 3, 2019

yeah with your manually (not using conda) downloaded pre-build 0.6.6 it also (so far) never crashes with whatever number of threads

@RichardCorbett

This comment has been minimized.

Copy link

@RichardCorbett RichardCorbett commented Apr 5, 2019

Hi folks,
I'm using sambamba-0.6.9-linux-static on CentOS 6 and 7 and getting what looks like a similar error.

I ran 40 different samples and only 10 got this error, but I tried this command on a few machines and they all seg faulted.

Do you have any suggestions?

thanks,
Richard

/projects/COLO829SomaticTitration/sambamba-0.6.9-linux-static markdup --tmpdir /projects/COLO829SomaticTitration /projects/COLO829SomaticTitration/normals/A36973_5_lanes_dupsFlagged.bam.spec.bam.p1.bam /projects/COLO829SomaticTitration/normals/A36973_5_lanes_dupsFlagged.bam.spec.bam.p1.bam.dupmark.bam

sambamba 0.6.9 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.14.0 / DMD v2.084.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
  sorted 147834456 end pairs
     and 4860622 single ends (among them 1137 unmatched pairs)
  collecting indices of duplicate reads...   done in 35457 ms
  found 2953009 duplicates
collected list of positions in 26 min 49 sec
marking duplicates...
Segmentation fault (core dumped)

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 5, 2019

@RichardCorbett what is the CPU you are on? Does using an older binary of sambamba work?

@RichardCorbett

This comment has been minimized.

Copy link

@RichardCorbett RichardCorbett commented Apr 5, 2019

CPU tested:
Intel Xeon E5 -2650 @ 2.20 GHz

I just had some filesystem problems overnight so retesting is going to be a challenge. In case it may be relevant I noticed that I was testing on BWA aln 0.5.7 BAM file which includes non-zero mapping qualities for unaligned reads.

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 5, 2019

Yeah, this may be another Xeon with problems. It will be interesting to see if older versions of sambamba+llvm+ldc show the same problem.

@steffenheyne can you confirm you are still good with 0.6.6 binary? And both of you, is there any way I could access one of these machines so I can debug the problem? If 0.6.6 works I could use that same build chain to compile a new binary and see if that keeps going.

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 5, 2019

sambamba 0.6.6 was built with

    LDC 0.17.1
    using DMD v2.068.2
    using LLVM 3.8.0
@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 5, 2019

Another thing we could try is building sambamba with the GNU D compiler. That would fine tune the diagnostics - whether it is with LLVM or D-runtime.

@RichardCorbett

This comment has been minimized.

Copy link

@RichardCorbett RichardCorbett commented Apr 5, 2019

A second test run overnight duplicated the error on a machine with this cpu:
144 x Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 5, 2019

Great. I am starting to suspect the D runtime. I should try hitting one of our machines badly.

@dpryan79

This comment has been minimized.

Copy link
Contributor

@dpryan79 dpryan79 commented Apr 9, 2019

I'll try rebuilding 0.6.6 in bioconda with more recent pinnings to see if (A) it still works without segfaulting and (B) it can still be combined with other recent packages in an environment. That will allow us (@steffenheyne and me) to get around the current issue. I agree that this is likely a D runtime issue.

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 11, 2019

Thanks. I am too busy now to sort it now but should have time by Easter.

@sirselim

This comment has been minimized.

Copy link

@sirselim sirselim commented Apr 26, 2019

Just adding my observation of this issue as well.
Processors: 2x Intel Xeon Gold 5118 (12 core, based on Skylake)

Using conda Sambamaba (0.6.8 and now 0.6.9) in a snakemake workflow. Getting segfaults at markdup stage.

Set the markdup rule to call 0.6.6 from conda and things seem to be processing well.

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Apr 26, 2019

Thanks. This is very annoying, good thing we have 0.6.6.

@pjotrp pjotrp added the in progress label May 29, 2019
@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented May 29, 2019

gdb -ex 'handle SIGUSR1 SIGUSR2 nostop noprint' --args ./sambamba-0.6.9-linux-static markdup SRR2096206.bam test.bam

Running in the debugger gives

error reading variable: access outside bounds of object referenced via synthetic pointer

#2  0x0000000000613feb in rt_finalizeFromGC ()
#3  0x000000000062755e in _D2gc4impl12conservativeQw3Gcx5sweepMFNbZm ()
#4  0x00000000006251c3 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm ()
#5  0x0000000000626687 in _D2gc4impl12conservativeQw3Gcx8bigAllocMFNbmKmkxC8TypeInfoZPv ()
#6  0x0000000000623984 in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQEgQEgQEeQEp10mallocTimelS_DQFiQFiQFgQFr10numMallocslTmTkTmTxQCzZQFcMFNbKmKkKmKxQDsZQDl ()
#7  0x00000000006259f3 in _DThn16_2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()
#8  0x000000000060bdb0 in gc_qalloc ()
#9  0x0000000000608be2 in _D4core6memory2GC6qallocFNaNbmkxC8TypeInfoZSQBqQBo8BlkInfo_ ()
#10 0x00000000006135cf in _d_newarrayU ()
#11 0x00000000004c47e5 in _D3std5array__T14arrayAllocImplVbi0TAhTmZQBaFNaNbmZQp (_param_0=140731428614032)
    at /gnu/store/qzcgba5bflcgxsinabfpfy7j5nddvj4f-ldc-1.14.0/include/d/std/array.d:950
#12 _D3std5array__T18uninitializedArrayTAhTmZQBaFNaNbNemZQr (_param_0=140731428614032)
    at /gnu/store/qzcgba5bflcgxsinabfpfy7j5nddvj4f-ldc-1.14.0/include/d/std/array.d:819
#13 _D3bio3std3hts3bam9readrange__T12BamReadRangeSQBsQBrQBqQBpQBo14withoutOffsetsZQBv8allocateMFNaNbNfmZAh (this=<optimized out>, size=<optimized out>)
    at /export/export/local/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:178
#14 _D3bio3std3hts3bam9readrange__T12BamReadRangeSQBsQBrQBqQBpQBo14withoutOffsetsZQBv8readNextMFZv (this=0x7ffe87e7c500)
    at /export/export/local/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:166
#15 0x00000000004eab28 in _D3std11parallelism8TaskPool__T3mapS3bioQBm3hts3bam11multireader17adjustTagsInRangeZ__TQCeTSQDm5range__T11ZipShortestVEQEn8typecons__T4FlagVAyaa18_616c6c4b6e6f776e53616d654c656e677468ZQByi0TSQHh9algorithm9iteration__T9MapResultSQHkQIvQHjQHiQHh__T13multiBamReadsTSQItQKeQIsQIr9readrange__T12BamReadRangeSQKhQLsQKgQKfQBo14withoutOffsetsZQBvZQDpFQDdmZ9__lambda3TSQNuQKi__TQKfVQJvi0TQEnTSQOsQLg__T6RepeatTmZQkZQLqZQHlTSQPxQMl__TQBfTCQPdQQoQPc5utils15samheadermerger15SamHeaderMergerZQDiTQDwZQOtZQRsMFQPpmmZ3Map7fillBufMFASQTwQPj__T5TupleTSQTfQUqQTeQTdQTc__T12MultiBamReadTSQUnQVyQUmQUl4read7BamReadZQBqTQGpZQCyZQDn () at /export/export/local/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:101
#16 0x00000000004eb433 in _D3std11parallelism__T3runTDFASQBd8typecons__T5TupleTS3bioQCe3hts3bam11multireader__T12MultiBamReadTSQBvQDyQBuQBt4read7BamReadZQBqTCQDaQFdQCz5utils15samheadermerger15SamHeaderMergerZQFgZQGbTQGfZQGqFQGpKQGrZQGv (
    fpOrDelegate=<error reading variable: access outside bounds of object referenced via synthetic pointer>, _param_1=...)
    at /gnu/store/qzcgba5bflcgxsinabfpfy7j5nddvj4f-ldc-1.14.0/include/d/std/parallelism.d:764
#17 _D3std11parallelism__T4TaskSQBaQz3runTDFASQBo8typecons__T5TupleTS3bioQCp3hts3bam11multireader__T12MultiBamReadTSQBvQEjQBuQBt4read7BamReadZQBqTCQDaQFoQCz5utils15samheadermerger15SamHeaderMergerZQFgZQGbTQGfZQHb4implFPvZv (myTask=0x7ffff7f006b8 "")
    at /gnu/store/qzcgba5bflcgxsinabfpfy7j5nddvj4f-ldc-1.14.0/include/d/std/parallelism.d:452
#18 0x00000000005ef2c1 in _D3std11parallelism8TaskPool5doJobMFPSQBkQBj12AbstractTaskZv ()
#19 0x00000000005ef40b in std.parallelism.TaskPool.startWorkLoop() ()
#20 0x000000000060976c in thread_entryPoint ()
#21 0x00000000005c5cb9 in start_thread ()
#22 0x000000000068e3db in clone ()
@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented May 29, 2019

A rebuild of sambamba ran without crash the first time with

 LDC 1.10.0 / DMD v2.080.1 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.4)
@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented May 29, 2019

I put out a new binary release which you can use for testing. Next step I'll need to build with a more recent compiler (again). I'll try to compile with the new DIP1000 as in #399

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented May 31, 2019

A build with the latest version of LDC 1.16.0-beta consistently segfaults:

  LDC 1.16.0 / DMD v2.086.0 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)

The stack trace differs from the earlier one with

0x00000000005dd9c4 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()
(gdb) bt                                                                                                                 
#0  0x00000000005dd9c4 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()       
#1  0x00000000004ef65e in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQD
xQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf10yieldForceMFNcNdNeZQCz (this=0x2828282828282828)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:605
#2  0x000000000060155b in rt_finalizeFromGC ()
#3  0x0000000000615d49 in _D2gc4impl12conservativeQw3Gcx5sweepMFNbZm ()
#4  0x0000000000612c15 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm ()
#5  0x0000000000614681 in _D2gc4impl12conservativeQw3Gcx8bigAllocMFNbmKmkxC8TypeInfoZPv ()
#6  0x00000000006110de in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQEgQEgQEeQEp10mallocT
imelS_DQFiQFiQFgQFr10numMallocslTmTkTmTxQCzZQFcMFNbKmKkKmKxQDsZQDl ()
#7  0x0000000000613663 in _DThn16_2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()
#8  0x00000000005f8b20 in gc_qalloc ()
#9  0x00000000005f5822 in _D4core6memory2GC6qallocFNaNbmkxC8TypeInfoZSQBqQBo8BlkInfo_ ()
#10 0x0000000000600b3f in _d_newarrayU ()
#11 0x00000000004c9435 in _D3std5array__T14arrayAllocImplVbi0TAhTmZQBaFNaNbmZQp (_param_0=140731236642144)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/array.d:953
#12 _D3std5array__T18uninitializedArrayTAhTmZQBaFNaNbNemZQr (_param_0=140731236642144)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/array.d:822
#13 _D3bio3std3hts3bam9readrange__T12BamReadRangeSQBsQBrQBqQBpQBo14withoutOffsetsZQBv8allocateMFNaNbNfmZAh (this=<optimized out>, size=<optimized out>)
    at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:178
#14 _D3bio3std3hts3bam9readrange__T12BamReadRangeSQBsQBrQBqQBpQBo14withoutOffsetsZQBv8readNextMFZv (this=0x7ffe8a36c900)
    at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:166

which looks like bgzf decompress and the GC sweep are interacting...

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented May 31, 2019

#13 at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:178

points at

            _alloc_buffer = uninitializedArray!(ubyte[])(max(size, 65536));

which triggers a GC sweep and segfaults.

The stack trace is informative and it looks like we have invalid pointers on the stack.

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented May 31, 2019

The debug version of sambamba renders

0x000000000070e814 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()                                                              (gdb) bt                                                                                                                                                       #0  0x000000000070e814 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()                                                          #1  0x000000000059a2a5 in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQD
xQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf10yieldForceMFNcNdNeZQCz (this=0x2828282828282828)                                                 
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:605                                                            #2  0x0000000000599b0b in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQDxQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf6__dtorMFNfZv (this=0x2828282828282828)                                                                       at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:747                                                            #3  0x000000000074064c in object.TypeInfo_Struct.destroy(void*) const ()
#4  0x000000000074a1f7 in rt.lifetime.finalize_array(void*, ulong, const(TypeInfo_Struct)) ()                                                                  #5  0x000000000074b456 in rt.lifetime.finalize_array2(void*, ulong) ()                                                                                         #6  0x000000000074b74a in rt_finalizeFromGC ()                                                                                                                 #7  0x000000000076c798 in _D2gc4impl12conservativeQw3Gcx5sweepMFNbZm ()                                                                                       
#8  0x0000000000767ff0 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm ()                        
#9  0x0000000000769f1c in _D2gc4impl12conservativeQw3Gcx8bigAllocMFNbmKmkxC8TypeInfoZPv ()                                                                     #10 0x0000000000765123 in _D2gc4impl12conservativeQw3Gcx5allocMFNbmKmkxC8TypeInfoZPv ()                                                                        #11 0x0000000000765059 in _D2gc4impl12conservativeQw14ConservativeGC12mallocNoSyncMFNbmkKmxC8TypeInfoZPv ()                                                    #12 0x0000000000764fa2 in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQEgQEgQEeQEp10mallocTimelS_DQFiQFiQFgQFr10numMallocslTmTkTmTxQCzZQFcMFNbKmKkKmKxQDsZQDl ()                                                                                          #13 0x00000000007651e6 in _D2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()
#14 0x0000000000768aa8 in _DThn16_2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()                                   #15 0x000000000073d3a1 in gc_qalloc ()                                                                                                                         #16 0x00000000007374ca in _D4core6memory2GC6qallocFNaNbmkxC8TypeInfoZSQBqQBo8BlkInfo_ ()                                                                       #17 0x00000000007499d7 in _D2rt8lifetime12__arrayAllocFNaNbmxC8TypeInfoxQlZS4core6memory8BlkInfo_ ()                                                           #18 0x000000000074ab37 in _d_newarrayU ()                                                                                                                      #19 0x000000000074abed in _d_newarrayT ()
#20 0x000000000051b07e in _D7contrib6undead6stream14BufferedStream6__ctorMFCQBwQBrQBn6StreammZCQCpQCkQCgQCc (this=0x7ffe8a37f380, source=0x7ffe8a30ce40,           bufferSize=134217728) at /home/wrk/izip/git/opensource/D/sambamba/BioD/contrib/undead/stream.d:1628
#21 0x000000000054f772 in bio.std.hts.bam.reader.BamReader.getNativeEndianSourceStream() (this=0x7ffe8a312d00)                                             
    at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/reader.d:517                          
...

where the last line refers to another allocation

    return new BufferedStream(file, _buffer_size);
@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented May 31, 2019

This may be what I am looking for

#0  0x000000000070e814 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()
#1  0x000000000059a2a5 in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQDxQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf10yieldForceMFNcNdNeZQCz (this=0x2026261b0d812821)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:605
#2  0x0000000000599b0b in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQDxQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf6__dtorMFNfZv (this=0x2026261b0d812821)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:747
#3  0x000000000074064c in object.TypeInfo_Struct.destroy(void*) const ()
#4  0x000000000074a1f7 in rt.lifetime.finalize_array(void*, ulong, const(TypeInfo_Struct)) ()
#5  0x000000000074b456 in rt.lifetime.finalize_array2(void*, ulong) ()
#6  0x000000000074b74a in rt_finalizeFromGC ()
@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Jun 1, 2019

Just now, for the first time, sambamba ran without crashing :). The problem is an out of order execution.

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Jun 2, 2019

I think the problem is with scopedtask where a task is added to the threadpool using the stack rather than the heap. In particular this section

https://github.com/biod/BioD/blob/master/bio/core/bgzf/inputstream.d#L391

where a task gets created and pushed on a roundbuf. When the garbage collector kicks in after reading the BAM file it wants to destroy this object but it is in an inconsistent state (maybe the thread already got cleaned up or it tries to clean up twice). I managed to prevent segfaulting by disabling the garbage collector, but obviously that won't do.

The roundbuf is probably used to keep a task connected with the bgzf unpacking buffer. I am not sure why this is necessary. Also I am not convinced a threadpool is that much of a benefit for bgzf unpacking as the single threaded routine I wrote last year is blazingly fast. Need to figure out what the best approach is...

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Jun 10, 2019

Adding this code to the destructor which segfaults

struct DecompressedBgzfBlock {
    ~this() {
      stderr.writeln("destroy DecompressedBgzfBlock ",start_offset,":",end_offset," ",decompressed_data.sizeof);
    };
    ulong start_offset;
    ulong end_offset;
    ubyte[] decompressed_data;
}

Running a decompress block typically reads

destroy DecompressedBgzfBlock 4945:5091 16                                                                                             
destroy DecompressedBgzfBlock 4616:4945 16
destroy DecompressedBgzfBlock 4287:4616 16                                                                                             destroy DecompressedBgzfBlock 3958:4287 16
destroy DecompressedBgzfBlock 3629:3958 16                                                                                             
destroy DecompressedBgzfBlock 3300:3629 16
destroy DecompressedBgzfBlock 2971:3300 16                                                                                             
destroy DecompressedBgzfBlock 2642:2971 16
destroy DecompressedBgzfBlock 2313:2642 16                                                                                             destroy DecompressedBgzfBlock 1984:2313 16  
destroy DecompressedBgzfBlock 1655:1984 16
destroy DecompressedBgzfBlock 1326:1655 16
destroy DecompressedBgzfBlock 997:1326 16
destroy DecompressedBgzfBlock 668:997 16  
destroy DecompressedBgzfBlock 339:668 16  
destroy DecompressedBgzfBlock 0:339 16                                                                                                 

but before a segfault we get

destroy DecompressedBgzfBlock 3184080310709005360:3467820302580068384 16
destroy DecompressedBgzfBlock 3184080310725782576:3467820302580068384 16                                                               destroy DecompressedBgzfBlock 3467820298285101088:3467820319995076652 16
destroy DecompressedBgzfBlock 3184361785685716016:3539877896617996576 16                                                               destroy DecompressedBgzfBlock 2318280822927401004:3184080310725782576 16
destroy DecompressedBgzfBlock 3539878068416698400:2318281922439028780 16                                                               destroy DecompressedBgzfBlock 3467820302580068640:2318280822927401004 16
destroy DecompressedBgzfBlock 2318280822927401004:3184361785702493232 16                                                               destroy DecompressedBgzfBlock 3467820298285101088:2318280822927401004 16
destroy DecompressedBgzfBlock 3467820298285101088:2318281922439028780 16                                                               destroy DecompressedBgzfBlock 2318281922439028780:2318281922439360048 16
destroy DecompressedBgzfBlock 2318286380681145392:3184080310725782576 16
destroy DecompressedBgzfBlock 3467820298285101344:2318281922439028780 16
destroy DecompressedBgzfBlock 3539877892323029024:2318281922439094316 16
Program exited with code -11              
wrk@penguin2 ~/izip/git/opensource/D/sambamba/BioD [env]$

After making sure the start_offset is set to 0 it looks like these blocks have become invalid and the garbage collector still tries to clean them up.

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Jun 10, 2019

Artem wrote:

Create a new task and put it on the roundbuffer using some magic

My conclusion is that that 'magic' no longer works. Creating a thread on the stack and moving it to the roundbuffer on the Heap confuses the garbage collector which is kinda unsurprising. With markdup I can't disable the GC cleanup so it needs some surgery.

@ekg

This comment has been minimized.

Copy link

@ekg ekg commented Jun 19, 2019

We are having the same problem. Is the only solution at present to downgrade to 0.6.6?

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Jun 19, 2019

It is one of these things that take a few days of work to fix. It is on my list :/

@ekg

This comment has been minimized.

Copy link

@ekg ekg commented Jun 20, 2019

It might save some users a bit of time if you make a kind of warning binary release of 0.6.6-stable and drop it at the top of the release page.

I know how much of a pain it can be to track down stuff like this. We just had a big problem due to the Spectre/Meltdown patches changing the way that multithreaded interleaved system calls work. I wonder if this problem is related.

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Jun 20, 2019

Actually the latest binary release 0.7.0 works.

https://github.com/biod/sambamba/releases/tag/v0.7.0

The problem is with later versions of the D compiler.

@dpryan79 dpryan79 mentioned this issue Jun 24, 2019
3 of 5 tasks complete
@rikrdo89

This comment has been minimized.

Copy link

@rikrdo89 rikrdo89 commented Sep 16, 2019

I got the same error "Segmentation fault (core dumped)" using sambamba 0.7.0 installed with conda and running in an LSF cluster.... :/

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Sep 16, 2019

Conda builds with a recent ldc. That is the problem at this point.

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Sep 23, 2019

A heads up.

At this point build sambamba with an older ldc - like the binary released on github.

We decided to replace the original bam reader with a new one I wrote almost 2 years ago and we have been testing. Rather than fix the GC related issue we are going to use the new bamreader which is simpler and therefore (hopefully) easier to maintain. @NickRoz1 who worked as a Google Summer of Code student for me on column based bams is doing that work.

One important difference is that the worker threads are no longer part of the reader itself. My theory is that performance should be similar ;)

@vanottee

This comment has been minimized.

Copy link

@vanottee vanottee commented Nov 4, 2019

Same segmentation fault issue here - had installed on HPC system with bioconda. Is there a way to know when the Conda build has been updated? Alternatively, if I extract from the 7.0 source code, how do I actually execute a tool, like markdup? Thanks!

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Nov 25, 2019

I am working on fixing this bug. Track progress here https://thebird.nl/blog/work/rotate.html

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Nov 28, 2019

@sjackman we have a new release of sambamba that fixes a long standing bug with the D runtime and should now compile with all versions of ldc. See also https://travis-ci.org/biod/sambamba

@sjackman

This comment has been minimized.

Copy link

@sjackman sjackman commented Nov 29, 2019

Thanks for the heads up, Pjotr. Once you've tagged a new release, would you like to open a PR to bump the version of Sambamba in Brewsci/bio? You need to change only lines, and you can do it from the GitHub web interface if you like. See https://github.com/brewsci/homebrew-bio/blob/c5b38cfea1eff4b18ae19d9dded00f990769de84/Formula/sambamba.rb#L6-L7

@pjotrp

This comment has been minimized.

Copy link
Member

@pjotrp pjotrp commented Nov 29, 2019

@sjackman feels dirty to edit through the web interface. But hopefully it works :)

@sjackman

This comment has been minimized.

Copy link

@sjackman sjackman commented Nov 29, 2019

Hehe. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.