sambamba-markdup: not enough data in stream via /dev/stdin #421

heuermh · 2019-12-17T16:45:28Z

While I realize our use case is somewhat complicated (running sambamba via docker executed in an Apache Spark cluster), I am hoping the underlying issue may be simple.

Is there something about BAM format streamed to sambamba over /dev/stdin that would cause a not enough data in stream error?

Alternatively, might it be possible to read from stdin and write to stdout directly?

bigdatagenomics/cannoli#203

If not clear from the log below, we're using sambamba version 0.7.1 via Docker image tag quay.io/biocontainers/sambamba:0.7.1--h148d290_0. Similar issues are seen when using version 0.6.8 installed locally via homebrew (0.7.1 does not appear to be available via brewsci/bio/sambamba yet).

$ ./bin/cannoli-submit \
  sambambaMarkdup \
  -use_docker \
  NA12878.alignedHg38.duplicateMarked.baseRealigned.bam \
  NA12878.alignedHg38.duplicateMarked.baseRealigned.markdup.alignments.adam
...
19/12/17 10:31:18 ERROR Utils: Aborting task
java.lang.RuntimeException: Piped command List(docker, run, -i, --rm,
quay.io/biocontainers/sambamba:0.7.1--h148d290_0, sambamba, markdup,
/dev/stdin, /dev/stdout) exited with error code 1.
...
sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.13.0 / DMD v2.083.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)
...
  sorted 1276 end pairs
     and 39 single ends (among them 26 unmatched pairs)
  collecting indices of duplicate reads...   done in 9 ms
  found 2 duplicates
collected list of positions in 0 min 1 sec
sambamba-markdup: not enough data in stream
  sorted 2432 end pairs
     and 86 single ends (among them 45 unmatched pairs)
  collecting indices of duplicate reads...   done in 16 ms
  found 4 duplicates
collected list of positions in 0 min 1 sec
sambamba-markdup: not enough data in stream
  sorted 1402 end pairs
     and 59 single ends (among them 47 unmatched pairs)
  collecting indices of duplicate reads...   done in 4 ms
  found 2 duplicates
collected list of positions in 0 min 1 sec
sambamba-markdup: not enough data in stream

The text was updated successfully, but these errors were encountered:

pjotrp · 2019-12-17T16:55:51Z

See #93

heuermh · 2019-12-17T17:08:06Z

I tried reordering the command as in the issue you referenced

List(docker, run, -i, --rm, quay.io/biocontainers/sambamba:0.7.1--h148d290_0,
sambamba, markdup, -o, /dev/stdout, /dev/stdin)
...
sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.13.0 / DMD v2.083.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)

sambamba-markdup: Unrecognized option -o

Regardless, it is our application that is streaming BAM format to /dev/stdin though, not sambamba. Are you looking for an end-of-block or end-of-file marker, or is this not enough data in stream error due to an incomplete record?

pjotrp · 2019-12-18T07:48:31Z

Sambamba expects a proper BAM stream. I can't validate this way.

heuermh · 2019-12-18T17:35:58Z

Sambamba expects a proper BAM stream. I can't validate this way.

I'm sorry, I'm not sure what you mean here, as far as I know we are sending a properly formatted BAM stream. It works well with several other bioinformatics tools.

I am trying to learn whether sambamba is doing validation differently than other tools, or if there might be an issue with using /dev/stdin instead of using stdin from a shell pipe.

pjotrp · 2019-12-25T06:34:11Z

It may help if you send an example we can reproduce.

pjotrp · 2020-11-30T15:58:21Z

No activity

heuermh · 2020-11-30T18:13:43Z

Hello @pjotrp, sorry for not responding earlier. This is still an issue even with the latest 0.7.1 container.

To reproduce, running sambamba locally

$ git clone https://github.com/heuermh/cannoli
$ checkout sambamba-markdup
$ mvn install
...
$ ./bin/cannoli-submit \
  sambambaMarkdup \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.markdup.alignments.adam
...

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.1 / DMD v2.090.1 / LLVM9.0.1 / bootstrap LDC - the LLVM D compiler (1.20.1)

finding positions of the duplicate reads in the file...

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.1 / DMD v2.090.1 / LLVM9.0.1 / bootstrap LDC - the LLVM D compiler (1.20.1)

finding positions of the duplicate reads in the file...
  sorted 109868 end pairs
     and 2807 single ends (among them 261 unmatched pairs)
  collecting indices of duplicate reads...   done in 89 ms
  found 28940 duplicates
collected list of positions in 0 min 7 sec
sambamba-markdup: not enough data in stream

With sambamba via docker

$ docker pull quay.io/biocontainers/sambamba:0.7.1--h984e79f_3
...
$ ./bin/cannoli-submit \
  sambambaMarkdup \
  -use_docker \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.markdup.docker.alignments.adam
...

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)


sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
finding positions of the duplicate reads in the file...
  sorted 109868 end pairs
     and 2807 single ends (among them 261 unmatched pairs)
  collecting indices of duplicate reads...   done in 113 ms
  found 28940 duplicates
collected list of positions in 0 min 9 sec
sambamba-markdup: not enough data in stream

Prerequisites are OpenJDK, Apache Maven, and Docker, if running that way. Any BAM file of non-trivial size will do.

Either of these could be a big help:

support reading from stdin directly instead of via /dev/stdin
--verbose option, or another way to provide more details about why validation is failing

Thank you in advance!

pjotrp · 2020-11-30T20:22:26Z

Not enough data in stream gets reported when a file ends/closes before the expected end. BAM has block sizes, so the stream stops before parsing is done and an error gets reported. Probably best to try building a simple test case outside a container and see if you can replicate the problem.

If you can not replicate we could build a sambamba with debugging information and you can do a back trace.

pjotrp · 2020-11-30T20:29:10Z

Happy to help a fellow coder.

pjotrp · 2020-12-01T06:51:22Z

I just confirmed the issue with

cat test/issue_204.bam | ./bin/sambamba-0.8.0-linux-amd64-static markdup /dev/stdin test.bam

sambamba 0.8.0
 by Artem Tarasov and Pjotr Prins (C) 2012-2020
    LDC 1.10.0 / DMD v2.080.1 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.4)

finding positions of the duplicate reads in the file...
  sorted 277 end pairs
     and 5 single ends (among them 5 unmatched pairs)
  collecting indices of duplicate reads...   done in 0 ms
  found 62 duplicates
collected list of positions in 0 min 0 sec
sambamba-markdup: not enough data in stream

markdup using a file runs normally. Other sambamba commands do not show this issue.

BTW on your comment support reading from stdin directly instead of via /dev/stdin: they are the one and same thing.

heuermh · 2020-12-01T20:01:17Z

Good catch! Sorry I missed trying that way (cat | sambamba markdup /dev/stdin) before; I only noticed that it worked ok on files.

pjotrp · 2020-12-02T11:49:44Z

The problem is that markdup, to avoid having the whole BAM file in RAM, reads the input file twice - once for getting the positions, then another round for writing the output file. Reading twice from stdin obviously won't work, so you get this error: "not enough data in stream".

In a shell this can work

sambamba-0.8.0 markdup /dev/stdin \
  ex1_header.dedup.bam < test/issue_204.sorted.bam

but that may imply an overhead because you are dealing with a file rather than an interprocess communication.

pjotrp · 2020-12-02T12:36:16Z

Looking at the code I think this can be made a single pass for a sorted BAM. That would make it faster. A future version could also hold an unsorted BAM file in RAM because 128GB machines are getting increasingly common.

pwaltman · 2021-06-30T17:40:23Z

Are there any plans to fix this behavior? Technically, I'd argue that it's still a bug, as sambamba accepts input that has been piped to it from another application, but fails to process the input correctly (or as expected).

Basically, the documentation implies that sambamba-markdup can be used to pipe commands together, but will throw this error if one tries to combine utilities into a single command, e.g.

bwa mem -R $RG_string $REFERENCE -t 8 fastq_R1.fq fastq_R2.fq \
| samtools view -b -S -u - \
   sambamba markdup -t 8 /dev/stdin $OUTFILE

Until this is fixed/updated, I don't understand the rationale for letting markdup accept input from stdin, since this would seem like the primary scenario in which this functionality would be used.

pjotrp · 2021-06-30T18:13:16Z

You are right, we should not allow piping. Feel free to submit a PR.

pjotrp closed this as completed Nov 30, 2020

pjotrp reopened this Nov 30, 2020

pjotrp added bug markdup labels Dec 1, 2020

pjotrp self-assigned this Dec 1, 2020

pjotrp added this to the v1.0.0 milestone Dec 1, 2020

pjotrp added the reproducible! label Dec 1, 2020

pjotrp added enhancement and removed bug labels Dec 2, 2020

pjotrp added a commit that referenced this issue Jul 31, 2021

Added test for issue #421

d349abd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sambamba-markdup: not enough data in stream via /dev/stdin #421

sambamba-markdup: not enough data in stream via /dev/stdin #421

heuermh commented Dec 17, 2019

pjotrp commented Dec 17, 2019

heuermh commented Dec 17, 2019 •

edited

pjotrp commented Dec 18, 2019

heuermh commented Dec 18, 2019

pjotrp commented Dec 25, 2019

pjotrp commented Nov 30, 2020

heuermh commented Nov 30, 2020

pjotrp commented Nov 30, 2020

pjotrp commented Nov 30, 2020

pjotrp commented Dec 1, 2020

heuermh commented Dec 1, 2020

pjotrp commented Dec 2, 2020

pjotrp commented Dec 2, 2020

pwaltman commented Jun 30, 2021

pjotrp commented Jun 30, 2021

sambamba-markdup: not enough data in stream via /dev/stdin #421

sambamba-markdup: not enough data in stream via /dev/stdin #421

Comments

heuermh commented Dec 17, 2019

pjotrp commented Dec 17, 2019

heuermh commented Dec 17, 2019 • edited

pjotrp commented Dec 18, 2019

heuermh commented Dec 18, 2019

pjotrp commented Dec 25, 2019

pjotrp commented Nov 30, 2020

heuermh commented Nov 30, 2020

pjotrp commented Nov 30, 2020

pjotrp commented Nov 30, 2020

pjotrp commented Dec 1, 2020

heuermh commented Dec 1, 2020

pjotrp commented Dec 2, 2020

pjotrp commented Dec 2, 2020

pwaltman commented Jun 30, 2021

pjotrp commented Jun 30, 2021

heuermh commented Dec 17, 2019 •

edited