Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sambamba-markdup: not enough data in stream via /dev/stdin #421

Open
heuermh opened this issue Dec 17, 2019 · 15 comments
Open

sambamba-markdup: not enough data in stream via /dev/stdin #421

heuermh opened this issue Dec 17, 2019 · 15 comments

Comments

@heuermh
Copy link

heuermh commented Dec 17, 2019

While I realize our use case is somewhat complicated (running sambamba via docker executed in an Apache Spark cluster), I am hoping the underlying issue may be simple.

Is there something about BAM format streamed to sambamba over /dev/stdin that would cause a not enough data in stream error?

Alternatively, might it be possible to read from stdin and write to stdout directly?

bigdatagenomics/cannoli#203

If not clear from the log below, we're using sambamba version 0.7.1 via Docker image tag quay.io/biocontainers/sambamba:0.7.1--h148d290_0. Similar issues are seen when using version 0.6.8 installed locally via homebrew (0.7.1 does not appear to be available via brewsci/bio/sambamba yet).

$ ./bin/cannoli-submit \
  sambambaMarkdup \
  -use_docker \
  NA12878.alignedHg38.duplicateMarked.baseRealigned.bam \
  NA12878.alignedHg38.duplicateMarked.baseRealigned.markdup.alignments.adam
...
19/12/17 10:31:18 ERROR Utils: Aborting task
java.lang.RuntimeException: Piped command List(docker, run, -i, --rm,
quay.io/biocontainers/sambamba:0.7.1--h148d290_0, sambamba, markdup,
/dev/stdin, /dev/stdout) exited with error code 1.
...
sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.13.0 / DMD v2.083.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)
...
  sorted 1276 end pairs
     and 39 single ends (among them 26 unmatched pairs)
  collecting indices of duplicate reads...   done in 9 ms
  found 2 duplicates
collected list of positions in 0 min 1 sec
sambamba-markdup: not enough data in stream
  sorted 2432 end pairs
     and 86 single ends (among them 45 unmatched pairs)
  collecting indices of duplicate reads...   done in 16 ms
  found 4 duplicates
collected list of positions in 0 min 1 sec
sambamba-markdup: not enough data in stream
  sorted 1402 end pairs
     and 59 single ends (among them 47 unmatched pairs)
  collecting indices of duplicate reads...   done in 4 ms
  found 2 duplicates
collected list of positions in 0 min 1 sec
sambamba-markdup: not enough data in stream
@pjotrp
Copy link
Member

pjotrp commented Dec 17, 2019

See #93

@heuermh
Copy link
Author

heuermh commented Dec 17, 2019

I tried reordering the command as in the issue you referenced

List(docker, run, -i, --rm, quay.io/biocontainers/sambamba:0.7.1--h148d290_0,
sambamba, markdup, -o, /dev/stdout, /dev/stdin)
...
sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.13.0 / DMD v2.083.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)

sambamba-markdup: Unrecognized option -o

Regardless, it is our application that is streaming BAM format to /dev/stdin though, not sambamba. Are you looking for an end-of-block or end-of-file marker, or is this not enough data in stream error due to an incomplete record?

@pjotrp
Copy link
Member

pjotrp commented Dec 18, 2019

Sambamba expects a proper BAM stream. I can't validate this way.

@heuermh
Copy link
Author

heuermh commented Dec 18, 2019

Sambamba expects a proper BAM stream. I can't validate this way.

I'm sorry, I'm not sure what you mean here, as far as I know we are sending a properly formatted BAM stream. It works well with several other bioinformatics tools.

I am trying to learn whether sambamba is doing validation differently than other tools, or if there might be an issue with using /dev/stdin instead of using stdin from a shell pipe.

@pjotrp
Copy link
Member

pjotrp commented Dec 25, 2019

It may help if you send an example we can reproduce.

@pjotrp
Copy link
Member

pjotrp commented Nov 30, 2020

No activity

@pjotrp pjotrp closed this as completed Nov 30, 2020
@heuermh
Copy link
Author

heuermh commented Nov 30, 2020

Hello @pjotrp, sorry for not responding earlier. This is still an issue even with the latest 0.7.1 container.

To reproduce, running sambamba locally

$ git clone https://github.com/heuermh/cannoli
$ checkout sambamba-markdup
$ mvn install
...
$ ./bin/cannoli-submit \
  sambambaMarkdup \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.markdup.alignments.adam
...

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.1 / DMD v2.090.1 / LLVM9.0.1 / bootstrap LDC - the LLVM D compiler (1.20.1)

finding positions of the duplicate reads in the file...

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.1 / DMD v2.090.1 / LLVM9.0.1 / bootstrap LDC - the LLVM D compiler (1.20.1)

finding positions of the duplicate reads in the file...
  sorted 109868 end pairs
     and 2807 single ends (among them 261 unmatched pairs)
  collecting indices of duplicate reads...   done in 89 ms
  found 28940 duplicates
collected list of positions in 0 min 7 sec
sambamba-markdup: not enough data in stream

With sambamba via docker

$ docker pull quay.io/biocontainers/sambamba:0.7.1--h984e79f_3
...
$ ./bin/cannoli-submit \
  sambambaMarkdup \
  -use_docker \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.markdup.docker.alignments.adam
...

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)


sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
finding positions of the duplicate reads in the file...
  sorted 109868 end pairs
     and 2807 single ends (among them 261 unmatched pairs)
  collecting indices of duplicate reads...   done in 113 ms
  found 28940 duplicates
collected list of positions in 0 min 9 sec
sambamba-markdup: not enough data in stream

Prerequisites are OpenJDK, Apache Maven, and Docker, if running that way. Any BAM file of non-trivial size will do.

Either of these could be a big help:

  • support reading from stdin directly instead of via /dev/stdin
  • --verbose option, or another way to provide more details about why validation is failing

Thank you in advance!

@pjotrp pjotrp reopened this Nov 30, 2020
@pjotrp
Copy link
Member

pjotrp commented Nov 30, 2020

Not enough data in stream gets reported when a file ends/closes before the expected end. BAM has block sizes, so the stream stops before parsing is done and an error gets reported. Probably best to try building a simple test case outside a container and see if you can replicate the problem.

If you can not replicate we could build a sambamba with debugging information and you can do a back trace.

@pjotrp
Copy link
Member

pjotrp commented Nov 30, 2020

Happy to help a fellow coder.

@pjotrp
Copy link
Member

pjotrp commented Dec 1, 2020

I just confirmed the issue with

cat test/issue_204.bam | ./bin/sambamba-0.8.0-linux-amd64-static markdup /dev/stdin test.bam

sambamba 0.8.0
 by Artem Tarasov and Pjotr Prins (C) 2012-2020
    LDC 1.10.0 / DMD v2.080.1 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.4)

finding positions of the duplicate reads in the file...
  sorted 277 end pairs
     and 5 single ends (among them 5 unmatched pairs)
  collecting indices of duplicate reads...   done in 0 ms
  found 62 duplicates
collected list of positions in 0 min 0 sec
sambamba-markdup: not enough data in stream

markdup using a file runs normally. Other sambamba commands do not show this issue.

BTW on your comment support reading from stdin directly instead of via /dev/stdin: they are the one and same thing.

@pjotrp pjotrp self-assigned this Dec 1, 2020
@pjotrp pjotrp added this to the v1.0.0 milestone Dec 1, 2020
@heuermh
Copy link
Author

heuermh commented Dec 1, 2020

Good catch! Sorry I missed trying that way (cat | sambamba markdup /dev/stdin) before; I only noticed that it worked ok on files.

@pjotrp
Copy link
Member

pjotrp commented Dec 2, 2020

The problem is that markdup, to avoid having the whole BAM file in RAM, reads the input file twice - once for getting the positions, then another round for writing the output file. Reading twice from stdin obviously won't work, so you get this error: "not enough data in stream".

In a shell this can work

sambamba-0.8.0 markdup /dev/stdin \
  ex1_header.dedup.bam < test/issue_204.sorted.bam

but that may imply an overhead because you are dealing with a file rather than an interprocess communication.

@pjotrp
Copy link
Member

pjotrp commented Dec 2, 2020

Looking at the code I think this can be made a single pass for a sorted BAM. That would make it faster. A future version could also hold an unsorted BAM file in RAM because 128GB machines are getting increasingly common.

@pjotrp pjotrp added enhancement and removed bug labels Dec 2, 2020
@pwaltman
Copy link

Are there any plans to fix this behavior? Technically, I'd argue that it's still a bug, as sambamba accepts input that has been piped to it from another application, but fails to process the input correctly (or as expected).

Basically, the documentation implies that sambamba-markdup can be used to pipe commands together, but will throw this error if one tries to combine utilities into a single command, e.g.

bwa mem -R $RG_string $REFERENCE -t 8 fastq_R1.fq fastq_R2.fq \
| samtools view -b -S -u - \
   sambamba markdup -t 8 /dev/stdin $OUTFILE

Until this is fixed/updated, I don't understand the rationale for letting markdup accept input from stdin, since this would seem like the primary scenario in which this functionality would be used.

@pjotrp
Copy link
Member

pjotrp commented Jun 30, 2021

You are right, we should not allow piping. Feel free to submit a PR.

pjotrp added a commit that referenced this issue Jul 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants