-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sambamba-markdup: not enough data in stream via /dev/stdin #421
Comments
See #93 |
I tried reordering the command as in the issue you referenced
Regardless, it is our application that is streaming BAM format to |
Sambamba expects a proper BAM stream. I can't validate this way. |
I'm sorry, I'm not sure what you mean here, as far as I know we are sending a properly formatted BAM stream. It works well with several other bioinformatics tools. I am trying to learn whether |
It may help if you send an example we can reproduce. |
No activity |
Hello @pjotrp, sorry for not responding earlier. This is still an issue even with the latest 0.7.1 container. To reproduce, running
With
Prerequisites are OpenJDK, Apache Maven, and Docker, if running that way. Any BAM file of non-trivial size will do. Either of these could be a big help:
Thank you in advance! |
Not enough data in stream gets reported when a file ends/closes before the expected end. BAM has block sizes, so the stream stops before parsing is done and an error gets reported. Probably best to try building a simple test case outside a container and see if you can replicate the problem. If you can not replicate we could build a sambamba with debugging information and you can do a back trace. |
Happy to help a fellow coder. |
I just confirmed the issue with
markdup using a file runs normally. Other sambamba commands do not show this issue. BTW on your comment support reading from stdin directly instead of via /dev/stdin: they are the one and same thing. |
Good catch! Sorry I missed trying that way ( |
The problem is that markdup, to avoid having the whole BAM file in RAM, reads the input file twice - once for getting the positions, then another round for writing the output file. Reading twice from stdin obviously won't work, so you get this error: "not enough data in stream". In a shell this can work
but that may imply an overhead because you are dealing with a file rather than an interprocess communication. |
Looking at the code I think this can be made a single pass for a sorted BAM. That would make it faster. A future version could also hold an unsorted BAM file in RAM because 128GB machines are getting increasingly common. |
Are there any plans to fix this behavior? Technically, I'd argue that it's still a bug, as sambamba accepts input that has been piped to it from another application, but fails to process the input correctly (or as expected). Basically, the documentation implies that sambamba-markdup can be used to pipe commands together, but will throw this error if one tries to combine utilities into a single command, e.g.
Until this is fixed/updated, I don't understand the rationale for letting markdup accept input from stdin, since this would seem like the primary scenario in which this functionality would be used. |
You are right, we should not allow piping. Feel free to submit a PR. |
While I realize our use case is somewhat complicated (running sambamba via docker executed in an Apache Spark cluster), I am hoping the underlying issue may be simple.
Is there something about BAM format streamed to sambamba over
/dev/stdin
that would cause anot enough data in stream
error?Alternatively, might it be possible to read from stdin and write to stdout directly?
bigdatagenomics/cannoli#203
If not clear from the log below, we're using sambamba version 0.7.1 via Docker image tag
quay.io/biocontainers/sambamba:0.7.1--h148d290_0
. Similar issues are seen when using version 0.6.8 installed locally via homebrew (0.7.1 does not appear to be available via brewsci/bio/sambamba yet).The text was updated successfully, but these errors were encountered: