-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
waiting for background processes after task command exits #211
Comments
@mckinsel FFS! very first thing we try and immediately hit this awful heisenbug! This may be a nasty issue for the SmartSeq2 pipelines which use the quoted HISAT2 tasks, because it can produce a truncated yet readable BAM file, leading to "successful" but wrong/incomplete results. I was able to zero in on the source only because of the "BAM EOF marker not present" warnings (not errors) printed by downstream tools, which were displayed to me thanks to @MDunitz. |
Well this is interesting. My view is that miniwdl should just pass along this bash behavior, since mitigations might make the behavior even more confusing and could lead to things like tasks hanging indefinitely only in miniwdl. That said, it's kind of a bummer that shellcheck doesn't give you a warning. So this means that any of the BAMs produced by the SS2 pipeline could be truncated even though the workflow succeeded right? But the truncation is probably not too many records? |
Technically yea, but I doubt it's any kind of huge problem -- only the tail of the file would be affected, and as noted Cromwell seems to have some larger time constant which makes it more likely to "win" the race condition than seen with miniwdl so far. It can be detected by lack of the BAM EOF marker a 28-byte string constant at the end of the output BAM file. Samtools, when opening a file lacking it, prints a warning EOF marker is absent. The input is probably truncated to standard error, but otherwise proceeds hoping for the best. |
PS, I will write a PR on skylab revising these HISAT2 tasks. It just entails creating a mkfifo pipe to connect hisat2 with samtools, and then explicitly waiting for both processes to exit. |
When the task command exits, the docker container stops promptly along with any processes which may have been left running in the background. Now, reasonable arguments can be made that this is correct behavior; but it promptly caused a nasty, nondeterministic, largely-silent data corruption in the first HCA workflow we attempted to run with miniwdl. This arose from the following final statement in a task command:
https://github.com/HumanCellAtlas/skylab/blob/950eab4c94573283e25fdceceaf9270634018529/library/tasks/HISAT2.wdl#L161-L181
It uses a command substitution to stream the results of hisat2 into samtools. But this spawns samtools as a background process which is liable to be stopped, prematurely whilst still writing the output BAM file, as soon as hisat2 exits. Bash actually may not even provide a reasonable way to wait for a background process spawned by command substitution like this!
Experiments suggest that Cromwell doesn't really wait for the background process to exit, either; but empirically it seems to take a little longer than miniwdl to tear down the container, making this problem less likely to manifest. Can or should miniwdl do anything to mitigate this? or should we just advise people to avoid this pattern of streaming into a command substitution, in favor of an explicit mkfifo & wait?
To repro with Cromwell, try this contrived task:
The log seems more likely to come out empty for larger values of w:
The text was updated successfully, but these errors were encountered: