Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

waiting for background processes after task command exits #211

Closed
mlin opened this issue Jul 18, 2019 · 4 comments
Closed

waiting for background processes after task command exits #211

mlin opened this issue Jul 18, 2019 · 4 comments
Labels
interop Bears on spec compatibility question Further information is requested
Projects

Comments

@mlin
Copy link
Collaborator

mlin commented Jul 18, 2019

When the task command exits, the docker container stops promptly along with any processes which may have been left running in the background. Now, reasonable arguments can be made that this is correct behavior; but it promptly caused a nasty, nondeterministic, largely-silent data corruption in the first HCA workflow we attempted to run with miniwdl. This arose from the following final statement in a task command:

    hisat2 -t \
      -x ${ref_name}/${ref_name} \
      -1 $FQ1 \
      -2 $FQ2 \
      ...
      -p ${cpu} -S >(samtools view -1 -h -o ${output_basename}.bam)

https://github.com/HumanCellAtlas/skylab/blob/950eab4c94573283e25fdceceaf9270634018529/library/tasks/HISAT2.wdl#L161-L181

It uses a command substitution to stream the results of hisat2 into samtools. But this spawns samtools as a background process which is liable to be stopped, prematurely whilst still writing the output BAM file, as soon as hisat2 exits. Bash actually may not even provide a reasonable way to wait for a background process spawned by command substitution like this!

Experiments suggest that Cromwell doesn't really wait for the background process to exit, either; but empirically it seems to take a little longer than miniwdl to tear down the container, making this problem less likely to manifest. Can or should miniwdl do anything to mitigate this? or should we just advise people to avoid this pattern of streaming into a command substitution, in favor of an explicit mkfifo & wait?

To repro with Cromwell, try this contrived task:

version 1.0
task t {
    input {
        Int w
    }
    command {
        touch log
        echo wow > >(sleep ${w}; dd of=log)
    }
    output {
        File log = "log"
        Float logsize = size("log")
    }
}

The log seems more likely to come out empty for larger values of w:

$ miniwdl cromwell orphan.wdl w=1
Cromwell input: {
  "w": 1
}
+ java -DLOG_LEVEL=warn -DLOG_MODE=pretty -jar /tmp/cromwell-44.jar run /home/mlin/src/czi/miniwdl/orphan3.wdl -o /home/mlin/src/czi/miniwdl/20190717_235020_t/cromwell/options.json -i /home/mlin/src/czi/miniwdl/20190717_235020_t/inputs.json
[2019-07-17 23:50:29,04] [warn] 'docker.hash-lookup.gcr-api-queries-per-100-seconds' is being deprecated, use 'docker.hash-lookup.gcr.throttle' instead (see reference.conf)
{
  "outputs": {
    "t.log": "/home/mlin/src/czi/miniwdl/20190717_235020_t/cromwell/cromwell-executions/t/341d3cb0-9912-4e43-af51-765b19af4f81/call-t/execution/log",
    "t.logsize": 4.0
  },
  "id": "341d3cb0-9912-4e43-af51-765b19af4f81"
}
{
  "outputs": {
    "t.log": "/home/mlin/src/czi/miniwdl/20190717_235020_t/cromwell/cromwell-executions/t/341d3cb0-9912-4e43-af51-765b19af4f81/call-t/execution/log",
    "t.logsize": 4.0
  },
  "id": "341d3cb0-9912-4e43-af51-765b19af4f81",
  "dir": "/home/mlin/src/czi/miniwdl/20190717_235020_t"
}
$ miniwdl cromwell orphan.wdl w=5
Cromwell input: {
  "w": 5
}
+ java -DLOG_LEVEL=warn -DLOG_MODE=pretty -jar /tmp/cromwell-44.jar run /home/mlin/src/czi/miniwdl/orphan3.wdl -o /home/mlin/src/czi/miniwdl/20190717_235106_t/cromwell/options.json -i /home/mlin/src/czi/miniwdl/20190717_235106_t/inputs.json
[2019-07-17 23:51:15,35] [warn] 'docker.hash-lookup.gcr-api-queries-per-100-seconds' is being deprecated, use 'docker.hash-lookup.gcr.throttle' instead (see reference.conf)
{
  "outputs": {
    "t.log": "/home/mlin/src/czi/miniwdl/20190717_235106_t/cromwell/cromwell-executions/t/969cb3e6-7d25-4618-9b34-d6794cba51bf/call-t/execution/log",
    "t.logsize": 0.0
  },
  "id": "969cb3e6-7d25-4618-9b34-d6794cba51bf"
}
{
  "outputs": {
    "t.log": "/home/mlin/src/czi/miniwdl/20190717_235106_t/cromwell/cromwell-executions/t/969cb3e6-7d25-4618-9b34-d6794cba51bf/call-t/execution/log",
    "t.logsize": 0.0
  },
  "id": "969cb3e6-7d25-4618-9b34-d6794cba51bf",
  "dir": "/home/mlin/src/czi/miniwdl/20190717_235106_t"
}
@mlin mlin created this issue from a note in miniwdl (Backlog) Jul 18, 2019
@mlin mlin added the interop Bears on spec compatibility label Jul 18, 2019
@mlin
Copy link
Collaborator Author

mlin commented Jul 18, 2019

@mckinsel FFS! very first thing we try and immediately hit this awful heisenbug!

This may be a nasty issue for the SmartSeq2 pipelines which use the quoted HISAT2 tasks, because it can produce a truncated yet readable BAM file, leading to "successful" but wrong/incomplete results. I was able to zero in on the source only because of the "BAM EOF marker not present" warnings (not errors) printed by downstream tools, which were displayed to me thanks to @MDunitz.

@mckinsel
Copy link
Contributor

Well this is interesting. My view is that miniwdl should just pass along this bash behavior, since mitigations might make the behavior even more confusing and could lead to things like tasks hanging indefinitely only in miniwdl. That said, it's kind of a bummer that shellcheck doesn't give you a warning.

So this means that any of the BAMs produced by the SS2 pipeline could be truncated even though the workflow succeeded right? But the truncation is probably not too many records?

@mlin
Copy link
Collaborator Author

mlin commented Jul 18, 2019

So this means that any of the BAMs produced by the SS2 pipeline could be truncated even though the workflow succeeded right? But the truncation is probably not too many records?

Technically yea, but I doubt it's any kind of huge problem -- only the tail of the file would be affected, and as noted Cromwell seems to have some larger time constant which makes it more likely to "win" the race condition than seen with miniwdl so far. It can be detected by lack of the BAM EOF marker a 28-byte string constant at the end of the output BAM file. Samtools, when opening a file lacking it, prints a warning EOF marker is absent. The input is probably truncated to standard error, but otherwise proceeds hoping for the best.

@mlin
Copy link
Collaborator Author

mlin commented Jul 18, 2019

PS, I will write a PR on skylab revising these HISAT2 tasks. It just entails creating a mkfifo pipe to connect hisat2 with samtools, and then explicitly waiting for both processes to exit.

mlin added a commit that referenced this issue Jul 18, 2019
@mlin mlin added the question Further information is requested label Jul 18, 2019
@mlin mlin closed this as completed Aug 23, 2019
miniwdl automation moved this from Backlog to Done Aug 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interop Bears on spec compatibility question Further information is requested
Projects
miniwdl
  
Done
Development

No branches or pull requests

2 participants