Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition while checking stderr #2378

Closed
danbills opened this issue Jun 26, 2017 · 10 comments

Comments

@danbills
Copy link
Contributor

commented Jun 26, 2017

As mentioned in this forum post, there appears to be a race between cromwell checking stderr and it actually being written/flushed to disk.

WDL seemed to fail with a file not found error always in regard to the stderr file, but when I look up the file manually the file was always there, and the specific task also finished with rc=0, but the main cromwell process failed with return code of 1 already due to the file not found error.

@katevoss katevoss added the PO Cleanup label Sep 21, 2017
@katevoss

This comment has been minimized.

Copy link

commented Sep 27, 2017

@danbills can you explain more about when the error occurs? Do you have an idea of how much effort it would take to fix?

@danbills

This comment has been minimized.

Copy link
Contributor Author

commented Sep 28, 2017

I was relaying information from the forum and am not familiar enough with the issue to describe the situation or effort to fix. Perhaps @Horneth or @mcovarr could shed some light?

@Horneth

This comment has been minimized.

Copy link
Contributor

commented Sep 28, 2017

This PR probably helped 90f1e17
by not checking stderr if we don't need to.

Increasing the number of IO retries won't change anything because we don't retry on a "FileNotFound".

I'm wondering if this is not the same thing we're seeing in travis sometimes where stdout is empty, but this time worse (the file doesn't even exist)

@katevoss

This comment has been minimized.

Copy link

commented Oct 12, 2017

As a user running workflows, I want to see my stderr output even when Cromwell gets a "FileNotFound" response so that I can debug my workflow.

  • Effort: Small to Medium
    • We're not sure of the exact way to fix it, so for now we have been patching the issue.
  • Risk: Small
  • Business value: Small
    • There is a workaround, to manually look up the stderr
@Horneth

This comment has been minimized.

Copy link
Contributor

commented Oct 12, 2017

@katevoss if the issue is what I think it is then I'm not sure how to fix it so I don't know how much time it would take, we've tried to fix it a few times already and failed.
Also I wouldn't say that manually looking up stderr is a workaround if Cromwell fails the workflow for an invalid reason

@katevoss

This comment has been minimized.

Copy link

commented Oct 12, 2017

@geoffjentry any thoughts on how to fix this issue?

@katevoss katevoss added the 🐛Bug label Oct 12, 2017
@geoffjentry

This comment has been minimized.

Copy link
Contributor

commented Oct 12, 2017

My intuition is what @Horneth said, that this is an artifact of something we've seen before. It is possible to sorta-fix-it-fix-it (e.g. write out a "i'm totally done" file at the very end and not read anything until then) but that has its own problems, including some which lead to the "sorta-" prefix there.

It's worth noting that this isn't a problem unique to Cromwell, it pops up fairly regularly with these things.

@katevoss

This comment has been minimized.

Copy link

commented Oct 12, 2017

In that case is it a won't-fix situation? We can keep patching if necessary.

@geoffjentry

This comment has been minimized.

Copy link
Contributor

commented Oct 12, 2017

That's the path we've been taking. At some point if this becomes a big enough headache paying the pain to try to fix it more would be worth it but for now it's not IMO

@katevoss katevoss removed the PO Cleanup label Jan 18, 2018
@ruchim

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2018

Closing it as we won't be fixing this for now. It can be re-opened by anyone who experiences this issue again in the future.

@ruchim ruchim closed this Sep 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.