Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task communication failures should not cause a task to fail #115

Closed
dpmatthews opened this issue Sep 13, 2012 · 9 comments
Closed

Task communication failures should not cause a task to fail #115

dpmatthews opened this issue Sep 13, 2012 · 9 comments
Assignees
Milestone

Comments

@dpmatthews
Copy link
Contributor

Currently tasks fail if any "cylc task" commands fail. However communication failures in a task should only ever be a warning. After all, they don't mean that the task itself has failed, only that it hasn't managed to report its status back to the server. Consider if your suite has died for some reason. In most situations it's preferable that any tasks that have been submitted should run
through to completion. For example, if you've got some large job submitted which has to queue for ages in order to get the required resources then you're not going to want it to fail if the "cylc task started" command fails.

  • Note 1: One exception to this is the new "cylc task broadcast" command. Since this alters the behaviour of other tasks, this must continue to fail if it can't contact the server.
  • Note 2: There will need to be some changes to how the server behaves when it has missed messages from a task. For example, if a task completes without all its internal outputs completed then the internal outputs should get completed automatically (this doesn't happen at the moment).
@matthewrmshin
Copy link
Contributor

Note 2:

One way to do this is to have each submitted task to keep a status file at, say, $CYLC_TASK_LOG_ROOT.status with the following contents:

  • Time when the task starts.
  • Time when the task exits.
  • Exit status.
  • Any other messages.

If any previous communications between the submitted task and the suite failed, the suite would be able to recover the above information from the submitted task's status file.

@cylc
Copy link
Collaborator

cylc commented Sep 13, 2012

Dave - agreed entirely. The current situation basically conforms to my default first-cut treatment for any feature - if something goes wrong then abort, and put in more nuanced behaviour later as required.

@cylc
Copy link
Collaborator

cylc commented Sep 13, 2012

I guess we'll need some kind of polling mechanism (perhaps checking for Matt's .status file) in case the "cylc task succeeded" call fails to get through at all. And if that fails, assume the task failed or use a new "unknown" task state?

Note ideas on polling in Issue #67

@cylc
Copy link
Collaborator

cylc commented Sep 13, 2012

The full solution, including status files and polling, could be quite involved. But the intermediate solution - don't abort tasks if messaging calls fail, and assume all internal outputs complete if a task reports success without first reporting the internal outputs complete - is relatively easy.

@dpmatthews
Copy link
Contributor Author

See also #86.

@ghost ghost assigned hjoliver Oct 5, 2012
@hjoliver
Copy link
Member

See also #114.

@matthewrmshin
Copy link
Contributor

Job status put in place by #282.

@hjoliver
Copy link
Member

status update

  • Task message calls now retry a configurable number of times, and don't abort the the job script if they ultimately fail to get through.
  • Re: "Note 2": any remaining outputs will be automatically completed by receipt of a succeeded message, once 215 job submission retry (no polling of submitted tasks yet) #364 is merged.
  • Re: Matt's addendum to "Note 2": we have the task status files, but they are not retrieved by the suite yet (and internal output messages are not written to them?)
  • Re: "Note 1": broadcast is not a messaging command, so it is not affected by the new(ish) messaging retries and non-failure.

To simplify things, I suggest we close this issue once #364 is merged and put use of task status files by cylc in a new issue: #381

@hjoliver
Copy link
Member

hjoliver commented Apr 9, 2013

Closing this as per previous comment.

@hjoliver hjoliver closed this as completed Apr 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants