Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of task execution failure due to resource problems #612

Open
hackermd opened this issue Mar 5, 2017 · 2 comments
Open

Handling of task execution failure due to resource problems #612

hackermd opened this issue Mar 5, 2017 · 2 comments
Assignees

Comments

@hackermd
Copy link
Contributor

hackermd commented Mar 5, 2017

When a task cannot be submitted due to problems with resources (e.g. too much memory allocated), the task remains in state NEW and the collection is not updated:

ERROR    | Invalid state ''NEW'' returned by task 26.
DEBUG    | Opening LocalTransport...
DEBUG    | Checking status of the following PIDs:
DEBUG    | Recovered resource information from files in /home/tissuemaps/.gc3/shellcmd.d: available memory: 2095.9MB, memory used by jobs: 0MB
DEBUG    | Performing matching of resource(s) localhost to task '27' ...
DEBUG    | Checking resource 'localhost' for compatibility with application requirements
INFO     | Rejecting resource 'localhost': requested more memory (3500MB) that resource provides (2.04678e+06KiB, 2.04678e+06KiB per CPU core)
DEBUG    | Task compatiblity check returned 0 matching resources
WARNING  | No compatible resources for task '27' - cannot submit it

In this case, the state of the parent task collection needs to be set to STOPPED to break the engine.progress() loop and allow re-submission of the task collection with modified resource parameters.

Related to #610.

@riccardomurri
Copy link
Collaborator

In this case, the state of the parent task collection needs to be set to STOPPED to break the engine.progress() loop and allow re-submission of the task collection with modified resource parameters.

Well, for one there might not be any parent task (and the Engine should not know nor be concerned).
Second, the change you suggest would break session-based scripts (which are ATM the majority use case for GC3Pie) as they are currently written so it's definitely a no-go as suggested.

I'm definitely in favor of detecting this kind of submission failure (= task
cannot run due to constraints with the current state of resources) in order to,
e.g., avoid re-trying to submit a task that is known not to work. However, I
need to think a bit about a good API for that.

Given this API, you could implement this kind of parent notification yourself by
overriding the update_state() method with something like this::

class MyTaskCollection(...):
  # ...
  def update_state(self, **extra):
    should_stop = False  # optimistic default
    for task in self.tasks:
      if task_has_failed_submission(task):
        should_stop = True
        break
    if should_stop:
      return Run.State.STOPPED
    else:
      return super(MyTaskCollection, self).update_state(**extra)

@hackermd
Copy link
Contributor Author

hackermd commented Mar 6, 2017

I see your point! Thanks for the workaround. I will update my task collections accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants