Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout configuration on atom #320

Open
wjdhollow opened this issue Aug 26, 2016 · 3 comments
Open

Add timeout configuration on atom #320

wjdhollow opened this issue Aug 26, 2016 · 3 comments

Comments

@wjdhollow
Copy link
Contributor

wjdhollow commented Aug 26, 2016

A build could potentially take a long time, but the ClusterRunner is made for breaking up a large task into small chunks. That being said, the expectation is that a sub job should finish within a reasonable amount of time. If it does not finish in a short period of time, nodes will not be deallocating in a timely manner, if at all, which can severely impact the cluster. For example, there have been occurrences where a single subjob has been stuck indefinitely.

I think that subjob durations should be restricted to a finite time limit.

@tjlee0909
Copy link

Doesn't seem like a bad idea to add an "atom_timeout" field or something of the sort to clusterrunner.yaml.

@josephharrington
Copy link
Contributor

Fixing build cancellation would solve part of this problem (the client can cancel a job when it has taken longer than it wants to wait). Currently cancellation will not interrupt in-progress atoms.

I agree with TJ that if we added this that it should be an atom_timeout vs. a subjob_timeout. Subjobs are an intermediate internal batching that users don't have control over. Users have control over their atoms.

If we were to add a default atom timeout, it should be very large and configurable in the clusterrunner.conf.

@wjdhollow
Copy link
Contributor Author

Atom timeout makes sense.

I filed this issue because it looked like a bunch of subjobs were stuck on the dashboard. After talking with Joey, it sounds like some may be false positives caused by #287.

@cmcginty cmcginty changed the title Timeout on sub job duration? Add timeout configuration on atom Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants