-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adhering to computation cost budget better #30
Comments
You could look at this answer using From there dask has a way to essentially |
@eddiebergman what do we do with the interrupted evaluation? |
Based on a lookover, the "hot-loop" is here, with the break condition here: Lines 750 to 751 in 54ce41c
To return on timeI would probably do something along the lines of this for the dask case, this should basically kill all jobs running in dask and wait for all of them to return. This wait part isn't fulllly necessary but in principal it should be fine. self.client.close()
for future in self.futures:
future.cancel()
concurrent.futures.wait(self.futures, "ALL_COMPLETED") Dask has the property that you can cancel running jobs, but in the non-dask case (here), where you're just raw dogging the function, you can't cancel it because it's in the same process. Killing it would mean killing the whole thing. Lines 572 to 574 in 54ce41c
To circumvent this, you would need to run it in a subprocess of some kind and use To inform the process so you can saveThis is much harder, especially when you don't control the target function. The first thing you need is the handle of the process that is running the target function. Then you can send a process = psutil.Process(<process-id of the thing to signal>)
process.terminate() The correct procedure here by OS standards is to cleanup the program and finish soon. The way to do this is to use pythons import signal
def callback(signal_num, framestack) -> None:
# ... cleanup, save a model, whatever
signal.signal(signal.SIGTERM, callback) The tricky part is that users have to specify this, i.e. their target function is going to be called and this callback has to be registered once inside the process that is running the target function. I do not know how you'd like to do that. I think your best approach is simply give an example and move on. Trying to automatically handle this stuff would be a nightmare to do and maintain. P.s.This won't work if using a custom remote dask server, as you have no way to send a signal to this other machine running the process (or maybe dask does?), only if things are done with local processes. Perhaps dask has some unified way of handling this |
The current implementation waits for all started jobs when the runtime budget is exhausted. This does make sense when using function evaluations or number of iterations as budget, but not when specifying the maximum computation cost in seconds.
Toy failure mode:
The computational budget is 1h, but a new job, that would e.g. take 30 mins, is submitted after 59 mins of optimization. Then the optimizer would wait for this job to finish and therefore overshoot the maxmimum computational budget of 1h.
For now a quick fix could be simply stopping all workers when the runtime budget is exhausted, however this would result in potentially lost compute time. Therefore it might also be interesting to think of a way to checkpoint the optimizers state in order to resume training.
The text was updated successfully, but these errors were encountered: