Adhering to computation cost budget better #30

Bronzila · 2023-06-26T15:20:07Z

The current implementation waits for all started jobs when the runtime budget is exhausted. This does make sense when using function evaluations or number of iterations as budget, but not when specifying the maximum computation cost in seconds.

Toy failure mode:
The computational budget is 1h, but a new job, that would e.g. take 30 mins, is submitted after 59 mins of optimization. Then the optimizer would wait for this job to finish and therefore overshoot the maxmimum computational budget of 1h.

For now a quick fix could be simply stopping all workers when the runtime budget is exhausted, however this would result in potentially lost compute time. Therefore it might also be interesting to think of a way to checkpoint the optimizers state in order to resume training.

eddiebergman · 2023-06-27T07:38:13Z

You could look at this answer using sched: https://stackoverflow.com/a/474543/5332072

From there dask has a way to essentially shutdown() the Client and the close() it.

Neeratyoy · 2023-07-21T12:45:08Z

@eddiebergman what do we do with the interrupted evaluation?
assuming it is a deep learning model training as an evaluation, is it okay to still exceed the runtime to trigger saving the current state?
@Bronzila feel free to share your thoughts too

eddiebergman · 2023-07-21T13:12:35Z

Based on a lookover, the "hot-loop" is here, with the break condition here:

DEHB/dehb/optimizers/dehb.py

Lines 750 to 751 in 54ce41c

    
           if self._is_run_budget_exhausted(fevals, brackets, total_cost): 
        
               break

To return on time

I would probably do something along the lines of this for the dask case, this should basically kill all jobs running in dask and wait for all of them to return. This wait part isn't fulllly necessary but in principal it should be fine.

self.client.close()
for future in self.futures:
	future.cancel()
	
concurrent.futures.wait(self.futures, "ALL_COMPLETED")

Dask has the property that you can cancel running jobs, but in the non-dask case (here), where you're just raw dogging the function, you can't cancel it because it's in the same process. Killing it would mean killing the whole thing.

DEHB/dehb/optimizers/dehb.py

Lines 572 to 574 in 54ce41c

    
           else: 
        
               # skipping scheduling to Dask worker to avoid added overheads in the synchronous case 
        
               self.futures.append(self._f_objective(job_info))

To circumvent this, you would need to run it in a subprocess of some kind and use psutil to effectively kill the process.

To inform the process so you can save

This is much harder, especially when you don't control the target function. The first thing you need is the handle of the process that is running the target function. Then you can send a SIGTERM to the process with .terminate()).

process = psutil.Process(<process-id of the thing to signal>)
process.terminate()

The correct procedure here by OS standards is to cleanup the program and finish soon. The way to do this is to use pythons signal module, more over, this function:

import signal

def callback(signal_num, framestack) -> None:
    # ... cleanup, save a model, whatever

signal.signal(signal.SIGTERM, callback)

The tricky part is that users have to specify this, i.e. their target function is going to be called and this callback has to be registered once inside the process that is running the target function. I do not know how you'd like to do that. I think your best approach is simply give an example and move on. Trying to automatically handle this stuff would be a nightmare to do and maintain.

P.s.

This won't work if using a custom remote dask server, as you have no way to send a signal to this other machine running the process (or maybe dask does?), only if things are done with local processes. Perhaps dask has some unified way of handling this

Bronzila added bug Something isn't working enhancement New feature or request labels Jun 26, 2023

Bronzila mentioned this issue Jun 26, 2023

Checkpointing optimization run #31

Closed

Bronzila mentioned this issue Jul 10, 2024

Version release v0.1.2 #97

Merged

Bronzila linked a pull request Jul 11, 2024 that will close this issue

Version release v0.1.2 #97

Merged

Bronzila closed this as completed in #97 Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adhering to computation cost budget better #30

Adhering to computation cost budget better #30

Bronzila commented Jun 26, 2023 •

edited

Loading

eddiebergman commented Jun 27, 2023

Neeratyoy commented Jul 21, 2023

eddiebergman commented Jul 21, 2023 •

edited

Loading

Adhering to computation cost budget better #30

Adhering to computation cost budget better #30

Comments

Bronzila commented Jun 26, 2023 • edited Loading

eddiebergman commented Jun 27, 2023

Neeratyoy commented Jul 21, 2023

eddiebergman commented Jul 21, 2023 • edited Loading

To return on time

To inform the process so you can save

P.s.

Bronzila commented Jun 26, 2023 •

edited

Loading

eddiebergman commented Jul 21, 2023 •

edited

Loading