what are reasons about the 'Worker died unexpectedly' ? #33

tzc1994 · 2018-12-12T08:18:33Z

Hi，when I use this autoML tool, I meet this problem"Worker died unexpectedly", what reasons about this issue? How can I slover it ?
thanks

shukon · 2018-12-12T08:38:23Z

It is always easier (and most of the times only possible at all) to answer, if you provide a StackTrace of the error-message, details of your hardware and environment and a Minimum Working Example (so the error can be reproduced and, if necessary, fixed).

tzc1994 · 2018-12-12T08:57:34Z

@shukon Thank you for replying me!
I use this tool to search the numbers of filter of one layer of a CNN network, and I use pytorch as backend.
I use the mnist example and its pytorch code file. To achieve my goal, I modiled my project from mnist examples.
Here is my configSpace:
`@staticmethod
def get_configspace():
"""
It builds the configuration space with the needed hyperparameters.
It is easily possible to implement different types of hyperparameters.
Beside float-hyperparameters on a log scale, it is also able to handle categorical input parameter.
:return: ConfigurationsSpace-Object
"""
cs = CS.ConfigurationSpace()
lr = CSH.UniformFloatHyperparameter('lr', lower=1e-6, upper=1e-1, default_value='1e-2', log=True)
sgd_momentum = CSH.UniformFloatHyperparameter('sgd_momentum', lower=0.0, upper=0.99, default_value=0.9, log=False)
cs.add_hyperparameters([lr, sgd_momentum])
#filter_num1 = CSH.UniformIntegerHyperparameter('filter_num1', lower=16, upper=48, default_value=32, log=True)
#filter_num2 = CSH.UniformIntegerHyperparameter('filter_num2', lower=32, upper=96, default_value=64, log=True)
#filter_num3 = CSH.UniformIntegerHyperparameter('filter_num3', lower=32, upper=96, default_value=64, log=True)
#filter_num4 = CSH.UniformIntegerHyperparameter('filter_num4', lower=64, upper=192, default_value=128, log=True)
#filter_num5 = CSH.UniformIntegerHyperparameter('filter_num5', lower=64, upper=192, default_value=128, log=True)
#filter_num6 = CSH.UniformIntegerHyperparameter('filter_num6', lower=128, upper=384, default_value=256, log=True)
#filter_num7 = CSH.UniformIntegerHyperparameter('filter_num7', lower=128, upper=384, default_value=256, log=True)
#filter_num8 = CSH.UniformIntegerHyperparameter('filter_num8', lower=128, upper=384, default_value=256, log=True)
filter_num9 = CSH.UniformIntegerHyperparameter('filter_num9', lower=128, upper=384, default_value=256, log=True)
#filter_num10 = CSH.UniformIntegerHyperparameter('filter_num10', lower=128, upper=384, default_value=256, log=True)
#filter_num11 = CSH.UniformIntegerHyperparameter('filter_num11', lower=128, upper=384, default_value=256, log=True)
#filter_num12 = CSH.UniformIntegerHyperparameter('filter_num12', lower=256, upper=768, default_value=512, log=True)
#filter_num13 = CSH.UniformIntegerHyperparameter('filter_num13', lower=256, upper=768, default_value=512, log=True)

    cs.add_hyperparameters([filter_num9])
    return cs

`
I just search one layer now.
The info of result as follow:
BUG:hpbandster:DISPATCHER: Finished worker discovery
DEBUG:hpbandster.run_0.worker.ubuntu1604.16378:WORKER: shutting down now!
DEBUG:hpbandster:DISPATCHER: Starting worker discovery
DEBUG:hpbandster:DISPATCHER: Found 0 potential workers, 1 currently in the pool.
INFO:hpbandster:DISPATCHER: removing dead worker, hpbandster.run_0.worker.ubuntu1604.16378140503776417536
INFO:hpbandster:Job (0, 0, 0) was not completed
DEBUG:hpbandster:HBMASTER: number of workers changed to 0
DEBUG:hpbandster:adjust_queue_size: lock accquired
INFO:hpbandster:HBMASTER: adjusted queue size to (-1, 0)
DEBUG:hpbandster:DISPATCHER: job (0, 0, 0) finished
DEBUG:hpbandster:DISPATCHER: Trying to submit another job.
DEBUG:hpbandster:HBMASTER: running jobs: 1, queue sizes: (-1, 0) -> wait
DEBUG:hpbandster:DISPATCHER: jobs to submit = 0, number of idle workers = 0 -> waiting!
DEBUG:hpbandster:DISPATCHER: register_result: lock acquired
DEBUG:hpbandster:DISPATCHER: job (0, 0, 0) on hpbandster.run_0.worker.ubuntu1604.16378140503776417536 finished
DEBUG:hpbandster:job_id: (0, 0, 0)
kwargs: {'config': {'filter_num9': 252, 'lr': 0.029349927740529143, 'sgd_momentum': 0.6664149273388896}, 'budget': 10.0, 'working_directory': '.'}
result: None
exception: Worker died unexpectedly.

DEBUG:hpbandster:job_callback for (0, 0, 0) started
DEBUG:hpbandster:job_callback for (0, 0, 0) got condition
WARNING:hpbandster:job (0, 0, 0) failed with exception
Worker died unexpectedly.
DEBUG:hpbandster:Only 1 run(s) for budget 10.000000 available, need more than 5 -> can't build model!
DEBUG:hpbandster:job_callback for (0, 0, 0) finished
DEBUG:hpbandster:DISPATCHER: Finished worker discovery
DEBUG:hpbandster:DISPATCHER: Starting worker discovery
DEBUG:hpbandster:DISPATCHER: Found 0 potential workers, 0 currently in the pool.
DEBUG:hpbandster:DISPATCHER: Finished worker discovery

Another, I want to konw the maximun number of this tool can search?
Thanks!

sfalkner · 2018-12-13T08:16:18Z

The message in the second line
WORKER: shutting down now! suggests that your worker script terminated without waiting for any jobs. Could you please post the part of the script responsible for the worker here.
I think I know what the problem is, but want to confirm with your code first.
Thanks!

tzc1994 · 2018-12-13T08:23:55Z

@sfalkner @shukon
Thanks, I have sloved this problem. The resaon is that the loss return from compute function is pytorch cuda tensor, and I use tensor.cpu() and float() to copy this cuda tensor to cpu. Luckily, it worked!

sfalkner · 2018-12-13T08:51:40Z

Make sure all values in the info dictionary are python build-in types as well. That can also lead to workers dying. If your issue is resolved, please close it. Thank you!

tzc1994 closed this as completed Dec 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what are reasons about the 'Worker died unexpectedly' ? #33

what are reasons about the 'Worker died unexpectedly' ? #33

tzc1994 commented Dec 12, 2018

shukon commented Dec 12, 2018 •

edited

Loading

tzc1994 commented Dec 12, 2018

sfalkner commented Dec 13, 2018

tzc1994 commented Dec 13, 2018

sfalkner commented Dec 13, 2018

what are reasons about the 'Worker died unexpectedly' ? #33

what are reasons about the 'Worker died unexpectedly' ? #33

Comments

tzc1994 commented Dec 12, 2018

shukon commented Dec 12, 2018 • edited Loading

tzc1994 commented Dec 12, 2018

sfalkner commented Dec 13, 2018

tzc1994 commented Dec 13, 2018

sfalkner commented Dec 13, 2018

shukon commented Dec 12, 2018 •

edited

Loading