Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what are reasons about the 'Worker died unexpectedly' ? #33

Closed
tzc1994 opened this issue Dec 12, 2018 · 5 comments
Closed

what are reasons about the 'Worker died unexpectedly' ? #33

tzc1994 opened this issue Dec 12, 2018 · 5 comments

Comments

@tzc1994
Copy link

tzc1994 commented Dec 12, 2018

Hi,when I use this autoML tool, I meet this problem"Worker died unexpectedly", what reasons about this issue? How can I slover it ?
thanks

@shukon
Copy link
Collaborator

shukon commented Dec 12, 2018

It is always easier (and most of the times only possible at all) to answer, if you provide a StackTrace of the error-message, details of your hardware and environment and a Minimum Working Example (so the error can be reproduced and, if necessary, fixed).

@tzc1994
Copy link
Author

tzc1994 commented Dec 12, 2018

@shukon Thank you for replying me!
I use this tool to search the numbers of filter of one layer of a CNN network, and I use pytorch as backend.
I use the mnist example and its pytorch code file. To achieve my goal, I modiled my project from mnist examples.
Here is my configSpace:
`@staticmethod
def get_configspace():
"""
It builds the configuration space with the needed hyperparameters.
It is easily possible to implement different types of hyperparameters.
Beside float-hyperparameters on a log scale, it is also able to handle categorical input parameter.
:return: ConfigurationsSpace-Object
"""
cs = CS.ConfigurationSpace()
lr = CSH.UniformFloatHyperparameter('lr', lower=1e-6, upper=1e-1, default_value='1e-2', log=True)
sgd_momentum = CSH.UniformFloatHyperparameter('sgd_momentum', lower=0.0, upper=0.99, default_value=0.9, log=False)
cs.add_hyperparameters([lr, sgd_momentum])
#filter_num1 = CSH.UniformIntegerHyperparameter('filter_num1', lower=16, upper=48, default_value=32, log=True)
#filter_num2 = CSH.UniformIntegerHyperparameter('filter_num2', lower=32, upper=96, default_value=64, log=True)
#filter_num3 = CSH.UniformIntegerHyperparameter('filter_num3', lower=32, upper=96, default_value=64, log=True)
#filter_num4 = CSH.UniformIntegerHyperparameter('filter_num4', lower=64, upper=192, default_value=128, log=True)
#filter_num5 = CSH.UniformIntegerHyperparameter('filter_num5', lower=64, upper=192, default_value=128, log=True)
#filter_num6 = CSH.UniformIntegerHyperparameter('filter_num6', lower=128, upper=384, default_value=256, log=True)
#filter_num7 = CSH.UniformIntegerHyperparameter('filter_num7', lower=128, upper=384, default_value=256, log=True)
#filter_num8 = CSH.UniformIntegerHyperparameter('filter_num8', lower=128, upper=384, default_value=256, log=True)
filter_num9 = CSH.UniformIntegerHyperparameter('filter_num9', lower=128, upper=384, default_value=256, log=True)
#filter_num10 = CSH.UniformIntegerHyperparameter('filter_num10', lower=128, upper=384, default_value=256, log=True)
#filter_num11 = CSH.UniformIntegerHyperparameter('filter_num11', lower=128, upper=384, default_value=256, log=True)
#filter_num12 = CSH.UniformIntegerHyperparameter('filter_num12', lower=256, upper=768, default_value=512, log=True)
#filter_num13 = CSH.UniformIntegerHyperparameter('filter_num13', lower=256, upper=768, default_value=512, log=True)

    cs.add_hyperparameters([filter_num9])
    return cs

`
I just search one layer now.
The info of result as follow:
BUG:hpbandster:DISPATCHER: Finished worker discovery
DEBUG:hpbandster.run_0.worker.ubuntu1604.16378:WORKER: shutting down now!
DEBUG:hpbandster:DISPATCHER: Starting worker discovery
DEBUG:hpbandster:DISPATCHER: Found 0 potential workers, 1 currently in the pool.
INFO:hpbandster:DISPATCHER: removing dead worker, hpbandster.run_0.worker.ubuntu1604.16378140503776417536
INFO:hpbandster:Job (0, 0, 0) was not completed
DEBUG:hpbandster:HBMASTER: number of workers changed to 0
DEBUG:hpbandster:adjust_queue_size: lock accquired
INFO:hpbandster:HBMASTER: adjusted queue size to (-1, 0)
DEBUG:hpbandster:DISPATCHER: job (0, 0, 0) finished
DEBUG:hpbandster:DISPATCHER: Trying to submit another job.
DEBUG:hpbandster:HBMASTER: running jobs: 1, queue sizes: (-1, 0) -> wait
DEBUG:hpbandster:DISPATCHER: jobs to submit = 0, number of idle workers = 0 -> waiting!
DEBUG:hpbandster:DISPATCHER: register_result: lock acquired
DEBUG:hpbandster:DISPATCHER: job (0, 0, 0) on hpbandster.run_0.worker.ubuntu1604.16378140503776417536 finished
DEBUG:hpbandster:job_id: (0, 0, 0)
kwargs: {'config': {'filter_num9': 252, 'lr': 0.029349927740529143, 'sgd_momentum': 0.6664149273388896}, 'budget': 10.0, 'working_directory': '.'}
result: None
exception: Worker died unexpectedly.

DEBUG:hpbandster:job_callback for (0, 0, 0) started
DEBUG:hpbandster:job_callback for (0, 0, 0) got condition
WARNING:hpbandster:job (0, 0, 0) failed with exception
Worker died unexpectedly.
DEBUG:hpbandster:Only 1 run(s) for budget 10.000000 available, need more than 5 -> can't build model!
DEBUG:hpbandster:job_callback for (0, 0, 0) finished
DEBUG:hpbandster:DISPATCHER: Finished worker discovery
DEBUG:hpbandster:DISPATCHER: Starting worker discovery
DEBUG:hpbandster:DISPATCHER: Found 0 potential workers, 0 currently in the pool.
DEBUG:hpbandster:DISPATCHER: Finished worker discovery

Another, I want to konw the maximun number of this tool can search?
Thanks!

@sfalkner
Copy link
Collaborator

The message in the second line
WORKER: shutting down now! suggests that your worker script terminated without waiting for any jobs. Could you please post the part of the script responsible for the worker here.
I think I know what the problem is, but want to confirm with your code first.
Thanks!

@tzc1994
Copy link
Author

tzc1994 commented Dec 13, 2018

@sfalkner @shukon
Thanks, I have sloved this problem. The resaon is that the loss return from compute function is pytorch cuda tensor, and I use tensor.cpu() and float() to copy this cuda tensor to cpu. Luckily, it worked!

@sfalkner
Copy link
Collaborator

Make sure all values in the info dictionary are python build-in types as well. That can also lead to workers dying. If your issue is resolved, please close it. Thank you!

@tzc1994 tzc1994 closed this as completed Dec 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants