-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stuck at NFO:root:__broadcast_initial_config_to_client. MSG_TYPE_S2C_INIT_CONFIG. #1
Comments
Hi, it happens when your MPI configuration is not correctly configured.
to
By this way, you can test whether your MPI configuration is correct or not. |
Thanks very much! As you say , I have tried use send() MPI operation rather than broadcast operation, it still doesn't work. I find the base problem , which is When I use docker to run the code, there will be some another problems. "shm" was set to small , so the "share memory" is not enough to use。In the end ,thank you for responding my problems, and your codes is very nice~ |
I am glad to here that you like my implementation. Can you run my program now? Yeah, you also need to check your physical configuration to make sure the MPI communication is executable. When the program stuck without logging for more than 3 minutes, it means that the multiprocessing program meets a bug. It could be 1) GPU memory is not enough, try to reduce your barch size or worker number. 2) MPI configuration. Make sure the bandwidth is enough to hold the model size. For 1), you have to retune your hyper-parameters. |
Thank you for your reminder. My experimental running environment is on a K8s cluster, and the account I have allocated has 8 V100 (32G) GPUs, so the memory should not be a big problem. I will contact the cluster administrator to reopen a docker later, if I succeed After running your code, I will reply to you as soon as possible. |
Hi, I run the code in a single GPU server successfully .Thank you ~ |
Great! We plan to release a distributed learning library recently. If you use our code for research or project, please help to cite this FedNAS paper and our framework/library paper, thanks. |
INFO:root:train_dl_global number = 6250
INFO:root:test_dl_global number = 1250
<class 'darts.model_search.Network'>
INFO:root:__broadcast_initial_config_to_client. MSG_TYPE_S2C_INIT_CONFIG.
您好,我在运行代码的时候 卡在这一步不输出结果了,请问是需要设置点什么配置吗?谢谢
The text was updated successfully, but these errors were encountered: