Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stuck at NFO:root:__broadcast_initial_config_to_client. MSG_TYPE_S2C_INIT_CONFIG. #1

Closed
renyan1998 opened this issue Apr 24, 2020 · 6 comments

Comments

@renyan1998
Copy link

INFO:root:train_dl_global number = 6250
INFO:root:test_dl_global number = 1250
<class 'darts.model_search.Network'>
INFO:root:__broadcast_initial_config_to_client. MSG_TYPE_S2C_INIT_CONFIG.
您好,我在运行代码的时候 卡在这一步不输出结果了,请问是需要设置点什么配置吗?谢谢

@chaoyanghe
Copy link
Owner

chaoyanghe commented Apr 24, 2020

Hi, it happens when your MPI configuration is not correctly configured.
Please follow the MPI configuration in the README.md.
Before running my program, it is better to use a simple MPI program to test whether the send() and broadcast() MPI operation is correct. Or you can try:
Change the following code:

    def init_config(self):
        self.__broadcast_initial_config_to_client()
        """
        comm.bcast (tree structure) is faster than a loop send/receive operation:
        https://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/
        """
        # for process_id in range(1, self.size):
        #     self.__send_initial_config_to_client(process_id)

to

    def init_config(self):
        # self.__broadcast_initial_config_to_client()
        """
        comm.bcast (tree structure) is faster than a loop send/receive operation:
        https://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/
        """
        for process_id in range(1, self.size):
             self.__send_initial_config_to_client(process_id)

By this way, you can test whether your MPI configuration is correct or not.
The speed of the send() MPI operation is a little bit slower than the broadcast() MPI operation when the worker number is large. But in the case of FL research, the worker number is smaller than 1000, so this won't largely increase the communication time.

@renyan1998
Copy link
Author

Thanks very much! As you say , I have tried use send() MPI operation rather than broadcast operation, it still doesn't work. I find the base problem , which is When I use docker to run the code, there will be some another problems. "shm" was set to small , so the "share memory" is not enough to use。In the end ,thank you for responding my problems, and your codes is very nice~

@chaoyanghe
Copy link
Owner

I am glad to here that you like my implementation.

Can you run my program now?

Yeah, you also need to check your physical configuration to make sure the MPI communication is executable. When the program stuck without logging for more than 3 minutes, it means that the multiprocessing program meets a bug. It could be 1) GPU memory is not enough, try to reduce your barch size or worker number. 2) MPI configuration. Make sure the bandwidth is enough to hold the model size.

For 1), you have to retune your hyper-parameters.

@renyan1998
Copy link
Author

Thank you for your reminder. My experimental running environment is on a K8s cluster, and the account I have allocated has 8 V100 (32G) GPUs, so the memory should not be a big problem. I will contact the cluster administrator to reopen a docker later, if I succeed After running your code, I will reply to you as soon as possible.

@renyan1998
Copy link
Author

Hi, I run the code in a single GPU server successfully .Thank you ~

@chaoyanghe
Copy link
Owner

Great!

We plan to release a distributed learning library recently. If you use our code for research or project, please help to cite this FedNAS paper and our framework/library paper, thanks.

@chaoyanghe chaoyanghe changed the title 运行时卡住了 stuck at NFO:root:__broadcast_initial_config_to_client. MSG_TYPE_S2C_INIT_CONFIG. Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants