-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357
Comments
For now, the asynchronous mode for PyTorch is supported only when you use the DistributedOptimizer approach, like this example. Your code now uses the DDP wrapper, for which we haven't implemented the async-mode yet. |
Hey @ymjiang thanks for the info! Nevertheless, after switching to using benchmark_byteps.py, the issue is still there. FYI:
|
This error indicates that you did not kill the previous bpslaunch process. @ruipeterpan pkill bpslaunch; pkill python3 |
@ruipeterpan Did you launch any worker in your example above? The server should do nothing but wait when there is no worker. |
@vycezhong If only the server gets launched, it starts ZMQ recv thread and waits w/o an error. As soon as the workers are launched, the server crashes. |
OK I can reproduce it. I will look at it. |
@ruipeterpan Could you please test if #359 fixes your issue? |
@vycezhong thanks for the fix! The server-crashing issue is resolved by #359, but I'm seeing some weird behavior for the training loss curve after applying the changes in the PR. I'll spend some time double-checking to make sure it's not a problem on my end. |
@ruipeterpan You also need to enable async for workers. |
I had already toggled BYTEPS_ENABLE_ASYNC for all workers, servers & the scheduler for both async mode and sync mode. |
@ymjiang I think it should be byteps/byteps/torch/__init__.py Line 204 in 6957bc3
|
I do not get it. The byteps/byteps/server/server.cc Line 284 in 6957bc3
|
@ruipeterpan Would you test v0.2.4? My loss curve with v0.2.4 seems fine (it is at least decreasing). |
The first incoming byteps/byteps/common/operations.cc Line 357 in 6957bc3
|
@ruipeterpan Please try this commit. 7ac1dc7 |
@ruipeterpan I think you may need to reduce the learning rate for the async-mode. I am not sure what value is appropriate, but could you try |
@ymjiang I suggest removing the copy and initialize the buffer with zeros. byteps/byteps/server/server.cc Lines 284 to 285 in 6957bc3
memset(stored->tensor, 0, stored->len); But this did not completely solve the problem. Parameter broadcasting should be synchronous. Now we rely on the time delay of non-root workers to get the thing done right, like byteps/byteps/torch/__init__.py Lines 283 to 290 in 6957bc3
@ruipeterpan Please try this commit. 18699f8 |
@vycezhong Here's what I got using 18699f8 with 4 workers + 1 server. I don't know if this is related, but I should note that in the first epoch in the first run, worker 0 got a loss of ~9 while all other workers got ~2.3. In the subsequent execution of the scripts, this issue was gone. The following screenshots are produced in the subsequent runs. Here's what I got using 18699f8 with 4 workers + 4 servers. Thank you! |
Hi, I encountered the same problem again in the current version of BPS. [05:05:15] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[05:05:15] byteps/server/server.cc:430: BytePS server engine uses 8 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[05:05:15] src/postoffice.cc:25: Creating Van: zmq
[05:05:15] src/./zmq_van.h:66: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[05:05:15] src/./zmq_van.h:71: BYTEPS_ZMQ_NTHREADS set to 4
[05:05:15] src/van.cc:441: Bind to [[role=server, ip=172.31.41.74, port=38517, is_recovery=0, aux_id=-1]
05:05:15] src/./zmq_van.h:299: Start ZMQ recv thread
[05:05:29] src/van.cc:387: S[8] is connected to others
[05:05:29] 3rdparty/ps-lite/include/dmlc/logging.h:276: [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first
Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]
terminate called after throwing an instance of 'dmlc::Error'
what(): [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first
Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]
Aborted (core dumped)
Traceback (most recent call last):
File "/opt/conda/bin/bpslaunch", line 253, in <module>
launch_bps()
File "/opt/conda/bin/bpslaunch", line 249, in launch_bps
stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134. I also use the example provided here. example
Server
Worker 0
Worker1
I can run without this bug in a synchronized version, and also ok for other synchronized training codes. Is there any new bug related to the asynchronous training? _init__.py", line 398, in broadcast_optimizer_state
76: Stopping W[9]
p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/van.cc:104: W[9] is stopped
p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/./zmq_van.h:81: W all threads joined and destroyed
Traceback (most recent call last):
File "/opt/conda/bin/bpslaunch", line 253, in <module>
launch_bps()
File "/opt/conda/bin/bpslaunch", line 239, in launch_bps
t[i].join()
File "/opt/conda/bin/bpslaunch", line 34, in join
raise self.exc
File "/opt/conda/bin/bpslaunch", line 27, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/opt/conda/bin/bpslaunch", line 193, in worker
stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd) Thanks for your attention. |
Describe the bug
Turning on asynchronous training (
export BYTEPS_ENABLE_ASYNC=1
) crashes the bps server (during SendPullResponse in byteps/server/server.cc)Expected behavior
The expected behavior is for the training to go error-free, just like in synchronous training
Stack trace from the crashed server
These are produced by turning on
BYTEPS_ENABLE_GDB
, settingBYTEPS_LOG_LEVEL
to INFO andPS_VERBOSE
to 2To Reproduce
Steps to reproduce the behavior:
building the docker image
byteps set up
Environment (please complete the following information):
A few other things
The text was updated successfully, but these errors were encountered: