Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357

ruipeterpan · 2021-01-18T03:35:52Z

Describe the bug

Turning on asynchronous training (export BYTEPS_ENABLE_ASYNC=1) crashes the bps server (during SendPullResponse in byteps/server/server.cc)

Expected behavior

The expected behavior is for the training to go error-free, just like in synchronous training

Stack trace from the crashed server

These are produced by turning on BYTEPS_ENABLE_GDB, setting BYTEPS_LOG_LEVEL to INFO and PS_VERBOSE to 2

root@025575bc0d1646e18ff8810e88261f03000000:/# bpslaunch
BytePS launching server
Command: gdb -ex 'run' -ex 'bt' -batch --args python3.8 -c 'import byteps.server'

warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[2021-01-18 03:00:55.980567: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-01-18 03:00:55.980615: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-01-18 03:00:55.980635: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2021-01-18 03:00:55.980650: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2021-01-18 03:00:55.980663: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[03:00:55] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[03:00:55] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[New Thread 0x7f0da1300700 (LWP 27)]
[03:00:55] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[03:00:55] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[New Thread 0x7f0da0aff700 (LWP 28)]
[New Thread 0x7f0da02fe700 (LWP 29)]
[New Thread 0x7f0d9fafd700 (LWP 30)]
[New Thread 0x7f0d9f2fc700 (LWP 31)]
[New Thread 0x7f0d9eafb700 (LWP 32)]
[New Thread 0x7f0d9e2fa700 (LWP 33)]
[[03:00:55] src/van.cc:389: Bind to role=server, ip=10.0.0.4, port=33261, is_recovery=0
03:00:55] src/./zmq_van.h:287: Start ZMQ recv thread
[New Thread 0x7f0d9daf9700 (LWP 34)]
[03:00:55] src/van.cc:478: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.0.0.4, port=33261, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:503: 1 => 2147483647. Meta: request=0, timestamp=4, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=10.0.0.8, port=60029, is_recovery=0 role=worker, id=11, ip=10.0.0.6, port=39183, is_recovery=0 role=worker, id=13, ip=10.0.0.5, port=60011, is_recovery=0 role=server, id=8, ip=10.0.0.4, port=33261, is_recovery=0 role=worker, id=15, ip=10.0.0.4, port=52559, is_recovery=0 role=scheduler, id=1, ip=10.0.0.4, port=12345, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:338: S[8] is connected to others
[New Thread 0x7f0d9d2f8700 (LWP 35)]
[03:01:26] src/van.cc:478: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[Thread 0x7f0d9d2f8700 (LWP 35) exited]
[03:01:27] src/van.cc:503: 1 => 8. Meta: request=0, timestamp=10, control={ cmd=BARRIER, barrier_group=0 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:478: ? => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[New Thread 0x7f0d9caf7700 (LWP 36)]
[New Thread 0x7f0d7ffff700 (LWP 37)]
[New Thread 0x7f0d7f7fe700 (LWP 38)]
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] 3rdparty/ps-lite/include/dmlc/logging.h:276: [03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]


[terminate called after throwing an instance of '03:01:27] src/van.cc:dmlc::Error'
503:   what():  13 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]


[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0

Thread 2 "python3.8" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f0da1300700 (LWP 27)]
0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f0da3b5c8b1 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f0da1c69957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f0da1c6fae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0da1c6fb21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f0da1c6fd54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f0da25bcf27 in dmlc::LogMessageFatal::~LogMessageFatal (this=0x7f0da12ff310, __in_chrg=<optimized out>) at 3rdparty/ps-lite/include/dmlc/logging.h:277
#7  0x00007f0da25b6472 in byteps::server::SendPullResponse (type=..., type@entry=..., key=<optimized out>, req_meta=..., server=server@entry=0x2018b60) at byteps/server/server.cc:52
#8  0x00007f0da25b8562 in byteps::server::BytePSHandler (req_meta=..., req_data=..., server=0x2018b60) at byteps/server/server.cc:376
#9  0x00007f0da25beb8e in std::function<void (ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)>::operator()(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) const (__args#2=<optimized out>, __args#1=..., __args#0=..., this=0x2018dc8) at /usr/include/c++/7/bits/std_function.h:706
#10 ps::KVServer<char>::Process (this=0x2018b60, msg=...) at 3rdparty/ps-lite/include/ps/kv_app.h:516
#11 0x00007f0da25dbd89 in std::function<void (ps::Message const&)>::operator()(ps::Message const&) const (__args#0=..., this=0x2018e88) at /usr/include/c++/7/bits/std_function.h:706
#12 ps::Customer::Receiving (this=0x2018e80) at src/customer.cc:64
#13 0x00007f0da1c9a6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f0da39046db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007f0da3c3da3f in clone () from /lib/x86_64-linux-gnu/libc.so.6
root@025575bc0d1646e18ff8810e88261f03000000:/#

To Reproduce

Steps to reproduce the behavior:

Build the docker image using a local Dockerfile. The Dockerfile is modified from the one provided in the repo with a few changes. The BytePS I used in the Dockerfile is vanilla BytePS (69a3d), only with some extra print statements.
Set up a 4-node cluster. In my case, I used Azure ML for a 4-node cluster (STANDARD_NC24, 4 x NVIDIA Tesla K80).
Start 4 worker containers (one each on the four nodes), 1 server container and 1 scheduler container on node 0. Then, run the training scripts. Note that these steps are exactly the same as those in [Question] Does replacing torch.distributed.all_reduce with BytePS impact the training curve? #356 except that the async training knob (BYTEPS_ENABLE_ASYNC) is turned on. For the exact steps, see below

building the docker image

# cd into the directory with Dockerfile and do
export BYTEPS_WITHOUT_MXNET=1; export BYTEPS_WITHOUT_TENSORFLOW=1; sudo docker build -t bytepsimage/pytorch . -f Dockerfile --build-arg FRAMEWORK=pytorch

byteps set up

# scheduler on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=scheduler --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name scheduler bytepsimage/pytorch

# server on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=server --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name server bytepsimage/pytorch

# 4 workers on node 0-3
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=0 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=0 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker0 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=1 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=1 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker1 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=2 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=2 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker2 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=3 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=3 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker3 bytepsimage/pytorch

# for each node, go inside the container (server, scheduler, worker 0, ..., worker 3)
sudo docker exec -it {replace_me_with_name_of_container} bash

# use bpslaunch to start the server/scheduler. For the workers, 
bpslaunch python3.8 bps_issue.py --bps -pindex 0
bpslaunch python3.8 bps_issue.py --bps -pindex 1
bpslaunch python3.8 bps_issue.py --bps -pindex 2
bpslaunch python3.8 bps_issue.py --bps -pindex 3

Environment (please complete the following information):

OS: Linux 4.15.0-1100-azure x86_64
GCC version: Ubuntu 5.4.0-6ubuntu1~16.04.12
CUDA and NCCL version: CUDA 11.0, NCCL 2.7.8-1
Framework (TF, PyTorch, MXNet): PyTorch

A few other things

Thank you in advance for your time to look at this issue, and thank you for open-sourcing this and being so responsive in the issues section!

The text was updated successfully, but these errors were encountered:

ymjiang · 2021-01-18T03:46:34Z

For now, the asynchronous mode for PyTorch is supported only when you use the DistributedOptimizer approach, like this example. Your code now uses the DDP wrapper, for which we haven't implemented the async-mode yet.

ruipeterpan · 2021-01-18T06:32:30Z

Hey @ymjiang thanks for the info! Nevertheless, after switching to using benchmark_byteps.py, the issue is still there.

FYI:

Here's the updated Dockerfile, the only difference with the official one being the cudnn version (8.0.5.39-1+cuda11.0) and nccl version (2.7.8-1+cuda11.0)
I used bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000 on the workers
The stack trace from the server:

root@536f767106da4e48b0f29957b21b64da000000:/# bpslaunch
BytePS launching server
Command: python3 -c 'import byteps.server'

[06:22:03] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[06:22:03] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[06:22:03] src/postoffice.cc:25: Creating Van: zmq
[06:22:03] src/./zmq_van.h:299: Start ZMQ recv thread
[06:22:15] 3rdparty/ps-lite/include/dmlc/logging.h:276: [06:22:15] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fe2ba4d699c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fe2ba4d6ddd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fe2ba4d01b2]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fe2ba4d22d2]
[bt] (4) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x24b9e) [0x7fe2ba4d8b9e]
[bt] (5) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x41781) [0x7fe2ba4f5781]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fe2b9bb46df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe2bb7e56db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe2bbb1ea3f]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [06:22:15] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fe2ba4d699c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fe2ba4d6ddd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fe2ba4d01b2]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fe2ba4d22d2]
[bt] (4) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x24b9e) [0x7fe2ba4d8b9e]
[bt] (5) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x41781) [0x7fe2ba4f5781]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fe2b9bb46df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe2bb7e56db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe2bbb1ea3f]


Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/local/bin/bpslaunch", line 4, in <module>
    __import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 658, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module>
    launch_bps()
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 216, in launch_bps
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.

jasperzhong · 2021-01-18T08:56:28Z

This error indicates that you did not kill the previous bpslaunch process. @ruipeterpan

pkill bpslaunch; pkill python3

ruipeterpan · 2021-01-19T03:50:57Z

@vycezhong Thanks for your help. I double-checked to make sure no bps-related processes are alive (both inside and outside of all containers) before launching the server, yet it still crashes. Am I doing something dumb?

jasperzhong · 2021-01-19T06:57:00Z

@ruipeterpan Did you launch any worker in your example above? The server should do nothing but wait when there is no worker.

ruipeterpan · 2021-01-19T07:31:59Z

@vycezhong If only the server gets launched, it starts ZMQ recv thread and waits w/o an error. As soon as the workers are launched, the server crashes.

jasperzhong · 2021-01-19T07:57:20Z

OK I can reproduce it. I will look at it.

ymjiang · 2021-01-19T08:06:49Z

@vycezhong Could this problem be related to #225?

My experience is that v0.2.4 worked well (e.g., see #271).

jasperzhong · 2021-01-19T08:19:53Z

@vycezhong Could this problem be related to #225?

My experience is that v0.2.4 worked well (e.g., see #271).

Yes. It is because I use update_buf for pulling. I will fix it then.

ymjiang · 2021-01-20T07:52:07Z

@ruipeterpan Could you please test if #359 fixes your issue?

ruipeterpan · 2021-01-20T16:54:17Z

@vycezhong thanks for the fix! The server-crashing issue is resolved by #359, but I'm seeing some weird behavior for the training loss curve after applying the changes in the PR. I'll spend some time double-checking to make sure it's not a problem on my end.

jasperzhong · 2021-01-21T06:54:04Z

@ruipeterpan You also need to enable async for workers.

ruipeterpan · 2021-01-21T07:17:57Z

@ruipeterpan You also need to enable async for workers.

I had already toggled BYTEPS_ENABLE_ASYNC for all workers, servers & the scheduler for both async mode and sync mode.

jasperzhong · 2021-01-21T08:10:43Z

@ymjiang I think it should be Parameter here? AsyncParam in servers will be initialized with random values.

byteps/byteps/torch/__init__.py

Line 204 in 6957bc3

handle = byteps_push_pull(p, average=False, name="AsyncParam."+name)

ymjiang · 2021-01-21T08:38:23Z

@ymjiang I think it should be Parameter here? AsyncParam in servers will be initialized with random values.

byteps/byteps/torch/__init__.py

Line 204 in 6957bc3

handle = byteps_push_pull(p, average=False, name="AsyncParam."+name)

I do not get it. The stored buffer will be initialized with the first incoming recv, which contains the value of the parameters.

byteps/byteps/server/server.cc

Line 284 in 6957bc3

bps_reducer_->copy(stored->tensor, recved,

ymjiang · 2021-01-21T08:43:11Z

@ruipeterpan Would you test v0.2.4? My loss curve with v0.2.4 seems fine (it is at least decreasing).

jasperzhong · 2021-01-21T09:08:21Z

The first incoming recv should be random values.

byteps/byteps/common/operations.cc

Line 357 in 6957bc3

char *data = const_cast<char *>(static_cast<const char *>(context.cpubuff));

ruipeterpan · 2021-01-22T04:13:23Z

@ymjiang Here's the loss curve I got for both sync and async using v0.2.4 (809ef20)

jasperzhong · 2021-01-22T04:37:42Z

@ruipeterpan Please try this commit. 7ac1dc7

ruipeterpan · 2021-01-22T05:05:44Z

@vycezhong Here's what I got using 7ac1dc7 and the original scripts (bps_issue.py) I provided:

Then I commented out a metric_average() on the training loss after each epoch (this part), and here's what I got:

Let me know if I can help with some more tests and I'll respond ASAP. Thanks for the help!

ymjiang · 2021-01-22T07:00:07Z

@ruipeterpan I think you may need to reduce the learning rate for the async-mode.

I am not sure what value is appropriate, but could you try lr/N? N is the number of workers.

ruipeterpan · 2021-01-22T08:05:03Z

@ymjiang Here's what I got using 7ac1dc7. The default is 0.05 and the loss curve was still going up after setting the lr to 0.0125. I also tried out some other learning rates, and in general, the larger the learning rate, the faster the loss curve goes up.

jasperzhong · 2021-01-23T07:15:01Z

@ymjiang
It is because parameter broadcasting also becomes asynchronous. The buffer is initialized with random values as shown in the figure below.

I suggest removing the copy and initialize the buffer with zeros.

byteps/byteps/server/server.cc

Lines 284 to 285 in 6957bc3

    
           bps_reducer_->copy(stored->tensor, recved, 
        
                              len);  // we may not need this copy

memset(stored->tensor, 0, stored->len);

But this did not completely solve the problem. Parameter broadcasting should be synchronous. Now we rely on the time delay of non-root workers to get the thing done right, like p.fill_(0).

byteps/byteps/torch/__init__.py

Lines 283 to 290 in 6957bc3

    
           if rank() != root_rank: 
        
               p.fill_(0) 
        
           # Remember to disable averaging because we are doing broadcast 
        
           if name: 
        
               handle = byteps_push_pull(p, average=False, name=prefix+name) 
        
           else: 
        
               handle = byteps_push_pull(p, average=False) 
        
           synchronize(handle)

@ruipeterpan Please try this commit. 18699f8

ruipeterpan · 2021-01-23T18:37:51Z

@vycezhong Here's what I got using 18699f8 with 4 workers + 1 server. I don't know if this is related, but I should note that in the first epoch in the first run, worker 0 got a loss of ~9 while all other workers got ~2.3. In the subsequent execution of the scripts, this issue was gone. The following screenshots are produced in the subsequent runs.

Here's what I got using 18699f8 with 4 workers + 4 servers.

Thank you!

tonyzhao-jt · 2022-06-22T05:20:32Z

Hi, I encountered the same problem again in the current version of BPS.

[05:05:15] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[05:05:15] byteps/server/server.cc:430: BytePS server engine uses 8 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[05:05:15] src/postoffice.cc:25: Creating Van: zmq
[05:05:15] src/./zmq_van.h:66: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[05:05:15] src/./zmq_van.h:71: BYTEPS_ZMQ_NTHREADS set to 4
[05:05:15] src/van.cc:441: Bind to [[role=server, ip=172.31.41.74, port=38517, is_recovery=0, aux_id=-1]
05:05:15] src/./zmq_van.h:299: Start ZMQ recv thread
[05:05:29] src/van.cc:387: S[8] is connected to others
[05:05:29] 3rdparty/ps-lite/include/dmlc/logging.h:276: [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]


Aborted (core dumped)
Traceback (most recent call last):
  File "/opt/conda/bin/bpslaunch", line 253, in <module>
    launch_bps()
  File "/opt/conda/bin/bpslaunch", line 249, in launch_bps
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.

I also use the example provided here. example
FYI, the bps setup
Scheduler:

export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234  # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch

Server

export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234  # the scheduler port
export BYTEPS_SERVER_ENGINE_THREAD=8
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch

Worker 0

export NVIDIA_VISIBLE_DEVICES=0,1,2,3
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch python3 benchmark_byteps.py

Worker1

export NVIDIA_VISIBLE_DEVICES=0,1,2,3
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=<my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch python3 benchmark_byteps.py

I can run without this bug in a synchronized version, and also ok for other synchronized training codes. Is there any new bug related to the asynchronous training?
And what to mention is that there is some bug in the sample code now.

_init__.py", line 398, in broadcast_optimizer_state
76: Stopping W[9]
    p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/van.cc:104: W[9] is stopped
    p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/./zmq_van.h:81: W all threads joined and destroyed
Traceback (most recent call last):
  File "/opt/conda/bin/bpslaunch", line 253, in <module>
    launch_bps()
  File "/opt/conda/bin/bpslaunch", line 239, in launch_bps
    t[i].join()
  File "/opt/conda/bin/bpslaunch", line 34, in join
    raise self.exc
  File "/opt/conda/bin/bpslaunch", line 27, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/opt/conda/bin/bpslaunch", line 193, in worker
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)

Thanks for your attention.

jasperzhong linked a pull request Jan 19, 2021 that will close this issue

Fix Asynchronous Training Bug #359

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357

Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357

ruipeterpan commented Jan 18, 2021

ymjiang commented Jan 18, 2021

ruipeterpan commented Jan 18, 2021

jasperzhong commented Jan 18, 2021 •

edited

Loading

ruipeterpan commented Jan 19, 2021

jasperzhong commented Jan 19, 2021

ruipeterpan commented Jan 19, 2021 •

edited

Loading

jasperzhong commented Jan 19, 2021

ymjiang commented Jan 19, 2021

jasperzhong commented Jan 19, 2021

ymjiang commented Jan 20, 2021

ruipeterpan commented Jan 20, 2021 •

edited

Loading

jasperzhong commented Jan 21, 2021

ruipeterpan commented Jan 21, 2021

jasperzhong commented Jan 21, 2021 •

edited

Loading

ymjiang commented Jan 21, 2021

ymjiang commented Jan 21, 2021

jasperzhong commented Jan 21, 2021

ruipeterpan commented Jan 22, 2021

jasperzhong commented Jan 22, 2021

ruipeterpan commented Jan 22, 2021 •

edited

Loading

ymjiang commented Jan 22, 2021

ruipeterpan commented Jan 22, 2021

jasperzhong commented Jan 23, 2021 •

edited

Loading

ruipeterpan commented Jan 23, 2021

tonyzhao-jt commented Jun 22, 2022 •

edited

Loading

Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357

Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357

Comments

ruipeterpan commented Jan 18, 2021

Describe the bug

Expected behavior

Stack trace from the crashed server

To Reproduce

building the docker image

byteps set up

Environment (please complete the following information):

A few other things

ymjiang commented Jan 18, 2021

ruipeterpan commented Jan 18, 2021

jasperzhong commented Jan 18, 2021 • edited Loading

ruipeterpan commented Jan 19, 2021

jasperzhong commented Jan 19, 2021

ruipeterpan commented Jan 19, 2021 • edited Loading

jasperzhong commented Jan 19, 2021

ymjiang commented Jan 19, 2021

jasperzhong commented Jan 19, 2021

ymjiang commented Jan 20, 2021

ruipeterpan commented Jan 20, 2021 • edited Loading

jasperzhong commented Jan 21, 2021

ruipeterpan commented Jan 21, 2021

jasperzhong commented Jan 21, 2021 • edited Loading

ymjiang commented Jan 21, 2021

ymjiang commented Jan 21, 2021

jasperzhong commented Jan 21, 2021

ruipeterpan commented Jan 22, 2021

jasperzhong commented Jan 22, 2021

ruipeterpan commented Jan 22, 2021 • edited Loading

ymjiang commented Jan 22, 2021

ruipeterpan commented Jan 22, 2021

jasperzhong commented Jan 23, 2021 • edited Loading

ruipeterpan commented Jan 23, 2021

tonyzhao-jt commented Jun 22, 2022 • edited Loading

jasperzhong commented Jan 18, 2021 •

edited

Loading

ruipeterpan commented Jan 19, 2021 •

edited

Loading

ruipeterpan commented Jan 20, 2021 •

edited

Loading

jasperzhong commented Jan 21, 2021 •

edited

Loading

ruipeterpan commented Jan 22, 2021 •

edited

Loading

jasperzhong commented Jan 23, 2021 •

edited

Loading

tonyzhao-jt commented Jun 22, 2022 •

edited

Loading