diff --git a/docs/env.md b/docs/env.md index 37727607f..e0bd273fc 100644 --- a/docs/env.md +++ b/docs/env.md @@ -80,22 +80,35 @@ The most important one is the number of GPUs per PCIe switches. You should confi export BYTEPS_PCIE_SWITCH_SIZE=x ``` -The rest do not impact the performance much. However, you can still experiment them if you have time. - -First, you can configure the tensor partition size. A smaller size improves BytePS pipelining, but may have higher other overhead like NCCL coordination, ZMQ message headers, etc. The default and recommended value is 1024000 (in bytes). +You can also configure the tensor partition size. A smaller size improves BytePS pipelining, but may have higher other overhead like NCCL coordination, ZMQ message headers, etc. The default and recommended value is 4096000 (in bytes). ``` export BYTEPS_PARTITION_BYTES=y ``` -Then, you can increase the number of concurrent NCCL streams used in local merging. However, this may lead to occasional hanging problem due to NCCL implementation. +The rest do not impact the performance much. However, you can still experiment them if you have time. + +You can increase the number of concurrent NCCL streams used in local merging. However, this may lead to occasional hanging problem due to NCCL implementation. ``` export BYTEPS_NCCL_NUM_RINGS=z ``` -Finally, BytePS uses group NCCL calls to reduce NCCL invoking overhead. You can try to increase the group sizes: +BytePS uses group NCCL calls to reduce NCCL invoking overhead. You can try to increase the group sizes: + +``` +export BYTEPS_NCCL_GROUP_SIZE=w +``` +Servers can also be the performance bottleneck, e.g., when there are only one server but multiple workers. +You can try to increase the number of push threads on the servers (default is 1): + +``` +export SERVER_PUSH_NTHREADS=v ``` -export BYTEPS_NCCL_NUM_RINGS=w + +Increasing the number of engine CPU threads may also improves server performance: + ``` +export MXNET_CPU_WORKER_NTHREADS=p +``` \ No newline at end of file diff --git a/docs/running.md b/docs/running.md index 7d2ebc9b0..9cba7ea98 100644 --- a/docs/running.md +++ b/docs/running.md @@ -11,7 +11,7 @@ On worker 0, run: ``` DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \ DMLC_WORKER_ID=0 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \ -launcher/launcher.py YOUR_COMMAND +python launcher/launcher.py YOUR_COMMAND ``` On worker 1, run (only DMLC_WORKER_ID is different from above): @@ -19,21 +19,21 @@ On worker 1, run (only DMLC_WORKER_ID is different from above): ``` DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \ DMLC_WORKER_ID=1 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \ -launcher/launcher.py YOUR_COMMAND +python launcher/launcher.py YOUR_COMMAND ``` On the server, run (remove DMLC_WORKER_ID, and set role to server): ``` DMLC_ROLE=server DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \ -DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 launcher/launcher.py +DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py ``` On the scheduler, run (remove DMLC_WORKER_ID, and set role to scheduler): ``` DMLC_ROLE=scheduler DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \ -DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 launcher/launcher.py +DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py ``` -The order of above commands does not matter. \ No newline at end of file +The order of above commands does not matter.