docs: improve the docs

bytedance · Jun 19, 2019 · cfc7ec1 · cfc7ec1
1 parent de4ea95
commit cfc7ec1
Show file tree

Hide file tree

Showing 2 changed files with 24 additions and 11 deletions.
diff --git a/docs/env.md b/docs/env.md
@@ -80,22 +80,35 @@ The most important one is the number of GPUs per PCIe switches. You should confi
 export BYTEPS_PCIE_SWITCH_SIZE=x
 ```
 
-The rest do not impact the performance much. However, you can still experiment them if you have time. 
-
-First, you can configure the tensor partition size. A smaller size improves BytePS pipelining, but may have higher other overhead like NCCL coordination, ZMQ message headers, etc. The default and recommended value is 1024000 (in bytes).
+You can also configure the tensor partition size. A smaller size improves BytePS pipelining, but may have higher other overhead like NCCL coordination, ZMQ message headers, etc. The default and recommended value is 4096000 (in bytes).
 
 ```
 export BYTEPS_PARTITION_BYTES=y
 ```
 
-Then, you can increase the number of concurrent NCCL streams used in local merging. However, this may lead to occasional hanging problem due to NCCL implementation.
+The rest do not impact the performance much. However, you can still experiment them if you have time. 
+
+You can increase the number of concurrent NCCL streams used in local merging. However, this may lead to occasional hanging problem due to NCCL implementation.
 
 ```
 export BYTEPS_NCCL_NUM_RINGS=z
 ```
 
-Finally, BytePS uses group NCCL calls to reduce NCCL invoking overhead. You can try to increase the group sizes:
+BytePS uses group NCCL calls to reduce NCCL invoking overhead. You can try to increase the group sizes:
+
+```
+export BYTEPS_NCCL_GROUP_SIZE=w
+```
 
+Servers can also be the performance bottleneck, e.g., when there are only one server but multiple workers. 
+You can try to increase the number of push threads on the servers (default is 1):
+
+```
+export SERVER_PUSH_NTHREADS=v
 ```
-export BYTEPS_NCCL_NUM_RINGS=w
+
+Increasing the number of engine CPU threads may also improves server performance:
+
 ```
+export MXNET_CPU_WORKER_NTHREADS=p
+```
diff --git a/docs/running.md b/docs/running.md
@@ -11,29 +11,29 @@ On worker 0, run:
 ```
 DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
 DMLC_WORKER_ID=0 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
-launcher/launcher.py YOUR_COMMAND
+python launcher/launcher.py YOUR_COMMAND
 ```
 
 On worker 1, run (only DMLC_WORKER_ID is different from above):
 
 ```
 DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
 DMLC_WORKER_ID=1 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
-launcher/launcher.py YOUR_COMMAND
+python launcher/launcher.py YOUR_COMMAND
 ```
 
 On the server, run (remove DMLC_WORKER_ID, and set role to server):
 
 ```
 DMLC_ROLE=server DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
-DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 launcher/launcher.py
+DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
 ```
 
 On the scheduler, run (remove DMLC_WORKER_ID, and set role to scheduler):
 
 ```
 DMLC_ROLE=scheduler DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
-DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 launcher/launcher.py
+DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
 ```
 
-The order of above commands does not matter.
+The order of above commands does not matter.